By Joseph Perez
Published on 2024年7月25日
A man walks into a hotel… Wait! This isn’t a joke! He checks in with one credit card, orders room service from his cell phone, and settles his bill with a different credit card. Simple, right?
Not if you’re the data analyst trying to comprehend this single customer. At every transaction and interaction, disparate systems register a distinct customer with unique information. This can lead to a lot of duplicate records and chaos for those who set out to analyze data.
Master Data Management (MDM) emerges as a vital solution. This discipline ensures consistent, reliable data across an organization – so one customer registers as truly one customer (no matter how many credit cards or nicknames he throws around).
In my recent Alation brief, I gave newcomers a tour of MDM and reference data, unpacking their significance and demonstrating how a data catalog supports them. This blog will cover the key takeaways (with the full transcript included, as well, if you’d prefer to watch the video and follow along.)
Master data is the core information that is essential for operations in a business. This includes data about people, products, and organizations. This could be customer, supplier, vendor, financial policy, insurance policy, et cetera. In contrast to transactions or analytical data, master data is more static. It’s consistent and reliable.
When managed correctly, master data leads to increased efficiency and reliability. It ensures that everyone in the organization is on the same page, making informed decisions based on accurate, consistent data.
Reference data, often confused with master data, includes the sets of permissible values used by other data fields. This can be as simple as country codes or as complex as industry-specific taxonomies. For instance, a standardized list of country codes ensures that ‘USA’ and ‘United States’ are not treated as separate entities.
Usually reference data lives externally to your organization. Take country codes as an example. USA, United States, and the ISO code 40: they all mean the same thing, but are written differently. Your end business users might say, “840? I've never heard of that country!” We want users to be able to be pointed to a table or crosswalk of some sort, to be able to understand that ‘840’ means ‘United States’ to create that standardization and shared understanding.
Let’s check back on the man in the hotel. This is a good example of how a customer’s interactions with a business would register within MDM architecture (and create duplicate records, necessitating a solution MDM provides).
The guest may check-in under one name or credit card. But maybe when he checks out he uses a different credit card, or forgets to share his loyalty account, or sends the bill to a work email instead of my personal email. We're seeing that in these systems records, we have disparate information that differs from source to source as we move from support to service. From an understanding of the various attributes, that man almost looks like three different people.
MDM resolves these discrepancies by registering these various interactions as one single customer. It then assigns a unique global ID to the individual that the business defines. In this way, MDM aligns various departments within an organization, fostering a unified approach to data management. Once we have mastered this individual, we know this man to be Joseph Perez, Platinum Loyalty Member, who loves a cheeseburger upon arrival (in contrast to three individuals, Joseph Perez, Loyalty Member 626, and a cheeseburger lover.)
Reference data management, on the other hand, provides a consistent set of identifiers, reducing discrepancies and enhancing data quality. Together, they create a single source of truth, essential for accurate reporting and analytics. This integration improves overall data governance, ensuring that everyone in the organization can trust the data they use.
Alation supports MDM by cataloging all data objects; this entails not only source systems but also mastered data. The platform enables data users to better grasp how an MDM record is created and certified, allowing them to peer “under the hood” to see how this sausage is made and request changes if needed.
By cataloging and documenting MDM objects and assets, Alation provides a centralized repository for managing master data (and how it’s rendered). This includes defining rules and standards for master data and automating reference data processes. For example, cataloging source tables and columns, and creating glossaries and terms, helps in standardizing reference data across the organization.
In practice, leveraging Alation for reference data management means creating a comprehensive reference data catalog set. This ensures that entities like country codes or state codes are consistently applied across all systems.
MDM is very much a messy process in some instances. With a data catalog, we can really peel back the layers and see how a record was created, where it lives, and who governs it.
Master Data Management and reference data management are crucial for any organization aiming for data accuracy and consistency. Alation provides a robust platform to manage and catalog this data effectively, giving technical experts and business users alike a single place to collaborate, find trusted master data, and better grasp how it’s determined. By leveraging Alation to “master their master data”, organizations can ensure better data governance, improved collaboration, and ultimately, more reliable and trustworthy data.
For those looking to dive deeper, watch the full webinar.
[00:00:00] I'm Joe Perez and I sit within Alation's professional services as a professional consultant. I have a tough act to follow. Last week, we had Jim Barker touching on data quality in a similar tone in a similar fashion. I'll be touching on MDM. Like Deb mentions, this is a part two. The part one was from chaos to clarity. So how we can leverage the data catalog, to stand up an MDM implementation? Today's lens is going to be slightly different. It's going to be MDM focused. We are going to define what MDM is. We're going to get into what reference data is. That's a little new there. But we're also going to be talking about [00:00:40]how MDM is already existing within your organization. How can we capitalize on that and leverage the catalog to capture some of that information?
[00:00:49] So again serving as a refresher, I want to talk about, you know, these definitions. And I took some of the content from last time, some of the questions I've got, and tailored it slightly to really make sure that we address some of those questions. And you know, make sure we call it out.
So what is master data? Master data is data about the business entity. So the who's or the what's within an organization. This could be terms like people product or org. More specifically this could be customer, supplier, vendor, financial policy, insurance policy, things of that nature. So we really want to focus on this static element within the organization. It's not transactions. It's not analytical data, but it's really this consistent, reliable data. And what MDM serves to do and what MDM serves to accomplish is to increase consistency, increase reliability, increase data quality, and overall make your organization that much better based off those elements.
[00:01:51] It's important to have alignment across units. And MDM is usually cross-silo. So it's not a specific unit. That's more entity resolution. This is really thinking, you know, big picture across your organization.
[00:02:05] We're also going to be talking about reference data. And reference data is in a similar fashion a type of master data. There's some, you know, slight differences. Usually reference data lives externally to your organization. So the example we have here is something like country codes. You know we have USA, we have United States we have 40. And then we have this ISO code. So these various flavors of the same things. It's important within your organization that we go about mastering this reference data. We want all of our users to be on the same page. We want that standardization. So typically again these are going to be even more static than master data. They may come externally to our organization, but it's equally important that we do capture, and standardize these things. And at the very least, if we are already using standardization (so maybe your organization already uses ISO codes) [00:03:01]we want users to know what the ISO code means. So when a user sees ISO 31662, we want users to know that this is the US or they see the numeric value 40 users. Your end business users might say, "840? I've never heard of that country, you know. Where is that?" We want users to be able to be pointed to a table, a crosswalk of some sort, to be able to understand that this is the United States, this is the USA to, again, create that standardization across your organization.
[00:03:31] Now I'm going to talk about both of these as disciplines. So Forrester really describes MDM or master data management as the process to create that single source of truth. Again, this is going to be across silos. We're going to get the best versions of each one of those records.
So if we're looking at a customer name, you may have John Smith that lives at this address with this phone number and one record, you may have the same John Smith, but with a different phone number. We want to establish that best source of truth with MDM and establish what should a golden record for John Smith the customer be. MDM helps accomplish that. It gives us better insights. It helps remove things like duplications and really addresses on those six dimensions of data quality, namely completeness and making sure we have a full record, even if it's a Frankenstein of a record coming from different disparate sources. We want consistency, so we want John Smith to have the same phone number and email across all use cases. And we want uniqueness. We don't want to be calling John Smith twice just because he has a different phone number or email address. They could be very much the same person. When a CEO reports to Wall Street about how many unique customers, we want them to have the confidence in their data that they have x many customers. And if you have duplications, if you don't have entity resolution, if you don't have master data, it's almost impossible to say that in complete honesty.
[00:05:01] On the other hand, on a similar note, we have reference data management. The Gartner Institute describes it as the utilization of a consistent and uniform set of identifiers. So again, that standardization piece, that universality, it's strategic for us to all be on the same page about what we're calling things. It makes our lives so much easier when that standardization exists
[00:06:07] Now, what does this look like in the catalog? I'm going to start with MDM. And I'm really going to take a step back and look at the bigger picture here in this next slide. So this is the application of how we can layer the catalog on top of our MDM architecture. So I'm going to break down this architecture from top to bottom at the very very top. We have these various systems of record. So I'm going to use a hotel chain. for example, we may have a CRM that's CRM, maybe loyalty numbers, things of that nature.Just customer records. And in there, each person may need to have a unique email address. As we continue and we move to the right, we may have a retail POS. That might be our hotel's checkout system. That hotel checkout system may have the same names, may have the same phone numbers. But maybe when I'm checking out I use a different credit card, or I don't mention my loyalty account, or I send it to my work email instead of my personal email.
[00:06:25] [00:06:25]So again, we're seeing that at these systems records, we have disparate information that differs from source to source as we move on to support and service. You know, that might be, oh, I made a phone call to upgrade my room or something. So they captured my work phone number instead of my cell number. Again, we have a different Joe Perez in each one of these systems. That's not even to say I don't, you know, confuse them even more by going by Joseph Perez at checkout. But then Joe Perez within my loyalty account, and then Jojo Perez when I'm on the support service line, or Jerry or something like that. I'm all the same person across these systems, but from an understanding of the various attributes, I look very different.
[00:07:09] Where MDM hopes to resolve is looking at all these systems and resolving and saying, this is all the same. Joseph Perez, with this email, with this phone, and with this loyalty number, it then goes to assign a unique global ID to me based on a set of conditions that I've reached. And MDM is very much rules-based. And these rules are going to take account. Oh, he has a similar first name, the exact last name, and the exact state and billing code. Let's consider these two people the same, where his name was a little bit fuzzy, but it wasn't close enough. Let's consider these two different hotel guests. And again, MDM goes through your data across sources and creates these records.
[00:07:52] Once these records are created, it then may feed it downstream to various sources for consumption. This could be your data lakes, your data warehouses, your data marts, your app analytical tools like Tableau or Power BI. I got the question last session surrounding "what connectors does Alation have to MDM?" Currently, we don't have anything that connects directly to the MDM application. And that's okay because as we look at this we see that the MDM record can be written back to the source of record. It can be written to an enterprise data warehouse. So, if it's being written into that enterprise data warehouse for downstream consumption, we can connect to Snowflake. We can connect to Databricks. If the CRM system is, let's say, based on the SQL server, we can connect to that and capture the MDM information. And what this also goes to talk about, it's not just the architecture of MDM, but in those orange and peach boxes. The orange boxes denote where an article, where a term, where a policy, word documents, can be done. And this can document assets. So defining rule sets, capturing the process, defining what the global ID's encompassed. So using a term for that. We also have the peach objects. And that's going to be what we're more used to. That's just cataloging the object at various levels. So this is, you know, cataloging a table that is used for MDM, so a source table for instance, or cataloging a golden column. So what the master column looks like, is going to be denoted in peach. So this is all just layered on top of your MDM records. So, really, Alation is serving as the platform to capture all this insight and all this information.
[00:09:34] And now, what I'm about to go into in greater detail is the actual applications of cataloging MDM objects, cataloging and defining document assets, and things of that nature. And then I'm going to move into reference data in a little bit more detail.
[00:09:34] So the application: cataloging MDM objects. We had two images down below on the screen. The one on the left is what I'm going to call a source table. It's going to be considered transactional data. So it's under that transactional domain. I see that the source system is Shipwell. Well, I see that we have MDM stewards assigned to it. I see what the golden source is. So I'm pointing users to that information. I'm talking about trust DK. So I'm capturing the information of trust. And what trust is in the context of MDM is how quickly or how fresh data is going back to that architecture slide, those different source of records systems aren't going to stay fresh at the same rate. So your CRM system may not be updated by your end user regularly, but your checkouts are going to be more fresh. So, something like this may have a medium trust decay compared to a slow trust decay. And that just shows how long we can trust this data, how long this sourced information has priority. So again, different elements that we could capture, you know, the trust decay, the master viewer. Is it a critical data element? When looking at the column level, this is a mastered record. So we call out that this field is mastered, we call out that it's not used for matching because it's mastered. So this is a byproduct of MDM. In this case, we consider it a critical data element. It's country. We talk about the entity types so that it revolves around product. so this country isn't used for, you know user's address. This is used for where the product might be shipped to or created or where it's currently sitting. So it's again that entity type that we're capturing here. Again we have the golden source. So what table this country lives in, where it's stored. And we also talk about the source system. So this country could come from a variety of sources. It could be your Shipwell system, your Juno OCFO connector, and information like that is how this record was comprised. So we get to really see how the sausage is made. You know, MDM is very much a messy process in some instances. We really want to, you know, peel back the layers and see, okay, how this record was created? Where does this record live? Who governs it? And things of that nature.
[00:12:08] So again this is how we could catalog MDM objects. So the actual action of looking at it at a metadata level. Tables, sources, schemas, and columns even display this information. So you just have the confidence is "I want to use this country here. It's been mastered. This is the best country field to use for my machine learning applications, for my BI dashboards, for things of that nature.".
[00:12:34] We also want to document MDM assets. So like we saw in that architecture, when the catalog is laid on top of everything, we have the, you know, action of cataloging objects. But we also want to document MDM assets. This could be by creating a simple glossary, you know, containing MDM asset terms. So, what is a customer entity? What is a product entity? What is an entity? And I know I'm going to get the question, hey, where can we find this? So a little Easter egg in the appendix I define a bunch of different types of entities. I define what an entity is so you guys can build this out with a little bit more ease. So I think I provided 16 different entity types that you can then go through and see which ones are applicable to you with a short and sweet description.
[00:13:19] We also want to maybe leverage a domain approach where we have domains capture what's transactional data, what's analytical data, and what's master data. So here we get to see that master data living in that domain, you know, definitely possible. And then users can just go to that master data domain to find data. Again, leveraging the catalog as a platform to make discovery context and understanding that much more efficient. We may even want to define the rules. So I have a policy here called customer 360. I have the columns I'm using to create that golden record for matching. I also have the rule set. So rule one is I need everything, every field to be an exact. I need an exact first name and exact last name and exact state as I move down that list. Rule two might loosen up, but I'm capturing that it's fuzzy now, exact and exact. Again, now any user can come here to the catalog and see how this was actually captured. Maybe I disagree that we should have fuzzy first and last names. This is something I can then leverage conversations or talk to the MDM steward to really see how is this being created. Is this the best thing appropriate? But now there's that visibility. Is that context that didn't exist prior to the catalog. This doesn't and now live in an Excel sheet in SharePoint tucked away somewhere. This is visible. This is consumable. This is a associable within your catalog. And then some other information I call out the entity type again is customer here. The business units of who this actually relates to is sales and marketing. Again, allowing for that easy discoverability and now allowing for, you know, process and repeatability.
[00:14:57] Some other things that aren't called out in these images and just documenting the overall process of MDM, training your stewards on, you know, what makes a good match, what matches should be considered for merge, and really allowing Alation to be be the place to document these processes.
[00:15:14] Now I kind of want to get into or not kind of want to get into. The next thing I want to get into is using the application, of Alation to create a reference data catalog set. And what do I mean by this? So this video is going to play on a loop. So don't worry if I'm going too slow or too fast. We want to leverage things like glossaries and terms. What we're seeing now is the glossary that has all these different country codes, describes what a country code is. Now we want to apply this to a catalog set. So I made the membership rules be 15 columns based off a regular expression with various forms of country. And it came up with these 15 15 columns with country. We're going to drill down into one of them, and now we're going to have the show fields all have that reference term country codes. And now we apply it. If we now click on one of these countries, we see that country codes exist that we don't want to just say, hey, have this term and manually apply it to every single time country is mentioned in your catalog. We want to go a step farther here and show you how you can automate this process using a catalog set for your reference data assets. Country codes one example. You can also have state codes. You can have Social Security number format. You can have cell phone area codes. The sky's really the limit when we think about what reference data is, it's really standardization within your catalog. And this is just one way we can accomplish that to make sure everyone's on the same page. When they see a code, when they see a country, they could really map it out. And this is just one easy and efficient way to accomplish that.
[00:16:49] The next thing I want to show is another option we have. So another option is you could treat your reference data because again it is just data as you would your normal data. So you could have any reference data source that has tables. You can leverage these tables to reside. And then we can run titling. So at the column level we can leverage this automatic titling. We take. Yes we can say from the current table or from a separate table. And then we can map these things out. So here if we look closely pill is the title of this value C25394. We want all users to have that title. This is a pill. We don't want there to be any confusion with what C23594 means. By leveraging this automatic titling at the column value level, whether it's within the table or externally, is a great way for us to master reference data and have everyone be on the same page of what this given value means.
[00:17:47] So again, super easy. I think it's about four clicks in order to get all this information mapped out and the productivity, the clarity that you create, the control that you're able to have by mastering these things using the catalog is going to be so go beyond means with those four clicks.
[00:18:07] The next thing I want to talk about with, the last thing I want to talk about is 360 capitalization. You know, we often think and we often hear customer 360 product we 60. We just want to capitalize on the whole 360 area. We want to have Alation be a place for users to collaborate on your MDM and RDM type endeavors. And why do you want to create or collaborate? Why do you want the ecosystem? Why do you want to govern? To capitalize on the synergies that exist between MDM and the data catalog. You know, MDM creates data. We should treat that data like we would analytical data as analytical transactional. We want to focus on visibility through collaboration. We also want to think about the catalog as a platform to enable discovery and that understanding of your master data. So leveraging the ecosystem on and really leveraging and capitalizing on those aspects that the Alation catalog does so well. And governance MDM is a part of DQ, it's a part of data governance. You know, I can be a doctor at the DQ level. I'm a cardiologist at the data catalog level of the data quality level. And then I'm a heart surgeon at the master data management level. That doesn't mean I shouldn't be doing all three. You know, my doctor should just know just about as much as my heart as my surgeon. It should be all that base level understanding. And the catalog is that level playing field to improve things like data literacy and data governance. Like I said, there is a jammed packed appendix, and that appendix includes things like, enterprise data types, just simple MDM architecture catalog activities, more, more references for that. The country codes that I showed, more areas surrounding entity information. So one entity is various entity types. We have customer entity, we have legal entity tax ammonia and the product category entities talking about the MDM framework MDM steps and then the various implementation styles which really hit on the area of how do I connect to MDM. Well it depends. Are you writing back to your source systems? Is MDM the single source of truth and kind of walking through those various flavors?