By Michael Meyer
Published on June 30, 2022
As important as the responsibilities of a data curator are to self-service analytics, data curation best practices are still in their infancy. The evolution of communications technologies created an environment ripe for evolving the methods and features of the activity of curation. And as new technology allowed for more publishers and created a higher volume of content, information curation thrived.
We are now seeing a similar transformation in the world of data, where there’s tension between the old world (single-source-of-truth data warehouses with top-down data governance) and the new world (distributed, self-service analytics with grassroots management).
In organizations of all sizes, self-service reporting and analysis are becoming the norm. Where people previously were given data in the form of a packaged report, today they’re free to discover and explore their own data.
In today’s data-driven world, many data workers are struggling with high volumes of often redundant data… and many long for a data user’s version of Wikipedia. Self-service analytics is a fragmented reality — there is no single source of truth. The data warehouse, once considered that source of truth, now shares the stage with data from files, streams, wikis, data dictionaries, metadata management tools, raw web content, emails, chats, and many other forms of data communication.
Data leaders in large organizations know that to make trust-based decisions, data users need context about the sources of data they wield. In other words, they need data knowledge, or an understanding of the nuances of the underlying physical data assets. Why was a dataset created? Who built it? How do people use it now and in the past?
Today, that knowledge is comprised of business descriptions and explanations of how the data has been used historically. It includes an understanding of the quality of data and how applicable the data might be for different use cases. This knowledge, or metadata, is a crucial guide for newcomers to that data, granting them the context to use new data with confidence.
Data quality can change with time. The goal of a self-service analytics organization — to give employees a one-stop-shop for data knowledge — requires that organizations can control the quality of data. Data curation is a technique that helps document whether the data is the right data to be using. Data quality automation brings to light the quality rules with the current state of the data. Together, the two ensure the quality of data so the business can make decisions.
Getting started with data curation can be a challenging endeavor due to the broad distribution of data knowledge across an organization. Pieces of data knowledge are often spread across wiki pages, data dictionaries, email, chat, social and raw web content, which the data curator needs to identify, understand and propagate.
Some challenges for the data curator include:
You can’t curate your entire data landscape. Focus on the data that is used the most in your organization. Monitor and analyze the datasets most frequently being used and start with them. In most organizations, data knowledge is distributed in too many places, and the data is changing too rapidly. The velocity of data growth outpaces the rate at which people can be assigned to document the knowledge of the data.
Data assets are often redundantly replicated in different formats and in multiple storage locations. How do you know what to trust? How often data is used, and by whom, is not always predictable or recorded. It’s hard to make data knowledge easily discoverable, so people have it when they need it.
Distinguishing high-quality data from inaccurate or stale resources can be difficult because such judgments often require subject matter expertise in the business function associated with that data. Therefore, it can be nearly impossible for a non-expert to know which data source is accurate.
Even an accurate, up-to-date data asset can be used in different ways by different teams. For instance, a product team might analyze clickstream data in two-minute intervals while the marketing team considers two-day sessions. Both methods are valid but result in different numbers for the same metric.
Today, most organizations have an abundance of data. But in order to know which data to trust and use for your use case, you must understand how it maps to your business processes, how recently it has been accessed, and how it is being used. You need assurances that the data is high quality, and is useful for the kind of analysis you are performing — all processes that require machine learning and human intelligence working together.
With that in mind, here are the best practices for finding your organization’s optimal balance between humans and machines for successful data curation:
Machines can be effectively trained to pattern-match and find the most important data. Machine learning with data intelligence can save a data curator and their organization a tremendous amount of time. It is essential to be able to find the same data elements and provide the business metadata once. Machines can then automate the process to update all of the other elements.
The key to creating context is to document the data effectively and provide the most useful information possible to enable appropriate use. This is not just about documenting technical information (e.g., columns, labels, tables), but actually creating context that helps people understand how they should use the information. In addition, having the ability to create a cross reference of related articles of information enhances the overall understanding of the data.
There may be hundreds of different uses and definitions for one data source. For example, when defining what constitutes a “U.S. state” — the shipping department might not include Hawaii because it’s a shipping exception. But the finance department would include it in a list of states as a revenue source.
Once you know the context of your data, you also need to make the data discoverable. This is done via push methods, such as emails and alert notifications, as well as just-in-time methods such as a suggestion-oriented query tool, and pull methods such as data catalogs.
Finally, you need to propagate changes to the data knowledge; that is, you need to stay on top of technical changes to the data. For example, as a data curator updates a column label, it should be automatically updated within the other tables and sources that use that same data source. This is difficult to do without technology.
With these new technological advancements, organizations are experimenting with how to integrate machines into the data curation process, while still providing data curators with the appropriate amount of control.
Just as Yelp serves as a guide to all of the restaurants in a given place, a data catalog organizes all of the data assets spread across a company’s various systems. A data catalog documents tribal knowledge and best practices by presenting the data in context.
A crowdsourced approach to data curation means that data analysts can move at the speed of business. Consider Wikipedia, which by all accounts is more accurate and up-to-date than the Encyclopedia Britannica because it’s constantly updated by a community of people, many of whom are subject matter experts and professionals within their given domain.
Another example is Pinterest, where you follow people with interests similar to yours and who can add their pins to your list of saved pins. Or Amazon, which has built a complex algorithm that recommends future purchases based on what you’ve purchased in the past.
Like these consumer catalogs, the value of a data catalog comes from its ability to surface the connections and context around different sets of data. A data catalog may let you upvote or downvote specific data assets, it may let you annotate these assets or deprecate them, and it may let you follow particular users and have conversations around the data.
Similarly, the value of your data curation program springs from the value of the human expertise around that data. Data leaders should ask, “What does it take to encourage knowledge sharing?”
Start with the basics. Provide guidance on how to get started, and address basic questions, such as “What makes a good description?” Some people will be hesitant to contribute because they won’t think their content is good enough. Show them it is valuable, and all part of a collaborative learning process, and others will jump in and help with their own questions or concerns.
From there, data teams should determine which data to prioritize. They should also explore how automation in the data catalog can alleviate some of the manual tasks.
Data curation fosters the reuse of the data knowledge that already exists in the organization and results in higher analyst productivity. When analysts can quickly learn the best practices for a dataset and other tips for success, they get back time to focus on new ideas and analysis. Secondarily, democratizing the wisdom of the few and positioning them as experts encourages others to share their knowledge, too. But these processes are only as strong as the technology behind them.
The Alation Data Catalog provides a modern approach to finding, understanding, and establishing trust with your data — all efforts that make it easier to establish successful data curation.
With the help of the Alation Data Catalog, data curators can create a broader awareness throughout an organization of how data can be applied to make informed decisions, improving the accuracy of data knowledge in the organization. Ultimately, investing in data knowledge with the Alation Data Catalog can inspire individuals to be more data-savvy and promote a data-driven culture.