By Dave Wells
Published on 2020年2月13日
This blog was last updated in September 2023
Data catalogs have quickly become a core component of modern data management. Organizations with successful data catalog implementations see remarkable changes in the speed and quality of data analysis, and in the engagement and enthusiasm of people who need to perform data analysis. By contrast, organizations without a data catalog often have these questions: What is a data catalog? Why do we need a data catalog? What does a data catalog do? These are all good questions and a logical place to start your data cataloging journey.
A Data Catalog is a collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need, serves as an inventory of available data, and provides information to evaluate fitness of data for intended uses.
This brief definition makes several points about data catalogs—data management, searching, data inventory, and data evaluation—but all depend on the central capability to provide a collection of metadata.
Fundamentally, metadata is data that provides information about other data. In other words, it’s “data about data” It consists of labels or markers that describe information, making it easier to find, understand, organize, and use. Metadata can be employed with a wide range of data formats, encompassing documents, images, videos, databases, and beyond.
Data catalogs have become the standard for metadata management in the age of big data and self-service business intelligence. The metadata that we need today is more expansive than metadata in the BI era. A data catalog focuses first on datasets (the inventory of available data) and connects those datasets with rich information to inform people who work with data. Figure 1 illustrates the typical metadata subjects contained in a data catalog.
Datasets are the files and tables that data workers need to find and access. They may reside in a data lake, warehouse, master data repository, or any other shared data resource. People metadata describes those who work with data—consumers, curators, stewards, subject matter experts, etc. Search metadata supports tagging and keywords to help people find data. Processing metadata describes transformations and derivations that are applied as data is managed through its lifecycle. Supplier metadata is especially important for data acquired from external sources, informing about sources and subscription or licensing constraints. I've taken a deep dive into catalog metadata in my blog post, "Data Catalogs vs. Metadata Management" If you're interested in learning more.
A modern data catalog includes many features and functions that all depend on the core capability of cataloging data—collecting the metadata that identifies and describes the inventory of shareable data. It is impractical to attempt cataloging as a manual effort. Automated discovery of datasets, both for initial catalog build and ongoing discovery of new datasets is essential. Use of AI and machine learning for metadata collection, semantic inference, and tagging, is important to get maximum value from automation and minimize manual effort.
With robust metadata as the core of the data catalog, many other features and functions are supported, the most essential including:
Robust search capabilities include search by facets, keywords, and business terms. Natural language search capabilities are especially valuable for non-technical users. Ranking of search results by relevance and by frequency of use are particularly useful and beneficial features.
Choosing the right datasets depends on ability to evaluate their suitability for an analysis use case without needing to download or acquire data first. Important evaluation features include capabilities to preview a dataset, see all associated metadata, see user ratings, read user reviews and curator annotations, and view data quality information.
The path from search to evaluation and then to data access should be a seamless user experience with the catalog knowing access protocols and providing access directly or interoperating with access technologies. Data access functions include access protections for security, privacy, and compliance sensitive data.
A robust data catalog software should provide many other capabilities including support for data curation and collaborative data management, data usage tracking, intelligent dataset recommendations, and a variety of data governance features.
Data catalogs make data work better. They help you find data easily, avoid duplication, understand data better, ensure data rules are followed, make data integration smoother, and encourage teamwork.
Data catalogs boost data understanding with detailed info about datasets. This includes where they come from, their quality, who uses them, how they should be used, and how they connect to other datasets. This info makes it easier for users to grasp the data's meaning, importance, and suitability. As a result, users can perform better decision-making and analysis.
Data catalogs cut down on errors. They offer quality data info and detailed descriptions, track data history, comply with metadata rules, promote teamwork, limit access, and assist with data prep. All this helps users handle data more accurately, reducing errors in analysis and usage.
Data catalogs improve data analysis in many ways. They make it easier to find data, offer context with metadata, ensure data quality, enable teamwork, and simplify data integration. Users can find and use datasets faster, saving time on prep. Detailed metadata provides insights into data quality, and relationships. Collaboration tools help teams share insights. With a data catalog, analysts can make informed decisions, reduce errors, and analyze data more efficiently.
Data catalogs have evolved to meet the changing needs of organizations in the digital age.
In the late 20th century, they began as digital versions of physical catalogs, offering basic information for books and documents. With the rise of digital libraries, these catalogs evolved to simplify the discovery of online resources like e-books.
As organizations started using databases and data warehouses, enterprise data catalogs were created to provide descriptive metadata as a guide. These catalogs grew to include data assets, making it easier for users to find specific data elements in these systems.
In the early 21st century, the need for thorough metadata management led to catalogs that provided information about data lineage, quality, connections, and business context. These catalogs became crucial for data governance.
In the age of big data and self-service analytics, data catalogs changed to handle different data sources and became vital for finding and preparing data.
Today, modern data catalogs use AI and ML to automate curation and metadata creation, improving data discovery. They also integrate into broader data management systems, providing customized data management for specific roles. In short, data catalogs have evolved from simple lists to powerful tools for efficient data management and analytics in the digital age.
The data management benefits of a data catalog become apparent by reflecting on the value of metadata and the capabilities that are created with comprehensive metadata. The greatest value, however, is often seen in the impact on analysis activities. We work in an age of self-service analytics. IT organizations can’t provide all of the data needed by the ever-increasing numbers of people who analyze data. But today’s business and data analysts are often working blind, without visibility into the datasets that exist, the contents of those datasets, and the quality and usefulness of each. They spend too much time finding and understanding data, often recreating datasets that already exist. They frequently work with inadequate datasets resulting in inadequate and incorrect analysis. Figure 2 illustrates how analysis processes change when analysts work with a data catalog.
Figure 2 – Process With and Without a Data Catalog
Without a catalog, analysts look for data by sorting through documentation, talking to colleagues, relying on tribal knowledge, or simply working with familiar datasets because they know about them. The process is fraught with trial and error, waste and rework, and repeated dataset searching that often leads to working with “close enough” data as time is running out. With a data catalog the analyst is able to search and find data quickly, see all of the available datasets, evaluate and make informed choices for which data to use, and perform data preparation and analysis efficiently and with confidence. It is common to shift from 80% of time spent finding data and only 20% on analysis to 20% finding and preparing data with 80% for analysis. Quality of analysis is substantially improved and organizational analysis capacity increases without adding more analysts.
To make the most of a data catalog and ensure it becomes an integral part of your data-driven journey, users can adopt a data catalog effectively through these strategies:
Launch thorough training and onboarding programs to teach users how to use the data catalog effectively. Offer workshops, tutorials, and documentation to help them navigate the catalog with ease.
Foster teamwork in the organization. Urge users to comment on datasets, share ideas, and work together on data projects using the catalog. Recognize and reward contributors, and highlight team achievements. Consider hosting "curation power-hour" events where teams can share their knowledge, making the platform better for everyone. This builds a sense of community and shared data knowledge.
Highlight real-life examples of how the data catalog has made a big difference in finding, preparing, and analyzing data. Share success stories and how the catalog helps various teams and projects. This shows how useful it is and encourages more people to use it.
These strategies help users welcome the data catalog as a valuable tool for their data tasks and encourage its effective use across the organization.
Managing data in the age of big data, data lakes, and self-service is challenging. Data catalogs help to step up to those challenges. Active data curation is a core element of data catalog success and a critical practice for modern data management. In my next blog I’ll answer the question: What Is Data Curation?
A Data Catalog is a collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need, serves as an inventory of available data, and provides information to evaluate fitness data for intended uses.
The benefits of a data catalog are improved data efficiency, improved data context, reduced risk of error, and improved data analysis.