How to Ensure Top-Tier Data Quality With a Data Catalog

Published on July 30, 2024

In today's data-driven world, businesses rely heavily on accurate and reliable data to make informed decisions. However, the quality of this data can often be compromised, leading to poor business outcomes. This is where the concept of data quality, facilitated by a data catalog (increasingly known as a data intelligence platform), becomes crucial.

Understanding data quality

What is data quality?

Data quality refers to the condition of a set of values of qualitative or quantitative variables. In simpler terms, it is the measure of the data's accuracy, consistency, and relevance. For a business, data quality is the degree to which data meets a company's expectations of accuracy, validity, completeness, and consistency. Today, data quality is essential to data management, as it ensures that data used for business decision-making, analysis, and reporting is reliable and trustworthy.

High-quality data is essential for effective decision-making, while poor-quality data can lead to misguided strategies and operational inefficiencies, not to mention the high risk that comes with AI models that are fed bad data, and the havoc this can wreak for a business.

The importance of data quality

High-quality data drives better business decisions. For instance, a retail company can use accurate sales data to identify best-selling products and optimize inventory levels. Conversely, poor data quality can lead to disastrous outcomes. Imagine a healthcare provider making treatment decisions based on incorrect patient information, which could result in dire consequences.

Data quality dimensions

In the realm of data management, ensuring top-tier data quality is paramount for any organization aiming to leverage its data assets effectively. Key dimensions of data quality, including completeness, consistency, accuracy, and timeliness, play a critical role in this endeavor. 

For example, in the world of finance, a bank relying on accurate and timely customer credit data can make well-informed lending decisions, minimizing the risk of defaults and enhancing its financial stability. Understanding and maintaining these dimensions through a robust data intelligence platform is essential for achieving superior data quality and driving business success.

Completeness

Completeness ensures all required data is present. Missing data can skew analysis and lead to incorrect conclusions. For example, a customer database missing email addresses might hinder marketing efforts. To handle missing data, businesses can use imputation techniques or data enrichment services.

Consistency

Consistency means uniformity of data across different datasets. Discrepancies can occur when different departments record data differently. Identifying and resolving these discrepancies is vital. For instance, a customer's name should be spelled the same way across all records to avoid confusion or erroneous duplicate records.

Example of data inconsistency: Image showing how single business names can be rendered many ways.

Caption: Inconsistencies abound across enterprises. Here’s one example from Scott Taylor that demonstrates the many ways data users can input single company names.

Accuracy

Accuracy refers to the correctness of data. Ensuring data accuracy involves validating it against trusted sources. For example, verifying customer addresses against postal service databases can prevent delivery issues and ensure your documented data is indeed accurate.

Timeliness

Timeliness ensures data is up-to-date and relevant. Outdated data is often irrelevant and can lead to incorrect decisions. An example of this may be most pertinent on Wall Street, as stock market analysts need real-time data to make accurate and informed trading decisions.

Businesses can use automated data refresh processes to maintain timeliness. Some data intelligence platforms offer integrations with spreadsheets (which often house outdated data) that can auto-refresh from the source on schedule to maintain accuracy and timeliness. 

Uniqueness

Uniqueness ensures no data duplication. Duplicate data is a common enterprise problem, particularly in a siloed data landscape, and can inflate metrics and lead to incorrect or repetitive analyses. Strategies for identifying and eliminating duplicate data include using unique identifiers and data deduplication tools.

Validity

Validity ensures data adheres to the required format and standards of the given system. For example, a phone number should follow a specific, consistent format. Processes to ensure validity include data validation rules and regular audits.

Data quality measures

The following measures ensure that data is reliable and fit for its intended use, which is crucial for decision-making and operational efficiency. A data intelligence platform helps maintain these measures by providing a centralized, organized repository where data quality can be continuously monitored, assessed and managed across the enterprise. As a central storage hub for metadata, a data intelligence platform can also deliver qualitative information about data assets, helping newcomers more readily assess an asset’s fitness for purpose and trustworthiness.

Quantitative measures

Quantitative measures use statistical methods to assess data quality. Common metrics include error rates and completeness rates. One example of a quantitative quality metric is the percentage of assets with errors; by calculating the percentage of missing values in a dataset data users can identify key gaps and holistically assess larger data sets or domains for their relative quality. This view is an essential tool for the enterprise data steward.

Qualitative measures

Qualitative measures involve user feedback and expert assessment. Methods for gathering qualitative information about data quality include surveys and focus groups. For instance, asking sales teams about data usability can provide qualitative insights into data quality issues.

A data intelligence platform houses qualitative information in the form of conversations, lineage, top users, common joins, and popular queries associated with a given asset. Such information helps data users more holistically comprehend specific data assets and use them more wisely. 

Automated vs. manual measures

Automated data quality tools offer efficiency and scalability. After all, in a data landscape of terabytes, asking an individual to assess even a portion of that estate is impractical. However, manual checks are essential for more nuanced assessment – and to diagnose deeper issues that may be causing the low quality. Combining both approaches ensures thorough data quality checks. For example, automated tools can flag anomalies, while manual checks can validate context-specific accuracy and propose a solution.

Automated DQ measures

Manual DQ measures

Use cases

Large datasets, real-time data processing, continuous monitoring

Small datasets, one-time projects, ad-hoc analysis

Pros

Efficiency in processing large volumes of data

Consistent and repeatable processes

Real-time error detection and correction

Scalability across various data sources

Greater flexibility in handling unique data scenarios

Detailed attention to specific data issues

Lower initial cost for small-scale implementations

Cons

High initial setup and maintenance costs

Requires technical expertise

Potential for over-reliance on automated systems, missing nuanced errors

Time-consuming and labor-intensive

Prone to human error and inconsistency

Not scalable for large datasets

Data quality software

Overview of data quality tools

Data quality tools help manage and improve data quality. Key features to consider include data profiling, cleansing, observability, reporting, monitoring, and alerting. For example, data profiling tools analyze data to identify quality issues.

Popular data quality tools

Leading tools in the market include Anomalo, BigEye, and Monte Carlo. These tools offer robust features for data quality management. For example, Anomalo provides automated anomaly detection to identify data quality issues.

Integration with data intelligence platforms

Integrating data quality tools with data intelligent platforms enhances and scales data management. Benefits of a data quality solution that integrates with your data intelligence platforms include centralized data governance and streamlined workflows. For instance, combining Anomalo with Alation provides a comprehensive view of data quality across your entire data landscape, traversed by both data scientists (your power users) and data analysts (your business users). 

Why do data quality vendors integrate with Alation?

Data quality is everybody’s responsibility. By integrating your DQ solution with a platform like Alation, you empower all data users in your organization to support data quality. As a business-friendly catalog, it centralizes quality and observability information in one platform. Other reasons DQ vendors integrate with Alation include:

  • Users can leverage the Alation popularity function to figure out what data to prioritize for DQ initiatives

  • Surface a range of health metrics in the catalog to signal data trustworthiness for all users to see at the point of consumption

  • Users can quickly detect and resolve issues with a holistic view of their data landscape

  • Alation’s lineage graph enables users to understand the full scope of related downstream impacts

  • Data users can easily communicate impacts to the organization, team, or data stewards to avoid using bad data (and incurring costs that result)

  • Policy Center offers a repository for context on DQ rules and policies so users can self-educate and stewards can monitor quality at scale

Data quality monitoring

Setting up monitoring processes

Effective data quality monitoring involves several steps. 

First, define data quality standards and metrics. 

Next, implement monitoring tools and establish regular review cycles. For example, setting up automated alerts for data quality issues ensures timely intervention.

Key Performance Indicators (KPIs)

KPIs help track data quality performance. Examples include data accuracy rates and error resolution times. Monitoring these KPIs helps identify trends and areas for improvement. 

Automated alerts and notifications

Automated alerts notify stakeholders of data quality issues in real time. Best practices for managing alerts include setting thresholds and prioritizing alerts based on severity. For example, high-priority alerts for critical data ensure prompt resolution.

Data governance for data quality

Data governance frameworks support data quality initiatives by establishing standard policies and procedures for data usage as it relates to quality. By surfacing key health metrics in a data intelligence platform, leaders can seamlessly integrate data quality information into workflows, empowering every data user with the confidence of high-quality data.

Policies and standards

Developing data quality policies involves defining data standards and guidelines. Ensuring compliance with these standards helps maintain data quality. For instance, standardizing data formats across departments ensures consistency and enables teams to share and use data across the organization more effectively. 

Data stewardship

Data stewards play a vital role in maintaining data quality. Data stewards’ responsibilities include monitoring data quality, resolving issues, and ensuring compliance with data policies. For instance, a data steward might oversee data entry processes to ensure accuracy and consistency at scale. Some solutions may offer automated stewardship capabilities to ease the challenges of standardizing data at high volumes. 

How Keller Williams launched an enterprise-grade data quality platform with Alation and Anomalo

Keller Williams (KW), the world's largest real estate franchise, leverages Alation and Anomalo to deliver high-quality data to its 190,000 agents. Managing over 70 TB of data daily, KW faced challenges in data validation and accessibility. Upon joining KW, Cliff Miller, Enterprise Data Architect, and Dan Djuric,  Head of Enterprise Data and Advanced Analytics, identified data governance and cataloging as critical areas for improvement.

KW needed best-in-class data cataloging and quality monitoring solutions. They chose Alation for data governance and Anomalo for data quality due to their seamless integration. This integration allowed KW to prioritize monitoring for 250 high-use tables, significantly improving data management.

Anomalo’s machine learning capabilities automatically detect unusual patterns, providing both day-to-day observability and deeper business metrics monitoring. This ensures no data gets corrupted and enhances overall data literacy across the company.

The integration of Alation and Anomalo led to a 10X cost savings compared to bundled legacy solutions. KW's Enterprise Information Management (EIM) team now benefits from a robust data governance framework, enabling better data-driven decision-making and fostering a data-centric culture. This transformation positions KW to maintain a competitive edge in the real estate industry.

By centralizing data information and documentation in Alation, KW has successfully made its data accessible and trustworthy for all stakeholders, boosting data literacy and empowering agents with reliable data insights.

In Miller’s words, “We were in need of two core platform competencies, we didn’t need ten. We wanted those things to be best of breed at what they did — it’s a great benefit that Alation and Anomalo integrate with each other so seamlessly.”

How an open framework supports the unique demands of data quality

Alation launched the Open Data Quality Initiative to empower customers to choose their preferred data quality vendor while ensuring seamless integration with Alation’s data catalog and data governance applications. This initiative accelerates data governance and simplifies metadata security, essential for maintaining data quality.

The principle of "garbage in, garbage out" underscores the importance of high-quality data. Poor quality data leads to poor analyses and insights, which can significantly impact decision-making. Data governance ensures the correct use of high-quality data, which is vital for consistent regulation compliance and improved data management. However, data quality is a rapidly evolving field with diverse approaches, including profiling statistics, aggregate quality scores, and different methodologies like sampling or machine learning-based rules. This diversity is driven by the varying needs of industries and departments, necessitating specialized data quality tools.

Alation has been at the forefront of evolving the data catalog into a comprehensive platform for data intelligence. This transformation is driven by two core principles: delivering the right data to the right people at the right time and providing openness and extensibility to integrate with other tools. The Open Data Quality Initiative furthers these goals by allowing customers to integrate their chosen data quality tools with Alation, creating a unified system of reference for data quality.

Key features that enhance data quality in Alation

  • SmartSuggestions: AI-powered suggestions to enhance SQL, based on how experts leverage SQL in your organization

  • Data Health Tab: Description of checks, status, and object health values

  • Data Lineage: View and understand the full scope of related impacts (Use Impact Analysis and Upstream Audit to quickly discern not just what data is affected, but who and why)

  • Data Profiling: Providing statistical insights into data

  • Alerting: Triggering notifications for data quality issues

Finally, trust flags in Alation are customizable by other DQ vendors, as Alation APIs were built to allow for custom development. This means DQ vendors can automatically endorse or warn others away from data, via trust flags, based on the checks run. For example, if a table fails any DQ checks, Anomalo will flag the relevant asset as a deprecation and provide a visual “stoplight” warning. By contrast, if the table passes the check, Anomalo endorses the table with a “green light.”

By integrating these features into the data catalog, Alation ensures that data quality is continuously monitored and easily accessible, enabling faster, data-backed business decisions.

The Open Data Quality Initiative includes an Open Data Quality Framework (ODQF) and a starter kit for data quality partners. This kit provides an open DQ API, developer documentation, and integration best practices, allowing seamless integration of specialty data quality information. This ensures that important data context and quality are readily available to all data consumers.

As a result, customers gain a complete view of data trustworthiness, integrating data quality context into workflows, and enabling better data governance. This initiative not only supports diverse data quality needs but also enhances overall data intelligence by providing high-quality, trustworthy data for decision-making.

By reinforcing data governance with robust data quality measures, Alation’s Open Data Quality Initiative empowers organizations to achieve better data intelligence and business value. This initiative represents a significant advancement in ensuring data quality and governance, making Alation a critical tool for modern data management.

Conclusion

Maintaining high data quality is essential for businesses to make informed decisions and stay competitive. By implementing effective data quality assessment practices, supported by data catalogs and governance frameworks, organizations can ensure their data is accurate, consistent, and reliable.

Curious to learn more about how Alation can help you scale data quality? Book a demo today.

    Contents
  • Understanding data quality
  • Data quality dimensions
  • Data quality measures
  • Data quality software
  • Data quality monitoring
  • Data governance for data quality
  • How Keller Williams launched an enterprise-grade data quality platform with Alation and Anomalo
  • How an open framework supports the unique demands of data quality
  • Conclusion
Tagged with