By Robert Seiner
Published on April 6, 2023
Data governance is traditionally applied to structured data assets that are most often found in databases and information systems. Other forms of governance address specific sets or domains of data including information governance (for unstructured data), metadata governance (for data documentation), and domain-specific data (master, customer, product, etc.). This blog focuses on governing spreadsheets that contain data, information, and metadata, and must themselves be governed.
Spreadsheets have been referred to as the dark matter of the data universe. Yet, spreadsheets sustain the necessary foundation for many business and operational decisions. There are millions of advanced spreadsheet users, and they spend more than a quarter of their time repeating the same or similar steps every time a spreadsheet or data source is updated or refreshed. Yet there are many advantages to spreadsheets that make them critical to organizations everywhere. For one, spreadsheets are convenient and a low-cost, user-friendly alternative to larger databases and information systems. For millions of users, spreadsheets are a powerful self-service tool that requires a limited skill base. Spreadsheets provide quick answers to questions and address urgent needs of the business. The ubiquity of spreadsheets creates disadvantages as well. Spreadsheets are not typically developed and managed for enterprise use, which opens the door to risk from malicious actors, as well as human errors. Some historical “whoops” moments caused by spreadsheets include:
JP Morgan lost $6 billion in the London “Whale” disaster. A “Value at Risk” (VaR) model operated on a series of spreadsheets, which were built manually, via copy and paste. Under pressure to meet deadlines, the team accelerated work… and the errors added up.
TransAlta, a Canadian power generator, overspent $24 million on contracts due to a cut-and-paste error in Excel spreadsheets.
RedEnvelope lost more than 25% of its value after overestimating gross margins due to a budgeting error in (what else?) a spreadsheet. The jewelry stores company revealed that one misrecorded number in one cell skewed their sales forecast.
It’s easy to see why these errors occur. Spreadsheets are not typically controlled, resulting in numerous versions of the same data and/or individuals with “corrected” versions of the truth. Spreadsheets often require cumbersome links to external data, are hard to combine, present confidentiality, privacy, and security concerns, and are constrained in their size and complexity. Despite this, the advantages of spreadsheets are plentiful – which is why spreadsheets are foundational to business operations. However, the disadvantages must be addressed if spreadsheets are to remain an integral part of the organization’s data landscape.
A data catalog (or data intelligence platform) is a collection of metadata, combined with data management and search tools, which helps analysts and other data users find the data that they need, alongside context that helps them understand and use that data. It serves as an inventory of available data and provides information to evaluate fitness of data for intended uses. The data catalog contains a wealth of knowledge about the data, which ultimately reaches spreadsheets and becomes foundational to business operations. Data catalogs and spreadsheets are related in many ways. I define metadata as “data housed in an IT tool, that provides business and technical understanding of other data and data-related assets.” Simply put, metadata adds context. Business communities require metadata to empower people to locate the data they need, comprehend the data and where it came from, and collaborate with the appropriate business owners and stewards. This metadata is an organizational asset that is housed, connected to the data, and made available through the data catalog.
As a spreadsheet retrieves external data, valuable context is often lost in the process. The data in the spreadsheet often becomes orphaned from the data’s meaning, which introduces risk, making it harder for both business and technical users to understand and use that data.
Spreadsheets have become a staple in the organization’s data landscape as data resources that are instrumental to operational success. Yet metadata about the data contained in spreadsheets, including (but not limited to) the name, location, purpose, data source, and ownership does not often exist. Organizations struggle to inventory their structured data resources like databases and information systems and seldom recognize spreadsheets as enterprise data resources because of how they are governed. Providing metadata through the data catalog about both the spreadsheets and the data within provides organizations with better insights into the spreadsheets that are most valuable and widely used in the organization. Data catalogs provide access to data that is organized, defined and governed. Yet valuable context is lost when that data is moved to a spreadsheet. By providing access to catalog metadata within spreadsheets, spreadsheet users can understand and use that data more powerfully. Leaders can also enjoy a holistic view of all spreadsheets being utilized by the business.
Spreadsheets have long been a critical element of operational efficiency and effectiveness. People that view spreadsheets as a good thing are people that are looking for the convenience of a controlled data environment, ease of data manipulation, and the ability to manage their own data sets. There are others that consider spreadsheets to be trouble. Spreadsheets are not going away any time soon, so it makes sense to incorporate them into the data landscape.
Data intelligence platforms enable people to better document, understand, analyze, and gain insights from the data within that data landscape. Accepting and incorporating spreadsheets as a critical element of the landscape is extremely beneficial to the organization. Alation Connected Sheets is the first tool that brings spreadsheets and the catalog together as part of a single landscape.
How do spreadsheet users benefit from Alation Connected Sheets? The ability to connect straight to the source allows knowledge workers to work natively in spreadsheets, pulling data directly from true data sources like the data warehouse or data lake. This connection enables the worker to find, select, filter, and import data, all while having the powerful context of metadata at their fingertips.
In this way, increased efficiency and self-service accuracy are immediate benefits of the catalog and spreadsheets working together. Leaders seek to use strategic data with confidence more often. At the same time, they also must reduce knowledge-worker risk by ensuring accessible data is governed and defined.
When data catalogs are connected to spreadsheets, users can pull trusted, governed, and accurate data into spreadsheets directly. Users no longer need to copy and paste, or export data from other tools, nor must they rely on data teams for clean, up-to-date data. Connecting spreadsheets to data sources boosts efficiency and accuracy while providing self-service capabilities to spreadsheet users.
Managing spreadsheets is a difficult task for even the most data-savvy professional. The problems “spreadsheet jockeys” encounter on a daily basis include the reliance on data engineers to provide data, not knowing what data they have access to, where to go for trusted data, and not knowing the right person to address questions about the data.
Bringing trusted, governed data to spreadsheets is a huge problem solver. One Alation user1 said, “Data teams will now be able to focus on complex analyses that drive the business forward; time otherwise spent pulling and verifying fresh data or tracing lineage.”
This integration helps spreadsheet users by increasing both knowledge in data through rich context, along with confidence – for both the spreadsheet creator and its audience. It also supports a broader ability to leverage existing security credentials. With this new tool, knowledge workers are equipped with the data they need to “excel” in their job function.
Even non-spreadsheet users benefit. Data teams will be able to focus on complex analyses that drive the business forward, time otherwise spent pulling and verifying fresh data or tracing lineage on behalf of spreadsheet users.
Alation Connected Sheets brings discovery and trust to the most popular tool for data users everywhere. Based on my recent discussions with Alation, going forward, Alation will enhance governance and discovery capabilities for spreadsheets, effectively ending the silo problem and bringing governance to bear on data that has been historically impossible to verify or trust. Stay tuned for more updates that make spreadsheets:
Findable. Data teams will be able to curate spreadsheets and publish them back into the catalog for others to discover.
Traceable. With the details on a spreadsheet’s history, data teams will be able to conduct impact anlaysis, tracing lineage from spreadsheet to upstream source applications.
Governable. Lineage and other curation details will offer governance guidance to new users, and persona-based access controls will mask sensitive data from those not authorized to see it – while still enabling them to conduct analysis
Authoritative. By establishing a centralized repository of spreadsheets, data leaders can break down silos and guide more curious minds to the one trusted version of a given spreadsheet, eliminating confusion and building trust across the organization.
1. Alation blog from Jason Lim – https://www.alation.com/blog/alation-2022-4-alation-connected-sheets-release/