
How to drive trusted decisions without changing your current data infrastructure.
Learn more about DataOS® in our white paper.
Anyone who works with data knows that it’s long past time for data catalogs to catch up with the rest of the modern data stack. Data is no longer consumed primarily by the IT team. Today, data teams include data analysts, data scientists, product managers, business analysts, citizen data scientists, and more. Each of these people has their own favorite data tools and even different languages for describing data.
Too often collaboration dissolves into chaos and confusion. Frustrated questions like, “What does this column name mean?” and “Why are the numbers on the dashboard wrong again?” slow data teams to a crawl. To help ensure this doesn’t happen, data teams and other users can leverage metadata for answers instead.
Metadata is a solution to enable collaboration across business units and to make data easier to find and use. Metadata (documentation, queries, history, glossaries, etc.) makes data understandable.
Anyone who has used a library catalog is familiar with metadata; tags like author, date of publication, subject, Dewey Decimal Number and more help you locate a book and determine whether it’s useful for what you have in mind. However, the modern idea of metadata dates to the 1990s and the rise of the Internet.
As the Internet grew, data and metadata exploded. IT teams were given ownership of data in most companies and placed in charge of creating an “inventory of data,” the way a grocery store might inventory apples and soap. Setting up these inventories and keeping them current were constant struggles for IT.
Data catalogs arose during the Hadoop era (2010s). They evolved as companies realized that they needed to improve the data inventories of the 1990s-2000s by adding new business metadata. The idea was to help the expanding class of business users understand their datasets and put the data in a business context.
These early data catalogs were clumsy, and specialized solutions were lacking. The earliest adopters of the modern data stack and most large tech companies resorted to building their own proprietary solutions, such as Airbnb’s Dataportal, Facebook’s Nemo, LinkedIn’s DataHub, Lyft’s Amundsen, Netflix’s Metacat, and Uber’s Databook. Small companies, without the resources for such in-house projects, had to wait for solutions to arrive.
And arrive they did, eventually, with tools such as Apache Atlas. Still, while the rest of the data stack has evolved in the past few years, and tools like Fivetran and Snowflake let users set up a data warehouse in hours once they are installed, data catalogs have not kept up. Even trying out current metadata tools involves significant engineering time for setup, plus weeks of back and forth with a sales rep to get a demo.
It’s time for a metadata solution that is just as fast, flexible, and scalable as the rest of the modern data stack. In January, 2021, Prukalpa Sankar wrote on towardsdatascience.com, “[I]n the next few years there will be the rise of a modern metadata management product that takes its rightful place in the modern data stack.” These new data catalogs will be based on principles of data and data use that have developed alongside the data stack.
Today’s BI dashboards, code snippets, SQL queries, models, recordings, presentations, and Jupyter notebooks are all data assets. All can be searched and analyzed for valuable information. All can be enriched and made more usable through appropriate metadata.
A modern data catalog should leverage metadata as a form of data that can be searched, analyzed, and maintained in the same way as all other types of data. The ability to process and understand metadata will help teams understand and trust their data better.
For example, query logs are just one kind of metadata available today. SQL query logs, properly analyzed, allow us to create column-level lineage, assign a popularity score to every data asset, and even deduce the potential owners and experts for each asset. Quality ratings from users, indexed by a data source, can identify source problems that can be addressed to improve data quality throughout the organization.
Today’s data catalogs have greatly improved discoverability, but they still do not give organizations a “single source of truth” for their data. Information about data assets is usually spread across tools for data lineage, data quality, data preparation and cleanup, and more. Data silos still impede discovery and enrichment. And dark data remains dark, hidden, and unused (let alone catalogued).
DataOS has these principles at its heart. Its metadata engine, Metis, allows DataOS to apply rich metadata covering all aspects of a dataset, from lineage to documentation. Sitting atop your data ecosystem, DataOS accesses every dataset in your organization, without moving the data, to eliminate silos and dark data.
For more information, or to arrange a demonstration (in days, not weeks), email us at [email protected].
Be the first to know about the latest insights from Modern.
In our previous post, The Pros and Cons of Leading Data Management and Storage Solutions, we untangled the differences among data lakes, data warehouses, data lakehouses, data hubs, and data operating systems. Remember to read part one if you need a quick refresher. ...
Data lakes, data warehouses, data hubs, data lakehouses, and data operating systems are data management and storage solutions designed to meet different needs in data analytics, integration, and processing. Each has unique advantages and drawbacks, and the right...
What is a data operating system? On the surface, it's an operating system designed specifically for managing and processing large amounts of data. It typically provides a scalable and flexible infrastructure for storing, processing, and analyzing big data and should...
Prevention and early intervention are essential to building an effective healthcare approach that supports patients from start to finish. The critical component of this approach is predictive analytics — analyzing big data gathered from patients, consumers, and...
Technical debt is something that many companies are aware of and are attempting to address. It is a big enough issue that several of our recent blog posts (Lessons in Technical Debt from Southwest Airlines, Start Paying Down Your Technical Debt Today, and A Better Way...
Data Mesh + Patient360: A Modern Revolution for Healthcare DataHealthcare organizations are sitting on a treasure trove of customer data. Operationalizing that data makes it actionable and usable, helping improve services, costs, and patient outcomes. However,...
The Modern Data Company BriefThe Modern Data Company is radically simplifying data architecture with its paradigm-shifting data operating system, DataOS. We're replacing overwhelm with composability, reinventing governance, and connecting legacy systems to your newest...
DataOS® – The Fastest Path from Data to DecisionDataOS is the world's first fully-integrated data operating system designed to move from companies from data to decision in weeks instead of months. Discover what makes DataOS different from the competition and how...
Not Getting Value from Your Data Transformation? Fix itImplementing customer lifetime value as a mission-critical KPI has many challenges. Companies need consistent, high-quality data and a straightforward way to measure CLV. In the past, organizations have struggled...
DataOS® Solution:AI/ML 70% of AI initiatives fail and teams spend the vast majority of their time simply prepping data for platforms, leaving very little left over for gaining insights and driving business value. But an AI/ML platform powered by DataOS can achieve...