
How to drive trusted decisions without changing your current data infrastructure.
Learn more about DataOS® in our white paper.
Data governance can be a powerful agent in scaling the use and distribution of trusted data throughout the company. However, more often than not, it conjures up the idea of a central authority strictly guarding against such access. In this 3-part series, we’ll cover the critical and often misunderstood components of data governance and offer perspective on how to implement data governance strategies that deliver trusted data at the speed of business. If you missed it, make sure to catch up on Part 1 – Data Timeliness.
A taxonomy, very broadly, is a system of organized information that allows the user to classify and show relationships between things. A common example of a taxonomy is the Dewey Decimal System of library classification, in which numbers form a code that correlate to topics, subtopics, and sub-subtopics. Wikipedia illustrates the way this hierarchy is set up:
500 Natural sciences and mathematics
510 Mathematics
516 Geometry
516.3 Analytic geometries
516.37 Metric differential geometries
516.375 Finsler geometry
In the Dewey classification system, each number is associated unambiguously with a single entry in the hierarchy. A number such as 516.375 above identifies a book or other resource specifically as dealing with Finsler Geometry. That number also shows how that book relates to others above and below it in the hierarchy.
A data taxonomy uses a system of unambiguous metadata terms (such as a filename or tags attached to a file) that allow an enterprise to classify a file or dataset into important business categories. Categories can be configured in any way that meets the needs of the organization, but some common ones include the date of creation, date last modified, account name of the creator/modifier, required access privileges, personal identifying information (PII), the department that owns the dataset, and the primary business use of the dataset.
Properly designed and developed, a data taxonomy improves discoverability, observability, and security for your data. Data that is properly classified, catalogued, and tagged is usually well-governed data.
A proper data taxonomy addresses many problems in your data and metadata, including:
The first and most important step to data discoverability is a data catalog. The first essential step in building a catalog is tagging data with business vocabulary so users can easily find the data they need. A data taxonomy makes cataloging much more powerful, improving data quality and discoverability. DataOS® can automate tagging and indexing to add incoming data to your catalog immediately.
The two keys to building a usable data taxonomy from scratch are focused changes and using the language of your users as much as possible.
Focus your taxonomy on one business area at a time. Balance your choice of area by beginning with high-priority targets, while keeping your scope manageable. For example, don’t begin with something like compliance with HIPAA or GDPR. Those are too large and too sweeping to start with. Save those to address after you build the taxonomies for a few smaller areas, such as marketing, sales, or security. Not only will this give you more practice with the methods of taxonomy, but much of what you build there will be needed for something like GDPR, so you’re whittling the scope of that project down as you go.
Use your narrow focus to plan and keep milestones as your taxonomy progresses from one target to the next.
More than many other data projects, a data taxonomy is a team effort. Your IT team or data steward can’t do it on their own. A data taxonomy needs to use the language of your business users, which means a polling process and meetings with users to learn how they think of their data.
You may add a hierarchy to your taxonomy to address the variety of terms that users may have for the same thing. If users have terms like “POS revenues,” “sales,” and “revenues,” then you can set up the taxonomy so all of those searches point back to “sales,” which is the tag that appears in your metadata. This is one of the primary ways in which a taxonomy enforces consistency and aids discoverability.
The focus of your taxonomy efforts can also help users see the value of the taxonomy to their particular projects, increasing enthusiasm and interest in developing the vocabulary for their area.
Most modern businesses spend a lot of money on collecting their data. The ROI on that effort depends on deriving business insights from the data. A data taxonomy makes data easier to find and easier to use while improving data governance and data quality. It makes your data more valuable to your business.
Data democratization is one of the data world’s favorite buzzwords, but data management platforms have a hard time delivering it––even when they say they will. This article examines some of the goals and roadblocks on the way to data democratization and shows how DataOS makes the process simpler and, above all, genuine.
Data democratization is easy to describe but hard to implement. The goal is to give every user access to the data they need to make business decisions. There are many considerations when realizing this goal. Business users will need data skills that they didn’t need under the old “gatekeeper” approach when access to data was strictly limited. Those same users will need self-service BI/Analytics tools to help them extract insights from the data they have. There are wide-ranging changes in business and data culture, such as transferring ownership and maintenance of domain-specific data sets to the departments that use them, rather than IT.
As complex as all of these issues are, though, some of the most intractable considerations have to do with data governance.
Governance problems arise as soon as we formulate the definition of data democratization. “Access for all” can never mean “access to everything for everyone.” Matters of privacy, data security, and simple data overload necessitate some form of access controls. Just what data does a given user “need” in their work, and how can they get access to it while maintaining privacy, data security, and regulatory compliance? It may not be practical to restrict or permit access to individual cells in a data table. More often, entire columns must be restricted, or even entire datasets, in order to restrict the information at all. As a result, users lose access to data that they could use and ought to have.
Solutions to these issues include curated “data marts” and selected data catalogs that allow users to find and search domain-specific datasets. One benefit of this approach is that it spares users from searching the entire enterprise data domain in order to find information they want, but it also means that they may miss relevant data that isn’t included in their data mart.
Another governance problem is the tendency of users to construct customized databases for their own particular needs. It makes sense in terms of making queries faster and simpler, but it creates new data silos just as the business is getting rid of them in the course of modernizing their data architecture. All the problems of silos—such as stale, duplicated data—come right back.
ABAC is a central part of governance with DataOS. DataOS provides a system of tags applied to both users (as permissions) and datasets at the row and column level. ABAC allows the governance policy to become highly granular and to grant access to selected parts of diverse datasets. In effect, ABAC allows a governance team to create an unlimited number of “virtual datasets” with access granted by privilege level and business unit, without having to create data marts that provide access only to specific curated datasets.
For example, a governance team might create three tags: business_1, business_2, and business_3, offering decreasing access from 1 to 3. A column in a data set might be tagged “business_1” and give access only to users with a business_1 tag on their account. Another column in the same dataset could be tagged “business_3,” offering access to users with any of the three access tags. Data is presented to the user with filters and hashes to obscure data that the user has not been cleared to see.
Attributes like these can be changed, added, or removed immediately as business needs and user access change, either from the DataOS GUI or with a few lines of YAML code through a command-line interface. Attributes can also control data retention and many other governance functions quickly and simply.
DataOS eliminates the problem of data duplication and siloed data sets because users never have to move data out of the datastores in which they reside. The data returned by a query is automatically presented to the user as a virtual dataset, which can be used by analytics and BI tools just as any other dataset would be.
Data is never duplicated or siloed. Each search presents the data as it exists in the most current versions of the source datasets.
DataOS delivers what other platforms only promise: simple, transparent, genuine data democratization. It lets you focus on the cultural changes needed for the new paradigm without worrying about your data management platform sabotaging your efforts.
To find out more about how DataOS can transform your data ecosystem, contact us for a demo today.
Modern data platforms are infamously fragmented and still proliferating. Feature sets are at once duplicative and bewilderingly diverse. In order to find the best solution for a business, we need a set of standard functions that every data platform should fulfill. We can then add specialty functions as needed to identify the platform that best meets our needs.
In a previous post, The Core Principles of a Modern Data Platform, we examined the core design principles to show what users should expect from any modern data platform worthy of the name. You’ll see that those principles give shape to these essential functions.
This article will examine the essential functions of a modern data platform and how DataOS fits in. You will see how DataOS works with common data tools and, in many cases, can replace them.
DataOS is the first true data operating system. It does for your data and data stack what your computer’s operating system does for your files and applications. DataOS facilitates operations between apps, supports and enhances security, and helps your data stack perform better and more consistently. As a result, it interacts with the elements and functions of your data stack in ways that no other product does.
Everything starts with data ingestion: bringing data in from all the data sources you use into the storage systems you’ve designated for each source. This is the beginning of many data pipelines and workflows.
DataOS uses a built-in implementation of Flare to ingest data. The process can be automated and set to occur on a schedule or triggered by events elsewhere in the data stack. The operating system can also automate ingestion by any standard tool.
The data storage and processing layer is fundamental to the modern data platform. This is also one of the most rapidly evolving functions. Currently, three architectures are most common: warehouse, lake, and lakehouse.
Each architecture has its own set of tools to accomplish tasks.
DataOS is storage-agnostic. Any sort of storage volume, from a Snowflake warehouse to the hard drive of a single desktop, is abstracted as a “depot” in DataOS. DataOS works with data from any depot without moving or copying the data.
If you’re using a data warehouse, then you’re probably using a tool that leverages native SQL for transformation. The other common approach is using an orchestration engine coupled with custom transformation in a programming language like Python.
DataOS can work with a transformation tool like dbt or Matillion, passing data to the tool for transformation, then passing the transformed data to another tool for analysis or to any depot in its network. Alternatively, the built-in Flare can write and execute transformation jobs that run on Apache Spark. DataOS supports both batch and incremental workloads.
The latest BI and analytics tools are built to fit within a larger modern data platform. Generally, the goal is self-service data dashboards for individual users who want to manipulate and explore data rather than using static graphs and tables.
DataOS can pass data to a visualization tool such as Tableau, or it can use its native Apache Atlas to build reports and dashboards customized to different stakeholders.
If there is any area where modern data platforms struggle, it is aiding discovery, enabling trust, and bringing context to data. One of the best ways to do this is with rich and editable metadata. In fact, metadata is becoming “big data” in its own right. The main approaches to this function are either open-source tools or proprietary tools developed in-house.
DataOS has Apache Atlas built-in for building data catalogs and other tools such as Metis for adding and editing metadata. But as with most other cases in this article, DataOS can also work through other tools in your data stack.
Especially with the growth of regulatory frameworks for data privacy (e.g., HIPAA or GDPR), companies must manage privacy and access controls throughout their data stacks. Several tools are emerging to meet this need.
DataOS handles privacy and security through its Attribute-Based Access Control (ABAC). This is one of its most powerful features — you can assign users their access level, as well as any special access privileges, through tags. You can tag a data table, row, or column with access requirements, and from then on users will be given only the data they are authorized to see, automatically and transparently at every touch point in the network.
Download the DataOS Overview and learn how DataOS delivers on all six essential functions of a modern data platform.