Data catalogs

How data catalogs are evolving to become "the front door" of the modern data stack.

As adoption of the modern data stack increases, new user needs emerge. I am particularly interested in user needs that exist at the intersection of various tools and layers in the modern data stack. In this post, I take a look at one such user need – data discovery – and explore how data discovery tools are evolving.

Stack overflow

The modern data stack is now a dizzying array of ETL/reverse ETL pipelines, warehouses, lakehouses, data transformation tools, and metrics dashboards. Each of these components contains valuable information. Making the most of it is a function of knowing where to find what you are looking for. This can be challenging when information is spread across disparate systems.

Data catalogs today

This dynamic has led to the rise of data catalogs. Data catalogs sit upstream from your data sources and pipelines, suck in your metadata and act as a reference point for what is going on in your data stack. They act as an aggregation layer on top of your data resources: tables, transformations, queries, aggregated metrics, etc.

Data catalogs unlock a few benefits:

  • Discovery — what exists in our stack?
    Search is the central functionality of a data catalog. It's typical to have several data resources in an org owned by different stakeholders. Having a centralized search layer on top of all these data resources makes it easier for a consumer to know what exists across their data stack.

  • Knowledge management — what does a particular resource mean?
    It's one thing to be able to search for a resource and another to understand its intended use. If data is an asset then the knowledge surrounding data is valuable IP. Data catalogs serve as a documentation hub for agreed-upon descriptions. They pull in existing documentation from the underlying source and allow users to define key terms that make sense to various stakeholders.

  • Collaboration
    Different stakeholders interact at different levels of the modern data stack and have varying needs. For example, a data consumer such as a PM may make a request to a data engineer for a particular set of columns to be included in a particular analysis. Two data engineers may go back and forth on a canonical SQL query. Data catalogs serve as a collaboration hub to streamline these interactions.

Data catalogs tomorrow: a front door to the modern data stack

One framing that helps as a guide for thinking about where data catalogs are today and where they are headed is that of a front door to the modern data stack. That is, a control plane for everything happening in your data stack. We already see the beginnings of this:

  • As data catalogs increase the expected value of all data assets, adoption increases among both data producers and consumers
    Every piece of data produced in an organization has a discounted expected value if it is hard to discover or use. As data catalogs make it easier to find what you are looking for, they become the first reference point for data consumers in an organization and give teams confidence that they are making the most of their data. Additionally, data catalogs passively index your metadata, aggregate data assets across multiple sources, and surface the most relevant results (e.g. via popularity rank) for a given search. As more data consumers in an organization are exposed to this benefit, they will come to rely on data catalogs as the default view for data exploration.

  • Data catalogs are a high recurring interaction node
    Data catalogs are evolving to become not just a reference point but a workflow tool for engineers, data scientists, PMs, and other stakeholders in an organization. They create value at the intersection of different parts of the data stack and different types of stakeholders. As teams adopt more tools in the modern data stack, it is more useful for interactions that reference various data resources to happen inside a data catalog versus in any one tool. Better built-in collaboration tools and aggregation features will continue to compound this value.

If you’re thinking about user needs that emerge out of increased adoption of the modern data stack, reach out at