Skill Guide

Data cataloging, lineage tracking, and metadata management

The systematic process of discovering, organizing, documenting, and governing an organization's data assets, their origins, transformations, and associated business context to ensure findability, understanding, and trustworthiness.

This skill is critical for data-driven decision-making, regulatory compliance (e.g., GDPR, CCPA), and operational efficiency. It directly impacts business outcomes by reducing data discovery time, enabling accurate impact analysis for changes, and building a foundation for reliable analytics and AI initiatives.

1 Careers

1 Categories

9.1 Avg Demand

15% Avg AI Risk

How to Learn Data cataloging, lineage tracking, and metadata management

1. **Core Terminology**: Master definitions of metadata (technical, operational, business), data lineage (procedural vs. declarative), and data catalog vs. data dictionary. 2. **Conceptual Framework**: Understand the three pillars-Cataloging (the 'what'), Lineage (the 'where'), and Governance (the 'so what'). 3. **Hands-on Exploration**: Use open-source tools like Apache Atlas or DataHub to explore a sample dataset and document its schema, descriptions, and owners.

1. **Implement in a Staging Environment**: Build a proof-of-concept catalog for a specific data pipeline (e.g., a sales ETL job). Focus on capturing technical lineage automatically and enriching it with business glossary terms. 2. **Tackle Common Pitfalls**: Avoid 'catalog rot' by establishing clear ownership and update cadences. Learn to differentiate between physical and logical lineage. 3. **Scenario Practice**: Perform impact analysis: trace where a source column change propagates. Audit data quality issues by backtracking through lineage to the origin.

1. **Architect Scalable Solutions**: Design federated governance models where domain teams own their metadata, while a central platform team maintains the infrastructure and standards. 2. **Strategic Integration**: Embed lineage and cataloging into CI/CD pipelines for data infrastructure (DataOps). Link data catalog APIs to BI tools and ticketing systems. 3. **Executive Communication**: Develop metrics to measure catalog adoption and data quality improvements. Mentor engineers on building 'metadata-as-code' practices within their development workflow.

Practice Projects

Beginner

Project

Catalog a Single Database Table with Full Context

Scenario

You are given a `customer_transactions` table in a PostgreSQL database. Your task is to create a comprehensive, searchable entry for it in a data catalog.

How to Execute

1. Extract the technical metadata (schema, column names, data types, constraints) using SQL queries or a tool like `pg_dump`. 2. Document business metadata: define the table's purpose, the business owner, the data steward, and key column definitions (e.g., `txn_amount` = total in USD). 3. Document operational metadata: describe the ETL job that populates it, its schedule, and the source system(s). 4. Upload all metadata to an open-source catalog (e.g., DataHub) or a structured Confluence/SharePoint page, ensuring tags and glossary terms are linked.

Intermediate

Project

Trace and Document a Data Pipeline's Lineage End-to-End

Scenario

A BI dashboard showing 'Monthly Active Users' is reporting incorrect numbers. You must trace the data from the dashboard back to its source to identify the root cause.

How to Execute

1. Start at the dashboard metric and identify the underlying SQL query or semantic model (e.g., in Looker or Power BI). 2. Use a lineage tool (or manual SQL parsing) to map the tables, views, and transformations used in that query. 3. Visually diagram the lineage, noting any filters, joins, or aggregations. 4. Walk the lineage backward, verifying logic at each stage. Identify discrepancies (e.g., a filter excluding a user segment incorrectly). Present your findings in a document that includes the lineage diagram and the root cause analysis.

Advanced

Case Study/Exercise

Design a Governance Framework for a New Data Mesh Domain

Scenario

Your company is adopting Data Mesh. The 'Customer' domain is launching. You must define the metadata governance model that ensures discoverability and interoperability while preserving domain autonomy.

How to Execute

1. **Define Federated Ownership**: Specify that the Customer domain team owns their data product metadata, but must adhere to central standards (e.g., mandatory fields: owner, SLA, refresh frequency). 2. **Standardize Contract & Lineage**: Mandate that all data products expose a standardized data contract (schema, semantics) and declarative lineage API. 3. **Integrate with Central Platform**: Configure the central data catalog to automatically ingest metadata via these APIs. Define cross-domain linking rules (e.g., how a 'Customer ID' is linked across domains). 4. **Establish Auditing & Compliance**: Set up automated checks to ensure metadata completeness and lineage connectivity. Create a process for handling GDPR right-to-erasure requests by leveraging lineage to identify all downstream datasets.

Tools & Frameworks

Software & Platforms

Apache AtlasDataHub (LinkedIn)CollibraAlationAWS Glue Data Catalog

Open-source solutions (Atlas, DataHub) are ideal for building custom, cloud-native catalogs. Commercial platforms (Collibra, Alation) offer robust governance workflows, business glossaries, and stewardship tools out-of-the-box. Cloud-native services (AWS Glue) are tightly integrated with their respective ecosystems for automatic technical metadata harvesting.

Standards & Methodologies

OpenMetadata StandardW3C DCAT (Data Catalog Vocabulary)Data Mesh PrinciplesActive Metadata Management

OpenMetadata and DCAT provide interoperability schemas for metadata. Data Mesh is a socio-technical framework that mandates domain-oriented, self-serve data products with embedded metadata. Active Metadata Management is a paradigm shift from static documentation to metadata that drives automation (e.g., auto-classifying PII, triggering pipeline alerts).

Interview Questions

Answer Strategy

The interviewer is testing for prioritization, pragmatism, and an understanding of change management. The answer should focus on a phased, value-driven approach, not just tooling. **Sample Answer**: 'I would not start by buying a tool. First, I'd partner with BI and analytics leaders to identify 3-5 high-pain, high-visibility data domains (e.g., Customer, Revenue). I would then launch a targeted, manual documentation sprint for these domains using a simple template, focusing on business context and ownership. Simultaneously, I'd evaluate and deploy a lightweight catalog tool (like DataHub) to house this curated content. The goal for 90 days is to have a highly usable, albeit narrow, catalog that solves a real pain point for key stakeholders, creating advocates for further rollout.'

Answer Strategy

This tests technical empathy and the ability to frame benefits in engineering terms. **Sample Answer**: 'I'd acknowledge their point-the code is truth. But I'd argue that lineage is the *map* to that truth, which is essential for debugging, onboarding, and impact analysis at scale. I would propose integrating lineage generation directly into their CI/CD pipeline using tools like `dbt` or Airflow's lineage API. By making it an automated byproduct of their existing workflow-where they review and approve the generated lineage as part of a pull request-we transform it from a chore into a valuable artifact that improves system observability and reduces their own support burden.'