Skill Guide

Data lineage tracking and provenance documentation

Data lineage tracking is the technical process of mapping the complete lifecycle of data-its origin, transformations, movements, and consumption-while provenance documentation is the formal, auditable record of this journey, including the 'who, what, when, and why' of every change.

It is the foundational control for data trust, regulatory compliance (GDPR, CCPA, SOX), and impact analysis, directly reducing risk and enabling confident, data-driven decision-making. It transforms data from a potential liability into a verifiable asset, crucial for audit readiness and operational efficiency.

1 Careers

1 Categories

8.7 Avg Demand

25% Avg AI Risk

How to Learn Data lineage tracking and provenance documentation

1. **Core Terminology:** Master terms like source/target, transformation, ETL/ELT, metadata, and data catalog. 2. **Diagramming Basics:** Practice manually drawing data flows for simple pipelines (e.g., CSV to database to dashboard) using tools like Lucidchart or draw.io. 3. **Basic SQL & Metadata Queries:** Learn to query system tables (e.g., in PostgreSQL, Snowflake) to trace table creation and dependencies.

1. **Tool Proficiency:** Implement lineage in a dedicated tool like Apache Atlas, OpenLineage, or a cloud-native solution (AWS Glue, Azure Purview, Google Dataplex). 2. **Pipeline Integration:** Embed lineage capture into a real ETL job (e.g., using dbt, Apache Spark) to automate metadata collection. 3. **Common Pitfalls:** Avoid assuming lineage is only for SQL; address incomplete lineage from manual processes, dark data, and real-time streams.

1. **Architectural Strategy:** Design a cross-platform, unified lineage strategy that spans on-prem, multi-cloud, and SaaS applications. 2. **Governance Integration:** Link lineage to a formal data governance framework, establishing stewardship and quality rules. 3. **Mentorship & Evangelism:** Champion lineage's ROI to business stakeholders, translating technical maps into business impact narratives for risk and compliance teams.

Practice Projects

Beginner

Project

Manual Lineage Mapping for a Marketing Dashboard

Scenario

You have a CSV file of 'Marketing_Spend', a SQL database with 'Campaign_Performance', and a Tableau dashboard showing 'ROI'. Your task is to trace how the final ROI metric is calculated.

How to Execute

1. Interview the marketing analyst to identify all source files and SQL queries. 2. Document the transformation logic (e.g., joining tables, calculating ROI as (Revenue-Cost)/Cost). 3. Create a visual diagram showing the flow from CSV → SQL → Tableau, noting each step's owner and timing.

Intermediate

Project

Implement Automated Lineage in a dbt Project

Scenario

You have a dbt project transforming raw Salesforce data into analytics models. You need to automatically capture lineage between source tables, staging models, and final marts.

How to Execute

1. Configure dbt's built-in metadata artifacts. 2. Set up an OpenLineage-compatible lineage backend (e.g., Marquez). 3. Run dbt build and verify the lineage graph in the OpenLineage UI, then document how to investigate a broken upstream dependency.

Advanced

Case Study/Exercise

Lineage-Driven Root Cause Analysis for a Regulatory Breach

Scenario

A financial regulator questions a reported quarterly profit figure. You suspect a data pipeline error in the consolidation layer. You must rapidly trace the data's path to identify the point of failure and produce an auditable report.

How to Execute

1. Use the enterprise data catalog (e.g., Collibra) to immediately pull the lineage graph for the 'Quarterly_Profit' metric. 2. Identify all upstream dependencies and the transformation code for each node. 3. Systematically check logs and metadata timestamps for each transformation step to pinpoint the anomaly (e.g., a failed job, incorrect join). 4. Compile a provenance report showing the correct vs. actual data flow, with timestamps and owners, for the regulator.

Tools & Frameworks

Software & Platforms (Lineage-Specific)

OpenLineage + MarquezApache AtlasAWS Glue DataBrew / Azure Purview / Google Dataplex Catalogdbt (with metadata)DataHub (LinkedIn)

OpenLineage is the open standard for lineage events; Marquez is a reference implementation. Cloud catalogs are essential for native cloud estates. dbt is the standard for analytical lineage. Apache Atlas and DataHub are robust open-source platforms for Hadoop/hybrid ecosystems.

Governance & Metadata Frameworks

Data Management Body of Knowledge (DMBOK)COBIT for IT GovernanceFAIR Data Principles

DMBOK provides the formal processes for metadata management. COBIT aligns data governance (including provenance) with enterprise IT governance and audit controls. FAIR principles guide the design of findable, accessible, interoperable, and reusable data, which lineage enables.

Interview Questions

Answer Strategy

Use the STAR method (Situation, Task, Action, Result). Focus on the systematic process: 1) Starting from the point of failure, 2) Using lineage tools or manual tracing backward, 3) Isolating the transformation or source layer, 4) Validating the fix. Example: 'In my last role, a sales report showed a 15% discrepancy. Starting from the report metric, I used our Azure Purview lineage graph to identify all upstream tables. I then manually checked the ETL job logs for that date and found a failed join in a Spark job that was silently dropping records. I corrected the code, backfilled the data, and documented the fix in our data catalog.'

Answer Strategy

Tests understanding of ML-specific lineage (data, code, environment). The answer must cover: 1) Data Provenance: tracking source datasets, versions, and transformations (feature stores). 2) Code Provenance: version control for scripts and models (Git). 3) Environment Provenance: containerization (Docker) and dependency management. 4) Model Provenance: logging training runs, hyperparameters, and final artifacts. A sample answer: 'I'd architect it with four pillars: data lineage via DVC or Delta Lake time travel, code lineage via Git commits linked to training runs, environment lineage via Docker images, and model lineage via MLflow or Weights & Biases to log all parameters and metrics. This creates a complete, immutable audit trail from raw data to deployed model.'