Skill Guide

Data lineage, provenance, and residency tracking across AI pipelines

The systematic process of tracking the origin, movement, transformation, and storage location of data throughout the entire AI/ML pipeline lifecycle to ensure auditability, compliance, and data integrity.

This skill is critical for mitigating regulatory risk, ensuring reproducibility of AI models, and maintaining trust in data-driven decisions. It directly impacts business outcomes by preventing costly compliance failures, enabling rapid root cause analysis for model errors, and supporting ethical AI governance.

1 Careers

1 Categories

8.7 Avg Demand

15% Avg AI Risk

How to Learn Data lineage, provenance, and residency tracking across AI pipelines

1. Master core terminology: lineage (data flow), provenance (origin and history), residency (physical/logical location). 2. Understand the components of a standard ML pipeline (data ingestion, preprocessing, feature engineering, model training, deployment). 3. Study basic data cataloging and metadata management concepts.

1. Practice implementing lineage tracking in a simple pipeline using tools like MLflow or OpenLineage. 2. Analyze real-world scenarios where lineage breaks (e.g., data drift from undocumented source changes). 3. Learn to read and create simple data flow diagrams (DFDs) and lineage graphs. Avoid the mistake of focusing only on code and ignoring data storage layers.

1. Architect enterprise-scale lineage systems that integrate with existing data lakes, warehouses, and MLOps stacks. 2. Design for compliance with specific regulations (GDPR, CCPA, China's Data Security Law, PIPL) by mapping data flows to legal requirements. 3. Mentor teams on embedding lineage practices into CI/CD pipelines and model cards.

Practice Projects

Beginner

Project

Lineage Tracker for a Simple CSV Dataset

Scenario

You have a CSV file containing customer data. You perform cleaning (remove nulls, standardize dates), create a new feature (age group), and train a basic logistic regression model to predict churn.

How to Execute

1. Use Python (pandas) to log each transformation step, input/output file paths, and parameters in a separate metadata file (JSON/YAML). 2. Record the source of the original CSV (e.g., 'CRM export 2023-10-26'). 3. Document the final model's location and the exact dataset version used to train it. 4. Draw a simple diagram showing the flow: Raw CSV -> Cleaned CSV -> Featured CSV -> Model.

Intermediate

Project

Integrate OpenLineage with an Airflow ETL and MLflow Pipeline

Scenario

Your team runs a daily Airflow DAG that ingests data from an API, stores it in a PostgreSQL database, transforms it, and triggers an MLflow training run. Auditors need to trace any model prediction back to the source API call.

How to Execute

1. Configure the Airflow OpenLineage integration to emit lineage events for each task (Extract, Transform, Load). 2. In your ML training script, use the OpenLineage-Python integration to log dataset inputs and the model output. 3. Use Marquez (the OpenLineage reference backend) to visualize the end-to-end lineage graph. 4. Practice a drill: Given a model version, query the lineage system to find all upstream data sources and their run IDs.

Advanced

Project

Design a Data Residency-Aware Multi-Region ML Platform

Scenario

Your global company must train a fraud detection model using European user data (subject to GDPR) and Asian user data (subject to local residency laws). The final model will be deployed in both regions. You must ensure no regulated data leaves its jurisdiction during training or inference.

How to Execute

1. Architect a federated training approach or use region-specific feature stores with strict access controls. 2. Implement metadata and lineage tracking that tags every data asset and model artifact with its permissible jurisdiction(s). 3. Use policy-as-code tools (e.g., OPA) to enforce residency rules in the pipeline orchestration layer. 4. Design an audit trail that proves compliance by showing that data processing occurred only within approved geographic boundaries.

Tools & Frameworks

Lineage & Metadata Platforms

OpenLineageApache AtlasMarquezDataHub (LinkedIn)

These are the industry standards for emitting, collecting, and visualizing data lineage events. Use OpenLineage for vendor-agnostic integration, Atlas for Hadoop ecosystems, and DataHub for a modern metadata platform.

MLOps & Pipeline Orchestration

MLflowKubeflow PipelinesApache AirflowDagster

These tools have native or plugin-based lineage tracking capabilities. MLflow tracks experiment lineage, Kubeflow and Airflow manage pipeline execution lineage, and Dagster emphasizes software-defined assets with built-in lineage.

Data Governance & Cataloging

AlationCollibraAWS Glue Data CatalogAzure Purview

Enterprise data catalogs that are expanding to include ML model lineage. Use them for comprehensive data governance, business glossary integration, and access control that complements technical lineage tracking.

Interview Questions

Answer Strategy

Focus on the systematic, trace-back approach. 'First, I would trace the lineage of the model that failed to identify the exact dataset version and feature engineering code used. Then, I would compare the lineage graph of the failed run to a previous successful run, pinpointing the divergence point-likely a change in an upstream data source or transformation logic. For communication, I would present a clear lineage diagram highlighting the breaking change, the impacted model versions, and a proposed fix to the data pipeline.'

Answer Strategy

This tests communication and business alignment. 'I framed provenance not as technical documentation, but as a business risk mitigation and revenue protection tool. I used a concrete example: showing how provenance could have quickly identified the root cause of a model that started making incorrect recommendations, potentially saving weeks of investigation and lost revenue. I also linked it directly to upcoming regulatory requirements, positioning proactive provenance as a cost-saving measure versus reactive, expensive compliance fixes.'