Learning Roadmap
How to Become a AI Data Lineage Analyst
A step-by-step, phase-based learning path from beginner to job-ready AI Data Lineage Analyst. Estimated completion: 6 months across 6 phases.
Progress saved in your browser — no account needed.
-
Foundations: Data Ecosystems & SQL Mastery
4 weeksGoals
- Achieve fluency in SQL across at least two major dialects (Postgres, Snowflake, or BigQuery)
- Understand relational and columnar data storage, partitioning, and schema evolution
- Learn core data concepts: ETL vs ELT, data warehouses, data lakes, and lakehouse architectures
Resources
- Mode Analytics SQL Tutorial (free)
- Fundamentals of Data Engineering by Joe Reis & Matt Housley
- Snowflake Hands-On Essentials lab series
- dbt Learn documentation and tutorials
MilestoneYou can write complex SQL queries, explain data warehouse architecture, and set up a local dbt project with source and staging models.
-
Data Lineage Concepts & Metadata Fundamentals
4 weeksGoals
- Understand what data lineage is, why it matters, and the difference between technical and business lineage
- Learn metadata management principles: technical metadata, operational metadata, business metadata
- Explore OpenLineage specification and its integration with Airflow, dbt, and Spark
Resources
- OpenLineage documentation and GitHub examples
- DataHub Getting Started Guide (LinkedIn open-source)
- Fundamentals of Data Catalogs (O'Reilly report)
- Practical Data Lineage blog series by DataKitchen
MilestoneYou can set up OpenLineage with a local Airflow instance, extract lineage events, and visualize them in Marquez.
-
ML Pipelines & AI-Specific Lineage
5 weeksGoals
- Learn ML pipeline components: data ingestion, feature engineering, training, evaluation, deployment
- Understand feature stores (Feast, Tecton) and model registries (MLflow, Weights & Biases)
- Study how RAG pipelines work: document ingestion, chunking, embedding, vector store, retrieval, generation
Resources
- Made With ML by Goku Mohandas (open-source MLOps course)
- MLflow documentation and tutorials
- LangChain documentation: Retrieval and Memory modules
- Feature Store for ML (O'Reilly article by Mike Del Balso)
MilestoneYou can trace a dataset from raw CSV through feature engineering to an MLflow-registered model and identify every transformation step.
-
Governance, Regulation & Compliance Frameworks
3 weeksGoals
- Study GDPR Articles 15, 17, and 22 as they relate to automated decision-making and data traceability
- Learn the EU AI Act risk classification system and its data documentation requirements
- Understand NIST AI Risk Management Framework (AI RMF) and its MAP function
- Explore industry-specific regulations: BCBS 239 (finance), HIPAA (healthcare), SOX (audit)
Resources
- EU AI Act full text (EUR-Lex)
- NIST AI Risk Management Framework 1.0
- GDPR.eu practical guides
- Responsible AI Practices by Google (free documentation)
MilestoneYou can map regulatory requirements to specific lineage artifacts and produce an audit-ready data flow diagram with compliance annotations.
-
Advanced Tooling & Graph-Based Lineage
4 weeksGoals
- Implement lineage storage in a graph database (Neo4j) with queryable relationships
- Build automated lineage extraction pipelines using Python, APIs, and AST parsing
- Integrate lineage dashboards with data quality monitoring (Great Expectations, Monte Carlo)
Resources
- Neo4j Graph Data Science documentation
- Great Expectations tutorial and gallery
- Astroid library for Python AST parsing
- Custom OpenLineage transport development guide
MilestoneYou can build a production-quality lineage pipeline that extracts metadata from dbt, Airflow, and MLflow into a Neo4j graph with a queryable API.
-
Portfolio Projects & Industry Readiness
4 weeksGoals
- Complete 2-3 end-to-end portfolio projects demonstrating lineage across different pipeline types
- Prepare for interviews with scenario-based practice and regulatory knowledge
- Contribute to an open-source lineage project (OpenLineage, DataHub, or OpenMetadata)
Resources
- Your own GitHub portfolio repository
- OpenLineage GitHub issues (good-first-issue label)
- Mock interview platforms and Data Engineering Discord communities
- Conference talk recordings from Data Council, dbt Coalesce, and OpenLineage Community Days
MilestoneYou have a polished GitHub portfolio, a published blog post on lineage best practices, and can confidently interview for AI Data Lineage Analyst roles.
Practice Projects
Apply your skills with hands-on projects. Ordered by difficulty.
End-to-End Lineage Tracker for an ML Pipeline
BeginnerBuild a simple ML pipeline (data ingestion → feature engineering → model training → evaluation) using Airflow and scikit-learn, then implement OpenLineage integration to automatically capture and visualize lineage in Marquez. Produce a lineage diagram showing every dataset, transformation, and model artifact.
dbt + DataHub Column-Level Lineage Implementation
IntermediateCreate a dbt project with 30+ models across staging, intermediate, and mart layers. Ingest the dbt project into LinkedIn DataHub and configure column-level lineage tracking. Write a Python script that queries the DataHub API to generate a column-level lineage report for any given mart table.
RAG Pipeline Provenance Tracker
AdvancedBuild a LangChain RAG pipeline over a set of PDF documents using ChromaDB. Implement custom LangChain callbacks that log every retrieval event (which chunks were retrieved, similarity scores, source documents) and store provenance metadata in a PostgreSQL lineage database. Create a dashboard that traces any chatbot response back to its source documents.
GDPR Compliance Lineage Audit Tool
IntermediateDesign a Neo4j graph database schema for data lineage. Build an ETL pipeline that extracts metadata from dbt and Airflow into the graph. Write Cypher queries that support GDPR use cases: tracing all storage locations for a specific data subject's PII, identifying all models trained on PII data, and generating deletion propagation reports.
Blast Radius Analyzer with Automated Alerting
AdvancedBuild a service that monitors upstream data source changes (schema modifications, volume anomalies) and automatically traverses a lineage graph to compute blast radius - identifying all downstream ML models, dashboards, and reports affected. Integrate with Slack for automated alerting with severity scoring based on downstream criticality.
AI Model Card Generator with Lineage Integration
IntermediateBuild a tool that automatically generates model cards (per Google's Model Cards Toolkit) by extracting lineage metadata from MLflow and OpenLineage. The model card should include: training data provenance, feature lineage, data quality metrics at training time, and known limitations derived from lineage gap analysis.
Ready to Start Your Journey?
Prep for interviews alongside your learning — it reinforces every concept.