Interview Prep
AI Data Lineage Analyst Interview Questions
50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsA strong answer covers the what (tracing data from source to consumption), the why (reproducibility, debugging, compliance), and AI-specific concerns (training data provenance, bias tracing, model explainability).
Technical lineage maps column-level transformations and code-level dependencies; business lineage maps data flows to business concepts like KPIs, customer segments, or regulatory reports.
A Directed Acyclic Graph represents task dependencies in pipelines (Airflow, dbt); lineage systems use similar graph structures to model how data flows between nodes.
Technical metadata (schemas, types, owners), operational metadata (run times, row counts, freshness), and business metadata (descriptions, domains, sensitivity labels).
OpenLineage is an open standard for lineage event collection with integrations into Airflow, dbt, Spark; proprietary solutions like Collibra or Alation offer full platforms but with vendor lock-in.
Intermediate
10 questionsA great answer discusses dbt's built-in column-level lineage (dbt-core 1.5+), the manifest.json and catalog.json artifacts, SQL parsing limitations, and complementary tools like SQLGlot for AST-based extraction.
Should cover schema monitoring via Airflow sensors or Great Expectations, Slack/webhook alerting, blast radius analysis using the lineage graph, and rollback strategies.
Key challenges include tracing document provenance through chunking and embedding, tracking which retrieved chunks influenced which generation, vector database metadata consistency, and handling document updates that invalidate cached embeddings.
Should discuss feature store metadata APIs, offline-vs-online lineage divergence, feature versioning, and how to link training-time feature snapshots to inference-time feature lookups.
Covers SparkListener API or OpenLineage-Spark integration, parsing logical and physical plans, handling UDFs as lineage blind spots, and persisting lineage events to a catalog or graph store.
Blast radius analysis identifies all downstream assets affected by an upstream change; implementation involves graph traversal (BFS/DFS) in a lineage graph database with impact scoring.
Lineage maps where personal data flows, enabling identification of all storage locations that must be purged; includes cascade deletion tracking and verification that deletion propagated fully.
Orchestrator lineage captures task-level dependencies and run metadata; transformation lineage captures column-level data flow and logic; neither alone is complete - both are needed.
Discusses DVC for data versioning, dbt manifest snapshots in Git, model registry versioning in MLflow, and strategies for aligning lineage graph versions with deployment releases.
A strong answer connects Great Expectations checkpoints to specific lineage nodes, auto-blocks downstream processing when upstream quality fails, and logs quality metrics as lineage metadata.
Advanced
10 questionsShould cover a federated metadata collection architecture (OpenLineage collectors per domain), a central lineage graph (Neo4j or Amazon Neptune), a unified API layer, and a governance dashboard with RBAC.
Covers logging every agent invocation with input/output hashes, storing intermediate results in an append-only log, linking chain steps via correlation IDs, and handling non-deterministic outputs.
Identifies leakage patterns: test data in training sets, future data in time-series splits, target leakage through feature engineering; uses lineage graph to validate temporal and logical separation.
Discusses metadata federation without exposing raw data, cryptographic provenance (zero-knowledge proofs or hash chains), shared lineage schemas across organizational APIs, and regulatory jurisdiction mapping.
Covers SQL parsing limitations (SQLGlot, sqlparse), the need for semantic analysis beyond syntactic parsing, handling of dynamic SQL through runtime query logging, and fallback to heuristic lineage.
Should describe scoring based on downstream model criticality, data sensitivity classification, SLA impact, number of affected consumers, and historical failure frequency - all derived from the lineage graph.
Covers capturing the generative model's lineage (which real data trained it), parameterized generation seeds, statistical similarity metrics as lineage metadata, and provenance chains from synthetic β trained β deployed models.
Discusses metadata reconciliation strategies, conflict resolution rules (preference for transformation-level over orchestration-level), manual override workflows, and confidence scoring per lineage edge.
Covers linking classification tags to retention rules, traversing lineage to propagate TTLs to derived datasets, automated archival/deletion workflows, and compliance attestation logging.
Discusses content-addressable storage (hash-based lineage), metadata extraction pipelines for unstructured data, linking preprocessing steps (resizing, normalization) to model artifacts, and provenance standards like C2PA.
Scenario-Based
10 questionsShould describe checking upstream data source changes, examining feature pipeline modifications, comparing training data lineage snapshots across versions, looking for demographic shifts in data sources, and producing a root cause report.
Walk through extracting lineage from model registry β training dataset β feature store β raw data source, attaching consent/legal basis metadata at each hop, and generating an audit-ready lineage report with evidence links.
Should describe querying the lineage graph for all downstream dependents, classifying impact (critical vs. informational), notifying affected team leads via automated alerts, and recommending rollback priorities.
Covers PII classification tagging in the data catalog, lineage graph traversal from PII-tagged columns to model training pipelines, visualization of the PII flow graph, and a summary report with risk ratings.
Discusses dual-tracking lineage during migration, mapping legacy Hive table references to Snowflake equivalents, updating OpenLineage integrations for new compute engines, and validating lineage completeness post-cutover.
Should cover querying the lineage graph for the individual's data records, tracing through every transformation and aggregation, determining if the data is still identifiable in training sets, and producing a GDPR-compliant response.
Discusses metadata schema harmonization, building a unified lineage graph abstraction, reconciling naming conventions and ownership models, and prioritizing lineage mapping for high-risk/regulatory datasets first.
Should walk through lineage-driven drift analysis: compare current upstream data profiles against training-time baselines, identify the earliest lineage node where distribution shift appears, check for source system changes, and recommend remediation.
Covers document ingestion pipeline lineage, chunking and embedding provenance, vector database metadata, retrieval log traceability, and prompt-response attribution to source documents.
Should describe extracting the model's training dataset lineage, checking for missing source tables, validating row counts against upstream sources, checking for NULL/drop transformations, and comparing schema expectations vs. reality.
AI Workflow & Tools
10 questionsCovers installing the OpenLineage-Airflow provider, configuring the lineage backend, ensuring each operator emits correct lineage events, and validating lineage data in Marquez or DataHub.
Covers dbt's metadata emission (manifest.json, run_results.json), DataHub's dbt ingestion connector, column-level lineage extraction, and enriching with business metadata in DataHub's UI.
Discusses wrapping LangChain callbacks to log document retrieval events, capturing chunk IDs and source document metadata, persisting retrieval logs to a lineage store, and linking to the generation step via request IDs.
Covers logging dataset fingerprints (hashes, row counts) as MLflow tags, linking to feature store snapshots, storing preprocessing pipeline configs as artifacts, and querying MLflow's API for lineage retrieval.
Discusses AWS Glue Data Catalog as the metadata backbone, SageMaker lineage tracking API, S3 object tagging for data provenance, and AWS CloudTrail for access-based lineage augmentation.
Covers defining expectations per pipeline node, linking checkpoint results to lineage graph nodes, auto-blocking downstream processing on failures, and storing validation results as lineage metadata.
Discusses Unity Catalog's automatic lineage capture for SQL and Python workloads, column-level lineage visualization, API access for programmatic lineage queries, and integration with external catalogs.
Covers designing the graph schema (nodes: datasets, jobs, models, columns; edges: produces, consumes, transforms), building ingestion pipelines for each source, handling schema conflicts, and querying with Cypher.
Covers configuring monitors for schema changes, volume anomalies, and freshness violations, linking alerts to lineage graph nodes for blast radius assessment, and automating incident ticket creation.
Covers logging training data hashes and preprocessing configs alongside the HuggingFace Trainer, tracking model artifacts in MLflow or SageMaker Model Registry, and linking inference endpoints back to training lineage.
Behavioral
5 questionsLook for evidence of stakeholder empathy, data-driven persuasion (showing a real incident where lineage would have helped), incremental adoption strategy, and measuring adoption success.
Strong answers show systematic investigation, clear communication of risk to stakeholders, a concrete remediation plan, and preventive measures implemented afterward.
Look for a risk-based prioritization framework: regulatory exposure, revenue impact, data sensitivity, incident history, and stakeholder urgency - not just ease of implementation.
Look for use of visual metaphors (flowcharts, water-pipe analogies), business impact framing, avoidance of jargon, and the ability to connect technical lineage to business risk or value.
Strong answers mention specific communities (Data Council, dbt Community, OpenLineage Slack), publications, conferences, hands-on experimentation, and engagement with open-source projects.