AI Data Lineage Analyst
An AI Data Lineage Analyst maps, monitors, and audits the complete lifecycle of data as it flows through AI and machine learning p…
Skill Guide
Data lineage graph modeling is the process of creating a structured, directed acyclic graph (DAG) that maps the complete lifecycle and transformations of data, from origin to final consumption.
Scenario
You have a dbt project with staging models that source from raw CSVs and a final mart model. You need to visualize the dependency graph.
Scenario
An Airflow DAG triggers a Spark job that writes to a BigQuery table, which is then used by a Looker dashboard. Trace the full lineage.
Scenario
Your organization has multiple data domains (e.g., Marketing, Sales) each owning their pipelines. You must implement a cross-domain lineage view for global data quality monitoring without centralizing control.
OpenLineage is the de-facto open standard for lineage event collection. Apache Atlas provides enterprise-grade metadata and lineage for Hadoop ecosystems. dbt and Airflow have built-in lineage for their respective domains. Marquez is a reference implementation of an OpenLineage backend. Graph databases are used for storing and querying complex lineage relationships at scale.
The OpenLineage spec defines the event schema. Data Contracts define quality and ownership, which lineage validates. W3C PROV is a foundational academic model for provenance that informs lineage standards.
Answer Strategy
The candidate should demonstrate knowledge of instrumentation points and event-driven architecture. Start by acknowledging the complexity, then propose using OpenLineage. Describe instrumenting Spark jobs via the OpenLineage-Spark integration to capture column-level transformations, emitting events to a central backend. Mention that Snowflake and Looker would require custom or vendor-supported emitters. The goal is a unified graph queried via the backend's API.
Answer Strategy
This tests the application of the skill to drive business value. The candidate should use the STAR method (Situation, Task, Action, Result). A strong answer would detail a scenario like a broken dashboard, using lineage to trace back to a source schema change in an upstream API, coordinating with the upstream team to fix it, and preventing future incidents by adding schema validation to the pipeline.
1 career found
Try a different search term.