AI Data Quality Analyst
An AI Data Quality Analyst ensures the accuracy, consistency, and fitness-for-purpose of datasets powering machine learning models…
Skill Guide
Data lineage tracking is the technical process of mapping the complete lifecycle of data-its origin, transformations, movements, and consumption-while provenance documentation is the formal, auditable record of this journey, including the 'who, what, when, and why' of every change.
Scenario
You have a CSV file of 'Marketing_Spend', a SQL database with 'Campaign_Performance', and a Tableau dashboard showing 'ROI'. Your task is to trace how the final ROI metric is calculated.
Scenario
You have a dbt project transforming raw Salesforce data into analytics models. You need to automatically capture lineage between source tables, staging models, and final marts.
Scenario
A financial regulator questions a reported quarterly profit figure. You suspect a data pipeline error in the consolidation layer. You must rapidly trace the data's path to identify the point of failure and produce an auditable report.
OpenLineage is the open standard for lineage events; Marquez is a reference implementation. Cloud catalogs are essential for native cloud estates. dbt is the standard for analytical lineage. Apache Atlas and DataHub are robust open-source platforms for Hadoop/hybrid ecosystems.
DMBOK provides the formal processes for metadata management. COBIT aligns data governance (including provenance) with enterprise IT governance and audit controls. FAIR principles guide the design of findable, accessible, interoperable, and reusable data, which lineage enables.
Answer Strategy
Use the STAR method (Situation, Task, Action, Result). Focus on the systematic process: 1) Starting from the point of failure, 2) Using lineage tools or manual tracing backward, 3) Isolating the transformation or source layer, 4) Validating the fix. Example: 'In my last role, a sales report showed a 15% discrepancy. Starting from the report metric, I used our Azure Purview lineage graph to identify all upstream tables. I then manually checked the ETL job logs for that date and found a failed join in a Spark job that was silently dropping records. I corrected the code, backfilled the data, and documented the fix in our data catalog.'
Answer Strategy
Tests understanding of ML-specific lineage (data, code, environment). The answer must cover: 1) Data Provenance: tracking source datasets, versions, and transformations (feature stores). 2) Code Provenance: version control for scripts and models (Git). 3) Environment Provenance: containerization (Docker) and dependency management. 4) Model Provenance: logging training runs, hyperparameters, and final artifacts. A sample answer: 'I'd architect it with four pillars: data lineage via DVC or Delta Lake time travel, code lineage via Git commits linked to training runs, environment lineage via Docker images, and model lineage via MLflow or Weights & Biases to log all parameters and metrics. This creates a complete, immutable audit trail from raw data to deployed model.'
1 career found
Try a different search term.