AI Data Lineage Analyst
An AI Data Lineage Analyst maps, monitors, and audits the complete lifecycle of data as it flows through AI and machine learning p…
Skill Guide
The systematic practice of capturing, organizing, and governing technical, operational, and business metadata about data assets to create a searchable, trustworthy, and contextual data catalog using platforms like Apache Atlas, DataHub, or OpenMetadata.
Scenario
A sales analytics team frequently asks, 'What does the 'revenue' column in the 'q4_sales' table actually mean, and where does this data come from?'
Scenario
A critical Tableau dashboard displays 'Customer Lifetime Value (CLV)'. The data engineering team needs to understand all upstream dependencies to safely modify a source table without breaking the dashboard.
Scenario
The legal team mandates that all Personally Identifiable Information (PII) in the data lake must be identified, classified, and have its access governed within 60 days.
Atlas is enterprise-grade, tightly integrated with Hadoop ecosystems (Hive, HBase). DataHub is a modern, cloud-native platform with strong search/discovery and a graph-based architecture. OpenMetadata offers a unified platform for metadata, data quality, and data governance with a strong focus on developer experience and automation.
Airflow and dbt are common sources of operational metadata (run status, transformations). Their metadata can be pushed/pulled into catalogs. Custom scripts are essential for building connectors to proprietary systems or extending platform functionality.
DAMA provides the foundational 'what' of data governance and stewardship roles. FAIR (Findable, Accessible, Interoperable, Reusable) is a guiding principle for catalog design. Data Mesh demands a federated computational governance model, where a metadata catalog is the central enabling technology for enforcing policies as code.
Answer Strategy
Structure the answer around Diagnosis, Root Cause, and Solution. Start with verifying the ingestion job health and connector logs. The root cause is often missing hooks in custom applications, Spark jobs without lineage emission, or incorrect permissions. The solution is multi-pronged: implement a 'lineage completeness' metric, mandate lineage emission via CI/CD checks for data pipelines, and establish a stewardship process for manual lineage curation as a fallback.
Answer Strategy
This tests understanding of Data Mesh principles and federated governance. The answer must show a clear separation of concerns: global vs. local. Use the metaphor of a 'global schema' vs. 'local dialect'. The strategy is to define mandatory global metadata standards (e.g., for ownership, data product SLAs, core glossary terms) enforced via automation, while allowing domains to extend with their own contextual metadata.
1 career found
Try a different search term.