Skill Guide

Metadata management and catalog design (Apache Atlas, DataHub, OpenMetadata)

The systematic practice of capturing, organizing, and governing technical, operational, and business metadata about data assets to create a searchable, trustworthy, and contextual data catalog using platforms like Apache Atlas, DataHub, or OpenMetadata.

It directly enables data governance, regulatory compliance (GDPR, CCPA), and operational efficiency by providing a single source of truth for data lineage, ownership, and quality. This reduces data discovery time, mitigates risk, and increases trust in data-driven decision-making.

1 Careers

1 Categories

8.7 Avg Demand

18% Avg AI Risk

How to Learn Metadata management and catalog design (Apache Atlas, DataHub, OpenMetadata)

1. Core Concepts: Understand the metadata types (technical: schemas, lineage; operational: pipelines, freshness; business: glossary terms, ownership, PII tags). 2. Tool Familiarization: Set up a local Docker instance of OpenMetadata or DataHub. Ingest sample metadata from a local database. 3. Basic Proficiency: Learn to navigate the catalog UI, search for assets, and manually tag a dataset with a business term and owner.

1. Automation & Integration: Configure automated metadata ingestion from cloud data warehouses (e.g., Snowflake, BigQuery), BI tools (e.g., Tableau), and orchestration platforms (e.g., Airflow) into your chosen catalog. 2. Lineage & Impact Analysis: Trace data lineage from source to report for a critical business metric. Use this to perform a simulated impact analysis for a proposed schema change. Avoid the mistake of over-collecting metadata without a clear governance use case.

1. Strategic Governance: Design and implement a full metadata governance framework, including role-based access control (RBAC) for metadata itself, data classification policies (e.g., auto-tagging PII), and stewardship workflows for term assignment and approval. 2. Platform Architecture: Evaluate and architect a scalable, multi-tenant metadata platform, potentially extending it with custom APIs or connectors. Mentor data engineers and stewards on effective catalog use.

Practice Projects

Beginner

Project

Catalog a Single Database for a Sales Team

Scenario

A sales analytics team frequently asks, 'What does the 'revenue' column in the 'q4_sales' table actually mean, and where does this data come from?'

How to Execute

1. Install OpenMetadata locally. 2. Connect it to a sample PostgreSQL database containing sales data. 3. Run the ingestion workflow to bring in table and column metadata. 4. Manually add business glossary terms ('Annual Revenue', 'Net Sales') to key columns, assign a 'Data Owner' (e.g., Sales Ops Manager), and document column descriptions in the UI.

Intermediate

Project

Establish Data Lineage for a Customer 360 Dashboard

Scenario

A critical Tableau dashboard displays 'Customer Lifetime Value (CLV)'. The data engineering team needs to understand all upstream dependencies to safely modify a source table without breaking the dashboard.

How to Execute

1. Configure DataHub to ingest metadata from Snowflake (tables/views) and Tableau (dashboards, data sources). 2. Ingest metadata and run the lineage job. 3. Navigate to the 'CLV' metric in DataHub and explore the automatically generated lineage graph. 4. Perform an impact analysis: click on a source table node and verify all downstream dashboards and datasets are correctly mapped. Document any gaps in the lineage.

Advanced

Project

Implement Automated PII Detection and Governance Workflow

Scenario

The legal team mandates that all Personally Identifiable Information (PII) in the data lake must be identified, classified, and have its access governed within 60 days.

How to Execute

1. In OpenMetadata, configure a custom classification policy using regex patterns or NLP models to auto-detect columns like 'email', 'ssn'. 2. Set up a stewardship workflow where auto-tagged PII assets are sent to a designated Data Steward for review and approval. 3. Integrate the catalog's tag system with your data warehouse's masking policies (e.g., Snowflake dynamic data masking) via a custom API script. 4. Create a governance report dashboard showing PII coverage, steward review status, and masking policy compliance.

Tools & Frameworks

Software & Platforms

Apache AtlasDataHub (LinkedIn)OpenMetadata

Atlas is enterprise-grade, tightly integrated with Hadoop ecosystems (Hive, HBase). DataHub is a modern, cloud-native platform with strong search/discovery and a graph-based architecture. OpenMetadata offers a unified platform for metadata, data quality, and data governance with a strong focus on developer experience and automation.

Integration & Orchestration

AirflowdbtCustom Python Scripts (pyatlasclient, datahub-python)

Airflow and dbt are common sources of operational metadata (run status, transformations). Their metadata can be pushed/pulled into catalogs. Custom scripts are essential for building connectors to proprietary systems or extending platform functionality.

Conceptual Frameworks

DAMA-DMBOK Data Governance FrameworkFAIR Principles for DataData Mesh Architecture

DAMA provides the foundational 'what' of data governance and stewardship roles. FAIR (Findable, Accessible, Interoperable, Reusable) is a guiding principle for catalog design. Data Mesh demands a federated computational governance model, where a metadata catalog is the central enabling technology for enforcing policies as code.

Interview Questions

Answer Strategy

Structure the answer around Diagnosis, Root Cause, and Solution. Start with verifying the ingestion job health and connector logs. The root cause is often missing hooks in custom applications, Spark jobs without lineage emission, or incorrect permissions. The solution is multi-pronged: implement a 'lineage completeness' metric, mandate lineage emission via CI/CD checks for data pipelines, and establish a stewardship process for manual lineage curation as a fallback.

Answer Strategy

This tests understanding of Data Mesh principles and federated governance. The answer must show a clear separation of concerns: global vs. local. Use the metaphor of a 'global schema' vs. 'local dialect'. The strategy is to define mandatory global metadata standards (e.g., for ownership, data product SLAs, core glossary terms) enforced via automation, while allowing domains to extend with their own contextual metadata.