Skill Guide

Data discovery, classification, and lineage tracking at scale

The systematic process of automatically locating all data assets within an organization, tagging them based on sensitivity and business context, and mapping their flow from source to consumption to ensure governance, compliance, and analytical integrity.

This skill is critical for regulatory compliance (GDPR, CCPA, SOX), reducing data breach risk, and enabling trusted self-service analytics. It directly impacts operational efficiency, audit costs, and the ability to monetize data assets reliably.

1 Careers

1 Categories

9.2 Avg Demand

15% Avg AI Risk

How to Learn Data discovery, classification, and lineage tracking at scale

Focus on 1) Understanding core data types (PII, PHI, confidential) and common regulatory frameworks. 2) Learning basic SQL and querying metadata catalogs. 3) Grasping the concept of data lineage graphs and ETL pipelines.

Move to practice by 1) Implementing a column-level lineage tracker for a small data warehouse using tools like OpenLineage or dbt's built-in features. 2) Developing custom classification rules using regex or ML models for semi-structured data. 3) Common mistake: Ignoring unstructured data and focusing only on structured databases.

Master the skill by 1) Designing enterprise-wide metadata architecture integrating discovery, classification, and lineage into a single data fabric. 2) Building automated policy engines that trigger actions based on classification (e.g., masking, encryption). 3) Aligning data governance metrics with business KPIs for executive reporting.

Practice Projects

Beginner

Project

Build a Data Catalog for a Sample E-commerce Dataset

Scenario

You have a set of CSV files representing customer orders, product inventory, and user logs. The goal is to document all data assets, classify sensitive columns, and trace a sample metric (e.g., 'Total Order Value') back to its sources.

How to Execute

1) Use a tool like Apache Atlas or a simple spreadsheet to list all tables/columns. 2) Manually classify columns (e.g., 'customer_email' as PII, 'order_amount' as financial). 3) Write SQL queries or use a lineage tool to document the joins and transformations needed to calculate 'Total Order Value' from raw tables.

Intermediate

Project

Implement Automated PII Detection in a Data Lake

Scenario

You manage a data lake with thousands of unstructured JSON files. The requirement is to scan for potential PII (names, emails, SSNs) and tag those files for access control.

How to Execute

1) Set up a scanning job using a tool like Amazon Macie or Google Cloud DLP. 2) Define and test custom detection rules for domain-specific PII (e.g., internal employee IDs). 3) Integrate scan results with a metadata store (e.g., AWS Glue Catalog) and create an automated tagging pipeline. 4) Test by injecting known PII into sample files and verifying detection and tagging.

Advanced

Project

Design a Cross-Platform Lineage Mesh for Regulatory Auditing

Scenario

Your data flows from SAP (ERP) into Snowflake (data warehouse), through dbt (transformation), and into Tableau (visualization). An auditor requires end-to-end lineage proof for a financial report metric.

How to Execute

1) Deploy a centralized lineage hub (e.g., DataHub, OpenLineage with Marquez). 2) Configure metadata extractors for each platform (SAP, Snowflake, dbt, Tableau) to push lineage events. 3) Build a reconciliation job that validates lineage completeness by sampling metrics and tracing them. 4) Create an automated audit report that exports the lineage graph and classification tags for the specified metric.

Tools & Frameworks

Software & Platforms

Apache AtlasDataHub (LinkedIn)AlationAtlanCollibra

Commercial and open-source data catalog platforms used for automated discovery, metadata management, and lineage visualization. Atlas is foundational in Hadoop ecosystems; DataHub is modern and event-driven.

Lineage & Transformation Tools

dbt (data build tool)OpenLineageMarquezAzure Data Factory

dbt provides native column-level lineage in transformation layers. OpenLineage is an open standard for lineage event collection; Marquez is its reference implementation. ADF offers pipeline lineage in Azure environments.

Classification & Scanning Engines

Amazon MacieGoogle Cloud DLPMicrosoft PurviewBigID

Cloud-native and specialized SaaS tools for sensitive data discovery using pattern matching, ML, and predefined taxonomies. They are essential for scanning data lakes and warehouses at petabyte scale.

Standards & Frameworks

Data Catalog Vocabulary (DCAT)W3C PROV (Provenance)ISO/IEC 27001 Annex A.8

DCAT defines a standard for describing datasets. W3C PROV provides a model for provenance (lineage). ISO 27001 controls guide information asset classification and handling requirements.

Interview Questions

Answer Strategy

Focus on event-driven architecture and decoupling. Sample answer: 'I would implement a passive lineage collection model using OpenLineage, where agents or sidecar processes listen to logs and API calls from orchestration tools (Airflow) and stream processors (Kafka Streams) rather than querying production databases. This metadata is published to a dedicated event bus (Kafka) and consumed asynchronously by a lineage service, ensuring zero impact on operational systems.'

Answer Strategy

Tests problem-solving and iterative improvement. Core competency: Accuracy vs. Coverage trade-off management. Sample answer: 'I would implement a three-phase approach: 1) Tune existing rules by analyzing false positives/negatives to adjust regex patterns and ML model thresholds. 2) Introduce a human-in-the-loop review process for borderline cases, using the feedback to retrain models. 3) Establish a governance workflow where data owners validate and certify classification tags for critical datasets, creating a feedback loop for continuous improvement.'