Skill Guide

SQL and data pipeline design for regulatory data lineage

The practice of designing SQL queries and data pipelines that automatically capture, store, and visualize the complete, auditable history of data transformations from source to final regulatory report.

It is non-negotiable for financial institutions and heavily regulated sectors to demonstrate to regulators (e.g., SEC, FINRA, EBA) that their reported numbers are traceable and trustworthy, directly mitigating massive compliance risk and potential fines.

1 Careers

1 Categories

8.5 Avg Demand

20% Avg AI Risk

How to Learn SQL and data pipeline design for regulatory data lineage

1. Master foundational SQL, focusing on `JOIN`s, Common Table Expressions (CTEs), and window functions (`ROW_NUMBER()`, `LAG()`) to track changes. 2. Understand the concepts of Slowly Changing Dimensions (Type 2) and how they store historical data states. 3. Learn basic data pipeline orchestration (e.g., Apache Airflow DAGs) and the principle of idempotency for re-running jobs safely.

1. Practice designing and implementing a SCD Type 2 table for a core business entity (e.g., customer, account) within a star schema. 2. Build a pipeline that not only transforms data but also logs each transformation step with timestamps, row counts, and checksums into a dedicated metadata store. 3. Common mistake: Using `UPDATE` statements that overwrite historical data, thus breaking the lineage chain.

1. Architect end-to-end lineage solutions integrating with tools like Collibra or Apache Atlas to auto-map lineage from SQL and pipeline code. 2. Design frameworks for reusable, parameterized lineage-capture macros (e.g., in dbt) that standardize logging across teams. 3. Align technical design with specific regulatory standards (BCBS 239, GDPR Article 30) to ensure the captured metadata directly answers audit questions.

Practice Projects

Beginner

Project

Build a Lineage-Aware Customer Dimension Table

Scenario

You have a raw `customers_raw` feed with customer data. You need to create a `dim_customer` table that preserves every historical state of a customer's key attributes (e.g., address, risk rating) for audit.

How to Execute

1. Design a `dim_customer` table with `customer_key` (surrogate), `customer_id` (natural), all descriptive columns, `effective_date`, `expiry_date`, and `is_current` flag. 2. Write a SQL merge/upsert script that compares incoming raw data to the current record in the dimension. If an attribute has changed, expire the old record (`is_current=0`, `expiry_date=now()`) and insert a new record. 3. Add a step to log each run: the number of new records, updated records, and the timestamp to a `pipeline_audit` table.

Intermediate

Project

Lineage-Enhanced Regulatory Report Pipeline

Scenario

You must produce a daily Capital Adequacy Report. The regulator demands to see, for any given number on the report, the exact source data, transformations, and intermediate tables that produced it.

How to Execute

1. Model your pipeline using a tool like dbt. Structure your models in staged layers (staging -> intermediate -> marts). 2. In your SQL models, use dbt macros to consistently inject audit columns (`_loaded_at`, `_batch_id`, `_source_system`). 3. Implement a post-run hook that queries the dbt metadata (via `dbt run-operation`) and loads model dependency, test results, and row count history into a lineage warehouse. 4. Build a simple dashboard (e.g., in Metabase) that joins this lineage metadata to the final report tables.

Advanced

Project

Design an Automated Lineage Governance Framework

Scenario

Your organization has hundreds of data pipelines for regulatory reporting. Manual documentation is impossible. You need a scalable system to automatically detect, capture, and govern data lineage across the enterprise.

How to Execute

1. Implement a centralized metadata store (e.g., using OpenLineage standard with Marquez or Apache Atlas). 2. Instrument all data processing frameworks (Spark, SQL, dbt) with OpenLineage client libraries to emit lineage events during runs. 3. Develop custom SQL parsers (using ANTLR or similar) to capture column-level lineage from complex SQL scripts not covered by existing tools. 4. Create a data governance council process that uses the automated lineage graphs to identify critical data elements, assign owners, and trigger compliance reviews when lineage breaks.

Tools & Frameworks

Software & Platforms

Apache Airflowdbt (Data Build Tool)Apache Atlas / OpenLineage

Use Airflow for orchestrating pipeline tasks with dependency tracking. Use dbt for version-controlled SQL transformations with built-in lineage and documentation. Use Atlas/OpenLineage for centralized, cross-platform metadata storage and lineage visualization.

Database & Query Features

Slowly Changing Dimensions (SCD) Type 2Common Table Expressions (CTEs)Window Functions (`LAG`, `LEAD`)

SCD Type 2 is the foundational pattern for storing history. CTEs make complex lineage-tracking logic readable. Window functions are essential for calculating change flags and sequencing historical records within SQL.

Interview Questions

Answer Strategy

The interviewer is testing for a structured, end-to-end thought process. The candidate should outline: 1) Source identification and staging, 2) Intermediate transformation with explicit history capture (e.g., SCD2 on the entity driving the metric), 3) Final metric calculation with clear, auditable logic, and 4) The audit trail design for each step. Sample Answer: 'First, I'd stage the raw transactions and counterparty data, adding load timestamps. Then, I'd build an intermediate SCD2 dimension for counterparty-country mappings to preserve changes. The exposure calculation would join fact tables to the current *and* relevant historical dimension records. Finally, I'd log every pipeline step's execution metadata-row counts, SQL hash, and batch ID-to a dedicated audit schema, creating a verifiable chain from source to report.'

Answer Strategy

This tests operational rigor and tool proficiency. The answer must be procedural. The core competency is debugging with lineage. Sample Answer: 'I would start in our reporting mart table for that number and use our dbt lineage graph or Atlas lineage view to trace it back through the intermediate models to the staging layer. I'd check the pipeline run logs for that specific execution batch, looking for any failed tests, row count anomalies, or source data freshness issues in the upstream tables. If the logic itself is questioned, I'd point to the version-controlled SQL transformation in dbt that produced it, showing the exact code and parameters used for that run.'