Skill Guide

Data quality engineering (Great Expectations, dbt tests, Soda)

The engineering discipline of systematically validating, monitoring, and improving data integrity across pipelines using declarative testing frameworks like Great Expectations, dbt tests, and Soda.

It prevents costly data errors from propagating to analytics and machine learning models, directly protecting revenue and operational decisions. By embedding quality checks into the pipeline, it reduces data incident response time and builds organizational trust in data assets.

1 Careers

1 Categories

9.1 Avg Demand

15% Avg AI Risk

How to Learn Data quality engineering (Great Expectations, dbt tests, Soda)

1. Core Concepts: Understand data dimensions (completeness, consistency, accuracy, timeliness) and testing types (schema, value, distribution). 2. Tool Fundamentals: Master one framework (e.g., dbt tests) by running basic tests (not_null, unique) on a simple dataset. 3. Process Mindset: Shift from reactive bug-fixing to proactive validation by defining quality expectations before data transformation.

1. Integration & Orchestration: Integrate tests into CI/CD pipelines (e.g., GitHub Actions running dbt test) and schedule monitoring (e.g., Soda scans). 2. Advanced Assertions: Move beyond basic tests to custom SQL assertions, distributional checks (using Great Expectations' expect_column_values_to_be_in_set), and cross-table referential integrity. 3. Metadata & Alerting: Configure alerting on test failures (Slack/email) and learn to query test metadata for trend analysis.

1. Strategic Governance: Design organization-wide data quality frameworks with SLAs, data contracts, and a catalog of certified expectations. 2. Complex System Design: Implement anomaly detection pipelines, reconcile data across heterogeneous sources (data lake, warehouse, SaaS apps), and build self-healing data pipelines. 3. Mentorship & Evangelism: Lead cross-functional data quality initiatives, train data producers on their quality responsibilities, and quantify quality ROI in business terms.

Practice Projects

Beginner

Project

Build a Validated dbt Model for Sales Data

Scenario

You are given a raw, messy sales CSV with columns: order_id, customer_id, amount, order_date, status. Transform it into a clean model with basic quality guarantees.

How to Execute

1. Create a dbt model (`stg_sales`) that selects and renames columns from the raw source. 2. Add schema tests in the YAML file: `not_null` on order_id and amount, `unique` on order_id, `accepted_values` on status ('pending', 'shipped', 'cancelled'). 3. Run `dbt test` and fix any failures by modifying the model logic. 4. Document the tested model in dbt docs.

Intermediate

Project

Implement a Great Expectations Data Context for a Data Pipeline

Scenario

You have a daily ETL job loading user event data into a Snowflake table. You need to catch data drift and missing data automatically.

How to Execute

1. Initialize a GE Data Context and connect to your Snowflake datasource. 2. Create an Expectation Suite for the `events` table: expect column `event_timestamp` to be in the last 30 days, expect `user_id` to be unique, and expect no more than 5% nulls in `page_url`. 3. Integrate a Validation Step into your Airflow DAG that runs `context.run_validation_operator` post-load. 4. Configure a Slack webhook action in GE to alert the data engineering channel on failure.

Advanced

Project

Architect a Cross-System Data Reconciliation Framework

Scenario

Critical financial data is replicated from a PostgreSQL OLTP to a Redshift OLAP. Discrepancies directly impact financial reporting. You must build an automated, scalable reconciliation system.

How to Execute

1. Define reconciliation metrics: row counts, sum of key financial columns (e.g., revenue), and hash-based checksums for critical rows. 2. Use Soda to create checks that query both systems and compare the metrics within a defined tolerance. 3. Build a metadata layer that logs reconciliation runs, status, and drift percentages over time. 4. Create a dashboard (e.g., in Metabase) powered by this metadata to give finance stakeholders visibility, and set up escalation paths for when tolerances are breached.

Tools & Frameworks

Software & Platforms

Great Expectationsdbt (data build tool)Soda SQL / Soda Cloud

Great Expectations is a Python-centric framework for creating detailed, shareable Expectation Suites and validation reports. dbt provides a declarative YAML-based testing layer integrated directly into transformation models. Soda offers a simple, SQL-based syntax for defining checks and a cloud platform for visualization and alerting.

Infrastructure & Orchestration

Apache AirflowPrefectGitHub Actions / CI/CD

Used to schedule, trigger, and monitor data quality validation runs as part of the data pipeline orchestration. They enable quality gates in deployment workflows, preventing bad data from reaching production.

Monitoring & Alerting

PagerDutySlack / Microsoft Teams WebhooksDatadog

Integrate with quality tools to provide real-time notifications (email, chat, SMS) when tests fail, ensuring rapid incident response. They are essential for moving from batch reporting to operational monitoring.

Interview Questions

Answer Strategy

Use a structured debugging framework: Isolate, Hypothesize, Validate, Fix. First, isolate the failure by querying the view directly in Snowflake to find the duplicated `user_id` and its source tables. Hypothesize the cause (e.g., a missing JOIN filter, a late-arriving dimension). Validate by examining the join logic and data freshness. Fix by adjusting the dbt model SQL or the upstream data, then add a temporary `unique_where` test if needed. Emphasize communication with upstream data producers.

Answer Strategy

Test prioritization skills and business alignment. Use a framework based on Impact (downstream usage) and Urgency (freshness, SLA). Highlight collaboration with stakeholders to define severity. The sample answer should show a concrete example, not just theory.