Skill Guide

Data engineering and ETL pipeline design using Airflow, dbt, and cloud-native tools

The practice of designing, building, and maintaining automated, version-controlled data pipelines that extract, transform, and load (ETL/ELT) data from source systems to analytical destinations using Apache Airflow for orchestration, dbt for in-warehouse transformation, and cloud-native services for scalability.

This skill directly enables data-driven decision-making by ensuring reliable, timely, and high-quality data flows. It reduces operational costs through automation and empowers analytics teams to trust their data, accelerating time-to-insight for competitive advantage.

1 Careers

1 Categories

8.5 Avg Demand

20% Avg AI Risk

How to Learn Data engineering and ETL pipeline design using Airflow, dbt, and cloud-native tools

1. **Core Concepts**: Understand DAGs (Directed Acyclic Graphs), tasks, operators in Airflow, and the ELT paradigm (Extract-Load-Transform). 2. **SQL & dbt Fundamentals**: Master advanced SQL and learn to write dbt models, tests, and documentation. 3. **Cloud Basics**: Get familiar with a major cloud platform's storage (e.g., S3, GCS) and data warehouse (e.g., Snowflake, BigQuery) services.

1. **Pipeline Orchestration**: Design idempotent tasks, implement dynamic task generation in Airflow using `PythonOperator` and Jinja templating. 2. **dbt Proficiency**: Use dbt macros, seeds, sources, and build multi-layered models (staging, intermediate, marts). Implement data quality tests. 3. **Common Pitfalls**: Avoid hard-coded paths, non-idempotent loads, and lack of error alerting. Use Airflow's `BranchPythonOperator` for conditional logic and Airflow variables/connections for configuration.

1. **System Architecture**: Design scalable, fault-tolerant multi-environment (dev/staging/prod) pipeline infrastructure using Infrastructure as Code (IaC). 2. **Cost & Performance Optimization**: Profile and optimize Airflow scheduler load, dbt incremental models, and warehouse compute costs. 3. **Strategic Leadership**: Establish data mesh principles, define SLAs/SLOs for data pipelines, and mentor teams on best practices for data reliability engineering.

Practice Projects

Beginner

Project

Automated Daily Sales Report Pipeline

Scenario

Build a pipeline that daily extracts raw sales data from a CSV in S3, loads it into a BigQuery staging table, transforms it using dbt into a summarized sales mart, and sends a Slack alert on success or failure.

How to Execute

1. Set up a local Airflow instance with a LocalExecutor and connect to BigQuery and Slack via Airflow connections. 2. Write an Airflow DAG with `S3ToGCSOperator`, `GCSToBigQueryOperator`, and a `BashOperator` to run `dbt run`. 3. Define dbt models to clean and aggregate raw sales data. 4. Use Airflow's `EmailOperator` or `SlackAPIPostOperator` to send notifications.

Intermediate

Project

Multi-Source CRM Data Integration & Modeling

Scenario

Integrate data from Salesforce (via REST API) and a legacy PostgreSQL database into Snowflake. Use dbt to create a unified customer 360 view, handling schema changes and data quality checks.

How to Execute

1. Use Airflow's `HttpOperator` or `SimpleHttpOperator` with authentication to extract Salesforce data. Use `PostgresOperator` for the legacy DB. 2. Load raw JSON and CSV data into Snowflake raw stages. 3. Design dbt models using `source` freshness checks, `ref` functions, and `dbt_utils` macros for schema normalization. 4. Implement dbt tests (`unique`, `not_null`, `accepted_values`) and Airflow sensors to wait for upstream data availability.

Advanced

Project

Real-Time Event Streaming Pipeline with dbt Models

Scenario

Design a hybrid pipeline where streaming clickstream data from Kafka is landed in a cloud data lake (e.g., Delta Lake on S3) via a connector, then processed in near-real-time by scheduled dbt jobs triggered by Airflow, handling late-arriving data and schema evolution.

How to Execute

1. Architect a solution using Confluent Kafka Connect or AWS Kinesis Firehose to land streaming data into an S3 data lake partitioned by date/hour. 2. Use Airflow with `ExternalTaskSensor` or `S3KeySensor` to trigger dbt jobs upon new data arrival. 3. Leverage dbt incremental models with `merge` strategy and `is_incremental()` logic to efficiently process only new/updated records. 4. Implement data validation and alerting for data drift using tools like Great Expectations integrated into the dbt project.

Tools & Frameworks

Orchestration & Workflow Management

Apache AirflowPrefectDagster

Airflow is the industry standard for defining, scheduling, and monitoring complex workflows via Python code (DAGs). Prefect and Dagster are modern alternatives offering more dynamic, programmatic orchestration and built-in observability.

In-Warehouse Transformation

dbt (data build tool)SQLMesh

dbt enables analytics engineers to transform data in the warehouse using SQL SELECT statements, promoting version control, documentation, and testing. SQLMesh offers similar functionality with built-in virtual data environments and advanced lineage.

Cloud Data Platforms

SnowflakeGoogle BigQueryAmazon RedshiftDatabricks Lakehouse

These are the target analytical engines. Understanding their specific SQL dialects, storage formats, compute scaling models, and cost structures is critical for efficient pipeline design.

Infrastructure & Observability

TerraformDockerDatafoldMonte Carlo

Terraform manages cloud infrastructure as code. Docker containerizes Airflow and dbt for consistent environments. Datafold and Monte Carlo provide data diffing, quality monitoring, and observability for pipeline outputs.

Interview Questions

Answer Strategy

The candidate should demonstrate knowledge of Airflow's built-in features for resilience. Key points: 1) Use `retries` and `retry_delay` parameters. 2) Implement `trigger_rule` (e.g., `all_success`, `one_failed`) to control downstream execution. 3) Configure `email_on_failure` and `email_on_retry`. 4) Use `on_failure_callback` or `on_success_callback` for custom alerting (e.g., to PagerDuty). 5) For critical paths, consider using Airflow Pools to limit concurrent task execution and prevent resource exhaustion.

Answer Strategy

This tests architectural thinking and process discipline. Answer: 'First, I would audit the current state using `dbt docs generate` and the DAG visualization to map dependencies. Second, I would establish a foundation by adding source definitions (`sources.yml`) and fundamental tests (`unique`, `not_null`) to all models. Third, I would refactor incrementally, breaking monolithic models into a layered architecture (staging -> intermediate -> marts) using dbt's `ref` function. Throughout, I would enforce governance by requiring new PRs to include documentation and tests, and run `dbt test` in CI/CD.'