Skill Guide

ETL/ELT pipeline design and orchestration using Airflow, Prefect, or Dagster

The systematic process of designing, building, and scheduling data movement workflows that extract data from sources, transform it (ETL) or load it first (ELT), and orchestrate their execution using specialized workflow management platforms.

This skill is fundamental to building reliable, scalable data infrastructure that powers analytics and machine learning. It directly impacts business outcomes by ensuring data freshness, integrity, and availability for decision-making.

1 Careers

1 Categories

8.5 Avg Demand

25% Avg AI Risk

How to Learn ETL/ELT pipeline design and orchestration using Airflow, Prefect, or Dagster

1. Understand core data concepts: source/target, extraction methods, transformation logic, loading strategies (ETL vs ELT). 2. Learn basic programming in Python and SQL. 3. Familiarize yourself with the CLI and core abstractions (DAGs, tasks) of one orchestrator (e.g., Airflow).

1. Design pipelines for specific scenarios like incremental loading, handling schema evolution, and idempotency. 2. Implement error handling, logging, and alerting within your chosen framework. 3. Common mistake: Creating overly complex, monolithic DAGs; instead, focus on modular, testable tasks.

1. Architect multi-team, cross-domain data platform solutions with governance and observability. 2. Optimize for cost and performance at scale (e.g., dynamic task generation, cluster autoscaling). 3. Define organizational standards, mentor engineers, and evaluate toolchain evolution (e.g., moving from Airflow to Dagster for asset-centric workflows).

Practice Projects

Beginner

Project

Build a Daily Sales Report Pipeline

Scenario

Extract daily sales data from a PostgreSQL database, perform basic aggregations (total sales per category), and load the results into a summary table for a dashboard.

How to Execute

1. Set up a local Airflow/Prefect/Dagster instance. 2. Write an extraction task using Python/SQLAlchemy. 3. Write a transformation task using Pandas or SQL. 4. Write a load task to insert results. 5. Schedule the DAG to run daily.

Intermediate

Project

Implement a Modular ELT Pipeline with dbt

Scenario

Build a pipeline that extracts raw data from an API (e.g., Shopify) and loads it into a cloud data warehouse (e.g., BigQuery), then uses dbt for transformation and testing.

How to Execute

1. Design the extraction as a reusable Airflow Operator or Prefect Task that handles pagination and rate limits. 2. Load raw data into a staging schema. 3. Integrate dbt as a separate task (e.g., using the dbt-core operator) to run models and tests. 4. Implement proper dependency orchestration and failure handling.

Advanced

Project

Design a Self-Service Data Platform with Asset-Based Orchestration

Scenario

Create an internal platform where data analysts can define their own datasets (assets) with metadata, and the orchestrator automatically manages dependencies, freshness, and lineage.

How to Execute

1. Implement an asset registry (e.g., using Dagster's Software-Defined Assets or Airflow Datasets). 2. Build a system for analysts to register assets via a YAML config or UI. 3. The orchestrator dynamically generates the dependency graph and triggers jobs based on data freshness SLAs. 4. Integrate with a data catalog and monitoring stack for observability.

Tools & Frameworks

Software & Platforms

Apache AirflowPrefectDagsterdbt (Data Build Tool)

Airflow is the industry standard for programmatic, code-first workflow authoring. Prefect offers a more modern, dynamic API with a focus on hybrid execution. Dagster emphasizes software engineering principles and asset-based orchestration. dbt is the standard for defining transformations within the warehouse.

Infrastructure & Deployment

DockerKubernetesTerraform/PulumiCloud Data Warehouses (BigQuery, Snowflake, Redshift)

Containerization (Docker) and orchestration (Kubernetes) are essential for deploying scalable, isolated pipeline tasks. IaC tools manage the underlying cloud infrastructure. The data warehouse is the primary target for modern ELT.

Interview Questions

Answer Strategy

Focus on the strategy for identifying new/changed records (e.g., a high-water mark timestamp or change tracking columns). Explain the use of a staging area, deduplication logic, and a merge (upsert) operation in the warehouse. Discuss idempotency and how to handle late data (e.g., re-processing a window). Sample: 'I'd use an incremental strategy based on the `updated_at` timestamp. The pipeline extracts all records modified since the last successful run into a staging table. I'd then use a dbt incremental model with a merge strategy to upsert into the dimension table, deduplicating on the primary key. To handle late data, the model would include a lookback window and I'd set up monitoring on row count anomalies.'

Answer Strategy

Tests incident response, root cause analysis, and improvement mindset. The answer should follow a clear structure: Alert & Triage, Mitigation, Root Cause, Prevention. Sample: 'When a key sales dashboard pipeline failed due to an API schema change, I first implemented a manual data refresh to restore service. The root cause was a lack of schema contract testing. I then added a pre-flight check task in the DAG that validates the source schema against an expected contract, failing fast with a clear alert. Systemically, I championed the adoption of a schema registry for all critical source systems.'