Skill Guide

ETL/ELT pipeline design and orchestration (Airflow, Dagster, Prefect, Mage)

ETL/ELT pipeline design and orchestration is the engineering discipline of architecting, building, scheduling, monitoring, and managing automated data workflows that extract, transform, and load data between systems, using orchestration frameworks like Airflow, Dagster, Prefect, or Mage as the control plane.

This skill is foundational for data-driven organizations, enabling reliable, scalable, and observable data movement which directly impacts analytics accuracy, ML model freshness, and operational decision-making. Mastery of orchestration tools reduces pipeline failures, data latency, and engineering toil, directly translating to business agility and trusted data products.

1 Careers

1 Categories

9.1 Avg Demand

15% Avg AI Risk

How to Learn ETL/ELT pipeline design and orchestration (Airflow, Dagster, Prefect, Mage)

Focus on understanding core data pipeline concepts (extract, transform, load stages, batch vs. stream, idempotency), learning the basic architecture and vocabulary of a single orchestration framework (e.g., Airflow's DAGs, Tasks, Operators, Schedules), and executing simple, local DAGs to move data between two static sources.

Transition to building robust, production-style pipelines. Work with dynamic DAG generation, implement error handling and alerting, manage connections/secrets securely, and integrate with cloud storage (S3, GCS) and data warehouses (BigQuery, Snowflake). Understand common pitfalls like circular dependencies and non-idempotent tasks.

Master the design of large-scale, multi-team data platform ecosystems. Focus on advanced patterns: data-aware scheduling, lineage tracking, cost optimization, multi-framework interoperability, and establishing governance and testing standards (e.g., DAG unit testing, CI/CD for pipelines). Lead architectural decisions and mentor teams on scalability and observability.

Practice Projects

Beginner

Project

Automated Daily Data Ingestion with Airflow

Scenario

A startup needs to daily ingest its user activity logs from a REST API into a local PostgreSQL database for basic reporting.

How to Execute

1. Set up a local Airflow instance (Docker is recommended). 2. Define a DAG that runs daily, using the PythonOperator to call a script that fetches data from a mock API and writes it to a CSV. 3. Use the BashOperator or PostgresOperator to load that CSV into PostgreSQL. 4. Add basic email alerts on task failure using Airflow's built-in mechanism.

Intermediate

Project

Cloud-Native ELT Pipeline with dbt and Dagster

Scenario

A mid-size company uses a cloud data warehouse (Snowflake) and wants to implement an ELT pattern where raw data is loaded first, then transformed in-warehouse using dbt, orchestrated reliably with Dagster.

How to Execute

1. Configure Dagster to connect to Snowflake and a cloud object store (S3). 2. Build an asset-based pipeline in Dagster where the first asset is a Python op that loads raw CSV files from S3 into a Snowflake raw schema. 3. Define a downstream dbt asset that runs the dbt transformations (staging, intermediate, final models). 4. Implement asset checks in Dagster for data quality (e.g., not null, unique keys) and set up schedules/sensors to trigger the pipeline on new file arrivals or daily.

Advanced

Project

Cross-Departmental Data Mesh Orchestration Platform

Scenario

A large enterprise is adopting a data mesh paradigm. The central data platform team must build a self-service orchestration layer that allows domain teams (Marketing, Finance) to independently develop, deploy, and monitor their own data products using preferred tools (some use Airflow, others Prefect), while enforcing governance, lineage, and SLOs.

How to Execute

1. Design a federated architecture: deploy isolated, managed Airflow/Prefect instances per domain team. 2. Implement a central metadata layer (using OpenLineage, DataHub) that aggregates lineage and pipeline status from all domains via API hooks or log scrapers. 3. Build a unified monitoring dashboard (Grafana) that tracks global SLOs (freshness, success rate) and a cost dashboard per domain. 4. Create a CLI/GitLab CI pipeline template that standardizes project structure, testing, and deployment to the respective orchestrator, incorporating policy-as-code checks (e.g., using OPA) for data contracts and security.

Tools & Frameworks

Orchestration Platforms

Apache AirflowDagsterPrefectMage

The core control planes for defining, scheduling, and monitoring pipelines. Airflow offers mature extensibility; Dagster emphasizes software-defined assets and testability; Prefect focuses on dynamic, Python-native flows; Mage is a newer, integrated notebook-like editor. Choice depends on team maturity, use case (batch vs. event-driven), and ecosystem needs.

Transformation & Storage

dbt (Data Build Tool)Apache SparkSnowflake/BigQuery/RedshiftS3/GCS/Azure Blob

Used within orchestrated tasks. dbt manages SQL-based ELT transformations in-warehouse. Spark handles large-scale batch/stream processing. Cloud data warehouses are the primary compute and storage targets. Object stores are the landing zone for raw data.

Infrastructure & Observability

Docker/KubernetesTerraformOpenLineageGrafana/PrometheusPagerDuty

Containerization (Docker/K8s) ensures environment consistency. Terraform manages cloud infrastructure as code. OpenLineage provides data lineage. Grafana/Prometheus monitor pipeline metrics and resource usage. PagerDuty handles incident alerting and on-call rotation.

Interview Questions

Answer Strategy

Focus on a systematic approach: **1) Isolate and Modularize** by breaking the DAG into smaller, domain-specific DAGs with clear ownership, using SubDAGs or (preferably) independent DAGs triggered by sensors or API calls. **2) Introduce Idempotency and Retries** by redesigning tasks to be safe to re-run and configuring exponential backoff retries with alerts. **3) Implement Data Awareness** by replacing time-based schedules with data-aware triggers (e.g., S3KeySensor) so tasks run only when upstream data is available. **4) Add Testing** by wrapping task logic in testable Python functions and using Airflow's testing utilities or mocking frameworks.

Answer Strategy

The question tests **architectural thinking and tool evaluation**. A strong answer contrasts the paradigms: Airflow focuses on **task orchestration** (how to do it), making it flexible but requiring manual lineage and dependency tracking. Dagster's **asset-based** model focuses on **what to produce** (the data assets), making lineage, freshness, and quality first-class concepts, which improves developer experience and observability for data products. For a new project focused on data as a product, Dagster's model may accelerate development; for complex, non-data workflows or teams deeply embedded in Airflow's ecosystem, Airflow's flexibility may be preferable.