Skill Guide

ETL/ELT pipeline orchestration (Airflow, Dagster, Prefect)

ETL/ELT pipeline orchestration is the automated management, scheduling, monitoring, and dependency resolution of data movement and transformation workflows across distributed systems using specialized platforms.

It is the foundational operational layer that ensures data reliability, timeliness, and cost-efficiency, directly enabling data-driven decision-making and product functionality. Without robust orchestration, data pipelines become fragile, opaque, and a bottleneck for engineering and business teams.

1 Careers

1 Categories

8.7 Avg Demand

18% Avg AI Risk

How to Learn ETL/ELT pipeline orchestration (Airflow, Dagster, Prefect)

1. **Core Concepts:** Understand Directed Acyclic Graphs (DAGs), idempotency, task dependencies, and the distinction between ETL and ELT. 2. **Local Setup:** Install and run Apache Airflow locally using its standalone mode or Docker Compose. 3. **First DAGs:** Write basic DAGs with PythonOperator and BashOperator, focusing on defining dependencies and scheduling with cron expressions.

Move beyond tutorials by building pipelines for real data sources (e.g., a public API, a CSV file). Practice implementing error handling (retries, alerts), parameterizing DAGs, and using variables/connections. Common mistakes include: monolithic DAGs, overusing XCom for large data, and neglecting idempotency. Start exploring asset-centric orchestration concepts (like Dagster's Software-Defined Assets).

Master orchestration as a system design challenge. Architect for scalability by implementing separate scheduler and worker fleets, managing resource queues, and optimizing task concurrency. Implement complex patterns: dynamic DAG generation, cross-DAG dependencies, and blue/green deployments for pipeline code. Align orchestration strategy with business SLAs and cost models, and mentor teams on design patterns and observability.

Practice Projects

Beginner

Project

Build a Weather Data Ingestion Pipeline

Scenario

Create a daily pipeline that fetches weather data from a public API (e.g., Open-Meteo), transforms it (converts units, selects fields), and loads it into a local PostgreSQL database.

How to Execute

1. Write an Airflow DAG with three tasks: `extract` (PythonOperator calling the API), `transform` (PythonOperator cleaning data), `load` (PythonOperator writing to Postgres). 2. Define task dependencies. 3. Use Airflow Connections to manage database credentials. 4. Schedule it to run daily and monitor in the Airflow UI.

Intermediate

Project

Orchestrate a Multi-Source Data Mart Refresh

Scenario

Build a pipeline that pulls user data from a mock REST API and order data from a CSV file, joins them in a transformation, and materializes a final analytics table in a data warehouse (e.g., BigQuery). The pipeline must handle API failures gracefully.

How to Execute

1. Use Airflow or Dagster to define two parallel extraction branches. 2. Implement a `transform` task that joins the datasets. 3. Use Airflow's BranchPythonOperator or Dagster's conditional execution to skip the load if upstream data quality checks fail. 4. Add SlackOperator or email alerts for task failures. 5. Parameterize the DAG for different environments (dev/prod).

Advanced

Project

Design a Modular, Self-Service Orchestration Platform

Scenario

Your data platform team needs to support 50+ data scientists and analysts who must deploy their own pipelines with governance. Design an orchestration layer that allows users to define pipelines in a templated DSL, with centralized monitoring, access control, and cost tracking.

How to Execute

1. Architect a system where user-defined YAML/SQL configs are dynamically compiled into orchestrator DAGs (Airflow DAG factories or Dagster Definitions). 2. Implement a metadata service for lineage and SLA tracking. 3. Set up resource pools and queues to isolate workloads and control costs. 4. Create a deployment pipeline (CI/CD) for pipeline code. 5. Develop a runbook for incident response and onboard teams with templates.

Tools & Frameworks

Orchestration Platforms

Apache AirflowDagsterPrefectMage

Core platforms for defining, scheduling, and monitoring workflows. Airflow is the de facto standard with a massive ecosystem. Dagster offers a strong asset-centric model and type system. Prefect emphasizes dynamic workflows and a modern API.

Infrastructure & Deployment

DockerKubernetesHelm ChartsTerraform

Containers and orchestration are used to deploy the orchestrator itself and run tasks in isolated environments. Kubernetes is standard for scalable, production-grade deployments. Helm/Terraform manage configuration as code.

Monitoring & Observability

GrafanaPrometheusPagerDutyDataDog

Used to monitor orchestrator health (scheduler, workers), pipeline performance (run duration, task latency), and trigger alerts. Custom metrics are often emitted from pipelines to these systems.

Data Tooling Integration

dbtSparkAWS GlueGoogle Cloud Dataflow

Orchestrators are the glue that triggers and manages other data tools. For example, an Airflow DAG can trigger a dbt build, a Spark job on EMR, or a Dataflow pipeline.

Interview Questions

Answer Strategy

The candidate should demonstrate system design thinking. They should discuss: 1) Choosing the right tool (Prefect/Airflow with Celery/K8s for latency), 2) Designing for high availability (multiple schedulers, external metadata database), 3) Implementing robust monitoring and alerting, 4) Using idempotency and dead-letter queues for reliability, and 5) Strategies for zero-downtime deployments of pipeline code. Sample answer: 'For a real-time feature pipeline with strict SLAs, I'd likely use Prefect or Airflow with a KubernetesExecutor for low-latency, scalable task execution. I'd deploy the orchestrator's scheduler and webserver in a highly available configuration with an external Postgres database. Idempotency would be baked into each task. We'd implement granular monitoring with Prometheus metrics exported to Grafana and use canary deployments for pipeline updates to avoid downtime.'

Answer Strategy

Tests operational maturity, problem-solving, and learning from failure. The answer should follow a clear structure: Situation, Task, Action, Result (STAR). Focus on the post-mortem process and systemic fixes, not just the immediate fix. Sample answer: 'A daily aggregation pipeline failed due to an upstream API rate limit being exceeded during peak hours. My immediate action was to implement a retry with exponential backoff. The root cause was poor scheduling design. In the post-mortem, I led a change to stagger our DAG start times and added a 'check_upstream_health' task as a gate before extraction. I also documented the API's limits in our internal wiki and updated our runbook to include rate-limit checks.'