Skill Guide

ETL/ELT pipeline design and orchestration (Airflow, Prefect, Dagster)

The systematic design, construction, and automated management of workflows that extract data from sources, transform or load it into target systems, ensuring reliability, scalability, and observability.

It is the central nervous system of the data platform; a robust pipeline ensures data is timely, trustworthy, and actionable for analytics and ML, directly impacting decision quality and operational efficiency. Poorly designed pipelines create data debt, causing project failures, distrust in data, and costly manual interventions.

1 Careers

1 Categories

8.7 Avg Demand

25% Avg AI Risk

How to Learn ETL/ELT pipeline design and orchestration (Airflow, Prefect, Dagster)

1. Master core data movement concepts: ETL vs. ELT, batch vs. streaming, idempotency, and incremental loading. 2. Learn basic SQL and Python scripting for data transformation tasks. 3. Install and configure a local instance of Apache Airflow; build a simple DAG that moves data from a CSV to a database.

1. Design pipelines for real-world scenarios like handling schema evolution, late-arriving data, and complex dependencies. 2. Implement advanced Airflow patterns (SubDAGs, TaskGroups, XComs, dynamic tasks) and integrate with cloud services (S3, BigQuery, Snowflake). 3. Common mistake: Failing to implement proper error handling, retries, and alerts, leading to silent pipeline failures.

1. Architect multi-environment (dev/staging/prod) orchestration systems with infrastructure-as-code (Terraform). 2. Evaluate and implement hybrid orchestration (e.g., Airflow for batch + Prefect for event-driven tasks). 3. Establish data platform standards, contribute to internal tooling, and mentor teams on building observable, cost-optimized pipelines.

Practice Projects

Beginner

Project

Daily Sales Report Pipeline

Scenario

Your manager needs a daily report of sales totals from a PostgreSQL database, delivered as a CSV to a shared S3 bucket by 8 AM.

How to Execute

1. Write a Python script using psycopg2 to extract the previous day's sales and calculate totals. 2. Create an Airflow DAG with a PythonOperator for the extraction script and a BashOperator to upload the CSV to S3 using the AWS CLI. 3. Schedule the DAG with a `@daily` interval and implement email alerts on failure. 4. Test the pipeline locally with a small date range.

Intermediate

Project

Multi-Source Customer 360 Data Mart

Scenario

Build a pipeline that combines daily CRM data (API), transaction logs (JSON files), and support tickets (database) into a unified customer dimension table in a data warehouse, with full history (SCD Type 2).

How to Execute

1. Design a modular Airflow DAG with separate extraction tasks for each source. 2. Implement dbt (data build tool) for the transformation logic to handle SCD Type 2 merges. 3. Use Airflow sensors for the API and file availability checks. 4. Implement data quality checks (e.g., using Great Expectations) between extraction and loading stages. 5. Parameterize the DAG for backfills and incremental loads.

Advanced

Project

Unified Orchestration Platform for ML & Analytics

Scenario

As the data platform lead, design an orchestration strategy that serves both batch analytics pipelines and real-time ML feature generation, ensuring resource isolation, observability, and cost control across AWS and GCP.

How to Execute

1. Evaluate and propose a hybrid toolset: use Dagster for its software-defined assets and strong typing for ML feature pipelines, while retaining Airflow for legacy batch jobs. 2. Architect a shared metadata and logging layer (e.g., using OpenTelemetry). 3. Implement a self-service pipeline creation framework for data scientists with guardrails. 4. Define and enforce SLA policies, resource quotas, and cost-allocation tags. 5. Build a comprehensive dashboard for pipeline health, data freshness, and cloud spend.

Tools & Frameworks

Orchestration Platforms

Apache AirflowPrefectDagsterMage.ai

Airflow: industry standard, code-first (Python DAGs), vast ecosystem, best for complex batch dependencies. Prefect: modern, hybrid execution model, strong focus on developer experience and dynamic workflows. Dagster: software-defined assets, excellent for data-aware orchestration and ML pipelines. Mage.ai: notebook-centric, developer-friendly for rapid pipeline prototyping.

Transformation & Data Quality

dbt (data build tool)Great ExpectationsSQLMesh

dbt: SQL-based transformation, version control, and documentation for the analytics layer. Great Expectations: data validation, profiling, and documentation to prevent bad data from propagating. SQLMesh: a dbt alternative with built-in virtual environments and incremental by default.

Infrastructure & IaC

TerraformDockerKubernetesHelm

Terraform: provision and manage the underlying infrastructure for orchestrators (e.g., cloud VMs, Kubernetes clusters, managed Airflow like MWAA). Docker/Kubernetes: containerize tasks and orchestrators for portability and resource management. Helm: package and deploy complex orchestrator applications (like Airflow) to Kubernetes.

Interview Questions

Answer Strategy

The interviewer is assessing system design thinking and knowledge of distributed processing. Focus on the architecture, not just code. Discuss choosing ELT vs. ETL, partitioning, and parallelization. Sample answer: "I'd use an ELT approach within Snowflake for performance. The Airflow DAG would first use a Sensor to check for the S3 file, then trigger a Spark (or Snowflake COPY) job to load the raw data into a Snowflake staging table. The transformation would be a series of dbt models running in Snowflake, leveraging its compute for the large join. The final model would be the aggregated mart. I'd use Snowflake's task scheduling for the heavy SQL work, orchestrated by Airflow for dependency management and SLA monitoring."

Answer Strategy

This tests operational maturity and systematic problem-solving. Demonstrate a structured approach. Sample answer: "First, I'd check the Airflow UI for task instance details, looking at the full traceback and XCom values. I'd examine recent changes to the DAG code or dependencies. If it's a resource issue, I'd check the worker logs and infrastructure metrics (CPU/Memory). I'd enable debug-level logging for a test run and try to reproduce the failure with a subset of data locally. A common intermittent issue is upstream data delays; I'd verify the data source's freshness and potentially add or adjust a sensor's timeout and poke interval."