Skill Guide

Real-time and batch data pipeline design (ETL/ELT)

The architectural discipline of designing systems that extract, transform, and load data from source systems into analytical targets, with separate paradigms for low-latency streaming (real-time) and high-volume scheduled (batch) processing.

It is the core infrastructure enabling data-driven decision making, directly impacting operational efficiency, real-time customer engagement, and regulatory compliance. Poor pipeline design creates data silos, increases latency, and incurs massive operational debt.

1 Careers

1 Categories

9.0 Avg Demand

20% Avg AI Risk

How to Learn Real-time and batch data pipeline design (ETL/ELT)

Master core concepts: 1) Understanding ETL (Extract, Transform, Load) vs. ELT (Extract, Load, Transform) paradigms and their trade-offs. 2) Grasping fundamental data modeling for analytical schemas (Star/Snowflake). 3) Learning basic SQL for transformations and understanding source system APIs (e.g., REST, CDC logs).

Transition to building with specific tools. Focus on 1) Implementing a batch pipeline using orchestrators (Airflow, Prefect) and processing engines (Spark, dbt). 2) Implementing a streaming pipeline using a message queue (Kafka, Pulsar) and a stream processor (Flink, Spark Structured Streaming). 3) Critical mistake to avoid: Treating batch and streaming as identical; understand state management, windowing, and exactly-once semantics in streaming.

Master system design and optimization. Focus on 1) Architecting the 'Lambda' or 'Kappa' architecture hybrid, deciding when to use which. 2) Designing for scalability, fault tolerance, and cost optimization (e.g., partitioning strategies, autoscaling). 3) Establishing data governance, observability, and lineage within the pipeline framework.

Practice Projects

Beginner

Project

Build a Basic Batch ETL Pipeline

Scenario

Ingest daily CSV sales data from an SFTP server, clean and aggregate it, and load it into a PostgreSQL data warehouse for a BI dashboard.

How to Execute

1) Use Python (Pandas) or SQL for data extraction and transformation logic. 2) Schedule the script with a simple cron job or an orchestrator like Prefect Cloud (free tier). 3) Implement error handling and logging. 4) Document the pipeline's purpose, schedule, and dependencies.

Intermediate

Project

Implement a Real-time Dashboard with Stream Processing

Scenario

Build a system to track website user clicks in real-time (<5s latency) to power a live dashboard showing popular products and user journeys.

How to Execute

1) Set up a message broker (e.g., Confluent Cloud or Redpanda). 2) Use a connector or lightweight producer to send clickstream events to a Kafka topic. 3) Use a stream processing framework (e.g., Flink SQL or Spark Structured Streaming) to perform windowed aggregations. 4) Sink the aggregated results to a real-time OLAP database (ClickHouse, Druid) or a caching layer (Redis) for the dashboard API.

Advanced

Project

Design a Hybrid Lambda Architecture

Scenario

You are tasked with designing a data platform for an e-commerce company that requires both real-time inventory updates and sub-second product recommendations, alongside nightly comprehensive reporting.

How to Execute

1) Design the 'Speed Layer': A streaming pipeline using CDC from the inventory database (Debezium) -> Kafka -> Flink for real-time updates and feature generation. 2) Design the 'Batch Layer': A batch pipeline using Spark/dbt to rebuild the master recommendation model and reporting tables nightly. 3) Design the 'Serving Layer': A query engine (e.g., Trino, Presto) that can federate queries across the real-time store (Redis) and the batch store (data lake) for a unified view. 4) Create a data catalog and lineage map for the entire system.

Tools & Frameworks

Orchestration & Workflow Management

Apache AirflowPrefectDagsterdbt (for SQL ELT)

Used to define, schedule, monitor, and manage complex DAGs of data pipeline tasks. Airflow is the industry standard; Dagster/Prefect offer more modern APIs. dbt handles the 'T' in ELT for SQL-centric transformations within the warehouse.

Stream Processing & Messaging

Apache KafkaApache FlinkSpark Structured StreamingAmazon Kinesis

Kafka (and alternatives like Pulsar, Redpanda) are the durable backbone for event streams. Flink is the leading framework for complex, stateful stream processing. Spark Streaming offers micro-batch integration with Spark's ecosystem.

Batch Processing & Storage

Apache SparkCloud Data Warehouses (Snowflake, BigQuery, Redshift)Data Lakehouses (Delta Lake, Apache Iceberg, Apache Hudi)

Spark is the workhorse for large-scale batch processing. Modern data platforms are built on cloud warehouses (for structured data) or lakehouses (for cost-effective, ACID-compliant storage of all data types on cheap object storage).

Interview Questions

Answer Strategy

Do not jump to 'just add Kafka.' Strategy: 1) Analyze the pipeline to identify which transformations/tables require the low latency (the 'hot' path). 2) Propose a targeted migration of that specific data flow to a streaming architecture (CDC/Kafka/Flink). 3) Explain how you'd maintain the existing batch pipeline for historical, complex processing (the 'cold' path). 4) Discuss the new challenges: operational complexity, cost, and monitoring for two systems. Sample: 'I'd implement a hybrid approach. First, I'd audit the pipeline to isolate the 15-minute SLA metrics. For that hot path, I'd use CDC to capture source changes into Kafka, process them with Flink for real-time aggregation, and land the results in a low-latency store like Druid. The remaining bulk processing would stay in the nightly Spark batch job. This avoids a full rewrite while meeting the new requirement.'

Answer Strategy

Tests operational maturity, ownership, and systematic thinking. The answer must show a blameless post-mortem mindset. Focus on: 1) Clearly defining the failure's business impact. 2) Identifying the technical root cause (e.g., schema drift, missing backpressure, resource exhaustion). 3) The specific, durable fix you implemented (e.g., added schema registry validation, implemented circuit breakers, added data contract tests). Sample: 'A batch pipeline failed because a source team added a new field without notice, breaking our deserialization. Business impact was a 12-hour delay in daily sales reports. Root cause was lack of schema evolution communication. I fixed the immediate failure and led the implementation of a Schema Registry with compatibility checks in CI/CD. Now, schema changes are a required PR for source teams, and our pipelines handle evolution gracefully.'