Skill Guide

Real-time streaming ML pipelines using Kafka, Flink, or Spark Structured Streaming

The architecture and operationalization of end-to-end machine learning systems that ingest, process, and serve predictions on unbounded data streams in real-time using distributed streaming frameworks.

This skill enables organizations to make automated, data-driven decisions at the speed of events, transforming latent data into immediate business value for use cases like fraud detection, dynamic pricing, and real-time recommendations. It directly impacts revenue capture and operational efficiency by eliminating batch processing latency.

1 Careers

1 Categories

9.1 Avg Demand

15% Avg AI Risk

How to Learn Real-time streaming ML pipelines using Kafka, Flink, or Spark Structured Streaming

1. Master the foundational theory: understand stream processing semantics (exactly-once, at-least-once), event time vs. processing time, and watermarking. 2. Gain proficiency in one core distributed system (Apache Flink or Spark Structured Streaming) by completing its official tutorials on stateful stream processing. 3. Learn Apache Kafka fundamentals: topics, partitions, consumer groups, and the producer/consumer API.

1. Move from toy examples to production concerns: implement a pipeline with real data sources (e.g., Kafka) and sinks (e.g., Elasticsearch, a feature store), focusing on state management (RocksDB state backend) and checkpointing for fault tolerance. 2. Integrate a simple ML model: deploy a pre-trained model (e.g., a scikit-learn classifier serialized with PMML or ONNX) for online inference within a stream processor, handling model loading and prediction errors. 3. Avoid the common mistake of ignoring data skew and late events; practice using windowed aggregations and side outputs for late data.

1. Architect complex, multi-stage pipelines that perform online feature computation, model serving, and feedback loops (e.g., for reinforcement learning), ensuring sub-second end-to-end latency. 2. Master performance tuning and cost optimization: profile jobs for bottlenecks (serialization, state size), tune parallelism, and manage cluster resources on Kubernetes or cloud-managed services. 3. Lead the design of an MLOps platform component for streaming ML, including CI/CD for pipeline code, model versioning, and monitoring for data/concept drift.

Practice Projects

Beginner

Project

Real-time Clickstream Anomaly Detector

Scenario

Build a system that processes a simulated stream of website click events from Kafka, identifies sessions with unusually high activity (potential bots), and logs the results.

How to Execute

1. Set up a local Kafka broker and produce synthetic clickstream events (user_id, timestamp, page) using a Python script. 2. Write a Spark Structured Streaming or Flink job that reads the stream, keys by user_id, and applies a windowed count (e.g., 1-minute tumbling window). 3. Implement a simple threshold-based anomaly rule (e.g., >50 clicks/minute) and write matched events to a console sink or a local database. 4. Add basic checkpointing to handle job restarts gracefully.

Intermediate

Project

Online Feature Store Integration for Fraud Prediction

Scenario

Enhance a streaming transaction pipeline to compute real-time user behavior features (e.g., rolling 5-minute spending total) and serve them alongside a pre-trained XGBoost model for fraud scoring.

How to Execute

1. Extend the Kafka producer to emit transaction events (user_id, amount, merchant, timestamp). 2. In Flink, compute stateful features using KeyedProcessFunction or Spark's mapGroupsWithState, storing state in RocksDB. 3. Serialize the feature vector and call a RESTful model endpoint (e.g., served via TensorFlow Serving or a simple Flask app) for each transaction. 4. Augment the original event with the fraud score and sink it to a dashboard (e.g., Grafana with a PostgreSQL backend) for real-time monitoring.

Advanced

Project

Multi-Modal Feature Pipeline with Late Data Handling

Scenario

Design and deploy a pipeline that ingests two disparate streams (e.g., user transactions and website clicks), performs complex event processing to join them, handles late-arriving data up to 24 hours, and manages model drift by periodically retraining on fresh features.

How to Execute

1. Architect a Lambda or Kappa-style architecture, using Kafka as the unified log for both streams. 2. Implement a Flink application that uses event-time timers and side outputs to handle late data, routing it to a separate sink for batch reprocessing. 3. Use a stateful co-process to join transaction and click streams per user, outputting a fused feature vector. 4. Implement an automated retraining pipeline (e.g., on Airflow) that consumes the late-data sink and the main feature log, retrains the model, and promotes a new version to the serving endpoint, with canary testing.

Tools & Frameworks

Streaming & Messaging

Apache KafkaApache FlinkSpark Structured StreamingApache Pulsar

Kafka is the de facto standard for durable, ordered event streams. Flink is preferred for complex, low-latency stateful computations and advanced windowing. Spark Structured Streaming is chosen for teams with existing Spark expertise needing integrated batch-streaming logic. Pulsar is a cloud-native alternative with built-in multi-tenancy.

ML Serving & State Management

TensorFlow ServingTorchServeRedis (as a feature/state store)RocksDB

Dedicated model servers (TF/Torch Serve) handle scalable inference and model versioning. Redis is used for low-latency online feature lookup in lambda architectures. RocksDB is the embedded state backend for Flink/Spark, enabling large state with efficient checkpointing.

Orchestration & Monitoring

Apache AirflowKubernetes (K8s)Prometheus/GrafanaGreat Expectations

Airflow orchestrates batch retraining and data validation tasks triggered by streams. Kubernetes manages the deployment and scaling of streaming job containers. Prometheus and Grafana are essential for monitoring pipeline health (latency, throughput, backpressure) and data quality. Great Expectations validates data schema and statistics within streams.

Interview Questions

Answer Strategy

Demonstrate a methodical approach: 1) Check metrics (heap/non-heap memory, state size per operator, checkpoint duration). 2) Verify state backend config (using RocksDB vs. heap). 3) Analyze state TTL and clean-up logic. 4) Consider state serialization efficiency. 5) Discuss scaling options. Sample Answer: 'First, I'd inspect Flink's metrics dashboard for state size per task and checkpoint breakdown. If the state backend is heap-based, I'd switch to RocksDB with incremental checkpoints. I'd audit my stateful functions for proper state TTL settings and ensure I'm clearing expired state. Finally, I'd profile serialization to ensure POJOs are efficient, and if necessary, increase checkpoint timeout and interval while tuning parallelism.'

Answer Strategy

Tests architectural judgment and experience with fundamental stream processing trade-offs. The answer should reveal understanding of watermarks, late data, and business requirements. Sample Answer: 'In a real-time bidding system, we set an aggressive watermark delay of 5 seconds to trigger windowed computations quickly. We knew some events would be late. We used a side output to capture late events, sending them to a separate 'repair' topic for batch correction. The trade-off was accepting ~0.5% of events being scored with a slightly stale feature set versus maintaining bid latency under 50ms, which was critical for revenue.'