Skill Guide

Stream processing (Kafka, Flink, Spark Streaming, Kinesis)

Stream processing is the real-time computation of unbounded data sequences, enabling continuous ingestion, transformation, and analysis of event streams with millisecond-to-second latency.

This skill is the engine of real-time decision-making, enabling businesses to react instantly to market changes, customer actions, and operational anomalies. It directly impacts revenue through dynamic pricing, fraud detection, and personalized user experiences while reducing costs via real-time monitoring and automated operations.

1 Careers

1 Categories

9.1 Avg Demand

15% Avg AI Risk

How to Learn Stream processing (Kafka, Flink, Spark Streaming, Kinesis)

1. Core Concepts: Master event-driven architecture, state management, exactly-once semantics, and windowing (tumbling, sliding, session). Understand the Lambda vs. Kappa architecture debate. 2. Tool Fundamentals: Learn Kafka's publish-subscribe model, topic-partition-consumer group relationships. Start with Flink's DataStream API and Spark Streaming's micro-batch vs. true streaming distinction. 3. Language Prerequisites: Solid Java/Scala skills are non-negotiable; Python proficiency is a strong secondary for prototyping.

1. Production Pitfalls: Practice handling late-arriving data (watermarks in Flink), backpressure management (Kafka consumer lag), and stateful processing with checkpointing/recovery. 2. Integration Patterns: Build pipelines connecting Kafka sources to Flink/Spark sinks (databases, Elasticsearch, alert systems). 3. Performance Tuning: Learn Kafka partition sizing, Flink parallelism configuration, and serialization optimizations (Avro, Protobuf). Avoid the common mistake of over-complicating state before mastering stateless transformations.

1. Architectural Strategy: Design hybrid batch-stream architectures (e.g., Flink + batch layer). Implement complex event processing (CEP) for fraud patterns. 2. Operational Mastery: Master Kubernetes-based deployment (Flink on K8s), monitoring with Prometheus/Grafana (consumer lag, checkpoint duration), and chaos engineering for stream resilience. 3. Executive Alignment: Translate business SLAs (e.g., "sub-second fraud alerts") into architectural decisions (choosing Flink's event-time processing over Spark's micro-batch). Mentor teams on stream-first design thinking.

Practice Projects

Beginner

Project

Real-Time Log Monitoring Dashboard

Scenario

Build a system to ingest web server logs via Kafka, process errors/warnings with Spark Streaming or Flink, and visualize counts per minute in a live Grafana dashboard.

How to Execute

1. Set up a local Kafka cluster (using Docker). 2. Write a producer to simulate or tail a log file. 3. Implement a stateless streaming job (e.g., Flink) to filter ERROR level logs and compute tumbling window counts. 4. Sink results to a time-series DB (InfluxDB) and connect Grafana.

Intermediate

Project

Exactly-Once E-Commerce Order Pipeline

Scenario

Design a pipeline where orders from Kafka are enriched with inventory data (from a database), processed for exactly-once delivery to a downstream analytics system and a transactional database, handling failures.

How to Execute

1. Implement Flink's Kafka source with checkpointing enabled. 2. Use Flink's Async I/O to call the inventory service for enrichment. 3. Implement a two-phase commit sink (e.g., Flink's JDBC sink with XA transactions) or idempotent writes. 4. Introduce chaos (kill TaskManagers) to test recovery and exactly-once guarantees.

Advanced

Project

Multi-Region Fraud Detection with Complex Event Processing

Scenario

Build a system to detect fraudulent transaction patterns (e.g., rapid consecutive high-value transactions from different geographies) across global Kafka topics, with sub-second alerting and minimal false positives.

How to Execute

1. Architect a multi-region Kafka setup with MirrorMaker 2 for geo-replication. 2. Use Flink CEP library to define stateful fraud patterns (sequences, conditions). 3. Implement dynamic rule updating via a separate configuration stream. 4. Design the alert sink with idempotency and integrate with a case management system, ensuring alerts are deduplicated across regions.

Tools & Frameworks

Core Streaming Engines

Apache FlinkApache Kafka StreamsApache Spark Structured StreamingAmazon Kinesis Data Analytics

Flink: Use for complex stateful processing, event-time guarantees, and low latency. Kafka Streams: Embedded library for simple, exactly-once processing tied directly to Kafka. Spark Streaming: Leverage for unified batch-stream SQL and ML integration. Kinesis: Optimal for serverless, AWS-native architectures with managed scaling.

Messaging & Storage

Apache KafkaAmazon Kinesis Data StreamsAzure Event HubsConfluent Schema Registry

Kafka: The industry standard for durable, high-throughput event streaming; use Schema Registry to enforce data contracts. Kinesis/Event Hubs: Managed alternatives for cloud-native deployments. For state storage, RocksDB (Flink) and managed state backends are critical.

Observability & Operations

Prometheus + GrafanaConfluent Control CenterFlink Web UIChaos Mesh

Monitor consumer lag, checkpoint duration, throughput. Use Control Center for Kafka health. Integrate Chaos Mesh for resilience testing in production-like environments.

Interview Questions

Answer Strategy

Test operational troubleshooting skills. Answer should follow a structured approach: 1. Check consumer group health (rebalancing?). 2. Analyze partition skew (hot partitions?). 3. Assess consumer code (slow processing, serialization, external service calls?). 4. Evaluate infrastructure (CPU, memory, network I/O on consumer hosts?). Resolution paths: increase consumer instances (up to partition count), optimize processing logic, or scale Kafka cluster/partitions.

Answer Strategy

Test ability to translate business requirements to technical design. Answer should: 1. Define 'active user' as a session window problem. 2. Flink: use session windows with 30s allowed lateness, stateful processing per user ID. 3. Spark: use micro-batches of 10-20s with watermarking for late data. 4. Choose Flink for true event-time semantics and lower latency; Spark if batch integration is also needed. Highlight trade-offs in state management and complexity.