Skill Guide

ML model serving architectures (batch, real-time, streaming inference)

ML model serving architectures are the infrastructure patterns and design principles for deploying trained machine learning models to handle inference requests, categorized by latency and throughput requirements into batch (offline), real-time (synchronous online), and streaming (asynchronous online) paradigms.

It directly determines the feasibility, cost-efficiency, and business impact of ML models by aligning computational resources with actual business use-case latency requirements. Proper architecture selection is the difference between a profitable, scalable ML product and a failed, unmanageable technical prototype.

1 Careers

1 Categories

9.2 Avg Demand

15% Avg AI Risk

How to Learn ML model serving architectures (batch, real-time, streaming inference)

1. Understand the core trade-off triangle: latency vs. cost vs. complexity. 2. Learn the definitions, use cases, and key components (e.g., message queues, REST/gRPC endpoints, feature stores) for batch, online, and streaming serving. 3. Study the architecture diagrams of reference systems like Uber's Michelangelo or Netflix's ML platform.

1. Move from diagrams to hands-on deployment using managed services (e.g., AWS SageMaker endpoints, Vertex AI prediction, Azure ML managed endpoints). 2. Focus on performance tuning: optimizing batch size, model serialization (ONNX, TensorRT), and understanding hardware (GPU vs. CPU). 3. Common mistake: not designing for failure-implement proper health checks, fallback models, and monitoring (latency p99, error rates).

1. Architect for state and scale: design systems handling online feature computation, model versioning, canary/A-B deployments, and auto-scaling policies based on QPS (queries per second). 2. Master cost optimization strategies (spot instances, serverless inference like AWS Lambda, model distillation). 3. Lead cross-functional alignment: ensure the serving architecture supports business needs for real-time personalization, batch reporting, and event-driven actions.

Practice Projects

Beginner

Project

Deploy a Simple Model as a REST API

Scenario

You have a trained scikit-learn model for iris classification. You need to serve it so a web application can send flower measurements and get a species prediction instantly.

How to Execute

1. Serialize the model using joblib or pickle. 2. Build a lightweight web service using Flask or FastAPI that loads the model and exposes a `/predict` endpoint. 3. Containerize the service with Docker. 4. Deploy the container to a simple cloud instance (e.g., AWS EC2, GCP Cloud Run) and test with Postman or curl.

Intermediate

Project

Build a Batch Scoring Pipeline with an Orchestration Tool

Scenario

A marketing team needs a weekly report of all new users scored with a propensity-to-churn model. The data lives in a data warehouse (e.g., BigQuery, Snowflake).

How to Execute

1. Use an orchestration tool like Apache Airflow or Prefect to define a DAG (Directed Acyclic Graph). 2. Create tasks to: a) Extract new user data from the warehouse, b) Preprocess features, c) Run batch inference using a library like PySpark or a dedicated batch serving framework (e.g., Seldon Core batch jobs), d) Write predictions back to the warehouse. 3. Schedule the DAG to run weekly. 4. Add monitoring for task success/failure and prediction drift.

Advanced

Project

Design a Real-Time Feature and Inference System

Scenario

You are building a real-time fraud detection system for a fintech app. The model requires features computed from the last 5 minutes of user transaction history (streaming features) and static user attributes.

How to Execute

1. Architect the data flow: use a streaming platform (Kafka/Pulsar) for transaction events. 2. Build a streaming feature store (e.g., Tecton, Feast with online store) to compute and serve windowed aggregates in real-time. 3. Deploy the model behind a high-performance serving layer (e.g., Triton Inference Server, KServe) that calls the feature store for each incoming request. 4. Implement the system for resilience: use circuit breakers, fallback to batch features if the stream is delayed, and set up comprehensive observability (metrics, logs, traces).

Tools & Frameworks

Serving Frameworks & Platforms

TensorFlow ServingTriton Inference Server (NVIDIA)TorchServeSeldon Core / KserveBentoML

Core software for loading models and exposing them via APIs (REST/gRPC). Triton is performance-critical for multi-framework, GPU-optimized serving. Seldon/Kserve are Kubernetes-native for orchestration. BentoML simplifies packaging and deployment.

Orchestration & Infrastructure

KubernetesApache Airflow / PrefectApache Kafka / PulsarDocker

Kubernetes is the de facto standard for scaling and managing containerized serving workloads. Airflow/Prefect orchestrate batch pipelines. Kafka/Pulsar are the backbone for streaming inference architectures, handling high-throughput event flows.

Feature Stores & MLOps

FeastTectonHopsworksMLflow (for model registry/tracking)

Feature stores (Feast, Tecton, Hopsworks) are critical for serving consistent features for both batch and real-time models, solving the training-serving skew problem. MLflow manages model versions, packaging, and deployment.

Interview Questions

Answer Strategy

The interviewer is testing the ability to translate business requirements (real-time, user-facing) into a technical architecture. Use the 'data path' framework: source -> processing -> serving. Sample answer: 'This requires a real-time online architecture. I'd propose: 1) A streaming pipeline (Kafka) to capture user click events. 2) A streaming feature store to compute short-term engagement features (e.g., last 5 viewed items). 3) The model, served via a low-latency framework like Triton, would call the feature store for each user's session. 4) The endpoint would be a gRPC service for minimal latency, deployed on a Kubernetes cluster with horizontal pod autoscaling to handle traffic spikes.'

Answer Strategy

This tests the understanding of the batch-to-streaming continuum and incremental processing. The core competency is system evolution and cost management. Sample answer: 'First, I'd profile the current batch job to identify the bottleneck-is it model computation, data I/O, or feature engineering? For a move to hourly, a micro-batch approach using Spark Structured Streaming or AWS Batch with smaller time windows is often more cost-effective than full streaming. I would implement a change-data-capture (CDC) pattern to process only new or updated data each hour, drastically reducing compute. The key is to incrementally validate the new job's accuracy against the existing batch baseline before a full switch.'