Skill Guide

Capacity planning and predictive scaling for inference infrastructure

The systematic process of forecasting compute, memory, and network resource requirements for ML inference services, and dynamically adjusting infrastructure capacity to meet demand with minimal cost and latency.

This skill directly prevents service degradation and cost overruns in production ML systems. It ensures high availability and performance for AI applications, which is critical for user retention and revenue in products relying on real-time inference.

1 Careers

1 Categories

9.1 Avg Demand

15% Avg AI Risk

How to Learn Capacity planning and predictive scaling for inference infrastructure

1. Understand core metrics: latency (p95/p99), throughput (QPS/RPS), GPU utilization, and cost-per-inference. 2. Learn basic queueing theory (Little's Law) and simple time-series forecasting (moving averages). 3. Study cloud pricing models (on-demand, spot, reserved instances) for GPU/AI accelerators.

1. Model load patterns using historical data: identify diurnal/weekly cycles and correlated events (e.g., marketing campaigns). 2. Implement scaling policies in Kubernetes (HPA) or cloud provider autoscalers (AWS ASG, GCP MIG) using custom metrics. 3. Conduct load testing (Locust, k6) to validate scaling triggers and failure modes. Common mistake: scaling on CPU/memory instead of inference-specific metrics like GPU SM occupancy or request queue depth.

1. Architect multi-layered scaling: combine predictive (scheduled), reactive (threshold-based), and cost-optimized (spot/preemptible) strategies. 2. Develop ML models to forecast traffic and model-specific resource usage (e.g., input size, batch size impact). 3. Design cost-performance trade-off frameworks, including right-sizing instances, bin-packing, and multi-region failover. Mentor teams on establishing SLOs (Service Level Objectives) for latency and error budgets.

Practice Projects

Beginner

Project

Basic Inference Service Scaling Simulation

Scenario

You have a simple text classification model served via a REST API on a single cloud GPU instance. Traffic follows a predictable daily pattern with peaks during business hours.

How to Execute

1. Deploy a mock inference service using Flask/FastAPI on a cloud VM with a GPU. 2. Use a load testing tool (e.g., Locust) to simulate a 24-hour traffic pattern with sinusoidal or square-wave load. 3. Configure a simple Kubernetes Horizontal Pod Autoscaler (HPA) or cloud ASG to scale based on CPU or a custom request queue metric. 4. Monitor metrics (latency, cost, instance count) and analyze the scaling behavior's effectiveness and cost implications.

Intermediate

Project

Multi-Metric Predictive Scaling Pipeline

Scenario

You manage a video analysis inference service. Traffic spikes are unpredictable (e.g., viral content) but correlated with external events. Cost sensitivity is high.

How to Execute

1. Ingest historical request logs and external event data (e.g., social media trends) into a time-series database (InfluxDB, Prometheus). 2. Build a predictive model (e.g., Prophet, LSTM) to forecast QPS 1-2 hours ahead. 3. Develop a custom Kubernetes controller or cloud function that translates forecasted QPS into required replica counts, considering model-specific resource profiles. 4. Implement a hybrid scaling strategy: use predictive scaling for base load, reactive scaling for forecast errors, and spot instances for batchy workloads.

Advanced

Case Study/Exercise

Global Inference Fleet Cost & Performance Optimization

Scenario

As the inference platform lead for a multinational SaaS company, you must optimize a $2M/month inference bill across AWS and GCP regions while guaranteeing <100ms p99 latency globally for a large-language model serving service.

How to Execute

1. Audit current infrastructure: map model versions to instance types, analyze utilization, and identify low-hanging fruit (e.g., over-provisioned instances). 2. Design a tiered architecture: use reserved instances for baseline, spot instances for non-critical batch jobs, and on-demand for scaling peaks. Implement cross-region traffic routing based on real-time latency and cost. 3. Develop a unified control plane that aggregates metrics, runs optimization algorithms (bin-packing, right-sizing), and executes scaling actions across clouds. 4. Establish a continuous review process with engineering and finance to align scaling policies with product roadmaps and budget cycles.

Tools & Frameworks

Infrastructure & Orchestration

Kubernetes (Horizontal Pod Autoscaler, Cluster Autoscaler, KEDA)AWS Auto Scaling Groups (ASG) + CloudWatchGoogle Cloud Managed Instance Groups (MIG) + Stackdriver

Core platforms for defining and executing scaling policies. KEDA is essential for event-driven scaling based on custom metrics like message queue length.

Monitoring & Observability

Prometheus + GrafanaDatadogOpenTelemetry

For collecting and visualizing key inference metrics (GPU utilization, latency, error rates) which are the signals that drive scaling decisions.

Load Testing & Simulation

Locustk6Artillery

Used to generate realistic traffic patterns to test and validate autoscaling policies before they are applied to production.

Forecasting & Modeling

Prophet (by Meta)TensorFlow/PyTorch for custom LSTM modelsAWS Forecast / Azure Anomaly Detector

For building predictive models to anticipate future demand, which is the core of predictive scaling.

Interview Questions

Answer Strategy

Structure the answer around: 1) Baseline modeling (requests/user, p95 latency SLO), 2) Traffic forecasting (using historical data from similar features, marketing plans), 3) Resource profiling (benchmark the model on target hardware to get QPS/GPU), 4) Incorporating a safety buffer and cost constraints. Sample: 'I'd start by profiling the model to establish QPS per GPU. Then, using product launch forecasts, I'd model peak traffic scenarios. I'd calculate required GPUs as (Peak QPS * Safety Factor) / QPS_per_GPU, then validate this with a staged load test. Finally, I'd choose a mix of reserved and spot instances to meet cost targets.'

Answer Strategy

Tests debugging skills and learning from failure. Use the STAR method. Focus on technical root cause (e.g., metric lag, incorrect threshold) and systemic fix (e.g., added new metric, implemented predictive layer). Sample: 'During a flash sale, our HPA didn't scale fast enough due to CPU metric lag. We hit latency SLOs. The root cause was scaling on CPU, not inference queue depth. I fixed it by implementing a custom metrics adapter to expose queue depth to the HPA and added a predictive scaling rule for known sale times, cutting over-provisioning by 40%.'