Skill Guide

Capacity planning and demand forecasting for training and inference workloads

The systematic process of estimating future compute resource requirements (GPU/CPU/TPU, memory, storage, network) for AI model training and inference, aligning infrastructure provisioning with anticipated workload to optimize cost and performance.

This skill directly prevents cost overruns and service degradation by ensuring infrastructure scales precisely with demand, not reactively. It transforms AI/ML from a research cost center into a scalable, profitable production function by enabling predictable budgeting and SLA-compliant performance.

1 Careers

1 Categories

8.7 Avg Demand

25% Avg AI Risk

How to Learn Capacity planning and demand forecasting for training and inference workloads

1. Master core concepts: GPU memory hierarchy (HBM vs. GDDR), FLOPS, utilization metrics (MFU), and batch size effects. 2. Understand workload signatures: Differentiate training (bursty, high parallelism) from inference (steady-state, latency-sensitive). 3. Learn basic monitoring: Use tools like Prometheus to track existing workload metrics (GPU utilization, memory usage, request latency).

1. Move from observation to modeling: Use historical utilization data to create baseline forecasts using time-series analysis (e.g., ARIMA, Prophet). 2. Perform scenario planning: Model demand for different model versions, batch sizes, or user growth rates. 3. Avoid common mistakes: Never assume linear scaling; account for non-parallelizable parts (Amdahl's Law) and overhead from frameworks (CUDA, NCCL).

1. Architect for elasticity: Design systems using orchestration (Kubernetes) to auto-scale inference endpoints and manage spot instances for training. 2. Implement feedback loops: Integrate capacity forecasts directly into FinOps and cloud procurement workflows. 3. Strategic alignment: Tie infrastructure plans directly to product roadmaps, quantifying the cost impact of each new model feature or user cohort.

Practice Projects

Beginner

Project

Forecasting a Single Training Job's GPU Hours

Scenario

Your team plans to train a new vision transformer model on a 10TB dataset. You have access to logs from a similar, smaller model trained previously.

How to Execute

1. Extract key metrics from prior training: dataset size, GPU hours per epoch, final convergence epoch. 2. Calculate scaling factor based on dataset size and model parameter increase (estimate 1.8-2.2x compute per parameter). 3. Add 15-25% buffer for experimentation and failed runs. 4. Create a simple spreadsheet projecting total GPU-weeks and cost at current cloud pricing.

Intermediate

Project

Building an Inference Auto-Scaling Policy

Scenario

A recommendation model serving 1000 QPS is experiencing latency spikes during peak hours (4-6 PM). You need to create a scaling policy that balances cost and latency SLA (<200ms p99).

How to Execute

1. Profile the model to determine its baseline performance: requests per GPU per second, memory footprint. 2. Define scaling metrics: Use custom metric (e.g., 'requests_per_gpu') rather than just CPU utilization. 3. Set scaling thresholds: Scale out when average requests/GPU > 80% of max capacity for 2 minutes. 4. Implement and test the policy in a staging environment using load testing tools (Locust, k6) to simulate peak traffic and validate SLA compliance.

Advanced

Case Study/Exercise

Global Capacity Board Decision Simulation

Scenario

You lead the ML Infra team. Product leadership wants to launch a new, 5x larger LLM feature in Q3, requiring 2000 A100 GPUs for training. Cloud provider commitments renew in Q2 with a 20% discount for a 3-year commitment. Finance is pressuring for cost certainty. You must present a plan.

How to Execute

1. Build a multi-faceted forecast: Separate needs for new training (Q3 burst), current production inference (steady growth), and experimentation. 2. Model procurement options: Compare reserved instances (for baseline inference) vs. spot/on-demand (for training bursts). 3. Quantify risk: Model scenarios for feature delay (stranded reserved capacity) and success (need for immediate burst expansion). 4. Propose a hybrid strategy: Commit to reserved instances for 70% of baseline inference need, use spot for training, and negotiate on-demand rates for overflow.

Tools & Frameworks

Monitoring & Data Platforms

Prometheus + GrafanaDatadogCustom logging to time-series DB

Non-negotiable for gathering the raw utilization data (GPU MFU, memory, network I/O) that forms the empirical basis of any forecast. Prometheus/Grafana is the industry standard for custom, high-cardinality metrics.

Forecasting & Analysis Tools

Python (Pandas, statsmodels, Prophet)Excel/Google SheetsCost exploration tools (AWS Cost Explorer, GCP Cost Management)

Used to analyze historical data, build statistical forecasting models, and visualize cost/usage trends. Prophet is excellent for forecasting with multiple seasonalities (e.g., daily/weekly user patterns).

Orchestration & Provisioning

Kubernetes (with Karpenter/Cluster Autoscaler)Terraform/PulumiCloud-specific ML services (SageMaker, Vertex AI)

The operational layer where capacity plans are executed. Kubernetes enables auto-scaling and bin-packing. Infrastructure-as-Code (Terraform) allows capacity changes to be version-controlled, repeatable, and auditable.

Mental Models & Methodologies

FinOps FrameworkAmdahl's Law / Gustafson's LawUtilization Target Setting (e.g., 70-80% for production)

FinOps provides the cultural framework for linking cloud cost to business value. Amdahl's Law is critical for understanding the limits of parallelizing training jobs. Setting utilization targets prevents both over-provisioning and risky saturation.

Interview Questions

Answer Strategy

Use a structured debugging framework: 1) Verify the problem: Check scalability metrics (MFU, speedup ratio). 2) Isolate the bottleneck: Is it communication (NCCL overhead), data loading (I/O), or model architecture (non-parallelizable)? 3) Propose targeted solutions: Profile first, then optimize (better data loader, larger batch size, different parallelism strategy). 4) Make a business decision: Is the 4x cost increase justified by a proportionally smaller time saving? Sample Answer: 'I would first ask for their scaling metrics to confirm the bottleneck. I'd have them run a profiling job with PyTorch Profiler to see if time is spent in NCCL_AllReduce or data loading. Often, optimizing the data pipeline or switching to model parallelism yields better gains than just adding GPUs. Before provisioning 32 GPUs, I'd run a cost-benefit analysis on the expected time reduction versus the 4x cost.'

Answer Strategy

Tests crisis management, communication, and pragmatic scaling. Show a calm, methodical approach. Answer: 'I'd immediately initiate a war room with Product, Finance, and Infra. My first action is to quantify the exact new requirement (model, QPS, latency). I'd explore short-term mitigation: enabling request queuing, reducing model cache TTL, and negotiating emergency cloud capacity. For the mid-term, I'd work with the team to right-size the model (distillation, quantization) and adjust auto-scaling policies aggressively. Simultaneously, I'd provide Finance with a clear cost projection for the options, ensuring we don't make panic-buying decisions.'