AI Cost Optimization Engineer
An AI Cost Optimization Engineer specializes in reducing and right-sizing the financial footprint of AI and ML workloads across cl…
Skill Guide
The systematic process of maximizing the computational efficiency of ML model serving by grouping multiple inference requests together (batching), dynamically adjusting group size based on load (dynamic batching), and strategically balancing the tradeoff between processing time per request (latency) and requests processed per unit time (throughput).
Scenario
Deploy a pre-trained ResNet-50 model on a single GPU to classify images. Measure the impact of different static batch sizes (1, 8, 32, 128) on latency and throughput.
Scenario
Serve a text classification model with an SLA of <100ms P99 latency. Traffic is bursty. Configure and test a dynamic batching server to maximize throughput while respecting the SLO.
Scenario
You are deploying two models: Model A (critical, low-latency user-facing fraud check) and Model B (background, high-throughput ad scoring). Both share a GPU cluster.
Core platforms for deploying models in production. They handle the actual batching logic, model versioning, and HTTP/gRPC endpoints. Triton is the industry leader for complex, multi-model, GPU-optimized deployments.
Nsight Systems profiles GPU kernel execution and memory transfers to identify bottlenecks. Locust/vegeta generate realistic client-side load to benchmark latency percentiles (P50, P99) and throughput under pressure.
Used *before* serving to compile and optimize models (quantization, layer fusion) to reduce per-batch inference time. This is a prerequisite for enabling more aggressive batching within a given latency SLO.
Answer Strategy
Use a structured problem-solving framework: 1) Diagnose (profile to find the bottleneck - is it data transfer, compute, or memory?), 2) Hypothesize (is the batch too large for the hardware/memory?), 3) Test (implement dynamic batching with a timeout to collect a batch within a time budget, e.g., 30ms), 4) Iterate (tune timeout and max_batch_size, potentially use model optimizations like TensorRT to reduce compute time). Sample Answer: 'First, I'd profile with Nsight to see if the bottleneck is data loading or GPU kernel execution. Given the 50ms SLO, I'd implement dynamic batching with a max_wait time of ~20ms and a max_batch_size of 16, then load test to find the optimal point. I'd also look at compiling the model with TensorRT to reduce the per-batch latency, which would allow us to potentially increase the batch size.'
Answer Strategy
Testing strategic thinking and business acumen. The candidate should connect technical decisions to cost and user experience. Sample Answer: 'For a video processing service, increasing the batch size from 8 to 32 tripled throughput and cut cloud costs by 40%, but increased latency from 100ms to 450ms. I presented a cost-benefit analysis to stakeholders showing the savings were significant, but the latency increase would hurt user engagement for real-time previews. We compromised by using two different serving configurations: one for real-time interactive requests (low batch, low latency) and another for batch processing jobs (high batch, high throughput). This optimized cost while preserving the user experience.'
1 career found
Try a different search term.