AI Inference Optimization Engineer
An AI Inference Optimization Engineer specializes in making trained AI models faster, cheaper, and more efficient when serving pre…
Skill Guide
The systematic design of request scheduling and batch assembly strategies for LLM inference to maximize hardware utilization (GPU throughput) and minimize latency.
Scenario
You have a local LLM (e.g., Mistral-7B) and need to serve a workload of random user queries. You must demonstrate why dynamic batching outperforms static batching.
Scenario
Your service must handle both short chat messages and very long document summaries. Long prefills are blocking short requests, causing high TTFT for chat users.
Scenario
You are the lead architect for a multi-tenant LLM platform. One tenant runs latency-sensitive interactive chat (SLO: P99 TTFT < 500ms), another runs batch processing of long documents (SLO: maximize throughput). The scheduler must allocate GPU resources dynamically.
These are production-grade engines that implement advanced batching strategies (continuous batching, chunked prefill, PagedAttention) under the hood. Use them as your primary runtime; deep skill involves configuring their scheduler parameters (max_num_batched_tokens, max_num_seqs) and understanding their kernel-level optimizations.
Nsight and PyTorch Profiler are used for low-level GPU kernel and CUDA memory analysis to identify bottlenecks in the batching pipeline. Grafana/Prometheus are used for high-level, real-time monitoring of inference metrics (TTFT, TPOT, queue depth, KV-cache usage) to make data-driven scheduling decisions.
Necessary to generate realistic, concurrent request patterns that mimic production traffic. Locust is a general-purpose load testing tool; LLMPerf and custom scripts allow for precise control over prompt lengths and arrival rates to stress-test specific batching strategies.
Answer Strategy
First, define both terms precisely. Then, link chunked prefill to solving the 'long prompt blocking' problem in mixed workloads. Finally, state a drawback, such as the overhead of managing KV-cache fragments or the risk of increasing TPOT if the chunk size is misconfigured. Sample Answer: 'Continuous batching dynamically groups requests at the iteration level to prevent decode-phase idle time. Chunked prefill breaks the compute-heavy prefill phase of long prompts into smaller chunks, allowing decode iterations for other requests to be interleaved. You'd implement it when serving heterogeneous workloads where a few long-document requests would otherwise monopolize the GPU, crippling TTFT for interactive users. A key drawback is the added scheduler complexity and potential memory fragmentation from managing partial KV-caches across chunks.'
Answer Strategy
Test for a systematic, observability-driven debugging approach. The answer must show how to isolate whether the cause is a traffic pattern change, a resource leak, or a scheduler bug. Sample Answer: '1. Check the request queue depth and latency distribution in Grafana-a spike in P99 queue wait time indicates the scheduler is overwhelmed. 2. Analyze the incoming prompt length distribution; a shift toward much longer prompts can cause memory pressure and scheduler contention. 3. Examine KV-cache utilization via nvidia-smi; if it's near max, the scheduler may be failing to batch efficiently due to fragmentation. 4. Review recent deployment logs for configuration changes to batch size or chunking parameters. The root cause is often a sudden influx of long prompts that the current batching strategy cannot handle without excessive preemption or queueing.'
1 career found
Try a different search term.