AI Inference Optimization Engineer
An AI Inference Optimization Engineer specializes in making trained AI models faster, cheaper, and more efficient when serving pre…
Skill Guide
The systematic optimization of GPU memory allocation and the efficient reuse of intermediate key/value tensors during transformer inference to minimize memory footprint and maximize throughput.
Scenario
You have a pre-trained LLM (e.g., GPT-2) and a set of sample prompts. The goal is to quantify the memory consumed by the KV-cache during autoregressive generation.
Scenario
You need to reduce the memory footprint of the KV-cache for a multi-head attention layer without a significant drop in output quality.
Scenario
Build a system for a serving framework that dynamically manages KV-cache blocks, allowing for memory sharing between requests and reducing fragmentation.
PyTorch provides the low-level primitives for memory manipulation and profiling. vLLM and Triton are production-grade serving frameworks with built-in KV-cache optimizations. Transformers is the primary interface for loading and interacting with models.
Nsight Systems and PyTorch Profiler provide kernel-level and operator-level memory timeline views. `nvidia-smi` is for quick, real-time monitoring of GPU memory usage during experiments.
PagedAttention eliminates memory fragmentation. Sliding window attention limits context length. Checkpointing trades compute for memory. Quantization reduces the bit-width of cached tensors.
Answer Strategy
The candidate must derive the formula: `2 * batch_size * num_layers * num_heads * sequence_length * head_dim * bytes_per_element`. The peak memory is this cache size plus the memory for model parameters, activations, and optimizer states. The sample answer should explicitly state the quadratic dependency on sequence length and the linear dependency on batch size and model depth.
Answer Strategy
Tests systems thinking. Likely bottlenecks are 1) memory fragmentation preventing large batches, 2) excessive padding in variable-length requests wasting cache space, 3) inefficient memory transfer between host and device. Diagnosis: Use a profiler (Nsight) to look for large gaps in GPU utilization and memory allocation/deallocation patterns. Sample answer should mention analyzing batch composition and memory fragmentation ratios.
1 career found
Try a different search term.