AI Runtime Engineer
AI Runtime Engineers are the architects behind reliable, high-performance AI systems in production - owning model deployment, infe…
Skill Guide
Distributed and parallel inference patterns are system-level techniques for partitioning large neural network models across multiple accelerators (GPUs/TPUs) to reduce latency and increase throughput during the forward pass.
Scenario
Serve a 13B parameter dense model (e.g., a subset of LLaMA) on 2x GPUs, each with 24GB VRAM, where the full model does not fit on a single device.
Scenario
Serve a 70B model across 4 GPUs (2 nodes) using pipeline parallelism, balancing memory load and minimizing the 'bubble' (idle time) in the pipeline.
Scenario
Design and deploy a serving architecture for a 1T+ parameter MoE model (e.g., Mixtral-8x22B) on a cluster of 16x A100 GPUs, optimizing for both cost and latency under a 100ms SLA.
Use these for production deployment. vLLM excels at continuous batching and PagedAttention; TensorRT-LLM provides deep NVIDIA kernel optimization and graph fusion; FasterTransformer is a battle-tested library for high-performance TP/PP.
Apply these to identify bottlenecks. Nsight Systems visualizes GPU kernels and NCCL communications; PyTorch Profiler gives a CPU/GPU breakdown of operator execution; use distributed gather functions to check for hanging or slow processes.
NCCL is the standard for GPU collective communications (AllReduce, AllGather). Use `torch.distributed` for Python-level orchestration. Understand MPI concepts for legacy or CPU-based clusters.
Answer Strategy
Structure your answer around: 1) Measurement (profiling to separate queueing, prefill, and decode times), 2) Bottleneck Analysis (Is it compute-bound in prefill? Communication-bound in decode? Memory-bound in KV-cache?), 3) Specific Countermeasures (e.g., switch from static to continuous batching, increase TP degree to reduce per-GPU compute, or optimize KV-cache with paged attention).
Answer Strategy
Test the candidate's understanding of hardware constraints and performance bottlenecks. The core trade-off is communication volume vs. pipeline bubbles. TP has high communication frequency (within each layer) but low latency on fast interconnects (NVLink). PP has low communication frequency (only at stage boundaries) but suffers from pipeline bubbles and load imbalance.
1 career found
Try a different search term.