AI Runtime Engineer
AI Runtime Engineers are the architects behind reliable, high-performance AI systems in production - owning model deployment, infe…
Skill Guide
The process of setting up, tuning, and optimizing large language model serving software (like vLLM, TensorRT-LLM) to maximize throughput, minimize latency, and ensure stable, efficient model execution in production environments.
Scenario
You have a Llama-3 8B model and need to serve it via an API for a prototype chat application with low latency.
Scenario
Your production service needs to serve a 70B parameter model with strict SLAs (<200ms TTFT) on 4x A100 GPUs. You must reduce cost-per-token.
Scenario
Your platform must serve a pool of models (code gen, chat, embedding) with varying hardware needs and traffic patterns, requiring dynamic scaling and model switching.
Core execution environments. Use vLLM for rapid prototyping and PagedAttention. Use TensorRT-LLM for peak performance on NVIDIA GPUs. Use Triton as an orchestrator for multiple frameworks and models. Use ONNX Runtime for CPU/edge deployment or cross-platform model serving.
Critical for identifying bottlenecks (GPU kernel stalls, batch assembly time). Use built-in framework metrics first, then drill down with system-level profilers. Monitor p95 latency, throughput, and GPU memory utilization in production.
Used to prepare models for efficient inference. Quantization reduces memory footprint and can increase throughput. Conversion scripts format models for specific engine consumption (e.g., HF to TRT-LLM).
Answer Strategy
The interviewer is testing structured problem-solving and deep knowledge of the inference stack. Use a systematic approach: 1) Verify metrics (is it truly ITL or overall latency?), 2) Profile the engine (check for batch composition inefficiencies, small batch sizes), 3) Examine hardware (GPU compute utilization vs memory bandwidth bottleneck), 4) Apply targeted fixes (increase `--max_batch_size`, optimize CUDA graphs, adjust `--max_seq_len` to reduce padding). Sample Answer: 'First, I'd confirm the bottleneck by checking Triton's ITL metrics and GPU profiling. If the GPU is underutilized, it suggests a batching or scheduling issue. I would increase the `--max_batch_size` parameter in the TensorRT engine build to allow more requests to be processed concurrently in each step. If memory bandwidth is the limit, I'd verify our FP8 quantization is active and consider using `--use_gemm_plugin` to optimize critical kernels. Finally, I'd validate the fix with a load test simulating real traffic.'
Answer Strategy
This tests business and architectural judgment. The STAR method is effective. Focus on quantifiable outcomes. Sample Answer: 'In my last role, we served a 34B model via vLLM. At scale, GPU costs were prohibitive. I led the evaluation of 4-bit AWQ quantization. The trade-off was a 1.5% drop on our internal accuracy benchmark versus a 40% reduction in cost-per-token and a 25% increase in throughput. I built a business case showing the accuracy drop was within our product's tolerance, while the cost savings allowed us to expand the feature to 10x more users. We deployed the quantized version and used A/B testing to monitor user engagement, which showed no negative impact.'
1 career found
Try a different search term.