AI Local LLM Engineer
An AI Local LLM Engineer specializes in deploying, optimizing, and maintaining large language models that run entirely on local or…
Skill Guide
Inference engine configuration involves deploying and optimizing large language models (LLMs) for production serving by tuning key parameters such as batching strategies, quantization, memory management, and hardware acceleration across frameworks like vLLM, llama.cpp, TensorRT-LLM, and TGI.
Scenario
You have a consumer-grade NVIDIA GPU (e.g., RTX 3060 with 12GB VRAM) and need to serve a 7B parameter model locally for a demo chatbot.
Scenario
Your team needs to deploy a 13B model on an A10G GPU to handle a sustained load of 50 concurrent requests with a target latency of <2 seconds for the first token.
Scenario
You architect a system that serves three different LLMs (a 1B for simple tasks, a 7B for general use, a 70B for complex analysis) behind a single API endpoint, automatically routing requests based on complexity and user tier.
vLLM (high throughput, dynamic batching), llama.cpp (CPU/edge, GGUF quantization), TensorRT-LLM (peak NVIDIA GPU performance, complex optimization), TGI (HuggingFace integration, production-ready defaults). Use based on hardware and performance needs.
Tools for reducing model size and memory footprint. GPTQ/AWQ are for GPU inference, GGUF for CPU/llama.cpp, BnB for easy integration during training. Critical for fitting large models onto consumer or cost-effective hardware.
Use Locust to simulate load and measure latency/throughput. nvidia-smi and PyTorch Profiler for GPU/CUDA-level bottleneck analysis. Essential for moving from 'it runs' to 'it runs efficiently'.
Answer Strategy
Structure the answer by comparing key decision factors: hardware utilization efficiency, ease of integration, latency vs. throughput focus, and operational complexity. Sample: 'I would start by evaluating the workload profile. For maximum throughput on fixed NVIDIA hardware, TensorRT-LLM would be my first candidate due to its optimized kernels. If the team values ease of use and HuggingFace model compatibility, TGI is a strong contender. vLLM offers an excellent balance with PagedAttention for high throughput and dynamic batching. My final choice would depend on a POC benchmarking each with our specific prompt/completion length distribution.'
Answer Strategy
The interviewer is testing systematic debugging, observability, and practical knowledge. Use the STAR method. Sample: 'In a previous role, our vLLM service saw a 40% latency spike after a model update. Using Prometheus, I observed GPU utilization was maxed but TTFT was high. Profiling revealed excessive KV-cache fragmentation due to a change in our prefix handling. The root cause was a misconfiguration in the prefix caching settings after an upgrade. We rolled back the config change, and I implemented a canary deployment pipeline with gradual rollout to prevent recurrence.'
1 career found
Try a different search term.