AI Code Generation Engineer
An AI Code Generation Engineer designs, builds, and optimizes systems that automatically produce, transform, and evaluate source c…
Skill Guide
The systematic engineering practice of reducing the computational expense and response time of large language models generating code, while maintaining output quality, across high-volume production environments.
Scenario
You need to deploy a 1-7B parameter code model (e.g., StarCoder, CodeLlama) for a latency-tolerant internal tool to run on cost-effective CPU servers.
Scenario
Your production API using a large model (70B+) has high TTFT due to autoregressive generation. You have access to a smaller, faster draft model (7B).
Scenario
You are designing the inference backend for a high-traffic IDE extension where query complexity varies widely (simple vs. complex refactoring) and caching identical prompts is common.
Core production servers that implement critical optimizations like PagedAttention, continuous batching, and kernel fusion. Use vLLM or TGI for open-source LLM serving; use TensorRT-LLM for maximum performance on NVIDIA hardware with model-specific compilation.
Tools for model compression and hardware-specific optimization. Use ONNX Runtime for cross-platform quantization and serving. Use llama.cpp/GGML for efficient CPU/edge inference with 4-bit quantization. TVM is for advanced compiler-level graph optimizations.
Essential for identifying bottlenecks. PyTorch Profiler and Nsight trace GPU kernels and memory ops. Prometheus collects metrics (QPS, latency, GPU util) visualized in Grafana. OpenTelemetry for distributed tracing across microservices.
For scaling and cost control. Kubernetes manages container orchestration and auto-scaling of inference pods. KubeCost provides granular cost allocation. Using Spot/Preemptible instances for non-critical, interruptible workloads can reduce compute costs by 60-70%.
Answer Strategy
Structure the answer using a systematic profiling framework. Start with observability: check if the bottleneck is TTFT (compute-bound) or ITL (memory-bandwidth-bound) using metrics from the serving framework. For high TTFT, investigate batching inefficiencies and implement continuous batching if not present. For high ITL or memory issues, analyze KV-cache memory usage and consider PagedAttention. Finally, suggest architectural changes like speculative decoding or model sharding optimization via Tensor Parallelism tuning. Sample Answer: 'First, I'd instrument the endpoint with detailed metrics to separate TTFT and ITL components. If TTFT is the issue, I'd verify batching is optimized; a jump from static to continuous batching in vLLM could yield a 10x throughput gain. If memory bandwidth is the limit, I'd profile KV-cache allocation and enable PagedAttention to reduce fragmentation. For this scale, I'd also test speculative decoding with a 7B draft model to cut TTFT by 2-3x, and validate with our quality benchmarks.'
Answer Strategy
Tests business acumen and technical diplomacy. The candidate should frame the discussion around value, not just cost. Use data to propose alternatives. Sample Answer: 'I'd first quantify the business value: what's the expected uplift in user retention or conversion for this feature? Then, I'd analyze the cost structure. Instead of routing all queries to the expensive model, I'd propose a tiered approach: a classifier identifies queries needing the large model, while simpler ones use a cheaper, faster one. I'd also implement aggressive caching for common refactoring patterns. This lets us launch the feature with a controlled cost-per-query, and we can A/B test to measure if the user value justifies the compute expense.'
1 career found
Try a different search term.