Interview Prep
AI Inference Optimization Engineer Interview Questions
50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsA strong answer covers forward-pass-only execution, production serving constraints (latency, throughput, cost), and the absence of gradient computation.
Latency is time-per-request; throughput is requests-per-second. Larger batches increase throughput but add per-request latency - a classic tradeoff.
Reducing numerical precision (FP32→INT8/INT4) to shrink model size, reduce memory bandwidth, and speed up computation with acceptable accuracy tradeoffs.
GPUs (NVIDIA A100/H100), CPUs (for smaller models), TPUs, AWS Inferentia, edge accelerators (Jetson, Apple Neural Engine), and FPGAs.
Static batching groups fixed-size requests; dynamic batching groups requests arriving within a time window. Dynamic is more efficient for variable workloads.
Intermediate
10 questionsStatic uses calibration data for pre-computed scales; dynamic computes scales at runtime; QAT simulates quantization during training for best accuracy.
ONNX is an open model interchange format enabling cross-framework optimization and deployment via ONNX Runtime; strengths include graph optimization passes, but it struggles with dynamic control flow.
It stores key and value tensors from previous tokens to avoid recomputation; for long sequences and large models, KV-cache memory can exceed model weights memory.
TensorRT fuses layers, selects optimal kernels per GPU architecture, applies precision calibration, and builds an optimized engine through graph parsing → optimization → engine serialization.
Distillation trains a smaller 'student' model to mimic a larger 'teacher', producing a fundamentally smaller architecture; quantization compresses the same architecture to lower precision.
FP16/BF16 maintain high accuracy with 2x memory reduction; INT8 offers 4x reduction with minor quality loss; INT4 maximizes compression but requires careful calibration and may degrade on edge cases.
Continuous batching allows new requests to join a batch mid-generation as others complete, dramatically improving GPU utilization for variable-length outputs.
Triton handles model versioning, concurrent multi-model serving, dynamic batching, health monitoring, metrics export, and protocol abstraction (gRPC/HTTP).
Use tools like PyTorch Profiler or Nsight to capture per-layer execution time, memory transfers, kernel launches, and GPU idle periods, then identify the critical path.
A smaller 'draft' model proposes tokens that the larger model verifies in parallel, accelerating generation; limitations include complexity, draft model quality, and reduced benefit for short outputs.
Advanced
10 questionsApply INT4 quantization (GPTQ/AWQ), use tensor parallelism across 4 GPUs, enable PagedAttention for KV-cache, configure continuous batching, and tune chunked prefill.
FlashAttention computes exact attention in tiles that fit in SRAM, avoiding the O(n²) HBM reads/writes, reducing memory from quadratic to linear while actually being faster.
Fuse element-wise operations (bias + activation + dropout), combine attention QKV projections, and merge normalization layers - reducing kernel launch overhead and memory round-trips.
Different optimal precision formats, memory hierarchies, operator support gaps, data transfer bottlenecks between devices, and the need for hardware-specific compilation pipelines.
Tensor parallelism splits individual layers across GPUs (low latency, high communication); pipeline parallelism assigns whole layers to GPUs (lower communication, pipeline bubbles for small batches).
PagedAttention borrows virtual memory paging concepts to store KV-cache in non-contiguous blocks, eliminating memory waste from pre-allocated contiguous buffers and enabling higher batch sizes.
Implement shadow traffic routing, capture per-request latency/quality/cost metrics, use statistical significance testing, and ensure identical request distributions across variants.
Structured pruning removes entire channels/attention heads and is hardware-friendly; unstructured prunes individual weights but requires sparse computation support that most hardware lacks.
Use optimization profiles with min/opt/max shape ranges, dynamic shape support in ONNX, and padding strategies; profile at the optimal shape and benchmark degradation at boundaries.
Profile each modality's encoder separately, consider modality-specific precision strategies, implement asynchronous pre-processing pipelines, and explore dedicated compute paths per modality.
Scenario-Based
10 questionsProfile current utilization, apply INT8 quantization with quality regression testing, implement continuous batching, explore speculative decoding, evaluate right-sizing GPU instances, and build cost monitoring dashboards.
Identify unsupported operators, evaluate ONNX export feasibility, write custom TensorRT plugins for missing ops, consider alternative serving frameworks, and assess whether architecture modifications could improve compatibility.
Profile to find the bottleneck (likely compute-bound), apply INT4 quantization, consider a distilled smaller model, enable FlashAttention, optimize the tokenizer and pre/post-processing pipeline, and explore speculative decoding.
Use aggressive 4-bit quantization (GGUF/AWQ), apply structured pruning to reduce parameter count, distill to a 1-3B parameter model, optimize for the target NPU/GPU, and profile on-device latency.
Deploy behind a load balancer across multiple GPU nodes, use INT8 quantization, configure vLLM with optimal batch size, implement request queuing and prioritization, use model replicas for fault tolerance, and build autoscaling based on queue depth.
Evaluate per-language calibration data representation, apply mixed-precision quantization (keep sensitive layers at higher precision), augment calibration data with minority language samples, and implement language-detection-based routing to different model variants.
Benchmark current CPU performance, evaluate GPU cost-per-request vs. CPU, convert model to TensorRT or ONNX Runtime GPU execution, handle the data pipeline transition, run parallel serving during migration, and validate quality parity.
Implement request routing by length buckets, use continuous batching with PagedAttention to eliminate padding waste, apply chunked prefill for long sequences, and configure separate optimization profiles per length tier.
Evaluate on-premise GPU hardware options, build air-gapped serving infrastructure, implement request/response logging with tamper-proof audit trails, optimize for the specific hardware selected, and establish on-premise monitoring and alerting.
Automate the quantization and compilation pipeline end-to-end, implement regression benchmarks that gate deployment, use modular serving configuration, build fallback to FP16 if optimized build fails, and maintain version-specific optimization profiles.
AI Workflow & Tools
10 questionsLaunch the inference workload under Nsight capture, analyze the timeline view for GPU utilization gaps, identify CPU-GPU synchronization stalls, examine kernel execution patterns, and iterate on bottlenecks.
Export to ONNX → parse with TensorRT → define optimization profile → create INT8 calibrator with representative dataset → build engine with layer fusion and kernel auto-tuning → validate accuracy against FP32 baseline.
Define benchmark request sets, run latency/throughput/accuracy benchmarks on every model or config change, compare against baselines with statistical thresholds, gate deployments on regression, and visualize trends over time.
Load model with from_pretrained, configure max_num_seqs, max_num_batched_tokens, tensor_parallel_size, enable quantization param, set gpu_memory_utilization, and monitor throughput with the built-in metrics endpoint.
Export to ONNX, use onnxruntime.transformers optimizer for graph fusion, enable quantization via onnxruntime.quantization, configure session options for multi-threading, and benchmark with I/O binding for memory efficiency.
Use optimum-cli to export to ONNX, apply quantization with ORTQuantizer, optimize graph with ORTOptimizer, serve with Optimum's inference API or deploy to Triton, and validate with the evaluation harness.
Export metrics from the serving framework (latency histograms, throughput, GPU utilization, batch sizes), pipe to Prometheus/Grafana, set alerts on P99 latency and error rate thresholds, and track per-model-version performance.
Initialize DeepSpeed inference engine with the model, specify tensor parallel degree and checkpoint, configure injection policy for the model architecture, and run the inference with automatic tensor-parallel execution.
Compile the model for each target (Neuron SDK for Inferentia, TensorRT for GPU), run identical benchmark request sets, measure latency distributions, throughput, cost-per-request, and power consumption, then compare total cost of ownership.
Load the base model once in GPU memory, maintain adapter weights separately, implement a request-routing layer that applies the correct LoRA at inference time, use vLLM's LoRA support or custom serving logic, and cache frequently-used adapters.
Behavioral
5 questionsShow structured thinking about stakeholder requirements, quantitative analysis of the accuracy-latency Pareto frontier, how you communicated tradeoffs, and the outcome of the decision.
Demonstrate the ability to translate technical tradeoffs into business metrics - cost savings, user experience improvements, competitive advantages - using concrete numbers and analogies.
Show initiative, technical depth in profiling or analysis, ability to build a business case, and how you drove the optimization to production impact.
Reference specific papers, conferences, open-source projects, or communities; demonstrate a systematic learning habit and ability to evaluate new techniques critically before adopting them.
Show pragmatic decision-making, awareness of long-term costs, how you documented and tracked technical debt, and whether you returned to address it later.