Question 1

What is the primary difference between latency and throughput in an AI inference context?

Accepted Answer

Explain that latency is the time for a single request, while throughput is the number of requests processed per unit time, and how they can be traded off (e.g., batching increases throughput but can increase latency).

Question 2

Explain the concept of post-training quantization (PTQ) and why it's useful for latency optimization.

Accepted Answer

Discuss reducing model precision (FP32 to INT8) to decrease memory footprint and leverage hardware accelerators, leading to faster computations, with a mention of the accuracy vs. speed trade-off.

Question 3

What is a GPU kernel, and why is its performance critical for deep learning inference?

Accepted Answer

Define it as a function executed on the GPU, and explain that inefficient kernels can become the bottleneck for the entire model, making optimization essential.

Question 4

Name two common metrics you would track to monitor the performance of a deployed LLM service.

Accepted Answer

Expect metrics like Time-to-First-Token (TTFT), Time-Between-Tokens (TBT/inter-token latency), P99 end-to-end latency, and Tokens per Second.

Question 5

What is the role of a model zoo or model repository in a production inference system?

Accepted Answer

Describe it as a central, version-controlled storage for pre-optimized model formats (like ONNX or TensorRT engines) to ensure consistency and enable rapid deployment.

Question 6

Explain the concept of KV-cache in autoregressive LLMs and its impact on latency for long sequences.

Accepted Answer

Describe how caching previous key and value tensors avoids recomputation, but also how managing this cache's memory becomes a critical challenge, leading to techniques like PagedAttention.

Question 7

What is speculative decoding, and under what conditions does it improve latency?

Accepted Answer

Explain using a smaller, faster 'draft' model to propose token sequences that are then verified in parallel by the large 'target' model. It's beneficial when the draft model's accept rate is high and the target model's latency dominates.

Question 8

Compare Tensor Parallelism (TP) and Pipeline Parallelism (PP) for model serving. When would you choose one over the other?

Accepted Answer

Explain that TP splits individual matrix multiplications across GPUs (good for reducing latency per layer), while PP splits the model into stages (good for throughput). TP requires fast interconnect; PP can introduce pipeline bubbles.

Question 9

How does the choice of activation function (e.g., GELU, SiLU) affect inference latency on modern GPUs?

Accepted Answer

Discuss how fused kernels can combine operations, and how some activations are more amenable to hardware acceleration or have lower computational complexity.

Question 10

Walk me through the steps you would take to diagnose why a newly deployed model has unexpectedly high P99 latency.

Accepted Answer

Expect a systematic approach: check system-level metrics (CPU, GPU, memory, network), then profile the inference stack for bottlenecks (data loading, pre-processing, kernel execution), and finally examine request pattern (e.g., load spikes, long prompts).

AI Latency Optimization Engineer Interview Questions

Beginner

Intermediate

Advanced

Scenario-Based

AI Workflow & Tools

Behavioral

Done Practicing? Here's What's Next

Full Career Guide

Learning Roadmap

Compare This Role