Interview Prep
AI Runtime Engineer Interview Questions
50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsA great answer covers inference vs. training compute profiles, latency requirements, the need for production-grade reliability, and why serving is a distinct engineering discipline.
Cover latency requirements, cost implications, use case examples (recommendation pre-computation vs. chatbot responses), and how serving architecture differs for each.
Discuss environment reproducibility, dependency isolation (CUDA version conflicts), consistent deployment across dev/staging/prod, and container orchestration enablement.
Cover latency percentiles (P50/P95/P99), throughput (requests per second), error rates, GPU utilization and memory, model quality metrics, and cost per inference.
Explain parallel compute architecture, matrix multiplication advantages, memory bandwidth, and note that not all inference requires GPUs (small models, low-throughput use cases).
Intermediate
10 questionsCover load balancing, statelessness requirements for horizontal scaling, GPU memory limits for vertical scaling, cost curves, and diminishing returns.
Discuss running two identical environments, traffic switching via load balancer or service mesh, health checks, rollback procedures, and model warm-up considerations.
Cover INT8/INT4/FP8 data types, memory reduction, throughput improvement, calibration processes, perplexity or accuracy benchmarks, and tools like GPTQ/AWQ.
Discuss artifact storage, metadata tracking (metrics, training data hash), promotion from staging to production, rollback capabilities, and tools like MLflow or W&B.
Cover Protocol Buffers serialization efficiency, bidirectional streaming, lower latency for high-throughput internal services, versus REST's simplicity and ecosystem for external APIs.
Discuss HPA with custom metrics (GPU utilization, queue depth, request latency), scale-from-zero challenges, cold start latency, warm pool strategies, and cluster autoscaler integration.
Explain batching as a classic trade-off (increases throughput but adds latency), GPU utilization efficiency, and how request queuing strategies balance both.
Cover Pydantic models for Python, JSON Schema validation, input sanitization, type checking, graceful error responses, and guarding against adversarial or malformed inputs.
Discuss model pre-loading, warm instances/pools, lazy vs. eager initialization, smaller model variants for fast path, serverless provisioned concurrency, and snapshot/restore techniques.
Cover statistical distribution shifts in input data, performance degradation over time, monitoring with tools like Evidently or Whylogs, alerting thresholds, and retraining triggers.
Advanced
10 questionsDiscuss tensor parallelism splitting individual layers across GPUs (low latency, high communication), pipeline parallelism splitting model stages (better for very large models), and hybrid approaches.
Explain iteration-level scheduling, dynamic request insertion/removal during generation, reduced GPU idle time, PagedAttention's role in memory management, and throughput gains.
Cover model repository structure, concurrent model execution, backend abstraction (TensorRT, PyTorch, ONNX), dynamic batching, ensemble models, and model analyzer.
Discuss key-value pair storage for attention computation, memory growth with sequence length, PagedAttention (vLLM), prefix caching, KV cache quantization, and token dropping strategies.
Cover draft model generating candidate tokens, verification by target model in parallel, speedup conditions (draft model accuracy), rejection sampling, and when it fails to help.
Discuss edge deployment vs. central inference, model artifact replication, region-aware load balancing, eventual consistency for model updates, data residency constraints, and failover design.
Explain IO-aware exact attention computation, tiling strategy to avoid materializing full attention matrix, memory reduction from O(nΒ²) to O(n), and integration with inference frameworks.
Cover priority queues, weighted fair queuing, preemption policies, separate resource pools, rate limiting per tier, and monitoring SLA compliance per customer.
Discuss NVIDIA Nsight Systems for timeline analysis, Nsight Compute for kernel-level profiling, identifying memory vs. compute bottlenecks, occupancy analysis, and kernel fusion opportunities.
Cover graph optimization levels, operator support breadth, build/compilation time, latency benchmarks, hardware portability, framework lock-in, and developer experience.
Scenario-Based
10 questionsCover checking monitoring dashboards (GPU utilization, memory, queue depth), recent deployments or config changes, infrastructure issues (network, node health), model-level issues (degenerate inputs), and rollback decision.
Discuss immediate rollback procedure, investigating distribution mismatch between test and production data, implementing shadow scoring for future deployments, root cause analysis, and expanding evaluation datasets.
Cover utilization analysis (idle GPUs, underutilized instances), right-sizing, spot/preemptible instances, request batching optimization, model quantization for smaller instances, autoscaling tuning, and reserved capacity planning.
Discuss comparing staging vs. production load profiles, checking for resource contention (noisy neighbors), analyzing request distribution (long-tail inputs), GPU thermal throttling, network latency, and implementing synthetic load testing.
Cover quantization (4-bit AWQ or GPTQ), model parallelism across cost-effective GPUs, aggressive batching, prompt caching, considering distilled smaller models, spot instances, and latency vs. cost trade-off analysis.
Discuss CPU bottlenecks (preprocessing/tokenization), GIL contention in Python, data loading bottlenecks, kernel launch overhead, insufficient batching, model serialization overhead, and profiling to find the true bottleneck.
Cover asynchronous logging decoupled from the inference path, structured log formats, cost-effective cold storage (S3 Glacier), PII detection and redaction, query access patterns, and retention policy automation.
Discuss model weight snapshots and memory-mapped loading, pre-provisioned warm pools, model serialization formats (safetensors, TorchScript), smaller draft models, and moving away from serverless to always-on with autoscaling.
Cover pausing or slowing the rollout, quantifying the drift magnitude, alerting ML engineering team, implementing input validation guards, logging anomalous inputs for investigation, and establishing drift monitoring as a deployment gate.
Discuss assessing both stacks' strengths, defining a migration timeline, maintaining service continuity during transition, creating abstraction layers (API gateway), standardizing on observability, and knowledge transfer planning.
AI Workflow & Tools
10 questionsCover Prometheus adapter for custom metrics, NVIDIA DCGM exporter for GPU metrics, HPA YAML configuration with custom metrics API, scale-up/down policies, and cooldown periods.
Discuss model export (ONNX or TorchScript), Triton model repository structure, config.pbtxt configuration, dynamic batching setup, instance count and GPU allocation, and performance testing with Perf Analyzer.
Cover custom application metrics (tokens/second, time-to-first-token, queue depth), GPU metrics via DCGM exporter, Prometheus scrape configuration, Grafana dashboard panels, alerting rules, and SLO definitions.
Discuss Ray Serve deployment graphs, fractional GPU allocation, dynamic request routing between models, autoscaling replicas based on queue depth, and Ray cluster configuration on Kubernetes.
Cover triggering on model registry events, automated unit/integration tests, model quality evaluation (accuracy, latency benchmarks), container image building and scanning, staged deployment (dev β staging β prod), and approval gates.
Discuss capturing a profiling trace, analyzing GPU kernel timelines, identifying gaps (CPU-bound operations), memory transfer bottlenecks, kernel overlap opportunities, and iterative optimization workflow.
Cover traffic splitting at the load balancer or feature flag level, ensuring consistent user-to-variant assignment, defining evaluation metrics, minimum sample size calculation, and monitoring for regressions during the experiment.
Cover VPC and subnet configuration, EC2 GPU instance launch templates with Deep Learning AMIs, security groups for inference endpoints, EKS cluster provisioning, and integration with NVIDIA device plugin.
Discuss OpenTelemetry SDK instrumentation in each service, trace context propagation, span attributes for model metadata, Jaeger or Tempo as the trace backend, and correlating traces with GPU profiling data.
Cover vLLM launch command with tensor-parallel-size, GPU memory utilization settings, max-model-len configuration, quantization flags, benchmarking with the built-in benchmark tool, and tuning batch size and max-num-seqs.
Behavioral
5 questionsA strong answer demonstrates structured thinking under pressure, balancing quick mitigation with root cause investigation, clear communication with stakeholders, and a postmortem-driven improvement.
Look for evidence of data-driven argumentation, proposing alternative solutions, understanding business context, maintaining relationships while upholding engineering standards.
Strong candidates discuss structured learning habits, evaluating maturity through benchmarks and community traction, prototyping before committing, and balancing innovation with operational stability.
Look for proactive monitoring insights, pattern recognition across incidents, building automation or safeguards, and convincing the team to invest in preventive work.
Strong answers cover runbooks for common incidents, architecture decision records, onboarding playbooks, regular knowledge-sharing sessions, and making infrastructure understandable to non-infrastructure engineers.