Skip to main content

Interview Prep

AI Runtime Engineer Interview Questions

50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.

Beginner: 5Intermediate: 10Advanced: 10Scenario-Based: 10AI Workflow & Tools: 10Behavioral: 5

Beginner

5 questions
What a great answer covers:

A great answer covers inference vs. training compute profiles, latency requirements, the need for production-grade reliability, and why serving is a distinct engineering discipline.

What a great answer covers:

Cover latency requirements, cost implications, use case examples (recommendation pre-computation vs. chatbot responses), and how serving architecture differs for each.

What a great answer covers:

Discuss environment reproducibility, dependency isolation (CUDA version conflicts), consistent deployment across dev/staging/prod, and container orchestration enablement.

What a great answer covers:

Cover latency percentiles (P50/P95/P99), throughput (requests per second), error rates, GPU utilization and memory, model quality metrics, and cost per inference.

What a great answer covers:

Explain parallel compute architecture, matrix multiplication advantages, memory bandwidth, and note that not all inference requires GPUs (small models, low-throughput use cases).

Intermediate

10 questions
What a great answer covers:

Cover load balancing, statelessness requirements for horizontal scaling, GPU memory limits for vertical scaling, cost curves, and diminishing returns.

What a great answer covers:

Discuss running two identical environments, traffic switching via load balancer or service mesh, health checks, rollback procedures, and model warm-up considerations.

What a great answer covers:

Cover INT8/INT4/FP8 data types, memory reduction, throughput improvement, calibration processes, perplexity or accuracy benchmarks, and tools like GPTQ/AWQ.

What a great answer covers:

Discuss artifact storage, metadata tracking (metrics, training data hash), promotion from staging to production, rollback capabilities, and tools like MLflow or W&B.

What a great answer covers:

Cover Protocol Buffers serialization efficiency, bidirectional streaming, lower latency for high-throughput internal services, versus REST's simplicity and ecosystem for external APIs.

What a great answer covers:

Discuss HPA with custom metrics (GPU utilization, queue depth, request latency), scale-from-zero challenges, cold start latency, warm pool strategies, and cluster autoscaler integration.

What a great answer covers:

Explain batching as a classic trade-off (increases throughput but adds latency), GPU utilization efficiency, and how request queuing strategies balance both.

What a great answer covers:

Cover Pydantic models for Python, JSON Schema validation, input sanitization, type checking, graceful error responses, and guarding against adversarial or malformed inputs.

What a great answer covers:

Discuss model pre-loading, warm instances/pools, lazy vs. eager initialization, smaller model variants for fast path, serverless provisioned concurrency, and snapshot/restore techniques.

What a great answer covers:

Cover statistical distribution shifts in input data, performance degradation over time, monitoring with tools like Evidently or Whylogs, alerting thresholds, and retraining triggers.

Advanced

10 questions
What a great answer covers:

Discuss tensor parallelism splitting individual layers across GPUs (low latency, high communication), pipeline parallelism splitting model stages (better for very large models), and hybrid approaches.

What a great answer covers:

Explain iteration-level scheduling, dynamic request insertion/removal during generation, reduced GPU idle time, PagedAttention's role in memory management, and throughput gains.

What a great answer covers:

Cover model repository structure, concurrent model execution, backend abstraction (TensorRT, PyTorch, ONNX), dynamic batching, ensemble models, and model analyzer.

What a great answer covers:

Discuss key-value pair storage for attention computation, memory growth with sequence length, PagedAttention (vLLM), prefix caching, KV cache quantization, and token dropping strategies.

What a great answer covers:

Cover draft model generating candidate tokens, verification by target model in parallel, speedup conditions (draft model accuracy), rejection sampling, and when it fails to help.

What a great answer covers:

Discuss edge deployment vs. central inference, model artifact replication, region-aware load balancing, eventual consistency for model updates, data residency constraints, and failover design.

What a great answer covers:

Explain IO-aware exact attention computation, tiling strategy to avoid materializing full attention matrix, memory reduction from O(nΒ²) to O(n), and integration with inference frameworks.

What a great answer covers:

Cover priority queues, weighted fair queuing, preemption policies, separate resource pools, rate limiting per tier, and monitoring SLA compliance per customer.

What a great answer covers:

Discuss NVIDIA Nsight Systems for timeline analysis, Nsight Compute for kernel-level profiling, identifying memory vs. compute bottlenecks, occupancy analysis, and kernel fusion opportunities.

What a great answer covers:

Cover graph optimization levels, operator support breadth, build/compilation time, latency benchmarks, hardware portability, framework lock-in, and developer experience.

Scenario-Based

10 questions
What a great answer covers:

Cover checking monitoring dashboards (GPU utilization, memory, queue depth), recent deployments or config changes, infrastructure issues (network, node health), model-level issues (degenerate inputs), and rollback decision.

What a great answer covers:

Discuss immediate rollback procedure, investigating distribution mismatch between test and production data, implementing shadow scoring for future deployments, root cause analysis, and expanding evaluation datasets.

What a great answer covers:

Cover utilization analysis (idle GPUs, underutilized instances), right-sizing, spot/preemptible instances, request batching optimization, model quantization for smaller instances, autoscaling tuning, and reserved capacity planning.

What a great answer covers:

Discuss comparing staging vs. production load profiles, checking for resource contention (noisy neighbors), analyzing request distribution (long-tail inputs), GPU thermal throttling, network latency, and implementing synthetic load testing.

What a great answer covers:

Cover quantization (4-bit AWQ or GPTQ), model parallelism across cost-effective GPUs, aggressive batching, prompt caching, considering distilled smaller models, spot instances, and latency vs. cost trade-off analysis.

What a great answer covers:

Discuss CPU bottlenecks (preprocessing/tokenization), GIL contention in Python, data loading bottlenecks, kernel launch overhead, insufficient batching, model serialization overhead, and profiling to find the true bottleneck.

What a great answer covers:

Cover asynchronous logging decoupled from the inference path, structured log formats, cost-effective cold storage (S3 Glacier), PII detection and redaction, query access patterns, and retention policy automation.

What a great answer covers:

Discuss model weight snapshots and memory-mapped loading, pre-provisioned warm pools, model serialization formats (safetensors, TorchScript), smaller draft models, and moving away from serverless to always-on with autoscaling.

What a great answer covers:

Cover pausing or slowing the rollout, quantifying the drift magnitude, alerting ML engineering team, implementing input validation guards, logging anomalous inputs for investigation, and establishing drift monitoring as a deployment gate.

What a great answer covers:

Discuss assessing both stacks' strengths, defining a migration timeline, maintaining service continuity during transition, creating abstraction layers (API gateway), standardizing on observability, and knowledge transfer planning.

AI Workflow & Tools

10 questions
What a great answer covers:

Cover Prometheus adapter for custom metrics, NVIDIA DCGM exporter for GPU metrics, HPA YAML configuration with custom metrics API, scale-up/down policies, and cooldown periods.

What a great answer covers:

Discuss model export (ONNX or TorchScript), Triton model repository structure, config.pbtxt configuration, dynamic batching setup, instance count and GPU allocation, and performance testing with Perf Analyzer.

What a great answer covers:

Cover custom application metrics (tokens/second, time-to-first-token, queue depth), GPU metrics via DCGM exporter, Prometheus scrape configuration, Grafana dashboard panels, alerting rules, and SLO definitions.

What a great answer covers:

Discuss Ray Serve deployment graphs, fractional GPU allocation, dynamic request routing between models, autoscaling replicas based on queue depth, and Ray cluster configuration on Kubernetes.

What a great answer covers:

Cover triggering on model registry events, automated unit/integration tests, model quality evaluation (accuracy, latency benchmarks), container image building and scanning, staged deployment (dev β†’ staging β†’ prod), and approval gates.

What a great answer covers:

Discuss capturing a profiling trace, analyzing GPU kernel timelines, identifying gaps (CPU-bound operations), memory transfer bottlenecks, kernel overlap opportunities, and iterative optimization workflow.

What a great answer covers:

Cover traffic splitting at the load balancer or feature flag level, ensuring consistent user-to-variant assignment, defining evaluation metrics, minimum sample size calculation, and monitoring for regressions during the experiment.

What a great answer covers:

Cover VPC and subnet configuration, EC2 GPU instance launch templates with Deep Learning AMIs, security groups for inference endpoints, EKS cluster provisioning, and integration with NVIDIA device plugin.

What a great answer covers:

Discuss OpenTelemetry SDK instrumentation in each service, trace context propagation, span attributes for model metadata, Jaeger or Tempo as the trace backend, and correlating traces with GPU profiling data.

What a great answer covers:

Cover vLLM launch command with tensor-parallel-size, GPU memory utilization settings, max-model-len configuration, quantization flags, benchmarking with the built-in benchmark tool, and tuning batch size and max-num-seqs.

Behavioral

5 questions
What a great answer covers:

A strong answer demonstrates structured thinking under pressure, balancing quick mitigation with root cause investigation, clear communication with stakeholders, and a postmortem-driven improvement.

What a great answer covers:

Look for evidence of data-driven argumentation, proposing alternative solutions, understanding business context, maintaining relationships while upholding engineering standards.

What a great answer covers:

Strong candidates discuss structured learning habits, evaluating maturity through benchmarks and community traction, prototyping before committing, and balancing innovation with operational stability.

What a great answer covers:

Look for proactive monitoring insights, pattern recognition across incidents, building automation or safeguards, and convincing the team to invest in preventive work.

What a great answer covers:

Strong answers cover runbooks for common incidents, architecture decision records, onboarding playbooks, regular knowledge-sharing sessions, and making infrastructure understandable to non-infrastructure engineers.