Interview Prep
AI Infrastructure Engineer Interview Questions
50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsA great answer covers parallelism, tensor cores, memory bandwidth, and the embarrassingly parallel nature of matrix operations in neural networks.
Cover environment reproducibility, dependency isolation, CUDA library management, and sharing consistent environments across dev/training/serving.
Discuss orchestration, auto-scaling, self-healing, resource scheduling (especially GPUs), and managing heterogeneous workloads.
Cover reproducibility, version control of infrastructure, drift detection, and mention Terraform, Pulumi, or CloudFormation with a concrete example.
Contrast latency requirements, cost models, scaling patterns, and give examples like nightly batch scoring vs. a live chatbot endpoint.
Intermediate
10 questionsCover node selection (InfiniBand topology), NCCL configuration, checkpoint storage, fault tolerance, gang scheduling, and tools like Slurm or KubeRay.
Discuss data validation, model performance gates, shadow deployments, canary releases, rollback strategies, and tools like GitHub Actions with MLflow or ZenML.
Mention NVIDIA device plugin, nvidia.com/gpu resource type, time-slicing vs. MIG, and how the scheduler places pods on GPU nodes.
Cover dynamic batching, model format support, multi-framework serving, tensor parallelism, quantization support, and operational complexity tradeoffs.
Discuss GPU compute utilization, memory utilization, SM occupancy, PCIe/NVLink bandwidth, and how to use DCGM, Prometheus, and Grafana for observability.
Contrast DDP vs. FSDP vs. tensor/pipeline parallelism; mention PyTorch, DeepSpeed, Megatron-LM, and when model size exceeds single-GPU memory.
Cover spot/reserved/on-demand mix, auto-scaling policies, workload scheduling (train during off-peak), right-sizing instances, and using managed services strategically.
Explain offline/online feature serving, consistency between training and inference, point-in-time correctness, and mention Feast or Tecton.
Cover model versioning, lineage (data, code, hyperparameters), performance metrics, approval workflows, and deployment stage transitions.
Discuss tensor parallelism, pipeline parallelism, model sharding, quantization (GPTQ, AWQ), KV cache management, and tools like vLLM or TGI.
Advanced
10 questionsCover namespace isolation, resource quotas, priority classes, queue-based scheduling (e.g., Kueue), network policies, cost attribution, and self-service abstractions like custom CRDs.
Cover NCCL debug environment variables, network health checks, InfiniBand diagnostics, RDMA issues, checkpoint resume strategies, and proactive watchdog patterns.
Discuss virtual memory-inspired KV cache management, reduced memory fragmentation, continuous batching, and tuning parameters like max_num_seqs, gpu_memory_utilization, and swap space.
Cover traffic splitting at the load balancer or service mesh level, shadow traffic, statistical quality monitoring (not just latency), automated rollback triggers, and progressive rollout.
Combine infrastructure metrics (Prometheus/Grafana) with ML metrics (Evidently AI, Arize), statistical tests (PSI, KS), automated alerting thresholds, and feedback loop integration.
Cover quantization (INT4/AWQ), tensor parallelism across 2-4 GPUs, vLLM with continuous batching, auto-scaling based on queue depth, load testing with realistic traffic, and CDN caching for common prompts.
Discuss model pre-loading, warm pools, model weight caching with NFS/shared memory, snapshot-based loading (CUDA graphs, torch.compile artifacts), and predictive pre-scaling.
Cover hardware-level partitioning vs. software-level time-sharing, isolation guarantees, latency predictability, use cases (multi-tenant serving vs. best-effort batch), and configuration tradeoffs.
Discuss DVC or LakeFS for data versioning, deterministic feature pipelines, metadata stores (e.g., OpenLineage), integration with experiment tracking, and provenance for auditing.
Cover CUDA/ROCm compatibility issues, cost-per-watt advantages, performance benchmarks for inference-heavy vs. training-heavy workloads, ecosystem maturity, and container image portability.
Scenario-Based
10 questionsCheck data pipeline integrity (shuffling, label alignment), verify GPU numerical stability, inspect gradient norms, validate data versioning, and implement automated quality gates in the pipeline.
Cover load testing, horizontal auto-scaling configuration, model optimization (quantization), caching strategies, graceful degradation (model cascading), pre-provisioning capacity, and runbook creation.
Audit utilization metrics (many GPUs may be idle), identify oversized instances, implement spot instances for training, add auto-scaling to reduce idle serving capacity, consolidate workloads, and set up cost allocation tagging.
Set up distributed training with FSDP or DeepSpeed ZeRO Stage 3, configure multi-node communication, implement data parallelism with gradient accumulation, and provide a self-service interface for launching jobs.
Check model input distribution changes (longer prompts), GPU thermal throttling, memory fragmentation, batch size regression, KV cache eviction patterns, and infrastructure changes like network or storage latency.
Cover VPC isolation, encryption at rest and in transit, access logging, least-privilege IAM, dedicated GPU nodes, data residency requirements, BAA with cloud providers, and audit trails for model access.
Implement resource quotas and priority classes in Kubernetes, use gang scheduling for training, preemption policies for inference, separate node pools with taints/tolerations, and cost chargeback.
Cover CUDA compute capability differences, FP8 support, memory bandwidth improvements, cost modeling, container image rebuilds, network topology changes, and re-benchmarking model performance and throughput.
Check load balancer timeout settings, connection pooling limits, request queue overflow, auto-scaler lag (scale-up delay), and implement circuit breakers, request batching, and better backpressure mechanisms.
Cover vector database selection and sharding, embedding pipeline architecture, index freshness strategy, hybrid search, latency budgeting for retrieval + generation, caching, and monitoring retrieval quality metrics.
AI Workflow & Tools
10 questionsCover DAG definition, component reuse, parameterization, artifact passing between steps, retry/failure policies, and integration with the model registry for deployment gating.
Cover experiment logging (params, metrics, artifacts), model signatures, registry with stage transitions, transition triggers via CI/CD, and deployment integration with serving infrastructure.
Discuss W&B agent configuration in pods, logging GPU metrics alongside training metrics, sweep configuration, custom panels for multi-node communication stats, and cost-per-epoch tracking.
Cover Ray Serve deployments with autoscaling configs, model multiplexing, dynamic request routing, deployment groups for latency tiers, and integration with Kubernetes for resource management.
Cover model repository structure, ensemble model configuration, batching strategies per model, shared memory for inter-model data transfer, and performance profiling with Perf Analyzer.
Cover drift detection (Evidently, Arize), trigger mechanisms, automated data validation gates, retraining orchestration, A/B testing the new model, and rollback if performance degrades.
Cover modular Terraform design (VPC, EKS, node groups with GPU AMI, IAM roles, S3 buckets, CloudWatch dashboards), state management, and environment promotion (dev/staging/prod).
Cover DVC remote configuration, .dvc tracking files, data pipeline definitions (dvc.yaml), integration with Git branching strategies, and CI steps that validate data versions before training.
Cover Helm chart structure, values files for environment overrides, dependency management (subcharts for MLflow, Feast, Argo), resource limits, and secrets management with external secrets operators.
Cover self-hosted GPU runners, Docker build with CUDA base images, integration tests with model validation, registry push, Kubernetes manifest apply (or Argo CD sync), and approval gates for production.
Behavioral
5 questionsLook for ability to translate technical tradeoffs into business impact, use analogies, create simple visuals, and demonstrate empathy for stakeholder concerns around cost, risk, or timeline.
Assess incident triage skills, communication during high-pressure situations, root cause analysis rigor, and whether they drove systemic improvements (not just a quick fix).
Look for structured prioritization frameworks, stakeholder communication skills, ability to negotiate scope, and evidence of balancing urgency vs. strategic impact.
Assess ability to disagree constructively, use data to support positions, listen to opposing views, and reach consensus or escalate appropriately while maintaining team relationships.
Look for flexibility, proactive communication about scope/cost/timeline impacts, modular design thinking, and evidence of maintaining code/infrastructure quality despite shifting requirements.