Skip to main content

Interview Prep

AI Infrastructure Engineer Interview Questions

50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.

Beginner: 5Intermediate: 10Advanced: 10Scenario-Based: 10AI Workflow & Tools: 10Behavioral: 5

Beginner

5 questions
What a great answer covers:

A great answer covers parallelism, tensor cores, memory bandwidth, and the embarrassingly parallel nature of matrix operations in neural networks.

What a great answer covers:

Cover environment reproducibility, dependency isolation, CUDA library management, and sharing consistent environments across dev/training/serving.

What a great answer covers:

Discuss orchestration, auto-scaling, self-healing, resource scheduling (especially GPUs), and managing heterogeneous workloads.

What a great answer covers:

Cover reproducibility, version control of infrastructure, drift detection, and mention Terraform, Pulumi, or CloudFormation with a concrete example.

What a great answer covers:

Contrast latency requirements, cost models, scaling patterns, and give examples like nightly batch scoring vs. a live chatbot endpoint.

Intermediate

10 questions
What a great answer covers:

Cover node selection (InfiniBand topology), NCCL configuration, checkpoint storage, fault tolerance, gang scheduling, and tools like Slurm or KubeRay.

What a great answer covers:

Discuss data validation, model performance gates, shadow deployments, canary releases, rollback strategies, and tools like GitHub Actions with MLflow or ZenML.

What a great answer covers:

Mention NVIDIA device plugin, nvidia.com/gpu resource type, time-slicing vs. MIG, and how the scheduler places pods on GPU nodes.

What a great answer covers:

Cover dynamic batching, model format support, multi-framework serving, tensor parallelism, quantization support, and operational complexity tradeoffs.

What a great answer covers:

Discuss GPU compute utilization, memory utilization, SM occupancy, PCIe/NVLink bandwidth, and how to use DCGM, Prometheus, and Grafana for observability.

What a great answer covers:

Contrast DDP vs. FSDP vs. tensor/pipeline parallelism; mention PyTorch, DeepSpeed, Megatron-LM, and when model size exceeds single-GPU memory.

What a great answer covers:

Cover spot/reserved/on-demand mix, auto-scaling policies, workload scheduling (train during off-peak), right-sizing instances, and using managed services strategically.

What a great answer covers:

Explain offline/online feature serving, consistency between training and inference, point-in-time correctness, and mention Feast or Tecton.

What a great answer covers:

Cover model versioning, lineage (data, code, hyperparameters), performance metrics, approval workflows, and deployment stage transitions.

What a great answer covers:

Discuss tensor parallelism, pipeline parallelism, model sharding, quantization (GPTQ, AWQ), KV cache management, and tools like vLLM or TGI.

Advanced

10 questions
What a great answer covers:

Cover namespace isolation, resource quotas, priority classes, queue-based scheduling (e.g., Kueue), network policies, cost attribution, and self-service abstractions like custom CRDs.

What a great answer covers:

Cover NCCL debug environment variables, network health checks, InfiniBand diagnostics, RDMA issues, checkpoint resume strategies, and proactive watchdog patterns.

What a great answer covers:

Discuss virtual memory-inspired KV cache management, reduced memory fragmentation, continuous batching, and tuning parameters like max_num_seqs, gpu_memory_utilization, and swap space.

What a great answer covers:

Cover traffic splitting at the load balancer or service mesh level, shadow traffic, statistical quality monitoring (not just latency), automated rollback triggers, and progressive rollout.

What a great answer covers:

Combine infrastructure metrics (Prometheus/Grafana) with ML metrics (Evidently AI, Arize), statistical tests (PSI, KS), automated alerting thresholds, and feedback loop integration.

What a great answer covers:

Cover quantization (INT4/AWQ), tensor parallelism across 2-4 GPUs, vLLM with continuous batching, auto-scaling based on queue depth, load testing with realistic traffic, and CDN caching for common prompts.

What a great answer covers:

Discuss model pre-loading, warm pools, model weight caching with NFS/shared memory, snapshot-based loading (CUDA graphs, torch.compile artifacts), and predictive pre-scaling.

What a great answer covers:

Cover hardware-level partitioning vs. software-level time-sharing, isolation guarantees, latency predictability, use cases (multi-tenant serving vs. best-effort batch), and configuration tradeoffs.

What a great answer covers:

Discuss DVC or LakeFS for data versioning, deterministic feature pipelines, metadata stores (e.g., OpenLineage), integration with experiment tracking, and provenance for auditing.

What a great answer covers:

Cover CUDA/ROCm compatibility issues, cost-per-watt advantages, performance benchmarks for inference-heavy vs. training-heavy workloads, ecosystem maturity, and container image portability.

Scenario-Based

10 questions
What a great answer covers:

Check data pipeline integrity (shuffling, label alignment), verify GPU numerical stability, inspect gradient norms, validate data versioning, and implement automated quality gates in the pipeline.

What a great answer covers:

Cover load testing, horizontal auto-scaling configuration, model optimization (quantization), caching strategies, graceful degradation (model cascading), pre-provisioning capacity, and runbook creation.

What a great answer covers:

Audit utilization metrics (many GPUs may be idle), identify oversized instances, implement spot instances for training, add auto-scaling to reduce idle serving capacity, consolidate workloads, and set up cost allocation tagging.

What a great answer covers:

Set up distributed training with FSDP or DeepSpeed ZeRO Stage 3, configure multi-node communication, implement data parallelism with gradient accumulation, and provide a self-service interface for launching jobs.

What a great answer covers:

Check model input distribution changes (longer prompts), GPU thermal throttling, memory fragmentation, batch size regression, KV cache eviction patterns, and infrastructure changes like network or storage latency.

What a great answer covers:

Cover VPC isolation, encryption at rest and in transit, access logging, least-privilege IAM, dedicated GPU nodes, data residency requirements, BAA with cloud providers, and audit trails for model access.

What a great answer covers:

Implement resource quotas and priority classes in Kubernetes, use gang scheduling for training, preemption policies for inference, separate node pools with taints/tolerations, and cost chargeback.

What a great answer covers:

Cover CUDA compute capability differences, FP8 support, memory bandwidth improvements, cost modeling, container image rebuilds, network topology changes, and re-benchmarking model performance and throughput.

What a great answer covers:

Check load balancer timeout settings, connection pooling limits, request queue overflow, auto-scaler lag (scale-up delay), and implement circuit breakers, request batching, and better backpressure mechanisms.

What a great answer covers:

Cover vector database selection and sharding, embedding pipeline architecture, index freshness strategy, hybrid search, latency budgeting for retrieval + generation, caching, and monitoring retrieval quality metrics.

AI Workflow & Tools

10 questions
What a great answer covers:

Cover DAG definition, component reuse, parameterization, artifact passing between steps, retry/failure policies, and integration with the model registry for deployment gating.

What a great answer covers:

Cover experiment logging (params, metrics, artifacts), model signatures, registry with stage transitions, transition triggers via CI/CD, and deployment integration with serving infrastructure.

What a great answer covers:

Discuss W&B agent configuration in pods, logging GPU metrics alongside training metrics, sweep configuration, custom panels for multi-node communication stats, and cost-per-epoch tracking.

What a great answer covers:

Cover Ray Serve deployments with autoscaling configs, model multiplexing, dynamic request routing, deployment groups for latency tiers, and integration with Kubernetes for resource management.

What a great answer covers:

Cover model repository structure, ensemble model configuration, batching strategies per model, shared memory for inter-model data transfer, and performance profiling with Perf Analyzer.

What a great answer covers:

Cover drift detection (Evidently, Arize), trigger mechanisms, automated data validation gates, retraining orchestration, A/B testing the new model, and rollback if performance degrades.

What a great answer covers:

Cover modular Terraform design (VPC, EKS, node groups with GPU AMI, IAM roles, S3 buckets, CloudWatch dashboards), state management, and environment promotion (dev/staging/prod).

What a great answer covers:

Cover DVC remote configuration, .dvc tracking files, data pipeline definitions (dvc.yaml), integration with Git branching strategies, and CI steps that validate data versions before training.

What a great answer covers:

Cover Helm chart structure, values files for environment overrides, dependency management (subcharts for MLflow, Feast, Argo), resource limits, and secrets management with external secrets operators.

What a great answer covers:

Cover self-hosted GPU runners, Docker build with CUDA base images, integration tests with model validation, registry push, Kubernetes manifest apply (or Argo CD sync), and approval gates for production.

Behavioral

5 questions
What a great answer covers:

Look for ability to translate technical tradeoffs into business impact, use analogies, create simple visuals, and demonstrate empathy for stakeholder concerns around cost, risk, or timeline.

What a great answer covers:

Assess incident triage skills, communication during high-pressure situations, root cause analysis rigor, and whether they drove systemic improvements (not just a quick fix).

What a great answer covers:

Look for structured prioritization frameworks, stakeholder communication skills, ability to negotiate scope, and evidence of balancing urgency vs. strategic impact.

What a great answer covers:

Assess ability to disagree constructively, use data to support positions, listen to opposing views, and reach consensus or escalate appropriately while maintaining team relationships.

What a great answer covers:

Look for flexibility, proactive communication about scope/cost/timeline impacts, modular design thinking, and evidence of maintaining code/infrastructure quality despite shifting requirements.