AI Infrastructure Engineer
AI Infrastructure Engineers design, build, and maintain the foundational systems that power machine learning workloads at scale - …
Skill Guide
The architectural, operational, and administrative discipline of provisioning, partitioning, allocating, and monitoring shared physical GPU resources across multiple users, teams, or workloads within a high-performance computing (HPC) or cloud-native environment.
Scenario
Build a small, functional GPU cluster for a research team using three servers with NVIDIA GPUs. Configure Slurm to allow two separate teams to submit jobs to their own dedicated partitions.
Scenario
An organization needs to run ML workloads for multiple teams on a shared Kubernetes cluster, requiring fair sharing of NVIDIA A100 GPUs with MIG (Multi-Instance GPU) capability.
Scenario
A 100-node GPU cluster is reporting only 40% average GPU utilization over a quarter, causing project delays and high cloud costs. You are tasked to lead a cross-functional team to diagnose the issue and implement a solution.
Used for traditional, large-scale batch job scheduling in research and on-premise environments. Focus on Slurm for its dominance in AI research clusters.
The cloud-native stack for managing ML workloads as containers. The GPU Operator simplifies driver/plugin management; Volcano adds gang scheduling and fair-share for AI jobs.
DCGM provides deep, low-level GPU telemetry (SM clocks, memory bandwidth, errors). Prometheus scrapes and visualizes these metrics for cluster-wide health and utilization dashboards.
Answer Strategy
Demonstrate knowledge of isolation and scheduling primitives. Answer: 'First, I'd enable MIG on the A100s to create fixed-size GPU slices. Then, I'd create namespaces for each team and apply ResourceQuotas to limit their total GPU request. For workload priority, I'd deploy the Volcano scheduler with a fair-share queue policy, ensuring high-priority inference jobs can preempt lower-priority training when needed.'
Answer Strategy
Test problem-solving and systems thinking. Use the STAR method. Sample response: 'We saw 50% GPU idle time despite a full queue. I used DCGM and found GPUs were dropping to P8 state due to thermal throttling. After checking airflow and fan speeds, I discovered a cooling unit failure. I coordinated with facilities to fix it and implemented a Grafana alert on GPU temperatures to prevent recurrence. Idle time dropped to under 5%.'
1 career found
Try a different search term.