Skill Guide

GPU and compute resource scheduling across shared clusters (AWS, GCP, Azure)

The orchestration of GPU, CPU, memory, and storage resources across multi-tenant cloud or hybrid clusters to maximize utilization, minimize cost, and enforce fair-share policies for AI/ML workloads.

Directly reduces cloud infrastructure spend by 30-60% while ensuring critical training and inference jobs meet SLAs. Enables organizations to scale AI initiatives without proportional cost growth, turning infrastructure from a bottleneck into a competitive moat.

1 Careers

1 Categories

8.7 Avg Demand

30% Avg AI Risk

How to Learn GPU and compute resource scheduling across shared clusters (AWS, GCP, Azure)

1. Core cloud resource primitives: Understand instance types (AWS p4d, GCP a2-highgpu, Azure NCv3-series), pricing models (on-demand, spot, reserved), and networking basics (VPC, subnets, ENIs). 2. Container orchestration fundamentals: Master Docker, Kubernetes concepts (pods, nodes, resource requests/limits), and the role of a scheduler. 3. Basic scheduling metrics: Learn to read GPU utilization (nvidia-smi), memory pressure, and queue depth from dashboards like Grafana or cloud-native monitoring.

1. Implement a multi-queue scheduler: Deploy and configure Kueue or Slurm on a managed Kubernetes cluster (EKS, GKE, AKS). Practice defining queues with different priorities and resource quotas for teams. 2. Cost-driven spot instance handling: Design a resilient training job that tolerates preemption using checkpointing (PyTorch Lightning, TF Checkpoint) and a mix of spot/on-demand instances. 3. Common mistake to avoid: Do not treat GPU scheduling like CPU scheduling. GPUs are indivisible resources on a node; focus on node-level bin-packing, not fractional GPU allocation (unless using MIG or vGPU).

1. Architect a multi-cloud burst strategy: Design a system where low-priority jobs burst to a secondary cloud provider when primary cluster capacity is exhausted, using tools like KubeFed or a custom controller. 2. Implement fair-share with decay: Configure Slurm or Kueue with FairshareTree to ensure teams with historical underusage get priority, preventing resource hoarding. 3. Mentor teams on defining SLOs: Teach ML engineers to specify resource SLOs (e.g., '100 GPU-hours within 24h for job class A') and translate them into scheduler configurations.

Practice Projects

Beginner

Project

Deploy a GPU-aware Kubernetes Cluster with Basic Prioritization

Scenario

Your data science team has 3 projects. Project A is a high-priority model retraining for a product launch. Project B is ad-hoc experimentation. Project C is a long-running data preprocessing job. You have a shared EKS cluster with 5 nodes, each with 4 NVIDIA T4 GPUs.

How to Execute

1. Provision the EKS cluster using eksctl with a node group of g4dn.xlarge instances. 2. Install the NVIDIA device plugin for Kubernetes. 3. Install Kueue and create two ResourceFlavors: 'high-priority' (for on-demand nodes) and 'low-priority' (for spot nodes). 4. Create a ClusterQueue with a cohort, defining a nominal quota for 'high-priority' GPUs. 5. Create two LocalQueues for the teams, binding them to different priorities (e.g., PriorityClass 'high' and 'low') and test job submission with kubectl.

Intermediate

Project

Implement Spot Instance Resilience with Automatic Checkpointing

Scenario

Your organization wants to cut training costs by 50% by using Spot instances but fears job interruptions. You need to run a 72-hour BERT fine-tuning job on GCP's preemptible A2 instances.

How to Execute

1. Write a training script using PyTorch Lightning with a ModelCheckpoint callback saving to a GCS bucket every 1000 steps. 2. Create a Kubernetes Job YAML with a restartPolicy of 'OnFailure'. 3. Use a node selector to target preemptible nodes. 4. Implement a custom readiness probe in your container that checks for a recent checkpoint file to confirm the job has resumed correctly after preemption. 5. Test by simulating a preemption via `gcloud compute instances stop` and verify automatic recovery.

Advanced

Project

Design a Multi-Cluster, Multi-Cloud Scheduler with Cost-Aware Bursting

Scenario

Your primary Azure AKS cluster is at capacity due to a new large language model training run. You need to offload lower-priority hyperparameter tuning jobs to a secondary GCP GKE cluster while maintaining a unified interface for your ML engineers.

How to Execute

1. Deploy a central KubeFed control plane to manage federated workloads across AKS and GKE. 2. Create a custom scheduler controller that watches the Kueue 'pending' jobs on the primary cluster. 3. When pending GPU-hours exceed a threshold, the controller automatically generates a federated Kueue job targeting the secondary GKE cluster. 4. Implement a cost model: jobs burst to GKE only if the estimated cost is ≤ 70% of the primary cluster's spot price. 5. Use a unified logging stack (Loki) to aggregate job status from both clouds for the ML team.

Tools & Frameworks

Scheduling & Orchestration

Kubernetes (EKS, GKE, AKS)KueueSlurmVolcano

Kubernetes is the substrate. Kueue is the modern, Kubernetes-native job scheduler for batch workloads, handling quotas and fair-share. Slurm is the HPC standard for large-scale, multi-tenant clusters. Volcano is a Kubernetes-native batch system optimized for AI/ML and big data.

Monitoring & Cost Management

Grafana + PrometheusKubecostNVIDIA DCGM ExporterCloud Provider Cost Explorer

Prometheus collects scheduler and GPU metrics. Grafana visualizes cluster health and job queues. Kubecost provides real-time Kubernetes cost allocation per team/job. DCGM Exporter gives deep GPU telemetry (SM occupancy, memory errors). Cloud Cost Explorer tracks spend trends.

Resilience & Checkpointing

PyTorch Lightning ModelCheckpointTensorFlow CheckpointManagerTorchElastic / TorchFaultToleranceCustom Sidecar Containers

Framework-native checkpointers are essential for spot/preemptible instance usage. TorchElastic enables fault-tolerant distributed training by automatically restarting workers. Custom sidecars can handle job state persistence for non-PyTorch workloads.

Infrastructure as Code (IaC)

TerraformAWS CloudFormationPulumi

Critical for reproducible, version-controlled cluster provisioning. Use Terraform modules for creating node groups with specific GPU instance types, configuring auto-scaling, and attaching spot instance pools. This ensures scheduling decisions are based on consistent infrastructure.

Interview Questions

Answer Strategy

Use the 'Fair-Share with Decay' framework. Explain defining ClusterQueues with nominal quotas aligned to team budgets. Implement a hierarchical fair-share tree where unused quota from lower-priority teams is redistributed to higher-priority ones. Mention using Kueue's Cohorts for inter-team borrowing. Provide a concrete example: Team A gets 40% of cluster GPUs as their nominal quota, but can borrow up to 60% if others are idle, with usage decaying over a 7-day window to prevent starvation.

Answer Strategy

Demonstrate a systematic troubleshooting approach. First, check DCGM and cloud provider logs for preemption reasons (capacity vs. price). Second, analyze the checkpoint frequency and recovery time - if recovery takes longer than the average spot instance lifetime, preemptions appear catastrophic. Third, implement a mixed-instance strategy: run the job on a blend of 70% spot and 30% on-demand instances using Kueue's ResourceFlavors, ensuring the on-demand 'anchor' instances keep the job alive. Finally, optimize checkpoint size (use gradient checkpointing) to reduce recovery time from 15 minutes to 2 minutes.