Skill Guide

GPU/TPU architecture and utilization profiling

The systematic process of analyzing the hardware architecture and runtime behavior of GPUs/TPUs to identify performance bottlenecks, optimize resource allocation, and maximize computational throughput for machine learning and high-performance computing workloads.

This skill directly translates to reduced cloud computing costs, faster model training/inference times, and more efficient resource utilization, enabling organizations to scale AI operations economically. It is critical for maintaining competitive advantage in performance-sensitive domains like large language model training and real-time inference services.

1 Careers

1 Categories

8.5 Avg Demand

20% Avg AI Risk

How to Learn GPU/TPU architecture and utilization profiling

1. Master the fundamental hardware architecture: CUDA cores, Tensor Cores, SMs (Streaming Multiprocessors), memory hierarchy (L1/L2 cache, shared memory, global memory/VRAM), and PCIe/NVLink interconnects for GPUs; MXU (Matrix Multiply Units) and their systolic array design for TPUs. 2. Learn to read and interpret basic metrics from profiling tools (e.g., GPU utilization %, memory bandwidth utilization, compute vs. memory-bound kernels). 3. Understand the basics of the software stack: CUDA/cuDNN for NVIDIA, XLA/TPU runtime for Google Cloud TPUs.

1. Move from observation to analysis: Use NVIDIA Nsight Systems/Compute or Google Cloud TPU Profiler to generate and interpret timelines, kernel execution traces, and memory transfer statistics. Focus on identifying common bottlenecks: kernel launch overhead, PCIe transfer latency, memory bandwidth saturation, or low arithmetic intensity. 2. Practice profiling a standard model (e.g., ResNet-50) and make one targeted optimization (e.g., increasing batch size to improve SM occupancy, using mixed precision to leverage Tensor Cores). Avoid the common mistake of optimizing in isolation; always correlate system-level metrics (CPU-GPU sync) with kernel-level data.

1. Architect for profiling: Design systems with built-in observability, using profiling to inform hardware selection (A100 vs. H100 vs. TPU v4) and framework choices (PyTorch vs. JAX/XLA). 2. Master advanced techniques: Roofline analysis to determine if kernels are compute or memory bound, deep dive into memory coalescing and bank conflicts, analyzing warp/wavefront occupancy, and optimizing multi-GPU/TPU pod communication (AllReduce, Tensor Parallelism). 3. Develop cost/performance models to guide strategic decisions on scaling training jobs or designing inference pipelines.

Practice Projects

Beginner

Project

Profile and Optimize a Simple CUDA Kernel

Scenario

You have a naive CUDA kernel that performs element-wise vector addition. It runs slower than expected on a discrete GPU.

How to Execute

1. Write the kernel in CUDA C++ or use PyTorch's profiler on a simple tensor operation. 2. Use `nvprof` or NVIDIA Nsight Systems to generate a timeline and identify the kernel. 3. Use NVIDIA Nsight Compute to analyze the kernel's memory throughput and occupancy. 4. Refactor the kernel to improve memory access patterns (e.g., ensure coalesced reads/writes) and re-profile to quantify the improvement.

Intermediate

Project

Optimize a PyTorch Model Training Loop for GPU Utilization

Scenario

A ResNet-50 training job on a single GPU shows fluctuating GPU utilization between 60-80%, with frequent small memory copies in the profiler timeline.

How to Execute

1. Use PyTorch's Profiler (`torch.profiler`) with CUDA activity enabled to record the training step. 2. Analyze the trace in Chrome (`chrome://tracing`) or Perfetto. Identify gaps between kernel executions, likely caused by CPU-side data loading or metric logging. 3. Implement optimizations: use `pin_memory=True` and increase `num_workers` in the DataLoader, move metric computation to GPU, and use CUDA graphs to launch batches of kernels. 4. Re-profile to verify increased kernel occupancy and reduced gaps, measuring a target of >90% sustained GPU utilization.

Advanced

Project

Profile and Optimize Multi-GPU/TPU Distributed Training Communication

Scenario

A large language model training job on a multi-node GPU cluster (8x A100 per node) shows scaling efficiency that degrades significantly beyond 16 GPUs. Network profiling indicates high latency in AllReduce operations.

How to Execute

1. Use a distributed profiling framework like `torch.profiler` with NVTX markers or `tensorflow.profiler` for TPU pods. Capture traces across all nodes. 2. Analyze the communication patterns: identify AllReduce overlaps with computation, measure bus bandwidth utilization (NVLink/PCIe/InfiniBand), and detect synchronization stalls. 3. Implement advanced optimizations: switch from NCCL AllReduce to hierarchical AllReduce, use gradient compression or communication-computation overlap (e.g., PyTorch's `torch.distributed.pipeline`), or tune the collective algorithm for your specific topology. 4. Run comparative benchmarks to measure scaling efficiency (time-to-train) improvement and correlate with cost savings.

Tools & Frameworks

Profiling & Analysis Software

NVIDIA Nsight SystemsNVIDIA Nsight ComputeGoogle Cloud TPU ProfilerPyTorch Profiler (`torch.profiler`)TensorFlow Profiler

Nsight Systems provides system-wide timeline analysis (CPU/GPU sync, memory transfers, kernel launches). Nsight Compute offers deep kernel-level analysis (memory throughput, occupancy, instruction mix). TPU Profiler and framework profilers (PyTorch/TF) are essential for application-level traces and are often the first step in identifying hotspots.

Monitoring & Observability Platforms

Grafana + Prometheus (with DCGM exporter)Weights & Biases System MetricsAWS CloudWatch / Google Cloud Monitoring

Used for continuous, production-level monitoring of GPU/TPU metrics (utilization, memory, temperature, power). Essential for detecting regressions, capacity planning, and cost anomaly detection in live deployments. DCGM (Data Center GPU Manager) exporter is a key tool for NVIDIA GPUs.

Conceptual Frameworks & Methodologies

Roofline Model AnalysisCompute vs. Memory Bound AnalysisOccupancy Calculators (NVIDIA)

The Roofline Model is a critical analytical framework to determine if a kernel is limited by compute capacity or memory bandwidth. Occupancy calculators help tune kernel launch parameters (blocks, threads, shared memory) to maximize SM utilization. These are not software but essential mental models for diagnosis.

Interview Questions

Answer Strategy

The interviewer is testing a systematic approach and knowledge of the profiling stack. Use a layered strategy: start with system-level tools, then drill down. Sample Answer: 'I would start with `nvidia-smi` to check for thermal throttling or memory errors. Then, I'd use PyTorch's Profiler to generate a system trace. The first metrics I examine are GPU utilization percentage and the timeline for gaps between kernel executions. Large gaps often point to CPU-side bottlenecks or inefficient data loading. Simultaneously, I'd look at the communication profiler to see if AllReduce operations are stalling, which would indicate a network or synchronization issue in the distributed setup.'

Answer Strategy

This tests the ability to interpret profiling data and translate it into action. The core competency is diagnosing memory-bound kernels and knowing optimization levers. Sample Answer: 'This indicates the kernel is memory-bandwidth bound, not compute-bound. My diagnosis is low arithmetic intensity-the ratio of compute operations to memory accesses is poor. My first three optimization steps would be: 1. Increase data reuse by leveraging shared memory or tiling to reduce global memory accesses. 2. Ensure memory coalescing to maximize the utilization of each memory transaction. 3. Consider using a more memory-efficient data format (e.g., half-precision/fp16) to halve the memory traffic, which is often the quickest win.'