Skill Guide

Model profiling and bottleneck identification (Nsight, PyTorch Profiler)

The systematic process of using specialized software to measure, analyze, and visualize the runtime performance of machine learning models, with the goal of precisely identifying computational, memory, or communication bottlenecks.

This skill directly translates to reduced cloud compute costs and faster iteration cycles by eliminating performance guesswork. It enables engineering teams to ship optimized, production-ready models within tight operational budgets and latency requirements.

1 Careers

1 Categories

9.2 Avg Demand

15% Avg AI Risk

How to Learn Model profiling and bottleneck identification (Nsight, PyTorch Profiler)

1. Master PyTorch's basic execution model: understand how tensors flow through operations on the GPU. 2. Learn to use `torch.profiler.profile()` to generate a basic trace of CPU and CUDA activities. 3. Familiarize yourself with interpreting the `chrome://tracing` viewer to identify obvious CPU-side stalls or gaps in CUDA kernel execution.

1. Move beyond basic traces: correlate GPU kernel execution with memory transfers using Nsight Systems (nsys). 2. Analyze kernel-level performance metrics with Nsight Compute (ncu) to understand SM occupancy, memory bandwidth, and arithmetic intensity. 3. Common mistake: optimizing a kernel that is not the bottleneck. Always profile end-to-end first, then drill down.

1. Profile and optimize distributed training across multiple GPUs/nodes, identifying communication/computation overlap issues. 2. Integrate profiling into CI/CD pipelines to track performance regressions. 3. Mentor teams on establishing a culture of performance-first development, aligning profiling efforts with business-critical metrics like cost-per-inference or training-time-to-accuracy.

Practice Projects

Beginner

Project

Profile a Pre-trained Image Classification Model

Scenario

You are given a standard ResNet-50 model pre-trained on ImageNet. Your task is to identify the top 3 most time-consuming operations during inference on a single GPU.

How to Execute

1. Wrap the model's forward pass in `torch.profiler.profile()` with `activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]`. 2. Run inference on a batch of dummy images and export the trace. 3. Load the trace into `chrome://tracing` and use the 'Top-down' view to sort operations by GPU time. 4. Identify and document the top 3 kernels (e.g., `gemm`, `cudnn convolution`).

Intermediate

Project

Optimize Data Loading Pipeline Bottleneck

Scenario

Your training script is showing low GPU utilization (~40%). Profiling reveals large gaps between GPU kernels and significant CPU activity. You suspect the data pipeline is the bottleneck.

How to Execute

1. Use PyTorch Profiler's `with_stack=True` to trace Python stack frames during the data loading. 2. Analyze the trace to identify slow CPU operations (e.g., complex augmentations, slow disk I/O). 3. Implement a solution: increase `num_workers`, switch to a faster library like NVIDIA DALI, or pre-process the dataset. 4. Re-profile to validate GPU utilization has increased to >80%.

Advanced

Project

Multi-GPU Training Performance Root-Cause Analysis

Scenario

A distributed training job using DDP (DistributedDataParallel) scales poorly from 4 to 8 GPUs. The expected 2x speedup is only 1.3x.

How to Execute

1. Use Nsight Systems (`nsys profile`) with `--mpi-impl openmpi` to capture a system-wide trace of all GPUs and NVLink/PCIe traffic. 2. Analyze the trace in Nsight Systems GUI: look for `AllReduce` or `AllGather` operations that are not overlapping with backward pass computation. 3. Use Nsight Compute to profile a single kernel from the critical path, checking for high memory stall reasons. 4. The root cause is likely insufficient overlap. Adjust `bucket_cap_mb` in DDP, or model architecture to have larger computation chunks per communication step.

Tools & Frameworks

Software & Platforms

PyTorch ProfilerNsight Systems (nsys)Nsight Compute (ncu)chrome://tracing

PyTorch Profiler is the integrated, high-level starting point. Nsight Systems (nsys) provides a holistic, system-level view of CPU/GPU timelines. Nsight Compute (ncu) is the deep-dive tool for kernel-level hardware metrics. chrome://tracing is the universal viewer for exported trace files.

Mental Models & Methodologies

Top-Down Performance AnalysisRoofline ModelAmdahl's Law

Top-Down Analysis: Start with the end-to-end trace, identify the largest time block, drill down. Roofline Model: A framework to determine if a kernel is memory-bound or compute-bound by comparing operational intensity to hardware limits. Amdahl's Law: Use to calculate the theoretical maximum speedup by optimizing a specific part of the system, guiding effort prioritization.

Interview Questions

Answer Strategy

The candidate must demonstrate a systematic, drill-down approach using the correct tools. They should avoid jumping to code changes. 'I would use Nsight Compute (ncu) to profile that specific kernel in isolation. I'd run it with metrics focused on memory bandwidth, SM occupancy, and warp execution efficiency. This tells me if the kernel is memory-bound (low arithmetic intensity, high DRAM traffic) or compute-bound, and whether it's underutilizing the GPU's streaming multiprocessors. Only after this analysis would I consider specific optimizations like improving memory access patterns or increasing parallelism.'

Answer Strategy

This tests the ability to formulate a complete investigation plan. The answer should follow a logical sequence. 'First, I'd clarify the performance target: latency, throughput, or cost. Then, I'd start with PyTorch Profiler on a representative batch to get a high-level CPU/CUDA timeline. I'd look for data loading stalls, excessive CPU-side Python overhead, or CUDA synchronization gaps. If the GPU is busy but slow, I'd use Nsight Systems to see if kernels are overlapping with communication or host tasks. Finally, I'd select the top 2-3 GPU kernels by time and profile them with Nsight Compute to identify hardware-level bottlenecks like memory latency or occupancy issues.'