AI Inference Optimization Engineer
An AI Inference Optimization Engineer specializes in making trained AI models faster, cheaper, and more efficient when serving pre…
Skill Guide
The systematic process of using specialized software to measure, analyze, and visualize the runtime performance of machine learning models, with the goal of precisely identifying computational, memory, or communication bottlenecks.
Scenario
You are given a standard ResNet-50 model pre-trained on ImageNet. Your task is to identify the top 3 most time-consuming operations during inference on a single GPU.
Scenario
Your training script is showing low GPU utilization (~40%). Profiling reveals large gaps between GPU kernels and significant CPU activity. You suspect the data pipeline is the bottleneck.
Scenario
A distributed training job using DDP (DistributedDataParallel) scales poorly from 4 to 8 GPUs. The expected 2x speedup is only 1.3x.
PyTorch Profiler is the integrated, high-level starting point. Nsight Systems (nsys) provides a holistic, system-level view of CPU/GPU timelines. Nsight Compute (ncu) is the deep-dive tool for kernel-level hardware metrics. chrome://tracing is the universal viewer for exported trace files.
Top-Down Analysis: Start with the end-to-end trace, identify the largest time block, drill down. Roofline Model: A framework to determine if a kernel is memory-bound or compute-bound by comparing operational intensity to hardware limits. Amdahl's Law: Use to calculate the theoretical maximum speedup by optimizing a specific part of the system, guiding effort prioritization.
Answer Strategy
The candidate must demonstrate a systematic, drill-down approach using the correct tools. They should avoid jumping to code changes. 'I would use Nsight Compute (ncu) to profile that specific kernel in isolation. I'd run it with metrics focused on memory bandwidth, SM occupancy, and warp execution efficiency. This tells me if the kernel is memory-bound (low arithmetic intensity, high DRAM traffic) or compute-bound, and whether it's underutilizing the GPU's streaming multiprocessors. Only after this analysis would I consider specific optimizations like improving memory access patterns or increasing parallelism.'
Answer Strategy
This tests the ability to formulate a complete investigation plan. The answer should follow a logical sequence. 'First, I'd clarify the performance target: latency, throughput, or cost. Then, I'd start with PyTorch Profiler on a representative batch to get a high-level CPU/CUDA timeline. I'd look for data loading stalls, excessive CPU-side Python overhead, or CUDA synchronization gaps. If the GPU is busy but slow, I'd use Nsight Systems to see if kernels are overlapping with communication or host tasks. Finally, I'd select the top 2-3 GPU kernels by time and profile them with Nsight Compute to identify hardware-level bottlenecks like memory latency or occupancy issues.'
1 career found
Try a different search term.