AI Inference Optimization Engineer
An AI Inference Optimization Engineer specializes in making trained AI models faster, cheaper, and more efficient when serving pre…
Skill Guide
The deep knowledge of GPU hardware execution models (SIMT, memory hierarchy, warp scheduling) combined with the ability to write, analyze, and refactor CUDA C/C++ code to maximize computational throughput and memory bandwidth utilization.
Scenario
You have two large 1D arrays (N=10^8 elements). You need to perform element-wise addition on the GPU.
Scenario
Implement a high-performance matrix multiplication (C = A * B) for large square matrices.
Scenario
Optimize a pipeline that applies a 3x3 Gaussian blur, followed by edge detection (Sobel filter) to a 4K video stream at 60 fps.
Nsight Compute is the primary tool for kernel-level performance analysis, showing metrics like SM occupancy, memory throughput, and instruction mix. Nsight Systems provides a timeline view for identifying system-level bottlenecks like kernel launch latency or memory transfer stalls. The Toolkit and binary tools are essential for compiling, inspecting PTX/SASS assembly, and understanding low-level code generation.
Use these as benchmarks and for high-performance building blocks. Writing your own kernel to beat cuBLAS is a master-level exercise; understanding *why* it's faster (e.g., its use of tensor cores, specialized memory access patterns) is the real learning objective. Thrust provides a high-level, STL-like interface for parallel operations and is excellent for rapid prototyping.
Answer Strategy
The candidate must articulate the hierarchy: registers (fastest, per-thread), shared memory (on-chip, per-block, programmer-managed), L1/L2 caches (hardware-managed), and global memory (slowest, off-chip). The key insight is that shared memory is a *scratchpad* explicitly managed by the programmer via `__shared__` declarations and `__syncthreads()`. It's used for data that will be reused by multiple threads in a block to avoid redundant global memory fetches.
Answer Strategy
This tests systematic problem-solving. The answer should outline: 1) Use Nsight Compute to identify the limiting factor (registers per thread vs. shared memory per block vs. thread blocks per SM). 2) Based on the bottleneck, apply the correct remedy: reduce register pressure via `__launch_bounds__` or loop unrolling, reduce shared memory allocation, or increase the number of threads per block (within hardware limits). 3) Re-profile to validate improvement, noting that peak occupancy is not the goal-maximizing throughput is. Sometimes lower occupancy with higher ILP or better memory access is better.
1 career found
Try a different search term.