Skill Guide

GPU architecture understanding and CUDA kernel optimization

The deep knowledge of GPU hardware execution models (SIMT, memory hierarchy, warp scheduling) combined with the ability to write, analyze, and refactor CUDA C/C++ code to maximize computational throughput and memory bandwidth utilization.

This skill directly translates to reduced operational costs and faster time-to-insight for compute-intensive workloads like AI training, scientific simulation, and high-frequency trading. Organizations with this expertise can achieve order-of-magnitude performance improvements over baseline implementations, creating a significant competitive moat.

1 Careers

1 Categories

9.2 Avg Demand

15% Avg AI Risk

How to Learn GPU architecture understanding and CUDA kernel optimization

1. Core Concepts: Understand the NVIDIA GPU architecture (Streaming Multiprocessor, CUDA Core, Warp, Thread Block, Grid). 2. Memory Model: Grasp the difference between global, shared, register, and constant memory, and their latency/bandwidth characteristics. 3. Basic Kernel Anatomy: Learn to write and launch a simple kernel using `__global__`, `<<<>>>` syntax, and `cudaMalloc`/`cudaMemcpy`.

Move from toy examples to real-world patterns. Focus on memory coalescing to avoid scattered global memory accesses, minimizing warp divergence within `if/else` blocks, and using shared memory to exploit data reuse and reduce global memory traffic. Common mistake: Premature optimization without profiling first.

Master occupancy tuning to balance register and shared memory usage against active warps per SM. Analyze and optimize instruction-level parallelism (ILP). Understand the hardware scheduler and use asynchronous memory copies (`cudaMemcpyAsync`) with streams to overlap computation and data transfer. Develop expertise in using the NVCC compiler's PTX/SASS output to hand-tune critical inner loops.

Practice Projects

Beginner

Project

Vector Addition with Memory Coalescing Check

Scenario

You have two large 1D arrays (N=10^8 elements). You need to perform element-wise addition on the GPU.

How to Execute

1. Write a naive kernel where each thread accesses `A[tid] + B[tid]`. 2. Profile with Nsight Systems to see memory copy and kernel execution times. 3. Intentionally break coalescing (e.g., have threads access `A[tid * stride]`) and re-profile to observe the performance penalty. 4. Write a report comparing the two.

Intermediate

Project

Tiled Matrix Multiplication with Shared Memory

Scenario

Implement a high-performance matrix multiplication (C = A * B) for large square matrices.

How to Execute

1. Start with a naive global-memory-only implementation. 2. Implement the standard tiled algorithm using shared memory: each thread block loads a tile of A and B into `__shared__` memory, performs partial dot products, then writes to global memory C. 3. Use Nsight Compute to analyze the 'Memory Throughput' and 'Compute (SM) Throughput' sections to identify if you are memory-bound or compute-bound. 4. Experiment with tile sizes and loop unrolling factors to improve the roofline model position.

Advanced

Project

CUDA Kernel for Real-Time Image Processing Pipeline

Scenario

Optimize a pipeline that applies a 3x3 Gaussian blur, followed by edge detection (Sobel filter) to a 4K video stream at 60 fps.

How to Execute

1. Design the kernel fusion strategy to combine both filters into one kernel launch, avoiding intermediate global memory writes. 2. Use `cudaStream` and double-buffering with `cudaMemcpyAsync` to overlap host-to-device data transfer of the next frame with kernel execution on the current frame. 3. Profile to ensure the kernel runtime is below the 16.6 ms budget per frame. 4. If not, explore using texture memory for the read-only input image to leverage the GPU's cache hierarchy for 2D spatial locality.

Tools & Frameworks

Development & Profiling Tools

NVIDIA Nsight Compute (NCU)NVIDIA Nsight Systems (NSYS)CUDA Toolkit (nvcc)CUDA Binary Utilities (cuobjdump, nvdisasm)

Nsight Compute is the primary tool for kernel-level performance analysis, showing metrics like SM occupancy, memory throughput, and instruction mix. Nsight Systems provides a timeline view for identifying system-level bottlenecks like kernel launch latency or memory transfer stalls. The Toolkit and binary tools are essential for compiling, inspecting PTX/SASS assembly, and understanding low-level code generation.

Core CUDA Libraries

cuBLAScuFFTcuRANDThrust

Use these as benchmarks and for high-performance building blocks. Writing your own kernel to beat cuBLAS is a master-level exercise; understanding *why* it's faster (e.g., its use of tensor cores, specialized memory access patterns) is the real learning objective. Thrust provides a high-level, STL-like interface for parallel operations and is excellent for rapid prototyping.

Interview Questions

Answer Strategy

The candidate must articulate the hierarchy: registers (fastest, per-thread), shared memory (on-chip, per-block, programmer-managed), L1/L2 caches (hardware-managed), and global memory (slowest, off-chip). The key insight is that shared memory is a *scratchpad* explicitly managed by the programmer via `__shared__` declarations and `__syncthreads()`. It's used for data that will be reused by multiple threads in a block to avoid redundant global memory fetches.

Answer Strategy

This tests systematic problem-solving. The answer should outline: 1) Use Nsight Compute to identify the limiting factor (registers per thread vs. shared memory per block vs. thread blocks per SM). 2) Based on the bottleneck, apply the correct remedy: reduce register pressure via `__launch_bounds__` or loop unrolling, reduce shared memory allocation, or increase the number of threads per block (within hardware limits). 3) Re-profile to validate improvement, noting that peak occupancy is not the goal-maximizing throughput is. Sometimes lower occupancy with higher ILP or better memory access is better.