Skill Guide

Understanding of GPU architecture, CUDA programming, and hardware-aware optimization

The engineering discipline of mapping software algorithms onto the massively parallel, hierarchical hardware architecture of GPUs, requiring explicit management of thousands of concurrent threads, memory subsystems, and execution units to extract maximum computational throughput.

This skill directly reduces infrastructure costs and time-to-result for compute-bound workloads (e.g., AI training, scientific simulation, rendering) by 10-100x. It is a primary differentiator for organizations building or deploying high-performance, cost-effective AI and HPC systems.

1 Careers

1 Categories

9.0 Avg Demand

15% Avg AI Risk

How to Learn Understanding of GPU architecture, CUDA programming, and hardware-aware optimization

1. Understand the GPU hardware model: SMs (Streaming Multiprocessors), warps, cores, and the memory hierarchy (global, shared, L1/L2 cache, registers). 2. Write basic CUDA kernels: vector add, matrix multiply, focusing on thread indexing and grid/block dimensions. 3. Learn to use `nvprof` or Nsight Compute to read basic performance metrics (achieved occupancy, memory throughput).

1. Master memory coalescing: Analyze `nvprof` metrics like `gld_efficiency` to ensure global memory accesses are contiguous. 2. Optimize shared memory: Implement and debug shared memory tiling for matrix operations, handling bank conflicts. 3. Apply algorithmic patterns: Implement reduction, scan, and stencil patterns in CUDA, focusing on warp-level primitives (`__shfl_down_sync`).

1. Architect for the specific GPU (Ampere, Hopper): Design kernels that leverage hardware features like Tensor Cores, asynchronous memory copies, and cooperative groups. 2. Profile and optimize at the micro-architectural level: Use Nsight Compute to analyze stall reasons (e.g., memory dependency, execution dependency) and SM utilization. 3. Develop performance-critical libraries or AI frameworks: Integrate CUDA Graphs, multi-stream concurrency, and kernel fusion to minimize launch overhead and maximize hardware saturation across complex pipelines.

Practice Projects

Beginner

Project

CUDA Matrix Multiplication with Shared Memory Tiling

Scenario

Implement a high-performance matrix multiplication (C = A * B) for large matrices (4096x4096) that significantly outperforms a naive global-memory-only implementation.

How to Execute

1. Write a naive kernel that reads from global memory. 2. Implement a tiled version using shared memory, loading sub-matrices (e.g., 32x32 tiles) cooperatively. 3. Ensure correct handling of shared memory boundaries and synchronization (`__syncthreads()`). 4. Use `nvprof` to compare the memory throughput and execution time of both kernels, quantifying the speedup.

Intermediate

Project

Optimizing a 2D Stencil Computation Kernel

Scenario

Accelerate a 5-point stencil operation (common in image processing or physics simulations) on a 2D grid, focusing on reducing global memory bandwidth pressure and maximizing arithmetic intensity.

How to Execute

1. Implement a naive kernel with redundant global memory reads. 2. Apply shared memory tiling to load halos (neighboring elements) once per tile. 3. Experiment with thread coarsening, where each thread computes multiple output elements, to improve instruction-level parallelism. 4. Use Nsight Compute to analyze the `shared_load_transactions_per_request` metric to optimize shared memory access patterns and eliminate bank conflicts.

Advanced

Project

Designing a Kernel-Fused Transformer Encoder Layer

Scenario

Develop a custom fused CUDA kernel for the core operations of a transformer encoder (LayerNorm, QKV projection, attention score, softmax) to minimize memory bandwidth usage and kernel launch latency, targeting A100 or H100 GPUs.

How to Execute

1. Profile a baseline PyTorch implementation using `torch.profiler` to identify the dominant memory-bound operations (e.g., softmax, residual add). 2. Design a fused kernel using CUDA C++ that combines 2-3 consecutive operations (e.g., residual add + layer norm) into one kernel, eliminating intermediate global memory writes. 3. Implement the kernel using Tensor Cores (via WMMA API or CUTLASS) for the matrix multiplications and optimize the softmax using warp-level reductions. 4. Integrate the kernel into a PyTorch model via a C++/CUDA extension, validating numerical correctness and benchmarking end-to-end throughput and latency against the baseline.

Tools & Frameworks

Profiling & Analysis Tools

NVIDIA Nsight ComputeNVIDIA Nsight Systemsnvprof (Legacy)

Nsight Compute is for kernel-level analysis (memory transactions, stall reasons, SM utilization). Nsight Systems is for system-level profiling (API calls, kernel launches, CPU-GPU overlap). Use Nsight Compute to drill into a single kernel's bottlenecks.

CUDA Libraries & SDKs

cuBLAScuDNNCUTLASSThrustCUB

Use cuBLAS/cuDNN for optimized AI primitives (GEMM, convolutions) as a baseline. CUTLASS provides templated C++ abstractions for writing custom, high-performance kernels using Tensor Cores. Thrust/CUB provide high-level parallel algorithms and device-wide primitives for reductions and scans.

Development & Debugging

CUDA-GDBCompute SanitizerNsight Eclipse/VS Code Extension

CUDA-GDB for kernel debugging. Compute Sanitizer (`memcheck`, `racecheck`) for detecting memory errors, race conditions, and leaks. IDE extensions provide integrated build, debug, and profiling support.