Skill Guide

C/C++/CUDA for low-level optimization

The disciplined practice of squeezing maximum hardware performance from CPU and GPU architectures by writing code that directly manages memory hierarchy, instruction pipelines, and parallel execution units in C/C++ and NVIDIA's CUDA.

This skill directly reduces cloud computing costs and latency in performance-critical systems, translating to competitive advantage and tangible cost savings. It enables building proprietary, high-performance engines that are difficult to replicate, becoming a core business asset.

1 Careers

1 Categories

9.0 Avg Demand

20% Avg AI Risk

How to Learn C/C++/CUDA for low-level optimization

Focus on: 1) C++ memory model & RAII for manual resource control. 2) x86/ARM assembly basics to understand compiler output. 3) CUDA kernel launch syntax, thread hierarchy (grid, block, thread), and device memory types (global, shared).

Focus on: 1) Using profiling tools (NVIDIA Nsight Compute/Systems, Linux perf) to identify bottlenecks (memory-bound vs. compute-bound). 2) Applying specific optimization patterns: loop tiling, data prefetching, warp-level programming. Avoid premature optimization; measure first. 3) Writing cache-aware algorithms and non-temporal memory accesses.

Focus on: 1) Designing architecture-aware data structures for specific hardware (e.g., structure-of-arrays for GPU coalescing). 2) Implementing and optimizing custom CUDA kernels for novel algorithms (e.g., sparse data, reductions). 3) Strategic use of intrinsics, PTX assembly, and understanding compiler optimizations/limitations at the ISA level. Mentoring others on profiling methodology.

Practice Projects

Beginner

Project

CUDA Vector Addition Optimization

Scenario

Accelerate the addition of two large vectors (1M+ elements) on the GPU, comparing naive and optimized kernel performance.

How to Execute

1. Implement a basic kernel with one thread per element. 2. Use `cudaMallocManaged` for simplicity. 3. Introduce grid/block sizing and measure latency with CUDA events. 4. Implement a version with coalesced memory access and compare throughput.

Intermediate

Project

Shared-Memory Matrix Multiplication

Scenario

Implement and optimize tiled matrix multiplication for matrices that don't fit in L2 cache, minimizing global memory accesses.

How to Execute

1. Write a naive global-memory kernel and profile it. 2. Implement a tiled version using CUDA shared memory. 3. Optimize tile dimensions for occupancy and bank conflict avoidance. 4. Use Nsight Compute to analyze memory throughput and compute utilization, iterating to find the optimal configuration.

Advanced

Project

High-Performance Sparse Matrix-Vector Multiplication (SpMV)

Scenario

Develop a CUDA kernel for SpMV on a CSR (Compressed Sparse Row) matrix that outperforms library implementations for your specific sparsity pattern.

How to Execute

1. Analyze the matrix sparsity pattern (e.g., average non-zeros per row). 2. Implement and benchmark a basic row-based kernel. 3. Implement an advanced kernel using warp-level primitives (e.g., `__shfl_sync`) for intra-warp reduction. 4. Implement a two-phase kernel for irregular rows. 5. Profile and tune occupancy vs. memory latency hiding, potentially using CUDA graphs for kernel launch overhead.

Tools & Frameworks

Compilers & Toolchains

GCC/Clang with -O3/-Ofast/-march=nativeNVIDIA Nvcc with compute capability flagsLLVM for custom optimization passes

The primary code generators. Use compiler flags aggressively and inspect assembly output (`-S`) to verify optimizations like auto-vectorization (SIMD) and loop unrolling.

Profilers & Analyzers

NVIDIA Nsight Compute (kernel analysis)NVIDIA Nsight Systems (system-wide trace)Linux perf, Intel VTuneValgrind (memcheck, cachegrind)

Essential for evidence-based optimization. Nsight Compute identifies GPU warp stalls, memory throughput, and instruction mix. Perf/VTune analyzes CPU cache misses, branch mispredictions, and IPC.

Libraries & Primitives

CUDA ThrustCUB (CUDA Unbound)Intel oneAPI (oneMKL, oneDNN)ARM Compute Library

Use high-quality, vendor-optimized libraries for common patterns (sort, reduce) before writing custom kernels. CUB is the standard for device-wide primitives in CUDA.

Interview Questions

Answer Strategy

Test the candidate's ability to interpret profiler data and reason about hardware. The scenario indicates a memory-latency-bound kernel, not throughput-bound. The high L2 hit rate means accesses are local, but the latency isn't being hidden. Strategy: 1) Increase occupancy (if low) to have more warps to hide latency. 2) Introduce explicit data prefetching or restructure computation to have independent memory accesses. 3) Use asynchronous memory copies (`cp.async`) if available. Sample Answer: 'The kernel is memory latency-bound. High L2 hits are good, but the SM is idle waiting for data. I'd first check occupancy; if it's low, I'd reduce register usage or shared memory per block to launch more concurrent warps. If occupancy is already high, I'd restructure the algorithm to have more independent arithmetic between memory loads to hide latency.'

Answer Strategy

Tests systematic methodology and communication. The core competency is a data-driven, iterative optimization cycle. A professional response should outline: 1) Establishing a baseline and defining the performance target. 2) Profiling to identify the hottest code paths (80/20 rule). 3) Formulating and implementing a hypothesis (e.g., changing data layout). 4) Re-profiling to measure the effect and verify no regressions. 5) Documenting the change and impact. Avoid vague answers like 'I made the code faster'.