AI Latency Optimization Engineer
An AI Latency Optimization Engineer is a specialized performance engineer who minimizes inference latency and maximizes throughput…
Skill Guide
GPU Architecture & CUDA Programming is the discipline of understanding the parallel processing hardware design of Graphics Processing Units and writing software using NVIDIA's CUDA platform to leverage their massive computational throughput for general-purpose computing (GPGPU).
Scenario
You are tasked with accelerating a simple vector addition operation (`C = A + B`) on the GPU, replacing a slow CPU loop.
Scenario
Implement a tiled matrix multiplication kernel that uses shared memory to minimize global memory traffic, a core optimization for deep learning frameworks.
Scenario
Build a high-performance parallel reduction (sum) that scales across multiple GPUs on a single node, handling load balancing and communication.
The core SDK for compiling and debugging CUDA code. Nsight Systems provides system-level performance analysis (CPU/GPU interaction, kernel timelines). Nsight Compute is for in-depth kernel profiling (memory latency, occupancy, instruction throughput).
Use cuBLAS and cuDNN for production-grade, optimized implementations of standard operations. CUTLASS provides building blocks for writing high-performance custom kernels. Thrust offers a high-level, STL-like interface for common parallel patterns (sort, reduce, scan).
1 career found
Try a different search term.