AI Computer Vision Engineer
AI Computer Vision Engineers design, build, and deploy intelligent systems that interpret and act on visual data-from medical imag…
Skill Guide
The engineering discipline of mapping software algorithms onto the massively parallel, hierarchical hardware architecture of GPUs, requiring explicit management of thousands of concurrent threads, memory subsystems, and execution units to extract maximum computational throughput.
Scenario
Implement a high-performance matrix multiplication (C = A * B) for large matrices (4096x4096) that significantly outperforms a naive global-memory-only implementation.
Scenario
Accelerate a 5-point stencil operation (common in image processing or physics simulations) on a 2D grid, focusing on reducing global memory bandwidth pressure and maximizing arithmetic intensity.
Scenario
Develop a custom fused CUDA kernel for the core operations of a transformer encoder (LayerNorm, QKV projection, attention score, softmax) to minimize memory bandwidth usage and kernel launch latency, targeting A100 or H100 GPUs.
Nsight Compute is for kernel-level analysis (memory transactions, stall reasons, SM utilization). Nsight Systems is for system-level profiling (API calls, kernel launches, CPU-GPU overlap). Use Nsight Compute to drill into a single kernel's bottlenecks.
Use cuBLAS/cuDNN for optimized AI primitives (GEMM, convolutions) as a baseline. CUTLASS provides templated C++ abstractions for writing custom, high-performance kernels using Tensor Cores. Thrust/CUB provide high-level parallel algorithms and device-wide primitives for reductions and scans.
CUDA-GDB for kernel debugging. Compute Sanitizer (`memcheck`, `racecheck`) for detecting memory errors, race conditions, and leaks. IDE extensions provide integrated build, debug, and profiling support.
1 career found
Try a different search term.