AI Model Compression Engineer
An AI Model Compression Engineer specializes in optimizing and shrinking large, computationally expensive machine learning models …
Skill Guide
The disciplined practice of squeezing maximum hardware performance from CPU and GPU architectures by writing code that directly manages memory hierarchy, instruction pipelines, and parallel execution units in C/C++ and NVIDIA's CUDA.
Scenario
Accelerate the addition of two large vectors (1M+ elements) on the GPU, comparing naive and optimized kernel performance.
Scenario
Implement and optimize tiled matrix multiplication for matrices that don't fit in L2 cache, minimizing global memory accesses.
Scenario
Develop a CUDA kernel for SpMV on a CSR (Compressed Sparse Row) matrix that outperforms library implementations for your specific sparsity pattern.
The primary code generators. Use compiler flags aggressively and inspect assembly output (`-S`) to verify optimizations like auto-vectorization (SIMD) and loop unrolling.
Essential for evidence-based optimization. Nsight Compute identifies GPU warp stalls, memory throughput, and instruction mix. Perf/VTune analyzes CPU cache misses, branch mispredictions, and IPC.
Use high-quality, vendor-optimized libraries for common patterns (sort, reduce) before writing custom kernels. CUB is the standard for device-wide primitives in CUDA.
Answer Strategy
Test the candidate's ability to interpret profiler data and reason about hardware. The scenario indicates a memory-latency-bound kernel, not throughput-bound. The high L2 hit rate means accesses are local, but the latency isn't being hidden. Strategy: 1) Increase occupancy (if low) to have more warps to hide latency. 2) Introduce explicit data prefetching or restructure computation to have independent memory accesses. 3) Use asynchronous memory copies (`cp.async`) if available. Sample Answer: 'The kernel is memory latency-bound. High L2 hits are good, but the SM is idle waiting for data. I'd first check occupancy; if it's low, I'd reduce register usage or shared memory per block to launch more concurrent warps. If occupancy is already high, I'd restructure the algorithm to have more independent arithmetic between memory loads to hide latency.'
Answer Strategy
Tests systematic methodology and communication. The core competency is a data-driven, iterative optimization cycle. A professional response should outline: 1) Establishing a baseline and defining the performance target. 2) Profiling to identify the hottest code paths (80/20 rule). 3) Formulating and implementing a hypothesis (e.g., changing data layout). 4) Re-profiling to measure the effect and verify no regressions. 5) Documenting the change and impact. Avoid vague answers like 'I made the code faster'.
1 career found
Try a different search term.