AI Quantization Engineer
An AI Quantization Engineer specializes in compressing and optimizing large, computationally expensive AI models for efficient dep…
Skill Guide
The practice of writing or modifying code at the hardware instruction level (Assembly) or using Single Instruction, Multiple Data (SIMD) intrinsics to maximize computational throughput by exploiting CPU parallelism and minimizing latency.
Scenario
Given a raw 24-bit RGB image buffer, convert each pixel to grayscale using the standard luminosity formula (0.21*R + 0.72*G + 0.07*B).
Scenario
Implement a single-precision floating-point matrix multiplication (C = A * B) for matrices of size 1024x1024, achieving at least 50% of theoretical peak FLOPS on your CPU.
Scenario
Design a library function (e.g., `fast_exp`) that runs optimally on any x86 CPU from Sandy Bridge to Sapphire Rapids, detecting AVX-512, AVX2, or SSE4.2 at runtime.
GCC/Clang are the primary compilers for generating and analyzing assembly. Use flags like `-O3 -march=native -S` to inspect output. NASM/YASM/MASM are used for writing pure assembly modules linked with higher-level code.
Essential for identifying hotspots, cache misses, branch mispredictions, and pipeline stalls. VTune is the industry standard for deep micro-architectural analysis. Use them to guide optimization efforts.
GDB for stepping through assembly. Compiler Explorer is critical for instantly seeing how C/C++ intrinsics or code patterns translate to assembly across compilers/versions. Visualizers help understand register usage and data movement.
Use oneMKL or SLEEF as reference implementations for highly optimized SIMD math. Agner's manuals are the definitive guide for instruction latencies, throughput, and optimization techniques.
Answer Strategy
Demonstrate a structured optimization methodology: Profile to confirm it's a hotspot -> Analyze compiler output -> Ensure data alignment -> Implement with AVX2 `_mm256_fmadd_ps` for FMA -> Use loop unrolling to maximize instruction-level parallelism -> Validate performance gain and correctness. Sample: 'First, I'd use perf to confirm the loop is a bottleneck. Then, I'd examine the compiler's auto-vectorized assembly. If suboptimal, I'd manually implement it with AVX2 intrinsics, loading aligned 256-bit chunks, using `_mm256_fmadd_ps` for fused operations, and unrolling the loop by a factor of 4 or 8 to keep the FMA units busy. I'd benchmark each step to measure the impact.'
Answer Strategy
Tests understanding of real-world constraints beyond raw compute. Look for mentions of memory bandwidth limits, data alignment issues, and the cost of data shuffling. Sample: 'A common pitfall is when the problem is memory-bandwidth bound, not compute-bound. For example, applying a simple SIMD operation to a massive array that doesn't fit in cache will be limited by DRAM bandwidth. I'd diagnose this using VTune to check for high 'Memory Bound' metrics and cache miss rates. The fix would be to improve data locality (blocking) before or instead of adding more SIMD complexity.'
1 career found
Try a different search term.