AI Model Compression Engineer
An AI Model Compression Engineer specializes in optimizing and shrinking large, computationally expensive machine learning models …
Skill Guide
The practice of tailoring software, algorithms, and computational workloads to exploit the unique architectural strengths and mitigate the bottlenecks of specific processing units (CPU, GPU, NPU, DSP) to maximize performance, throughput, and power efficiency.
Scenario
You are tasked with comparing the performance of a large matrix multiplication (e.g., 4096x4096 FP32) on a multi-core CPU versus a discrete GPU to determine which is more cost-effective for a batch processing pipeline.
Scenario
You have a pre-trained image classification model (e.g., MobileNetV3) that must run in <5ms latency on a smartphone's NPU (e.g., Qualcomm Hexagon DSP, Apple ANE) for a real-time camera feature.
Scenario
Design and implement a system that ingests 100+ concurrent 1080p video streams, performs object detection (YOLOv5), and outputs metadata with <100ms end-to-end latency on a server with 2x Intel Xeon CPUs, 4x NVIDIA A100 GPUs, and a SmartNIC with offload capabilities.
Non-negotiable first step. Used to identify hotspots, memory bottlenecks, cache misses, and kernel occupancy on their respective architectures. The choice is dictated by the target hardware.
Pre-optimized, vendor-tuned building blocks. Use these for baseline high performance before writing custom kernels. They handle architecture-specific optimizations (e.g., AVX-512, Tensor Cores) automatically.
For writing portable code that can target CPUs, GPUs, and accelerators. Use when you need to support multiple hardware vendors (Intel, NVIDIA, AMD, ARM) with a single codebase, accepting potential performance trade-offs vs. native SDKs.
Answer Strategy
Demonstrate systematic profiling. Sample Answer: 'First, I'd check for a memory bandwidth bottleneck using Nsight Compute's memory chart. High occupancy with low throughput often indicates threads are stalled waiting for data, not computing. I'd validate by looking at L2 cache hit rates and global memory throughput. Second, I'd inspect kernel launch latency and grid occupancy; perhaps we have many small kernels creating overhead. I'd use Nsight Systems' trace to see the gap between kernel launches. Third, I'd check for warp divergence, where threads in a warp take different execution paths, serializing execution. I'd look at the warp state statistics in the profiler.'
Answer Strategy
Tests pragmatic decision-making and business impact awareness. Sample Answer: 'We were building a cross-platform inference engine. Our SYCL code was portable but 20% slower on NVIDIA GPUs than our CUDA baseline. The deadline was for a flagship product on NVIDIA hardware. The trade-off: I recommended shipping with the CUDA-optimized path for the v1 release, while maintaining the SYCL path in a separate branch. We documented the performance gap and the architectural reasons (specific memory coalescing patterns). This met the business performance target for launch, and we later invested in closing the SYCL gap for v2 to support Intel GPUs, which became a key sales differentiator.'
1 career found
Try a different search term.