AI Local LLM Engineer
An AI Local LLM Engineer specializes in deploying, optimizing, and maintaining large language models that run entirely on local or…
Skill Guide
The systematic profiling, analysis, and optimization of computational kernels and data movement across heterogeneous hardware accelerators (GPU, CPU SIMD, NPU) to maximize throughput, minimize latency, and achieve optimal performance-per-watt.
Scenario
A naive matrix transpose kernel is running 3x slower than the memory bandwidth limit suggests it should.
Scenario
Design a system to run object detection on a 4K video stream with <50ms latency on a system with an NVIDIA GPU, a modern x86 CPU, and an NPU.
Scenario
A GNN training framework suffers from OOM errors and severe memory fragmentation due to variable-sized graph tensors.
Used for timeline analysis, kernel-level metrics (stalls, occupancy, memory throughput), and system-wide bottleneck identification. The primary tool for moving from guesswork to data-driven optimization.
Highly optimized, vendor-provided implementations of common operations (GEMM, conv). Learning to correctly use and configure them is the first step to achieving near-peak performance.
For managing complex execution graphs, data transfers, and memory lifecycles to hide latency and maximize hardware utilization across multiple devices or processes.
Answer Strategy
The candidate must demonstrate a systematic, tool-driven approach. They should mention: 1) Checking memory transaction efficiency (coalescing), 2) Analyzing shared memory usage and bank conflicts, 3) Considering algorithmic changes (tiling, recomputation) to increase arithmetic intensity, 4) Verifying they aren't hitting a memory bandwidth limit. Sample Answer: 'I would first inspect the `Global Memory Load/Store Efficiency` metric. If below 100%, I'd restructure memory access patterns. Next, I'd check for shared memory bank conflicts via the `Shared Memory Bank Conflict` metric. If those are optimal, I'd profile the kernel's algorithmic intensity-perhaps implementing loop tiling to reuse data from shared memory, thereby increasing FLOPs per byte loaded from global memory.'
Answer Strategy
This tests understanding of trade-offs beyond peak FLOPs. The answer should highlight latency, power, data movement cost, and hardware availability. Sample Answer: 'For a sub-2ms latency requirement on a single-image classification task in an edge device without a discrete GPU, I used AVX2. The key factors were: 1) Eliminating PCIe/host-device memory copy latency, 2) Lower power consumption, 3) Avoiding GPU kernel launch overhead for a very small workload. I optimized the model with quantization (INT8) and used Intel's OpenVINO to leverage VNNI instructions for maximum throughput.'
1 career found
Try a different search term.