Interview Prep
AI Model Compression Engineer Interview Questions
49 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsA great answer covers reducing model size, compute, and/or memory requirements to enable deployment on resource-constrained devices, often with a trade-off against some accuracy.
A good answer defines quantization as reducing numerical precision (e.g., FP32 to INT8) and pruning as removing redundant weights or neurons from a network.
The answer should describe ONNX as an open interchange format for ML models that enables framework interoperability and deployment on various runtimes and hardware.
Look for mentions of accuracy (e.g., top-1 accuracy), model size (MB), inference latency (ms), memory footprint, and possibly FLOPs or energy consumption.
A correct answer explains that PTQ converts a model's weights and activations to lower precision after training is complete, requiring little or no retraining, often using a calibration dataset.
Intermediate
9 questionsThe answer should contrast removing individual weights (unstructured, leads to sparse matrices needing special libraries) versus removing entire filters/channels (structured, leads to smaller dense models that run efficiently on standard hardware).
A strong response explains QAT simulates quantization during training to make the model robust, resulting in higher accuracy but more training cost. It's chosen when PTQ accuracy drops are unacceptable.
The candidate should describe training a smaller 'student' model to mimic the output (soft targets) or intermediate representations of a larger, pre-trained 'teacher' model to transfer knowledge efficiently.
A great answer discusses that INT8 is often faster on CPUs and mobile NPUs, while GPUs may benefit more from FP16 or TensorFloat-32. It should mention operator and kernel support as a key factor.
The answer should state it's a representative subset of training/validation data used to determine the dynamic range of activations for setting quantization scales, crucial for accuracy.
Look for specific mentions like torch.quantization, torch.ao.quantization, ONNX, ONNX Runtime with quantization tools, or Torch-TensorRT.
The candidate should explain fusing multiple operations (e.g., Conv + BatchNorm + ReLU) into a single kernel to reduce memory bandwidth and launch overhead, which frameworks and compilers like TensorRT do automatically.
The answer should cover both static model size (weights, parameters) and dynamic memory usage (activations, temporary buffers), often measured using profilers or specialized scripts.
A good response includes hardware compatibility, lack of support for certain ops, thermal throttling, battery impact, and the need for thorough on-device testing.
Advanced
10 questionsThe candidate should explain the hypothesis that dense networks contain sparse subnetworks ('winning tickets') that can be trained to comparable accuracy, opening up research into pruning at initialization.
A comprehensive answer would cover a multi-pronged strategy: aggressive quantization (e.g., 4-bit GPTQ or AWQ), structured pruning of attention heads/FFN layers, potentially distillation, and using efficient inference engines like llama.cpp or MLC-LLM.
The answer should describe assigning different bit-widths (e.g., 4-bit, 8-bit) to different layers based on sensitivity analysis. Methods include using Hessian-based metrics, Monte Carlo simulations, or simple accuracy profiling on a validation set.
Look for an explanation that weight sharing (e.g., k-means clustering of weights) reduces entropy, allowing for further compression with Huffman coding. It's a complementary technique to quantization, often applied in tandem for high compression ratios.
A detailed answer covers exporting to ONNX, parsing with TensorRT, handling unsupported ops (fallback or custom plugins), dynamic shapes, precision calibration for INT8, and building the engine. Pitfalls include ONNX export failures and accuracy drift.
The candidate should outline a system for iterative pruning/quantization, automated accuracy evaluation, hardware-in-the-loop profiling, and a search algorithm (e.g., Bayesian optimization) to find the Pareto-optimal trade-off point.
A strong answer mentions NPUs/TPUs with specific dataflow architectures, compiler stacks (e.g., XLA, TVM), and the need for compression engineers to understand these architectures to design models that map well onto them.
The answer should clarify that this is a form of knowledge distillation specifically for transformers, often involving architectural changes (fewer layers) and a loss that combines soft-target distillation with the standard task loss and sometimes embedding loss.
The candidate should discuss techniques like padding to fixed buckets, using dynamic shape support in TensorRT/ONNX Runtime, and the overhead implications of each approach.
A nuanced answer notes that while smaller models generally consume less energy per inference, the relationship isn't linear. It depends on memory access patterns, hardware utilization, and the specific energy profile of the compute units (e.g., SRAM vs. DRAM access costs).
Scenario-Based
10 questionsA great answer suggests the problem is likely domain shift-the calibration/validation data wasn't representative of real-world user data. The solution involves collecting a new calibration set from production data and re-calibrating/adjusting the quantization or fine-tuning the compressed model on this in-domain data.
The answer should advocate for PTQ first, due to its speed and minimal engineering effort, suitable for a tight deadline. The recommendation should be to try PTQ, set an accuracy threshold (e.g., <1% drop), and only pivot to QAT if PTQ fails to meet that threshold, acknowledging the longer timeline.
The candidate should explain that the entire pipeline must be constrained to 4-bit, requiring careful quantization-aware training from the start, and that model architecture choices might be limited by the NPU's operator set. Collaboration with the hardware team for kernel availability is crucial.
A systematic answer would involve profiling to find the bottleneck (memory-bound or compute-bound), then applying targeted optimizations: further pruning/compression of the bottleneck layers, exploring operator fusion, adjusting the thread pool configuration, or finally considering a slightly more aggressive but less accurate compression for the critical section.
The answer should outline analyzing the computational and memory profile of each modality's encoder and the fusion head. Typically, the vision encoder is the largest and most amenable to pruning, while the language model might be more sensitive. A staged approach, compressing each component separately with sensitivity analysis, is key.
The most likely cause is peak memory usage during inference, exceeding the available RAM. The fix involves profiling memory allocation over time, implementing memory-efficient attention, reducing batch size to 1, using memory-mapped formats, or further optimizing the graph to reduce intermediate buffer sizes.
The answer should include reverse-engineering their approach (model architecture, compression technique), benchmarking their model if possible, and systematically improving your own pipeline by adopting best practices like better graph optimizations, a more efficient operator library, or a more advanced quantization scheme like mixed-precision.
A strong response emphasizes automation: a CI/CD pipeline that takes a new base model, runs the full compression script (pruning, quantization, calibration), tests it against a regression suite, and deploys it. Versioning of the base model, compressed model, and calibration data is critical.
The candidate should use a relatable analogy, like shipping goods: 'A fully accurate model is like a perfect but huge crate that's too expensive to ship everywhere. We repackage the same product into smaller, lighter boxes (compressed models) so it can be delivered instantly to every customer's door (their phone), making the product actually usable.'
The answer must have two parts: ethically, flagging this as a fairness/accuracy disparity issue that needs stakeholder visibility. Technically, using techniques like per-channel quantization, adjusting calibration data to better represent minority classes, or applying targeted quantization-aware fine-tuning with a weighted loss function for that class.
AI Workflow & Tools
10 questionsA great answer details: 1) Export PyTorch model to ONNX, 2) Convert ONNX to TF SavedModel using tf2onnx, 3) Apply post-training quantization via TFLite converter with a representative dataset, 4) Convert to .tflite file, 5) Benchmark and validate on Android emulator/device using TFLite benchmark model tool.
The candidate should describe logging key metrics (accuracy, model size, latency, FLOPs) for each run, using the comparison feature to plot trade-off curves, and tagging experiments with compression techniques (e.g., 'pruning_50%', 'qat_int8').
The answer should cover: 1) Export to ONNX, 2) Create a TensorRT builder/logger, 3) Build an engine with explicit precision (FP16/INT8) and optimization profiles, 4) For INT8, provide a calibration cache via a calibrator class, 5) Serialize and deserialize the engine for deployment, 6) Run inference using the TensorRT runtime API.
Look for a description of a pipeline that triggers on model file changes, runs a validation script on a fixed test set, runs a latency benchmark script on a reference device, compares results to stored baselines, and fails the build if thresholds are exceeded.
The answer should mention techniques like embedding pruning, quantization (INT8 or even binary embeddings), hash-based embeddings, and using dimensionality reduction. Tools might include custom PyTorch/TensorFlow scripts, specialized libraries like `torch.nn.EmbeddingBag`, and profiling with torch.utils.benchmark.
A technical response should describe: defining the model graph (from ONNX), specifying the hardware target, using TVM's auto-scheduler (Ansor) or auto-tuning to find optimized operator implementations, and compiling to a deployable library for that target.
The candidate should explain setting up the profiler with activities (CPU, CUDA), running a few inference steps, and analyzing the trace in TensorBoard or Chrome trace viewer to look for high-latency kernels, frequent memory operations, or low GPU utilization.
The answer should indicate that per-channel quantization for weights (especially in conv layers) is generally more accurate, while per-tensor is simpler. Symmetric is often used for weights, asymmetric for activations. The choice is guided by layer sensitivity analysis and empirical testing on accuracy.
The answer should cover creating an inference session with options (e.g., execution providers for GPU/NNAPI), preprocessing input data to match the model's expected format, running inference, and post-processing the output. Error handling and memory management are key points.
A good answer mentions using a sensitivity metric like the Hessian (approximated by Fisher information) or a brute-force search over a subset of configurations, evaluating accuracy on a small calibration set for each candidate configuration to select the Pareto-optimal one.
Behavioral
5 questionsThe answer should demonstrate a structured decision-making process: defining constraints, exploring options, involving stakeholders, and ultimately making a data-driven decision. The outcome should highlight learning and the impact of the chosen trade-off.
Look for a proactive approach: following key researchers on Twitter, reading ArXiv papers, participating in conferences (NeurIPS, MLSys), taking online courses, and engaging with open-source communities on GitHub.
A strong response shows humility and analytical skills. For example, a pruning method that assumed uniform weight importance failed because the model had critical sparse structures. The learning was about the importance of layer-wise sensitivity analysis and domain-specific knowledge.
The candidate should use a simple analogy, like rounding a price to the nearest dollar for faster mental math-quantization rounds neural network weights to fewer decimal places (precision levels), making calculations faster at a small cost to exactness.
The answer should reflect a practical, product-focused mindset. It involves defining clear acceptance criteria upfront with stakeholders (e.g., <1% accuracy loss, <50ms latency, <15MB size) and using a combination of offline testing and controlled online A/B testing to validate.