Interview Prep
AI On-Device AI Engineer Interview Questions
50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsA strong answer covers latency benefits, data privacy advantages, offline capability, and the tradeoff in available compute and memory versus cloud GPUs.
Discuss reducing weight precision (e.g., FP32 to INT8), the resulting memory and latency savings, and calibration methods to preserve accuracy.
Cover CPU (NEON/SVE), GPU (Adreno/Mali/Apple GPU), NPU/DSP (Hexagon DSP, Apple Neural Engine, Samsung NPU), and when each is appropriate.
TensorFlow Lite, Core ML, ONNX Runtime Mobile, PyTorch Mobile, ExecuTorch, MediaPipe-any three with a brief note on platform fit.
Post-training quantization applies after training and is simpler but may lose more accuracy; QAT simulates quantization during training for better accuracy at the cost of training complexity.
Intermediate
10 questionsCover torch export β ONNX β TFLite conversion chain, operator coverage gaps, custom op registration, dynamic shape handling, and numerical accuracy validation.
Discuss operator support matrices, throughput per watt, memory bandwidth considerations, quantization requirements of the NPU, and fallback paths.
Cover teacher-student architecture, soft label training, temperature scaling, and how distillation preserves nuanced knowledge that hard labels miss.
Discuss eliminating intermediate memory writes by fusing consecutive ops (e.g., Conv + ReLU), reducing memory bandwidth bottleneck on edge devices.
Cover using hardware power monitors, tegrastats, INA219 sensors, measuring idle vs. active power, isolating inference from background processes, and reporting energy-per-inference.
Discuss sensitivity analysis per layer, keeping first/last layers at higher precision, and using mixed-precision tools like QAT or hardware-supported mixed-precision runtimes.
Cover writing custom delegates/kernels, operator fallback to CPU, graph partitioning, and evaluating alternative architectures to avoid the unsupported op entirely.
Discuss representative data sampling for activation range estimation, calibration algorithms (min-max, entropy, percentile), and the impact of calibration data quality on accuracy.
Cover KV-cache growth with sequence length, attention matrix scaling, memory fragmentation, and strategies like sliding window attention or memory-mapped weight files.
Discuss automated benchmark gates on latency, accuracy, memory footprint; testing on real-device farms or simulators; canary deployment with rollback triggers.
Advanced
10 questionsCover weight-only quantization (INT4/GPTQ/AWQ), KV-cache management, speculative decoding for latency, memory-mapped model loading, and potential use of NPU for matmul layers.
Discuss TVM's relay IR, BYOC (Bring Your Own Codegen), auto-scheduling (Ansor), target-agnostic graph optimizations, and comparison to XLA's HLO and Core ML's intermediate representation.
Cover the delegate API, node-level op support validation, memory planning across CPU-offloaded and accelerated nodes, kernel registration, and testing with the TFLite benchmark model.
Structured pruning removes entire channels/filters yielding real speedup on standard hardware; unstructured is higher sparsity but requires sparse-dense hardware. Preference should be justified by target hardware capabilities.
Discuss fixed-function compute blocks, batch size requirements, supported op sets, precision support, memory bandwidth access patterns, and developer tooling differences.
Cover fine-tuning on device with limited data, speaker adaptation layers, differential privacy guarantees, federated learning aggregation, and constraints of on-device training compute.
Cover outlier handling in activations (LLM.int8()), grouping strategies, the role of calibration data for weight rounding, and why activation quantization degrades significantly in transformers.
Discuss model architecture selection (nano/small variants), TensorRT optimization with FP16, input resolution reduction, NMS optimization, and profiling with thermal throttling simulation.
Cover NCHW vs NHWC layout transformations, in-place operation planning, memory reuse graphs, tensor lifetime analysis, and how these reduce peak memory and improve cache utilization.
Discuss binary diff on model files, delta quantization (only shipping layers that changed), chunked downloads with integrity verification, and rollback strategies if the update causes accuracy drops.
Scenario-Based
10 questionsAssess compression feasibility (quantization, pruning, distillation), set accuracy SLA, propose phased rollout (cloud fallback + progressive on-device), and communicate tradeoffs clearly.
Check calibration data representativeness, run accuracy benchmarks segmented by demographic, evaluate if quantization disproportionately affects underrepresented patterns, and re-calibrate with balanced data.
Check for operator fallback to CPU, memory transfer overhead between CPU and NPU, suboptimal graph partitioning, and compare against their cherry-picked benchmark model vs. your real workload.
Discuss a model conversion matrix per SoC, per-device runtime selection (TFLite NNAPI vs. vendor-specific), a unified benchmark gate, and a configuration-driven deployment system.
Consider model download incomplete, OTA update partial failures, first-run initialization requiring cloud resources, and graceful degradation strategies for missing components.
Discuss p95 vs average latency differences, cold-start vs warm-start behavior, device thermal state during benchmarks, and the importance of user-perceived latency metrics beyond raw throughput.
Discuss architectural constraints (tiny screen, minimal RAM), potential for distilled/distorted diffusion models, latentspace-only generation, offloading to phone as a companion, and setting realistic expectations.
Cover adding integration tests against known input/output pairs, containerizing the build environment, introducing the TFLite Support Library or ONNX Runtime as a migration path, and incremental refactoring.
Discuss federated learning for global model improvement, on-device fine-tuning for personalization, differential privacy for gradient protection, and auditable privacy guarantees.
Cover model versioning and traceability, locked inference pipelines, extensive validation datasets, deterministic execution, and documentation for regulatory submission.
AI Workflow & Tools
10 questionsWalk through Optimum's ORTQuantizer API, dynamic vs static quantization options, the calibration process, and using the ORTModelForSequenceClassification class for benchmarking.
Cover logging latency, memory, accuracy metrics per device, using W&B Tables for comparison, artifact versioning for model binaries, and sweep configurations for hyperparameter search.
Cover ONNX import, FP16/INT8 builder configurations, dynamic shape profiles, timing cache usage, and profiling with trtexec to validate latency targets.
Describe defining the search space for the target hardware, running the auto-scheduler with task extraction, tuning trials, and integrating the generated schedule into a TVM runtime library.
Cover torch.export β Core ML conversion API, ct.models.ModelConfiguration options, coremltools quantization utilities, and interpreting the performance report's latency and memory estimates.
Discuss self-hosted runners with USB-connected devices, containerized benchmark scripts, publishing results as PR comments, and gating merge on performance thresholds.
Cover MediaPipe Hands for landmark extraction, training a small classifier on landmark features, integrating as a MediaPipe task graph, and profiling the full pipeline end-to-end.
Discuss uploading models to AI Hub, selecting target SoCs, reviewing automated profiling reports, and iterating on optimization based on the platform-specific recommendations.
Cover torch.export graph capture, operator registry for custom kernels, backend delegation to NPU/GPU, and the ExecuTorch runtime initialization on Android/iOS.
Discuss the SmoothQuant algorithm for migrating quantization difficulty from activations to weights, the onnxruntime.quantization.quantize API with smooth_quant mode, and calibration dataset requirements.
Behavioral
5 questionsLook for clear communication of constraints, data-driven justification, offering alternative solutions, and maintaining a collaborative rather than adversarial tone.
Assess accountability, root cause analysis rigor, improvements made to testing infrastructure, and whether the candidate treats production incidents as learning opportunities.
Look for systematic learning habits (reading papers, attending conferences, contributing to OSS) and the ability to evaluate new tools critically rather than chasing every trend.
Seek evidence of empathy for hardware constraints, willingness to adjust ML requirements, finding common ground on shared metrics like latency and power, and successful cross-functional outcomes.
Evaluate the candidate's ability to prioritize foundational skills (systems programming, hardware understanding) and provide a structured, encouraging mentorship approach.