Skip to main content

Interview Prep

AI On-Device AI Engineer Interview Questions

50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.

Beginner: 5Intermediate: 10Advanced: 10Scenario-Based: 10AI Workflow & Tools: 10Behavioral: 5

Beginner

5 questions
What a great answer covers:

A strong answer covers latency benefits, data privacy advantages, offline capability, and the tradeoff in available compute and memory versus cloud GPUs.

What a great answer covers:

Discuss reducing weight precision (e.g., FP32 to INT8), the resulting memory and latency savings, and calibration methods to preserve accuracy.

What a great answer covers:

Cover CPU (NEON/SVE), GPU (Adreno/Mali/Apple GPU), NPU/DSP (Hexagon DSP, Apple Neural Engine, Samsung NPU), and when each is appropriate.

What a great answer covers:

TensorFlow Lite, Core ML, ONNX Runtime Mobile, PyTorch Mobile, ExecuTorch, MediaPipe-any three with a brief note on platform fit.

What a great answer covers:

Post-training quantization applies after training and is simpler but may lose more accuracy; QAT simulates quantization during training for better accuracy at the cost of training complexity.

Intermediate

10 questions
What a great answer covers:

Cover torch export β†’ ONNX β†’ TFLite conversion chain, operator coverage gaps, custom op registration, dynamic shape handling, and numerical accuracy validation.

What a great answer covers:

Discuss operator support matrices, throughput per watt, memory bandwidth considerations, quantization requirements of the NPU, and fallback paths.

What a great answer covers:

Cover teacher-student architecture, soft label training, temperature scaling, and how distillation preserves nuanced knowledge that hard labels miss.

What a great answer covers:

Discuss eliminating intermediate memory writes by fusing consecutive ops (e.g., Conv + ReLU), reducing memory bandwidth bottleneck on edge devices.

What a great answer covers:

Cover using hardware power monitors, tegrastats, INA219 sensors, measuring idle vs. active power, isolating inference from background processes, and reporting energy-per-inference.

What a great answer covers:

Discuss sensitivity analysis per layer, keeping first/last layers at higher precision, and using mixed-precision tools like QAT or hardware-supported mixed-precision runtimes.

What a great answer covers:

Cover writing custom delegates/kernels, operator fallback to CPU, graph partitioning, and evaluating alternative architectures to avoid the unsupported op entirely.

What a great answer covers:

Discuss representative data sampling for activation range estimation, calibration algorithms (min-max, entropy, percentile), and the impact of calibration data quality on accuracy.

What a great answer covers:

Cover KV-cache growth with sequence length, attention matrix scaling, memory fragmentation, and strategies like sliding window attention or memory-mapped weight files.

What a great answer covers:

Discuss automated benchmark gates on latency, accuracy, memory footprint; testing on real-device farms or simulators; canary deployment with rollback triggers.

Advanced

10 questions
What a great answer covers:

Cover weight-only quantization (INT4/GPTQ/AWQ), KV-cache management, speculative decoding for latency, memory-mapped model loading, and potential use of NPU for matmul layers.

What a great answer covers:

Discuss TVM's relay IR, BYOC (Bring Your Own Codegen), auto-scheduling (Ansor), target-agnostic graph optimizations, and comparison to XLA's HLO and Core ML's intermediate representation.

What a great answer covers:

Cover the delegate API, node-level op support validation, memory planning across CPU-offloaded and accelerated nodes, kernel registration, and testing with the TFLite benchmark model.

What a great answer covers:

Structured pruning removes entire channels/filters yielding real speedup on standard hardware; unstructured is higher sparsity but requires sparse-dense hardware. Preference should be justified by target hardware capabilities.

What a great answer covers:

Discuss fixed-function compute blocks, batch size requirements, supported op sets, precision support, memory bandwidth access patterns, and developer tooling differences.

What a great answer covers:

Cover fine-tuning on device with limited data, speaker adaptation layers, differential privacy guarantees, federated learning aggregation, and constraints of on-device training compute.

What a great answer covers:

Cover outlier handling in activations (LLM.int8()), grouping strategies, the role of calibration data for weight rounding, and why activation quantization degrades significantly in transformers.

What a great answer covers:

Discuss model architecture selection (nano/small variants), TensorRT optimization with FP16, input resolution reduction, NMS optimization, and profiling with thermal throttling simulation.

What a great answer covers:

Cover NCHW vs NHWC layout transformations, in-place operation planning, memory reuse graphs, tensor lifetime analysis, and how these reduce peak memory and improve cache utilization.

What a great answer covers:

Discuss binary diff on model files, delta quantization (only shipping layers that changed), chunked downloads with integrity verification, and rollback strategies if the update causes accuracy drops.

Scenario-Based

10 questions
What a great answer covers:

Assess compression feasibility (quantization, pruning, distillation), set accuracy SLA, propose phased rollout (cloud fallback + progressive on-device), and communicate tradeoffs clearly.

What a great answer covers:

Check calibration data representativeness, run accuracy benchmarks segmented by demographic, evaluate if quantization disproportionately affects underrepresented patterns, and re-calibrate with balanced data.

What a great answer covers:

Check for operator fallback to CPU, memory transfer overhead between CPU and NPU, suboptimal graph partitioning, and compare against their cherry-picked benchmark model vs. your real workload.

What a great answer covers:

Discuss a model conversion matrix per SoC, per-device runtime selection (TFLite NNAPI vs. vendor-specific), a unified benchmark gate, and a configuration-driven deployment system.

What a great answer covers:

Consider model download incomplete, OTA update partial failures, first-run initialization requiring cloud resources, and graceful degradation strategies for missing components.

What a great answer covers:

Discuss p95 vs average latency differences, cold-start vs warm-start behavior, device thermal state during benchmarks, and the importance of user-perceived latency metrics beyond raw throughput.

What a great answer covers:

Discuss architectural constraints (tiny screen, minimal RAM), potential for distilled/distorted diffusion models, latentspace-only generation, offloading to phone as a companion, and setting realistic expectations.

What a great answer covers:

Cover adding integration tests against known input/output pairs, containerizing the build environment, introducing the TFLite Support Library or ONNX Runtime as a migration path, and incremental refactoring.

What a great answer covers:

Discuss federated learning for global model improvement, on-device fine-tuning for personalization, differential privacy for gradient protection, and auditable privacy guarantees.

What a great answer covers:

Cover model versioning and traceability, locked inference pipelines, extensive validation datasets, deterministic execution, and documentation for regulatory submission.

AI Workflow & Tools

10 questions
What a great answer covers:

Walk through Optimum's ORTQuantizer API, dynamic vs static quantization options, the calibration process, and using the ORTModelForSequenceClassification class for benchmarking.

What a great answer covers:

Cover logging latency, memory, accuracy metrics per device, using W&B Tables for comparison, artifact versioning for model binaries, and sweep configurations for hyperparameter search.

What a great answer covers:

Cover ONNX import, FP16/INT8 builder configurations, dynamic shape profiles, timing cache usage, and profiling with trtexec to validate latency targets.

What a great answer covers:

Describe defining the search space for the target hardware, running the auto-scheduler with task extraction, tuning trials, and integrating the generated schedule into a TVM runtime library.

What a great answer covers:

Cover torch.export β†’ Core ML conversion API, ct.models.ModelConfiguration options, coremltools quantization utilities, and interpreting the performance report's latency and memory estimates.

What a great answer covers:

Discuss self-hosted runners with USB-connected devices, containerized benchmark scripts, publishing results as PR comments, and gating merge on performance thresholds.

What a great answer covers:

Cover MediaPipe Hands for landmark extraction, training a small classifier on landmark features, integrating as a MediaPipe task graph, and profiling the full pipeline end-to-end.

What a great answer covers:

Discuss uploading models to AI Hub, selecting target SoCs, reviewing automated profiling reports, and iterating on optimization based on the platform-specific recommendations.

What a great answer covers:

Cover torch.export graph capture, operator registry for custom kernels, backend delegation to NPU/GPU, and the ExecuTorch runtime initialization on Android/iOS.

What a great answer covers:

Discuss the SmoothQuant algorithm for migrating quantization difficulty from activations to weights, the onnxruntime.quantization.quantize API with smooth_quant mode, and calibration dataset requirements.

Behavioral

5 questions
What a great answer covers:

Look for clear communication of constraints, data-driven justification, offering alternative solutions, and maintaining a collaborative rather than adversarial tone.

What a great answer covers:

Assess accountability, root cause analysis rigor, improvements made to testing infrastructure, and whether the candidate treats production incidents as learning opportunities.

What a great answer covers:

Look for systematic learning habits (reading papers, attending conferences, contributing to OSS) and the ability to evaluate new tools critically rather than chasing every trend.

What a great answer covers:

Seek evidence of empathy for hardware constraints, willingness to adjust ML requirements, finding common ground on shared metrics like latency and power, and successful cross-functional outcomes.

What a great answer covers:

Evaluate the candidate's ability to prioritize foundational skills (systems programming, hardware understanding) and provide a structured, encouraging mentorship approach.