AI On-Device AI Engineer
An AI On-Device AI Engineer specializes in deploying, optimizing, and running machine learning models on edge hardware-smartphones…
Skill Guide
The process of transforming a machine learning model's computational graph from a training framework (like PyTorch or TensorFlow) into an optimized, deployable format for inference engines, involving graph restructuring techniques such as operator fusion and constant folding to maximize performance.
Scenario
Convert a pre-trained PyTorch ResNet-50 model to run on an NVIDIA Jetson Nano for a classification task.
Scenario
Deploy a YOLOv5 model on a mobile device using a framework like Core ML or TFLite, handling complex pre/post-processing and non-standard layers.
Scenario
A research team has developed a novel sparse attention module for a transformer model that is not supported by any inference framework's default opset.
ONNX is the universal interchange format. Use `torch.onnx.export` or `tf2onnx` to create it. Use runtime-specific converters (ONNX Runtime, TFLite, Core ML Tools) for final deployment optimization.
These are the primary engines that perform graph optimization (fusion, constant folding). TensorRT is standard for NVIDIA GPU/edge. OpenVINO optimizes for Intel hardware. TVM is for compiler-based optimization across diverse backends.
Netron for visualizing graph structure. TensorBoard for profiling. Nsight for low-level GPU kernel analysis. Essential for identifying bottlenecks and verifying optimizations.
Answer Strategy
The strategy is to demonstrate end-to-end pipeline knowledge and debugging skill. Start with `torch.onnx.export` and the `opset_version` argument. Explain that the custom autograd Function is not automatically traced. The solution is to register a symbolic function using `torch.onnx.register_custom_symbolic` to map it to an ONNX op or a custom op. Pitfalls include shape mismatches, missing symbolic registrations, and TensorRT not supporting the target ONNX op, requiring a custom plugin.
Answer Strategy
Testing structured problem-solving. 1. Visualize the graph in Netron to check if expected fusions (e.g., Conv+BN+ReLU) are present. 2. Use `trtexec --verbose` or TensorRT's `ILogger` to inspect optimization passes and warnings. 3. Profile with Nsight Systems to identify excessive kernel launches or memory copies. 4. Common causes: unsupported ops forcing fallback to slow DNN layers, suboptimal workspace size, or INT8 calibration dataset mismatch.
1 career found
Try a different search term.