AI IoT Agent Engineer
An AI IoT Agent Engineer designs, deploys, and orchestrates autonomous AI agents that perceive, reason about, and act upon data fr…
Skill Guide
The engineering discipline of compressing, converting, and optimizing deep learning models for inference on resource-constrained devices using quantization, pruning, and hardware-specific runtimes like ONNX Runtime and TensorRT.
Scenario
Deploy an image classification model to a Raspberry Pi 4 (ARM CPU) to classify objects from a USB camera feed in real-time.
Scenario
Deploy a BERT-based sentiment analysis model on a Jetson Nano (NVIDIA GPU) for a kiosk application, targeting <100ms latency per inference.
Scenario
Build a system to automatically deploy an object detection model (YOLOv8) to a fleet containing NVIDIA Jetson AGX Orin (TensorRT), a Qualcomm-based Android phone (QNN), and an Intel CPU (OpenVINO).
ONNX is the universal intermediate format. Use `torch.onnx.export` (PyTorch) or `tf2onnx` (TensorFlow/Keras) to create the .onnx file as the first step in any deployment pipeline.
TensorRT is the premier optimizer for NVIDIA GPUs (desktop & Jetson). ONNX Runtime provides cross-platform deployment with various execution providers. OpenVINO targets Intel hardware. QNN targets Qualcomm SoCs. Choose based on target device silicon.
Use TensorRT's calibration for high-accuracy INT8 on NVIDIA GPUs. ONNX Runtime's tool is for PTQ on CPU/other backends. PyTorch/TensorFlow native tools are for QAT, which is more accurate but requires retraining.
nsys is critical for profiling GPU kernel execution on NVIDIA devices. ONNX Runtime and TensorRT have built-in profiling to identify bottleneck layers. Use `trtexec` for quick engine build-time profiling.
Answer Strategy
Structure the answer using a systematic optimization pipeline: 1) Baseline measurement, 2) Model simplification, 3) Export and graph optimization, 4) Precision quantization, 5) Runtime optimization, 6) Validation. Sample Answer: "First, I'd profile the baseline FP32 model using `nsys` and `trtexec` to establish a latency and accuracy baseline. Then, I'd export to ONNX and use GraphSurgeon to remove unnecessary operations. Next, I'd apply TensorRT with FP16 precision, which is lossless for most vision models, and benchmark. If more speed is needed, I'd use TensorRT's INT8 quantization with a calibration dataset from the training distribution to stay within the 1% accuracy bound, carefully validating on a hold-out set. Finally, I'd enable dynamic batching and optimize the pre/post-processing pipelines in the TensorRT C++ API to avoid host-device sync bottlenecks."
Answer Strategy
Tests the candidate's systematic debugging methodology and understanding of the optimization stack. Sample Answer: "This is a classic precision or graph alteration issue. My first step is to isolate the problem: I would run inference on the same input tensor using both the ONNX Runtime CPU backend and TensorRT, comparing intermediate layer outputs. I'd use TensorRT's `IEngineInspector` to examine the built engine's layer precision and fusion, looking for layers unexpectedly running in lower precision. I'd also verify that the ONNX model uses opset versions fully supported by TensorRT. If the issue persists, I'd build the TensorRT engine with `--verbose` logs to check for layer fallbacks to default precision, which might indicate unsupported operations causing silent errors."
1 career found
Try a different search term.