AI Spatial Computing Engineer
An AI Spatial Computing Engineer designs and builds intelligent systems that merge AI models with immersive 3D environments - powe…
Skill Guide
The systematic process of reducing the latency and computational cost of trained machine learning models for production deployment by converting them to optimized formats (ONNX, TensorRT, Core ML) and applying compression techniques like quantization.
Scenario
Deploy a ResNet-50 model trained in PyTorch to run efficiently on an iPhone or Android device with minimal latency.
Scenario
Optimize a BERT model for a semantic search service to handle 500+ queries per second with P99 latency under 30ms on a single NVIDIA T4 GPU.
Scenario
Architect a unified inference gateway for a large e-commerce platform that must serve 10+ different models (recommendation, NLP, CV) across a heterogeneous fleet of GPUs (NVIDIA, AMD), CPUs, and edge devices.
Core tools for converting models to optimized, hardware-specific formats. TensorRT is critical for NVIDIA GPU inference, Core ML for Apple Silicon, and OpenVINO for Intel hardware.
Essential for identifying performance bottlenecks in the computation graph, memory access patterns, and kernel execution times. Use before and after optimization to quantify gains.
Production-grade serving solutions that manage model loading, request batching, versioning, and multi-GPU serving. Triton is particularly advanced for supporting multiple backends (TensorRT, ONNX Runtime).
Answer Strategy
Demonstrate a systematic debugging process. Start by validating the problem with a representative evaluation set. Use TensorRT's built-in profiling (trtexec --profilingVerbosity=detailed) to compare layer-by-layer outputs between the original and optimized models. Implement precision fallbacks (e.g., per-layer precision control) to isolate the layer causing accuracy loss. Finally, consider using Quantization-Aware Training (QAT) on the original model before conversion to TensorRT to make it more robust to FP16 representation.
Answer Strategy
Test strategic thinking and cost-awareness. The answer should cover: 1) Profiling current costs (compute type, batch size, utilization), 2) Evaluating optimization options (quantization, distillation, architecture search), 3) Considering operational factors (latency vs. throughput, accuracy trade-offs), and 4) Framing the business case (projected savings, timeline, risks).
1 career found
Try a different search term.