AI Text-to-Speech Engineer
An AI Text-to-Speech (TTS) Engineer designs, trains, and deploys neural speech synthesis systems that convert text into natural, e…
Skill Guide
The process of converting and refining trained machine learning models to minimize latency, memory footprint, and computational cost for real-time serving in production environments.
Scenario
You have a trained ResNet-50 model in PyTorch that needs to be served in a web application. The goal is to reduce inference latency on CPU.
Scenario
Deploy a BERT-based text classification model on an NVIDIA T4 GPU to handle 100+ requests per second with sub-50ms latency.
Scenario
Create a production-ready Automatic Speech Recognition system that streams audio chunks from a microphone and returns partial transcripts in real-time, using a streaming Conformer model.
ONNX provides the universal intermediate representation. ONNX Runtime is the cross-platform engine for executing and optimizing ONNX graphs. Export tools (`torch.onnx.export`, `tf2onnx`) convert native models to ONNX.
TensorRT performs graph optimization, layer fusion, and kernel auto-tuning for NVIDIA GPUs. It's the industry standard for maximizing GPU inference throughput and latency. OpenVINO is the Intel equivalent for CPU/VPU deployment.
Frameworks for applying post-training quantization (PTQ) or quantization-aware training (QAT) to reduce model size and speed up integer arithmetic. TensorRT's toolkit integrates tightly with its engine builder.
Triton is the leading solution for deploying multiple optimized models (TensorRT, ONNX, PyTorch) with advanced features like dynamic batching, concurrent model execution, and model pipelines. TF Serving and TorchServe are strong single-framework alternatives.
Essential tools for identifying bottlenecks. Nsight Systems provides a timeline view of GPU/CPU activity. Nsight Compute and TensorRT's profiler give kernel-level detail to diagnose slow layers and memory issues.
Answer Strategy
Structure the answer as a multi-phase pipeline: 1) Export to ONNX with dynamic shapes. 2) Optimize with TensorRT, enabling FP16 or INT8 (with a discussion on calibration). 3) Consider architectural changes like model pruning or distillation if latency targets aren't met. 4) Deploy via Triton with dynamic batching to maximize GPU utilization. Mention profiling at each step.
Answer Strategy
The core competency is systematic debugging and understanding quantization fundamentals. A strong answer: 1) Validate calibration dataset representativeness. 2) Use per-channel vs. per-tensor quantization, especially for weights. 3) Try QAT instead of PTQ. 4) Isolate the layers causing the most accuracy loss using sensitivity analysis. 5) Consider mixed-precision, keeping sensitive layers in FP16/FP32.
1 career found
Try a different search term.