AI Distillation Engineer
An AI Distillation Engineer specializes in compressing large-scale foundation models into smaller, faster, and cheaper student mod…
Skill Guide
The process of converting trained machine learning models into the interoperable ONNX format and deploying them for high-throughput, low-latency inference using NVIDIA TensorRT for GPU optimization or vLLM for efficient Large Language Model (LLM) serving.
Scenario
Deploy a standard image classification model for a web service with strict latency requirements.
Scenario
Achieve maximum throughput for a sentiment analysis API serving a fine-tuned BERT model on an A10 GPU.
Scenario
Serve a 7B parameter LLM for a high-traffic chatbot application, ensuring stable response times under concurrent user load.
Used in the initial model translation and graph cleanup phase. torch.onnx.export is the primary tool for PyTorch models. ONNX Simplifier and GraphSurgeon are essential for post-export graph optimization and surgery.
TensorRT is the core framework for optimizing and running GPU inference. trtexec is the CLI for benchmarking and building engines. polygraphy is for validation and debugging. TensorRT-LLM is the specialized library for LLMs.
vLLM provides state-of-the-art throughput for LLM serving via PagedAttention. TGI is another popular alternative. Triton is the production-grade platform that can orchestrate TensorRT engines and other backends.
Nsight Systems provides system-wide GPU/CPU timeline analysis. Nsight Compute offers detailed kernel-level profiling. These are non-negotiable for diagnosing bottlenecks in the inference pipeline.
Answer Strategy
Structure the answer in clear phases: 1) ONNX Export (handling dynamic axes, custom ops), 2) TensorRT Parsing & Optimization (layer fusion, precision selection, calibration for INT8), 3) Engine Build & Profiling (using trtexec, analyzing layer timings). Highlight challenges like unsupported operators and the performance-accuracy trade-off in quantization.
Answer Strategy
The interviewer is testing understanding of memory management innovation. Explain the problem of memory fragmentation with standard key-value (KV) cache management, then describe PagedAttention's solution using virtual memory-like blocks. Emphasize the outcome: higher GPU memory utilization, enabling larger batch sizes and therefore higher throughput.
1 career found
Try a different search term.