AI Platform Engineer
AI Platform Engineers design, build, and maintain the internal developer platforms and infrastructure that empower ML engineers an…
Skill Guide
The engineering discipline of deploying trained ML models into production environments and applying systematic techniques (quantization, batching, kernel optimization, hardware acceleration) to maximize throughput, minimize latency, and optimize cost-per-inference.
Scenario
Deploy a ResNet-50 model for image classification that must handle 100 requests per second with a P99 latency under 100ms on a single GPU.
Scenario
Deploy a 7B parameter LLM (e.g., Mistral) to serve a chatbot product with variable prompt lengths, optimizing for maximum tokens-per-second per GPU dollar.
Scenario
Design a system for a video analytics platform that performs: 1) object detection per frame, 2) OCR on detected text regions, and 3) sentiment analysis on extracted text, all within a 500ms end-to-end latency budget.
The core platforms for model deployment. Triton is the enterprise-grade, multi-backend orchestrator. vLLM and TensorRT-LLM are state-of-the-art for high-throughput LLM serving. Choose based on model type, need for custom backends, and operational complexity.
Used to compile, optimize, and quantize models for specific hardware targets. TensorRT is dominant for NVIDIA GPU optimization, creating highly tuned engine files. ONNX Runtime provides cross-platform acceleration. These are often used *before* deploying to a serving framework.
Nsight and PyTorch Profiler are for deep kernel-level performance analysis. Triton Model Analyzer is purpose-built for finding the optimal configuration (batch size, instance count) for models on Triton. Prometheus/Grafana are for production monitoring of SLOs like latency, throughput, and queue depth.
Answer Strategy
The interviewer is testing systematic debugging and knowledge of the optimization stack. Structure your answer as a phased plan. **Sample Answer:** 'First, I would instrument the serving code with PyTorch Profiler to get a kernel-level timeline and identify whether the bottleneck is in preprocessing, model execution, or postprocessing. Assuming it's the GPU kernel, I would then export the model to ONNX and use ONNX Runtime or TensorRT to create an optimized engine, applying FP16 quantization. For deployment, I would move away from Flask to a dedicated server like Triton or vLLM. Using Triton, I would enable dynamic batching and use its Model Analyzer to find the optimal instance count and batch size that saturates the GPU without hitting memory limits, thereby maximizing throughput within our latency budget.'
Answer Strategy
This tests business acumen and technical pragmatism. Use the STAR method briefly. **Sample Answer:** 'In a real-time fraud detection system, our initial BERT-based model had a P99 latency of 250ms, which was too slow. I framed the decision as a business impact analysis: the cost of missing fraud vs. the cost of delayed transactions. I led an effort to apply INT8 quantization and knowledge distillation to a smaller model, rigorously measuring accuracy drop on a held-out set. We accepted a 0.5% relative accuracy drop because it reduced latency to 50ms and cut GPU costs by 60%, directly improving the ROI of the system and allowing us to scale to 10x more traffic.'
1 career found
Try a different search term.