AI Model Routing Engineer
An AI Model Routing Engineer designs and operates intelligent decision layers that dynamically direct user requests to the optimal…
Skill Guide
The systematic practice of identifying computational bottlenecks and implementing targeted optimizations to reduce latency and resource consumption across different inference execution patterns in ML systems.
Scenario
A containerized image classification model shows >5s cold-start latency on AWS Lambda. Need to reduce to <1s while maintaining accuracy.
Scenario
A credit card transaction scoring model processes 1000 TPS with p99 latency >500ms. Need to achieve <100ms while maintaining throughput.
Scenario
An e-commerce platform runs 50+ models (recommendations, search ranking, content generation) with heterogeneous latency requirements and resource constraints.
PyTorch Profiler for framework-level metrics, Nsight for GPU kernel analysis, OpenTelemetry for distributed tracing across services.
Triton for multi-framework serving with advanced batching, Ray Serve for Python-native scalable serving, BentoML for packaging and deployment.
ONNX Runtime for cross-framework optimization, TensorRT for NVIDIA GPU acceleration, DeepSpeed for large model inference.
Prometheus for metrics collection, Grafana for visualization, Evidently for model performance monitoring in production.
Answer Strategy
Use a structured framework: 1) Check infrastructure metrics (CPU/GPU utilization, memory), 2) Examine model-specific metrics (batch size distribution, queue depth), 3) Profile model execution with Nsight/PyTorch Profiler, 4) Check for data distribution shifts. Sample answer: 'I'd start by checking Grafana dashboards for resource utilization anomalies, then examine Triton's built-in metrics for queue backlog and batch timeouts. Next, I'd run targeted profiling with Nsight to identify if the regression is in preprocessing, model execution, or postprocessing. Finally, I'd check if recent training data shifts are causing computational spikes in certain model layers.'
Answer Strategy
Tests architectural thinking and understanding of trade-offs. Sample answer: 'I'd implement a dual-path architecture: real-time requests go through a low-latency path with model caching and optimized batching, while batch jobs use a high-throughput path with larger batch sizes and relaxed latency SLAs. The key is implementing intelligent routing based on request metadata and SLA requirements, with shared model artifacts to maintain consistency. I'd use Triton's model scheduling policies to implement priority queues, ensuring real-time requests get preferential access to GPU resources.'
1 career found
Try a different search term.