AI Caching Systems Engineer
An AI Caching Systems Engineer architects, implements, and optimizes sophisticated caching layers specifically for AI inference pi…
Skill Guide
The ability to deploy, configure, manage, and optimize machine learning models for high-throughput, low-latency inference in production using specialized serving platforms.
Scenario
You have a pre-trained image classification model (e.g., ResNet-50 in ONNX format) and need to expose it as a REST API endpoint for a internal demo.
Scenario
Deploy a user-facing text analysis pipeline: Model A (tokenizer/preprocessor), Model B (sentiment classifier), and Model C (entity extractor), where the final output requires aggregation. The goal is to maximize throughput under a 100ms latency SLA.
Scenario
You must serve a 70B parameter LLM for an internal chat application with variable daily traffic (low at night, peak during business hours), optimizing for both cost and response time stability.
Core serving platforms. Triton is the multi-framework orchestrator. TF Serving is optimized for TF models. vLLM is the state-of-the-art for LLM inference with PagedAttention. Choose based on your model ecosystem and hardware.
TensorRT compiles models for peak GPU performance. ONNX provides framework interoperability. Docker/K8s provide the standard containerized deployment and orchestration layer. Helm charts simplify complex deployments.
Prometheus scrapes server metrics (latency, throughput, GPU usage); Grafana visualizes them. Nsight Systems is for deep GPU kernel profiling. vLLM exposes detailed queue and scheduling metrics.
Answer Strategy
Use a structured diagnostic framework: 1) Resource & Status Check, 2) Bottleneck Identification, 3) Hypothesis Testing, 4) Mitigation & Monitoring. Sample answer: 'First, I'd check the server's resource utilization (GPU, memory, CPU) and logs for errors like OOM. Next, I'd examine the server's metrics endpoint for changes in queue latency and batch size. If the GPU is underutilized but the queue is growing, the bottleneck is likely model computation. I'd then profile a single inference request with Nsight Systems to check for kernel inefficiencies. Based on the profile, I might try recompiling the model with TensorRT, increasing the max batch size if memory allows, or, if it's a code change issue, rolling back to the previous model version.'
Answer Strategy
Tests ability to bridge the development/production gap and knowledge of optimization. Response should focus on a reproducible, optimized pipeline. Sample answer: 'My first step is to avoid serving the raw PyTorch script. I'd export the model to a standardized, optimized format like ONNX or TorchScript, which eliminates Python overhead. I'd then choose a serving framework-likely Triton if we need a flexible pipeline, or TorchServe if it's a pure PyTorch shop. The key optimization phase involves: 1) converting the model to TensorRT for maximum GPU performance, 2) configuring dynamic batching in the server, and 3) load testing with the `perf_analyzer` tool to find the optimal batch size and instance count that meets our latency SLO.'
1 career found
Try a different search term.