AI Caching Systems Engineer
An AI Caching Systems Engineer architects, implements, and optimizes sophisticated caching layers specifically for AI inference pi…
Skill Guide
The systematic knowledge of how a trained ML model is loaded, processed, and served in a production environment to generate predictions, and the identification of computational, memory, and I/O constraints that limit its performance, scalability, and cost-efficiency.
Scenario
You have a pre-trained ResNet-50 model for image classification deployed on a Flask API. Users report high response times (~500ms) during peak load.
Scenario
Your company needs to serve a 7B parameter LLM for a chatbot product with a strict cost budget per 1000 queries and a P99 latency SLA of 2 seconds.
Scenario
A financial services company requires sub-50ms inference for fraud scoring on transactions globally, with 99.99% uptime and the ability to roll out new models with zero downtime. The system must handle 100k TPS.
Use NVIDIA Nsight for GPU kernel-level analysis, PyTorch Profiler for operator-level timing, and Prometheus/Grafana for production monitoring of latency, throughput, and memory usage. Start with `cProfile` for quick Python-level bottlenecks.
TensorRT and ONNX Runtime are for model graph optimization and hardware-specific acceleration. Triton is the industry standard for high-performance, multi-framework model serving in production. Use TorchServe for PyTorch-native deployment simplicity.
Kubernetes provides the orchestration layer for scalable, resilient serving. KServe/Seldon Core add advanced model serving capabilities (canary, A/B testing) on top of K8s. Cloud ML engines offer managed infrastructure but can limit low-level control.
1 career found
Try a different search term.