AI Infrastructure Engineer
AI Infrastructure Engineers design, build, and maintain the foundational systems that power machine learning workloads at scale - …
Skill Guide
ML model serving architectures are the infrastructure patterns and design principles for deploying trained machine learning models to handle inference requests, categorized by latency and throughput requirements into batch (offline), real-time (synchronous online), and streaming (asynchronous online) paradigms.
Scenario
You have a trained scikit-learn model for iris classification. You need to serve it so a web application can send flower measurements and get a species prediction instantly.
Scenario
A marketing team needs a weekly report of all new users scored with a propensity-to-churn model. The data lives in a data warehouse (e.g., BigQuery, Snowflake).
Scenario
You are building a real-time fraud detection system for a fintech app. The model requires features computed from the last 5 minutes of user transaction history (streaming features) and static user attributes.
Core software for loading models and exposing them via APIs (REST/gRPC). Triton is performance-critical for multi-framework, GPU-optimized serving. Seldon/Kserve are Kubernetes-native for orchestration. BentoML simplifies packaging and deployment.
Kubernetes is the de facto standard for scaling and managing containerized serving workloads. Airflow/Prefect orchestrate batch pipelines. Kafka/Pulsar are the backbone for streaming inference architectures, handling high-throughput event flows.
Feature stores (Feast, Tecton, Hopsworks) are critical for serving consistent features for both batch and real-time models, solving the training-serving skew problem. MLflow manages model versions, packaging, and deployment.
Answer Strategy
The interviewer is testing the ability to translate business requirements (real-time, user-facing) into a technical architecture. Use the 'data path' framework: source -> processing -> serving. Sample answer: 'This requires a real-time online architecture. I'd propose: 1) A streaming pipeline (Kafka) to capture user click events. 2) A streaming feature store to compute short-term engagement features (e.g., last 5 viewed items). 3) The model, served via a low-latency framework like Triton, would call the feature store for each user's session. 4) The endpoint would be a gRPC service for minimal latency, deployed on a Kubernetes cluster with horizontal pod autoscaling to handle traffic spikes.'
Answer Strategy
This tests the understanding of the batch-to-streaming continuum and incremental processing. The core competency is system evolution and cost management. Sample answer: 'First, I'd profile the current batch job to identify the bottleneck-is it model computation, data I/O, or feature engineering? For a move to hourly, a micro-batch approach using Spark Structured Streaming or AWS Batch with smaller time windows is often more cost-effective than full streaming. I would implement a change-data-capture (CDC) pattern to process only new or updated data each hour, drastically reducing compute. The key is to incrementally validate the new job's accuracy against the existing batch baseline before a full switch.'
1 career found
Try a different search term.