Interview Prep
AI Model Routing Engineer Interview Questions
50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsA strong answer explains that different models have different strengths, costs, and latencies, so routing selects the best model per request to optimize quality, cost, and speed.
Expect candidates to mention OpenAI, Anthropic, and Google (or Cohere/Meta) and compare aspects like per-token pricing, context window limits, or streaming support.
A fallback chain is a sequence of backup models invoked when the primary model fails, is rate-limited, or exceeds latency thresholds - ensuring high availability.
Cold start refers to the latency spike when a model endpoint spins up from idle; routing logic must account for this by pre-warming endpoints or routing to always-on models.
Token pricing means you pay per input/output token; routing should send short, simple tasks to cheaper models and reserve expensive models for complex tasks where quality justifies cost.
Intermediate
10 questionsA great answer describes a weighted multi-objective function, e.g., Score = w1*(1/normalized_cost) + w2*(1/normalized_latency) + w3*(quality_score), with weights tunable per use case.
The candidate should describe embedding reference examples for each intent, then at runtime embedding the query and using cosine similarity to find the nearest intent cluster for routing.
Rule-based is transparent, deterministic, and easy to debug but brittle at scale; ML-based adapts to patterns but requires training data and is harder to audit - use rules for safety-critical paths, ML for optimization.
Describe tracking error rates per endpoint, transitioning to 'open' state (skip endpoint) after threshold breaches, and periodically probing with half-open state to detect recovery.
A capability matrix maps each model to its strengths (code, reasoning, multilingual, vision, context length, safety features); maintain it through regular benchmarking, provider documentation review, and automated quality tests.
Discuss template abstraction layers that translate a canonical prompt format into provider-specific formats (system/user/assistant roles, message arrays, special tokens) before sending.
Expect: per-model latency (p50/p95/p99), cost per query, routing distribution, error rates, fallback invocation rate, quality scores, user satisfaction signals, and SLA compliance.
Describe splitting traffic by user/session hash, running both strategies in parallel, collecting quality and cost metrics per variant, and using statistical significance testing before rolling out the winner.
Request-level routing selects one model for the entire request; step-level routing (e.g., in LangGraph) selects the optimal model for each reasoning step, tool call, or sub-task independently.
Discuss modeling average token counts per query, weighted cost by routing distribution across models, implementing per-user or per-tenant budget caps, and using cheaper models as default with escalation logic.
Advanced
10 questionsDescribe using a confidence score or quality classifier on the cheap model's output, defining quality thresholds, and only escalating when the threshold isn't met - citing the FrugalGPT approach.
Discuss pre-computing routing decisions where possible, using fast classifiers (small models or heuristics) for routing, caching routing decisions for similar queries, and setting strict timeouts on routing logic.
Describe running golden test sets against production models on a schedule, monitoring quality score distributions over time, alerting on statistical deviations, and automatically deprioritizing models that drift.
Discuss geo-aware routing tables, provider region mapping (Azure EU West, AWS Frankfurt), data classification layers, and policy enforcement gates that block routing to non-compliant endpoints.
Describe decomposing the request into sub-tasks, routing vision tasks to multi-modal models (GPT-4o, Gemini Pro Vision) and text tasks to the best text model, then orchestrating the pipeline with output passing.
Discuss pre-warming strategies, keeping minimum instances alive, using predictive scaling based on traffic patterns, routing latency-sensitive requests to always-on endpoints, and caching model weights.
Describe collecting per-query quality signals, joining them with routing decisions and model metadata, retraining the routing model periodically, and using multi-armed bandit approaches for exploration.
Discuss evaluating retrieval confidence scores, routing to a smaller model when retrieved context is highly relevant and clear, and escalating to a larger reasoning model when context is ambiguous or conflicting.
Describe tracking per-provider rate limit windows, implementing request queuing and token bucket algorithms, distributing load across multiple API keys or regions, and using rate-limit-aware routing weights.
Describe running a standardized benchmark suite, comparing quality/cost/latency against existing models, starting with shadow mode (run in parallel, don't serve), then canary rollout to a small traffic percentage.
Scenario-Based
10 questionsExpect: immediate fallback to alternative cheap models (Claude Haiku, Gemini Flash), update routing tables, notify stakeholders, benchmark alternatives, negotiate with OpenAI for migration timeline, and build model substitution agility into future architecture.
Discuss building a PII and medical entity classifier as a pre-routing gate, maintaining a compliance-approved model whitelist, logging audit trails, and ensuring the routing decision itself is deterministic and auditable.
Analyze routing distribution changes, check if a quality threshold change is causing more escalations to expensive models, examine token count drift in prompts, review if model pricing changed, and check for regression in the routing classifier.
Advocate for a structured evaluation: run internal benchmarks on your actual use cases (not just public benchmarks), test with production traffic in shadow mode, evaluate hosting costs and operational complexity, then canary rollout with quality monitoring.
Investigate routing logs to see if quality correlates with specific models, add output quality scoring to your monitoring, consider adding model attribution to responses, and tighten quality thresholds for routing decisions.
Calculate total daily budget ($1,000), profile conversation complexity distribution, route simple FAQ-style queries to the cheapest capable model, escalate complex/angry queries to a better model, and implement per-conversation cost tracking with hard caps.
Benchmark existing models on Japanese tasks, identify models with strong multilingual capabilities, consider adding Japanese-optimized models (e.g., Sarashina, Swallow), build a language detection pre-routing step, and maintain per-language model quality matrices.
Implement PII redaction before logging, store prompts in encrypted vaults with access controls, log only routing metadata (model selected, latency, cost, quality score) by default, and ensure compliance with GDPR/CCPA.
Discuss per-step routing within the agent loop, budget limits per agent invocation, latency budgets for multi-turn interactions, and the challenge of routing decisions that depend on intermediate reasoning outputs.
Describe implementing output schema validation, using structured output modes (JSON mode, function calling), normalizing outputs through a post-processing layer, and weighting models that reliably produce parseable output more heavily.
AI Workflow & Tools
10 questionsDescribe defining nodes for each model/tool, using a classifier node as the entry point, conditional edges based on query classification, and state management to track routing decisions and outputs.
Discuss provider API key management, model aliasing, fallback chain configuration, retry policies, load balancing strategies, and cost tracking integration.
Describe logging per-query metadata (model selected, latency, cost, input/output tokens, quality scores), creating custom dashboards, tracking routing experiments as W&B runs, and setting up alerting on quality degradation.
Discuss OpenRouter's model routing API, setting cost ceilings per request, specifying fallback model lists, using their latency and pricing metadata for real-time routing decisions, and monitoring via their dashboard.
Describe creating route definitions with reference utterances per intent, encoding them into Pinecone, then at runtime embedding incoming queries and performing nearest-neighbor lookup to select the route (and therefore the model).
Discuss Bedrock's unified API across models (Claude, Llama, Mistral, Titan), IAM-based access control, Bedrock agents for orchestration, CloudWatch monitoring integration, and cross-region inference for availability.
Describe maintaining a golden test set with expected outputs, running it against new model versions via CI/CD, comparing quality metrics (BLEU, exact match, LLM-as-judge scores), and blocking routing updates if quality drops below threshold.
Discuss KServe or vLLM deployments per model, a central routing service as ingress, HPA for auto-scaling per model based on queue depth, resource quotas per model, and health checks for routing decisions.
Describe caching model outputs keyed by normalized prompt hash, implementing semantic caching (embedding-based similarity lookup), setting TTL based on content volatility, and routing cache hits to skip model inference entirely.
Describe sending model outputs to a separate evaluation LLM with a rubric prompt, parsing quality scores, storing them alongside routing metadata, and using aggregated quality scores to update routing weights over time.
Behavioral
5 questionsLook for structured thinking about tradeoffs, stakeholder communication, data-driven decision making, and willingness to define measurable quality thresholds rather than relying on gut feeling.
Expect evidence of calm incident response, root cause analysis, post-mortem documentation, and concrete systemic improvements - not just the immediate fix.
Look for specific sources (arXiv, Twitter/X AI community, HuggingFace, conference talks), practical application of new knowledge, and a bias toward experimentation rather than just reading.
Look for ability to build a data-driven business case, communicate technical tradeoffs to non-technical stakeholders, and find creative compromises when appropriate.
Expect evidence of iterative development, comfortable with uncertainty, building modular/pluggable architectures, defining clear interfaces, and using experimentation to reduce ambiguity rather than waiting for perfect requirements.