Skip to main content

Interview Prep

AI Model Routing Engineer Interview Questions

50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.

Beginner: 5Intermediate: 10Advanced: 10Scenario-Based: 10AI Workflow & Tools: 10Behavioral: 5

Beginner

5 questions
What a great answer covers:

A strong answer explains that different models have different strengths, costs, and latencies, so routing selects the best model per request to optimize quality, cost, and speed.

What a great answer covers:

Expect candidates to mention OpenAI, Anthropic, and Google (or Cohere/Meta) and compare aspects like per-token pricing, context window limits, or streaming support.

What a great answer covers:

A fallback chain is a sequence of backup models invoked when the primary model fails, is rate-limited, or exceeds latency thresholds - ensuring high availability.

What a great answer covers:

Cold start refers to the latency spike when a model endpoint spins up from idle; routing logic must account for this by pre-warming endpoints or routing to always-on models.

What a great answer covers:

Token pricing means you pay per input/output token; routing should send short, simple tasks to cheaper models and reserve expensive models for complex tasks where quality justifies cost.

Intermediate

10 questions
What a great answer covers:

A great answer describes a weighted multi-objective function, e.g., Score = w1*(1/normalized_cost) + w2*(1/normalized_latency) + w3*(quality_score), with weights tunable per use case.

What a great answer covers:

The candidate should describe embedding reference examples for each intent, then at runtime embedding the query and using cosine similarity to find the nearest intent cluster for routing.

What a great answer covers:

Rule-based is transparent, deterministic, and easy to debug but brittle at scale; ML-based adapts to patterns but requires training data and is harder to audit - use rules for safety-critical paths, ML for optimization.

What a great answer covers:

Describe tracking error rates per endpoint, transitioning to 'open' state (skip endpoint) after threshold breaches, and periodically probing with half-open state to detect recovery.

What a great answer covers:

A capability matrix maps each model to its strengths (code, reasoning, multilingual, vision, context length, safety features); maintain it through regular benchmarking, provider documentation review, and automated quality tests.

What a great answer covers:

Discuss template abstraction layers that translate a canonical prompt format into provider-specific formats (system/user/assistant roles, message arrays, special tokens) before sending.

What a great answer covers:

Expect: per-model latency (p50/p95/p99), cost per query, routing distribution, error rates, fallback invocation rate, quality scores, user satisfaction signals, and SLA compliance.

What a great answer covers:

Describe splitting traffic by user/session hash, running both strategies in parallel, collecting quality and cost metrics per variant, and using statistical significance testing before rolling out the winner.

What a great answer covers:

Request-level routing selects one model for the entire request; step-level routing (e.g., in LangGraph) selects the optimal model for each reasoning step, tool call, or sub-task independently.

What a great answer covers:

Discuss modeling average token counts per query, weighted cost by routing distribution across models, implementing per-user or per-tenant budget caps, and using cheaper models as default with escalation logic.

Advanced

10 questions
What a great answer covers:

Describe using a confidence score or quality classifier on the cheap model's output, defining quality thresholds, and only escalating when the threshold isn't met - citing the FrugalGPT approach.

What a great answer covers:

Discuss pre-computing routing decisions where possible, using fast classifiers (small models or heuristics) for routing, caching routing decisions for similar queries, and setting strict timeouts on routing logic.

What a great answer covers:

Describe running golden test sets against production models on a schedule, monitoring quality score distributions over time, alerting on statistical deviations, and automatically deprioritizing models that drift.

What a great answer covers:

Discuss geo-aware routing tables, provider region mapping (Azure EU West, AWS Frankfurt), data classification layers, and policy enforcement gates that block routing to non-compliant endpoints.

What a great answer covers:

Describe decomposing the request into sub-tasks, routing vision tasks to multi-modal models (GPT-4o, Gemini Pro Vision) and text tasks to the best text model, then orchestrating the pipeline with output passing.

What a great answer covers:

Discuss pre-warming strategies, keeping minimum instances alive, using predictive scaling based on traffic patterns, routing latency-sensitive requests to always-on endpoints, and caching model weights.

What a great answer covers:

Describe collecting per-query quality signals, joining them with routing decisions and model metadata, retraining the routing model periodically, and using multi-armed bandit approaches for exploration.

What a great answer covers:

Discuss evaluating retrieval confidence scores, routing to a smaller model when retrieved context is highly relevant and clear, and escalating to a larger reasoning model when context is ambiguous or conflicting.

What a great answer covers:

Describe tracking per-provider rate limit windows, implementing request queuing and token bucket algorithms, distributing load across multiple API keys or regions, and using rate-limit-aware routing weights.

What a great answer covers:

Describe running a standardized benchmark suite, comparing quality/cost/latency against existing models, starting with shadow mode (run in parallel, don't serve), then canary rollout to a small traffic percentage.

Scenario-Based

10 questions
What a great answer covers:

Expect: immediate fallback to alternative cheap models (Claude Haiku, Gemini Flash), update routing tables, notify stakeholders, benchmark alternatives, negotiate with OpenAI for migration timeline, and build model substitution agility into future architecture.

What a great answer covers:

Discuss building a PII and medical entity classifier as a pre-routing gate, maintaining a compliance-approved model whitelist, logging audit trails, and ensuring the routing decision itself is deterministic and auditable.

What a great answer covers:

Analyze routing distribution changes, check if a quality threshold change is causing more escalations to expensive models, examine token count drift in prompts, review if model pricing changed, and check for regression in the routing classifier.

What a great answer covers:

Advocate for a structured evaluation: run internal benchmarks on your actual use cases (not just public benchmarks), test with production traffic in shadow mode, evaluate hosting costs and operational complexity, then canary rollout with quality monitoring.

What a great answer covers:

Investigate routing logs to see if quality correlates with specific models, add output quality scoring to your monitoring, consider adding model attribution to responses, and tighten quality thresholds for routing decisions.

What a great answer covers:

Calculate total daily budget ($1,000), profile conversation complexity distribution, route simple FAQ-style queries to the cheapest capable model, escalate complex/angry queries to a better model, and implement per-conversation cost tracking with hard caps.

What a great answer covers:

Benchmark existing models on Japanese tasks, identify models with strong multilingual capabilities, consider adding Japanese-optimized models (e.g., Sarashina, Swallow), build a language detection pre-routing step, and maintain per-language model quality matrices.

What a great answer covers:

Implement PII redaction before logging, store prompts in encrypted vaults with access controls, log only routing metadata (model selected, latency, cost, quality score) by default, and ensure compliance with GDPR/CCPA.

What a great answer covers:

Discuss per-step routing within the agent loop, budget limits per agent invocation, latency budgets for multi-turn interactions, and the challenge of routing decisions that depend on intermediate reasoning outputs.

What a great answer covers:

Describe implementing output schema validation, using structured output modes (JSON mode, function calling), normalizing outputs through a post-processing layer, and weighting models that reliably produce parseable output more heavily.

AI Workflow & Tools

10 questions
What a great answer covers:

Describe defining nodes for each model/tool, using a classifier node as the entry point, conditional edges based on query classification, and state management to track routing decisions and outputs.

What a great answer covers:

Discuss provider API key management, model aliasing, fallback chain configuration, retry policies, load balancing strategies, and cost tracking integration.

What a great answer covers:

Describe logging per-query metadata (model selected, latency, cost, input/output tokens, quality scores), creating custom dashboards, tracking routing experiments as W&B runs, and setting up alerting on quality degradation.

What a great answer covers:

Discuss OpenRouter's model routing API, setting cost ceilings per request, specifying fallback model lists, using their latency and pricing metadata for real-time routing decisions, and monitoring via their dashboard.

What a great answer covers:

Describe creating route definitions with reference utterances per intent, encoding them into Pinecone, then at runtime embedding incoming queries and performing nearest-neighbor lookup to select the route (and therefore the model).

What a great answer covers:

Discuss Bedrock's unified API across models (Claude, Llama, Mistral, Titan), IAM-based access control, Bedrock agents for orchestration, CloudWatch monitoring integration, and cross-region inference for availability.

What a great answer covers:

Describe maintaining a golden test set with expected outputs, running it against new model versions via CI/CD, comparing quality metrics (BLEU, exact match, LLM-as-judge scores), and blocking routing updates if quality drops below threshold.

What a great answer covers:

Discuss KServe or vLLM deployments per model, a central routing service as ingress, HPA for auto-scaling per model based on queue depth, resource quotas per model, and health checks for routing decisions.

What a great answer covers:

Describe caching model outputs keyed by normalized prompt hash, implementing semantic caching (embedding-based similarity lookup), setting TTL based on content volatility, and routing cache hits to skip model inference entirely.

What a great answer covers:

Describe sending model outputs to a separate evaluation LLM with a rubric prompt, parsing quality scores, storing them alongside routing metadata, and using aggregated quality scores to update routing weights over time.

Behavioral

5 questions
What a great answer covers:

Look for structured thinking about tradeoffs, stakeholder communication, data-driven decision making, and willingness to define measurable quality thresholds rather than relying on gut feeling.

What a great answer covers:

Expect evidence of calm incident response, root cause analysis, post-mortem documentation, and concrete systemic improvements - not just the immediate fix.

What a great answer covers:

Look for specific sources (arXiv, Twitter/X AI community, HuggingFace, conference talks), practical application of new knowledge, and a bias toward experimentation rather than just reading.

What a great answer covers:

Look for ability to build a data-driven business case, communicate technical tradeoffs to non-technical stakeholders, and find creative compromises when appropriate.

What a great answer covers:

Expect evidence of iterative development, comfortable with uncertainty, building modular/pluggable architectures, defining clear interfaces, and using experimentation to reduce ambiguity rather than waiting for perfect requirements.