Interview Prep
AI Full Stack AI Developer Interview Questions
50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsA strong answer distinguishes traditional CRUD/database-backed apps from AI-native apps that orchestrate LLM calls, manage prompts, handle streaming responses, and deal with non-deterministic outputs.
The answer should cover secret management, environment variables, rate limiting, billing exposure, and the risk of key leakage in client-side code or version control.
A good response covers structured instructions, system/user/assistant roles, few-shot examples, and how prompt quality directly affects output reliability and user experience.
The answer should trace: frontend sends request → backend validates/authenticates → constructs prompt with context → calls LLM API → streams or returns response → frontend renders.
A good answer explains tokenization, context window limits, cost per token pricing, latency implications, and how to design UIs that communicate token usage to users.
Intermediate
10 questionsThe answer should cover document ingestion, chunking strategies, embedding generation, vector storage, retrieval with similarity search, context injection into prompts, and response generation with citations.
A strong response compares managed vs. self-hosted, scale and latency characteristics, operational complexity, cost models, metadata filtering capabilities, and team familiarity.
The answer should cover Server-Sent Events or ReadableStream, the OpenAI streaming SDK, React Server Components or client-side consumption, handling partial JSON, and UX considerations like cursor animation.
A good answer discusses JSON mode, function calling / structured outputs, Pydantic validation with retry logic, output parsing, and fallback handling for malformed responses.
The answer should cover tool/function definitions in the API, the model deciding when to call functions, executing the function server-side, returning results, and handling multi-turn tool use loops.
A strong response covers embedding user queries, comparing similarity against cached responses, setting similarity thresholds, cache invalidation strategies, and tools like GPT Cache or Redis with vector search.
The answer should discuss JWT/session auth, middleware-based rate limiting, per-user token budgets, usage tracking in a database, and returning appropriate 429 responses with usage information.
A good answer covers fixed-size chunking, recursive character splitting, semantic chunking, document-structure-aware splitting, overlap windows, and how chunk size affects retrieval quality and context utilization.
The answer should include faithfulness, answer relevancy, context precision, context recall, hallucination rate, and tools like RAGAS, DeepEval, or custom LLM-as-judge evaluation pipelines.
A strong response contrasts sequential LCEL chains with LangGraph's stateful, cyclic graph execution, explaining that LangGraph is better for complex agents with branching logic, loops, and human-in-the-loop patterns.
Advanced
10 questionsThe answer should cover agent state definitions, graph nodes for each agent role, conditional edges for routing, shared memory or scratchpad, error handling, and how to prevent infinite loops or deadlocks.
A comprehensive answer covers input sanitization, prompt delimiters and instruction hierarchy, output filtering, canary tokens, separate trust boundaries between user input and system instructions, and using moderation APIs.
The answer should discuss complexity classification (rule-based or smaller model classifier), confidence thresholds, fallback escalation, A/B testing routing logic, and tracking quality vs. cost tradeoffs.
A strong answer covers multi-modal model selection, frame sampling strategies, WebSocket streaming, latency budgets, combining vision and text context, graceful degradation under load, and cost management for vision tokens.
The answer should cover storing prompts as version-controlled code or in a prompt registry, automated evaluation against golden datasets, CI/CD integration with prompt diff reviews, canary rollouts, and rollback mechanisms.
A good response contrasts fine-tuning (for style, tone, domain patterns) with RAG (for factual, up-to-date knowledge), covers dataset preparation, training configuration, evaluation, and deployment of custom models.
The answer should cover structured logging of prompts/responses/metadata, token counting and cost attribution, latency percentile tracking, automated quality scoring (LLM-as-judge), user feedback collection, and dashboards with alerting.
A strong answer covers document chunking and indexing, on-demand retrieval based on user queries, map-reduce summarization patterns, sliding window approaches, and communicating limitations transparently to users.
The answer should discuss tenant-scoped vector namespaces or collections, filtered retrieval with tenant metadata, separate prompt templates, data residency requirements, and end-to-end encryption considerations.
A good response covers model selection (smaller/faster models), semantic caching, parallel tool execution, pre-computed embeddings, streaming for perceived latency, edge inference, and setting realistic user expectations.
Scenario-Based
10 questionsThe answer should cover immediate response (add disclaimers, flag for human review), short-term fixes (constrain outputs with system prompts, add retrieval guardrails), and long-term solutions (domain-specific evaluation, medical knowledge base, regulatory compliance review).
A strong answer covers checking retrieval quality (are the right chunks being retrieved?), examining embedding model relevance, reviewing chunking strategy, testing similarity thresholds, checking for stale data, and evaluating the prompt's use of retrieved context.
The answer should cover analyzing usage patterns, implementing semantic caching, routing simple queries to cheaper models, optimizing prompt lengths, batching non-real-time requests, setting per-user budgets, and exploring self-hosted open-source models for suitable workloads.
A good response covers on-premises or VPC-deployed models, data encryption at rest and in transit, tenant-isolated vector stores, audit logging, access controls, PII redaction, and compliance with legal industry data handling standards.
The answer should cover logging and tracing tool call decisions (LangSmith), improving tool descriptions and parameter schemas, adding few-shot examples of correct tool usage, implementing validation layers, and considering a routing classifier before the agent.
A strong answer covers short-term (conversation buffer), long-term (vector-stored user facts with retrieval), structured memory (user preference database), memory extraction prompts, privacy controls, and memory summarization to manage context window limits.
The answer should compare quality benchmarks, cost per token at scale, latency, hosting requirements, customization needs (fine-tuning), data privacy constraints, team operational capacity, and a hybrid strategy where each model handles different query types.
A good response covers immediate mitigation (enabling output moderation, tightening system prompts), investigation (analyzing attack patterns), prevention (input/output guardrails, canary detection, content classifiers), and long-term hardening (red-teaming, continuous adversarial testing).
The answer should cover evaluating multilingual model capabilities, adding language detection to route to appropriate models, testing prompt templates in target languages, adjusting retrieval for multilingual embeddings, and building language-specific evaluation datasets.
A strong answer covers understanding MCP specification, building an MCP server that exposes ERP tools and resources, configuring MCP clients in your agent framework, handling authentication and permissions, and testing end-to-end tool invocations with proper error handling.
AI Workflow & Tools
10 questionsThe answer should cover tracing the full execution tree in LangSmith, examining each LLM call's input/output, identifying where reasoning diverges, inspecting tool call decisions, comparing against successful runs, and using evaluation datasets for systematic testing.
A good answer covers storing prompts in version control, building evaluation datasets, running automated LLM-as-judge or deterministic tests in CI, generating quality score reports as PR checks, and gating deployment on passing thresholds.
The answer should cover browsing the MTEB leaderboard, selecting models based on dimension/latency/quality tradeoffs, evaluating on domain-specific retrieval benchmarks, fine-tuning if needed, and deploying via HuggingFace Inference Endpoints or self-hosted with TEI.
A strong response covers logging retrieval metrics, prompt versions, chunking parameters, embedding models, and end-to-end quality scores as W&B experiments, using sweeps for hyperparameter optimization, and comparing runs visually.
The answer should cover multi-stage Docker builds, GPU node pools with NVIDIA runtime, resource requests and limits, horizontal pod autoscaling based on queue depth, health checks for model warm-up, and cost management for GPU instances.
A good answer covers Bedrock's model marketplace (Claude, Llama, Titan), IAM-based access control, VPC integration for data privacy, Knowledge Bases for RAG, Guardrails for content filtering, and trade-offs around latency, flexibility, and vendor lock-in.
The answer should cover defining graph state with an approval flag, using interrupt_before or interrupt_after on the output node, implementing a REST endpoint to resume execution after human review, and persisting state in a checkpointer.
A strong response covers creating an embeddings table with a vector column, indexing with IVFFlat or HNSW, storing metadata for filtering, performing similarity search with cosine or inner product distance, and combining vector search with traditional SQL predicates.
The answer should cover using Streamlit for quick iteration and stakeholder demos, extracting the core logic into a backend API, then rebuilding the frontend in Next.js with proper auth, state management, error handling, and production UX patterns.
A good answer covers defining test cases with expected outputs or quality criteria, configuring LLM-as-judge evaluators, running evaluations in CI pipelines, tracking metrics over time, and using assertions for deterministic checks alongside semantic scoring.
Behavioral
5 questionsA strong answer demonstrates calm under pressure, systematic debugging, stakeholder communication, a practical interim fix, and a thoughtful post-mortem that led to lasting improvements.
The answer should cover specific information sources (research papers, Twitter/X, newsletters, Discord communities), a personal evaluation framework (maturity, team fit, maintenance burden), and examples of successfully introducing and deprecating tools.
A good response shows technical advocacy grounded in data or user research, empathy for the other perspective, constructive compromise, and an outcome-oriented approach rather than positional arguing.
The answer should cover setting expectations with concrete examples of failures, using demos to show both capabilities and edge cases, proposing phased rollouts with success metrics, and framing limitations as engineering challenges rather than dead ends.
A strong answer demonstrates structured learning (official docs first, then tutorials, then experimentation), knowing when to ask for help, building a minimal viable implementation before optimizing, and documenting learnings for the team.