Skip to main content

Interview Prep

RAG Engineer Interview Questions

50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.

Beginner: 5Intermediate: 10Advanced: 10Scenario-Based: 10AI Workflow & Tools: 10Behavioral: 5

Beginner

5 questions
What a great answer covers:

A strong answer explains that RAG combines external knowledge retrieval with LLM generation to reduce hallucination, keep outputs current, and ground responses in verifiable sources.

What a great answer covers:

Should describe how text is mapped to a dense numerical vector via an embedding model, and how cosine similarity or dot product enables meaning-based (not keyword-based) retrieval.

What a great answer covers:

Sparse is keyword-based and excels at exact matches; dense captures semantic meaning. Hybrid approaches often combine both for best results.

What a great answer covers:

Should explain that documents are split into smaller segments for embedding and retrieval, and that chunk size affects retrieval precision, context relevance, and LLM token usage.

What a great answer covers:

e.g., Pinecone (fully managed, serverless), Weaviate (built-in hybrid search with BM25), Chroma (lightweight, open-source, developer-friendly). Shows awareness of the ecosystem.

Intermediate

10 questions
What a great answer covers:

Should cover document parsing (PDF tables, footnotes), fine-grained chunking by legal sections, metadata filtering by jurisdiction/date, citation accuracy, and hallucination mitigation for compliance-critical output.

What a great answer covers:

Should mention recall@k, MRR, precision@k for retrieval; faithfulness, answer relevance, hallucination rate for generation; and describe building a ground-truth QA dataset from domain experts or synthetic generation.

What a great answer covers:

Should describe combining BM25 sparse scoring with dense vector similarity, using alpha/beta weights or RRF (Reciprocal Rank Fusion), and tuning based on benchmark results.

What a great answer covers:

Should explain that reranking applies a more computationally expensive model (cross-encoder) to the top-k retrieved chunks to reorder them by relevance, improving precision at the cost of latency.

What a great answer covers:

Should cover query rewriting (using an LLM to make standalone queries), conversation memory management, and context window budgeting between chat history and retrieved documents.

What a great answer covers:

Should discuss structure-aware parsing (tables parsed with Camelot/Tabula, code with AST-based splitting), semantic chunking for narrative, and metadata tagging per chunk type.

What a great answer covers:

Embedding models encode queries and documents independently (bi-encoder) for fast retrieval; reranking models (cross-encoder) process query-document pairs jointly for higher accuracy but slower speed. They serve different pipeline stages.

What a great answer covers:

Should describe metadata filtering at query time (e.g., tenant_id, permission_level fields), vector database pre-filtering capabilities, and ensuring embeddings never leak cross-tenant information.

What a great answer covers:

Embedding drift occurs when the distribution of incoming queries or documents changes relative to the embedding model or indexed data. Monitor retrieval quality metrics over time, use drift detection on embedding distributions, and re-index periodically.

What a great answer covers:

Should discuss token counting (tiktoken), prioritization of retrieved chunks by relevance score, truncation strategies, and potentially using prompt compression techniques like LLMLingua.

Advanced

10 questions
What a great answer covers:

HyDE generates a hypothetical answer using the LLM, embeds that answer, and uses it as the query vector. It can improve recall when query and document language differ significantly, but adds latency and can fail if the hypothetical answer is misleading.

What a great answer covers:

Should describe retrieval β†’ relevance grading (LLM or classifier scores retrieved chunks) β†’ if confidence is low, trigger web search or expanded retrieval β†’ generate with reflection tokens or a critique step β†’ refine or regenerate if faithfulness is low.

What a great answer covers:

Should cover vector database sharding and replication, approximate nearest neighbor index tuning (HNSW ef, IVF nprobe), tiered storage (hot/warm/cold), query result caching with semantic cache keys, and edge deployment considerations.

What a great answer covers:

RAG excels when knowledge changes frequently and traceability is needed; fine-tuning excels for style/format adaptation and reduced inference cost; RAG + fine-tuning combines both - fine-tune the model for better instruction following while keeping knowledge externalized.

What a great answer covers:

Should discuss source attribution and provenance tracking, presenting conflicting viewpoints with citations, confidence scoring per source, recency-based tiebreaking, and potentially a debate/verification step using multiple LLM calls.

What a great answer covers:

LLMs attend more strongly to the beginning and end of the context, losing information in the middle. Mitigate by placing the most relevant chunks at the start and end, reducing total chunks, using reranking, or applying map-reduce strategies.

What a great answer covers:

Should describe an index-per-corpus architecture with a query router (LLM-based or classifier-based) that determines which retriever(s) to invoke, followed by result merging, deduplication, and unified reranking.

What a great answer covers:

Should cover retrieval metrics (recall@k, nDCG), generation metrics (faithfulness, relevance via RAGAS), operational metrics (p50/p95 latency, cost per query), and user-centric metrics (thumbs up/down, session-level task completion).

What a great answer covers:

Should discuss input sanitization of retrieved content, delimiters and structural separation between instructions and context, using the system prompt to instruct the LLM to ignore embedded commands, and content scanning before indexing.

What a great answer covers:

Should cover change data capture (CDC) from source systems, document fingerprinting for deduplication, vector namespace or metadata-based versioning, tombstone records for deletions, and re-embedding triggers when the embedding model is upgraded.

Scenario-Based

10 questions
What a great answer covers:

Should describe: checking retrieved chunks for the relevant query (did the right policy document get retrieved?), evaluating retrieval recall, examining the prompt template for context bleeding, checking if the policy document is stale, and reviewing the generation for hallucination vs. faithful-but-wrong retrieval.

What a great answer covers:

Should discuss multilingual embedding models (e.g., multilingual-e5-large, BGE-M3), cross-lingual retrieval, whether to translate queries or use multilingual embeddings directly, separate indices per language vs. shared multilingual index, and evaluation in each language.

What a great answer covers:

Should describe tracking which chunks were retrieved and used, passing source metadata (title, URL, page number) into the prompt, instructing the LLM to cite sources inline, and potentially validating that cited sources actually support the generated claims.

What a great answer covers:

Should cover ANN index parameter tuning (HNSW ef_construction, M), sharding across nodes, implementing a semantic cache for frequent queries, pre-filtering to reduce search space, quantization (PQ, SQ) to reduce memory, and potentially tiered retrieval (fast coarse retrieval + fine reranking).

What a great answer covers:

Should articulate: LLMs have knowledge cutoffs, hallucinate on proprietary data, cannot cite sources, cannot be updated without retraining, and may violate data residency requirements. RAG provides freshness, traceability, domain grounding, and cost efficiency for domain-specific applications.

What a great answer covers:

Should discuss code-aware parsing (AST-based chunking, function/class-level splitting), code-specific embedding models (CodeBERT, StarCoder embeddings), combining code with natural language comments/docstrings, and retrieval strategies that respect code structure (e.g., including imports and dependencies).

What a great answer covers:

Should cover prompt engineering (explicit instructions to use only provided context), placing context before the question in the prompt, using stronger grounding instructions, adjusting temperature to 0, and potentially fine-tuning the LLM to better follow RAG-style instructions.

What a great answer covers:

Should describe confidence scoring on retrieval quality, a classification step that determines if retrieved context is sufficient to answer, threshold-based routing to a refusal message, and logging unanswered queries for knowledge base gap analysis.

What a great answer covers:

Should cover running systems in parallel, indexing the same corpus into a vector store, building a hybrid retrieval layer that queries both, A/B testing retrieval quality, gradual traffic shifting, and maintaining the keyword system as a fallback during transition.

What a great answer covers:

Should discuss using cheaper/smaller embedding models, deduplication to avoid re-embedding similar content, incremental indexing (only embed new/changed docs), batching API calls, caching embeddings for repeated content, and evaluating if local embedding models (e.g., BGE-small) can replace API-based ones.

AI Workflow & Tools

10 questions
What a great answer covers:

Should describe using RunnablePassthrough for query, a retriever Runnable, a reranker Runnable, a prompt template Runnable, an LLM Runnable, and an output parser, chained with the | operator. Should show understanding of LCEL's composability, streaming, and fallback support.

What a great answer covers:

Should cover enabling tracing on the LangChain chain, capturing per-step latency, retrieved chunk IDs and scores, prompt templates and completions, token usage per step, and using LangSmith's evaluation UI to compare runs, identify bottlenecks, and track regression over time.

What a great answer covers:

Should describe using the KG index for relational queries (e.g., 'what are the dependencies of module X?') and the vector index for semantic similarity queries, combining results with a router query engine, and using LlamaIndex's SubQuestionQueryEngine for decomposing complex queries.

What a great answer covers:

Should describe a LangGraph StateGraph with nodes: retrieve β†’ grade_documents β†’ (if relevant: generate; if not relevant: web_search β†’ generate) β†’ grade_generation. Edges represent conditional routing based on document relevance and generation faithfulness. State holds query, documents, grades, and generation.

What a great answer covers:

Should describe generating a test dataset (question, ground_truth_answer, contexts), running the RAG pipeline on test questions, computing RAGAS metrics (faithfulness, answer_relevancy, context_precision, context_recall), integrating into CI/CD with pass/fail thresholds, and tracking metric trends over time.

What a great answer covers:

Should describe creating a collection with a vectorizer (e.g., text2vec-openai), querying with both nearText (semantic) and BM25, specifying the hybrid alpha parameter, and applying a reranking module (e.g., reranker-cohere) on results with the query for final reordering.

What a great answer covers:

Should describe uploading files to an Assistant, enabling file_search which auto-chunks and embeds, using threads for conversation state, and noting limitations: less control over chunking strategy, no hybrid search, limited retrieval customization, no access to raw embeddings, and vendor lock-in.

What a great answer covers:

Should describe deploying a model (e.g., Llama 3, Mistral) via Ollama/vLLM, integrating with LangChain's ChatOllama or vLLM wrapper, and discussing trade-offs: lower cost, data privacy, no internet dependency vs. lower quality on complex tasks, GPU infrastructure management, and slower iteration.

What a great answer covers:

Should describe embedding incoming queries, searching for near-duplicate queries in a cache index with a similarity threshold (e.g., cosine > 0.97), returning cached answers for hits, and invalidating cache entries when source documents are updated (event-based or TTL-based).

What a great answer covers:

Should describe building a Haystack Pipeline with a Router component (LLM-based or classifier) that directs queries to different branches: document_store retriever for knowledge questions, SQL tool for data queries, and web search for out-of-scope queries, with result merging and response generation.

Behavioral

5 questions
What a great answer covers:

Look for structured storytelling: context (what was the system and constraint), conflict (quality vs. speed tension), action (specific trade-offs made, e.g., switched from cross-encoder reranking to top-k filtering), and result (measured impact on user experience and business metrics).

What a great answer covers:

Should demonstrate empathy, clear communication of technical constraints, setting expectations with concrete examples or demos, proposing incremental delivery milestones, and turning the conversation toward measurable outcomes rather than capabilities.

What a great answer covers:

Should describe specific habits (reading arXiv papers, following specific researchers, attending AI Engineer meetups, hands-on experimentation) and a concrete example showing the learn β†’ apply β†’ iterate cycle.

What a great answer covers:

Look for accountability, systematic root-cause analysis (not just 'the LLM hallucinated'), specific changes implemented (better evaluation, guardrails, retrieval improvements), and a process improvement to prevent recurrence.

What a great answer covers:

Should demonstrate data-informed but human-centered approach: showing the expert the actual retrieved chunks, understanding their definition of 'right,' discovering potential metric-evaluation misalignment, co-designing better test cases, and iterating together rather than defending metrics.