AI Context Engineering Specialist
An AI Context Engineering Specialist designs, orchestrates, and optimizes the information architecture that feeds large language m…
Skill Guide
The systematic application of frameworks like RAGAS and DeepEval to quantitatively assess the faithfulness, relevance, and accuracy of Large Language Model outputs and the retrieved context in Retrieval-Augmented Generation pipelines.
Scenario
You have a basic RAG chatbot that answers questions from a single PDF document about company HR policies.
Scenario
The product team wants to switch from a basic cosine-similarity vector retriever to a more advanced hybrid (vector + BM25) retriever for a customer support knowledge base.
Scenario
User satisfaction scores for your production RAG system have dropped by 15% over the last sprint, with complaints about irrelevant answers.
Core tools for automated evaluation. RAGAS and DeepEval provide the metric calculations. LangSmith and Phoenix are observability platforms that integrate evaluation into logging, tracing, and monitoring workflows.
Frameworks used to build the RAG pipelines that you will evaluate. Proficiency in one is a prerequisite, as you need to instrument its components for evaluation.
For manually creating and versioning ground-truth datasets (Sheets). MLflow and W&B are used for logging evaluation runs, parameters, and metrics, enabling collaboration and historical comparison.
Answer Strategy
Structure your answer around: 1) Dataset creation (ground-truth, production samples), 2) Metric selection (prioritize Faithfulness and Context Relevancy for initial launch to prevent hallucinations), 3) Integration into CI/CD (e.g., GitHub Actions run tests), 4) Alerting thresholds. Sample Answer: 'I'd start by curating a test set from the product's source documents and likely user queries. For initial validation, I'd prioritize Faithfulness and Context Relevancy using RAGAS to ensure the system isn't hallucinating and is retrieving useful information. I'd integrate this as a gating step in the CI/CD pipeline using a script that fails the build if scores drop below a defined baseline, and set up monitoring in LangSmith for trend analysis.'
Answer Strategy
This tests diagnostic reasoning. High Faithfulness means the answer is grounded in the context, but low Answer Relevancy means the answer doesn't address the user's original question. The issue is likely in the retrieval or the prompt instructing the LLM. Sample Answer: 'This pattern indicates the answer is factually correct based on the context but fails to address the user's intent. I would first examine the retrieved contexts: are they on-topic? If contexts are irrelevant, the problem is the retriever. If contexts are relevant but the answer is a non-sequitur, I'd inspect the generator's system prompt for instructions on how to synthesize and present information based on the query.'
1 career found
Try a different search term.