AI Hallucination Detection Specialist
An AI Hallucination Detection Specialist identifies, measures, and mitigates fabricated or factually incorrect outputs generated b…
Skill Guide
The systematic design of reproducible, scalable pipelines that programmatically test and score LLM outputs against defined metrics using specialized evaluation frameworks or custom code.
Scenario
You have a customer support chatbot prompt. You need to ensure that changes to the prompt don't break its ability to answer the top 10 most common questions correctly.
Scenario
Your Retrieval-Augmented Generation (RAG) system is live, but you need to evaluate not just the final answer, but the quality of the retrieved context and the generation's faithfulness to that context.
Scenario
Your MLOps team requires that no prompt change or fine-tuned model can be merged into the main branch or deployed unless it passes a rigorous, automated evaluation suite with defined thresholds.
**OpenAI Evals** is a Python framework for creating and running evaluations, with a focus on model-graded evals and a registry of existing evals. **promptfoo** is a fast, CLI-first tool for testing LLM prompts across multiple providers and models with a focus on speed and reliability. **DeepEval** and **Ragas** are specialized libraries for evaluating RAG pipelines and specific metrics like faithfulness and hallucination.
**LLM-as-a-Judge** uses a stronger model (e.g., GPT-4) to grade the outputs of a cheaper/faster model, enabling scalable, nuanced evaluation. **HITL Sampling** is used to validate automated evals and handle ambiguous edge cases. **Production A/B Testing** measures the real-world impact of changes on business metrics, providing the ultimate ground truth.
Answer Strategy
Structure the answer around: 1) **Safety First**: Prioritize 'harmless' metrics (toxicity, bias, PII leakage) using dedicated classifiers and rule-based filters. 2) **Helpfulness Core**: Implement metrics like 'Answer Relevancy' and 'Task Completion' using LLM-as-a-Judge with a rubric. 3) **Infrastructure**: Propose a multi-stage pipeline in promptfoo-first a fast safety filter, then a more expensive quality scorer. Emphasize building a dataset of adversarial test cases for safety.
Answer Strategy
Test for **systematic thinking** and **practical impact**. Use the STAR method: **Situation**: Manual spot checks were passing. **Task**: Need to validate 1000+ responses. **Action**: Built an eval pipeline with a custom 'coherence' scorer that found the model was generating fluent but logically inconsistent answers in 8% of cases. **Result**: Caught the issue pre-launch, retrained the model, and improved the coherence score by 40%.
1 career found
Try a different search term.