AI Hallucination Detection Specialist
An AI Hallucination Detection Specialist identifies, measures, and mitigates fabricated or factually incorrect outputs generated b…
Skill Guide
Red-teaming for generative AI is a structured adversarial testing methodology designed to proactively discover and document model failures, safety violations, and harmful outputs before deployment.
Scenario
You are given access to a public-facing chatbot API. Your goal is to determine if it will generate content violating its stated safety policies (e.g., creating harmful code, adult content).
Scenario
Your team has fine-tuned a model to answer questions about internal corporate financial documents. You must test its reliability and tendency to fabricate information (hallucinate).
Scenario
A financial advisory RAG system cites SEC filings. An adversary attempts to make it recommend a specific stock by poisoning its responses over multiple turns, exploiting the system's context window and retrieval logic.
Use PyRIT and Garak for structured, automated attack generation and vulnerability scanning against models. Use LangSmith to trace and analyze the internal decision logic of complex chains during red-team exercises. Use Promptfoo to define and run repeatable test suites against multiple model endpoints.
Use OWASP Top 10 to ensure comprehensive coverage of common application-layer vulnerabilities. Use NIST AI RMF and MITRE ATLAS to align red-teaming with organizational risk governance and to catalog adversary tactics, techniques, and procedures. Adapt STRIDE to model threats like spoofing model identity or tampering with training data.
Use adapted CVSS or internal tiers to standardize severity assessment of findings, enabling prioritized engineering fixes. The HARM taxonomy provides a consistent language for categorizing and discussing failure modes across teams.
Answer Strategy
The interviewer is testing structured problem-solving and threat modeling. Use a phased approach: 1) **Scoping**: Define prohibited outputs (e.g., real persons, copyrighted art styles, violent scenes) based on policy and law. 2) **Methodology**: Describe a mix of automated (using adversarial prompt libraries) and manual testing (creative artists and cultural experts probing edge cases). 3) **Execution**: Explain how you'd document failures with a consistent severity rubric. 4) **Reporting**: Emphasize translating findings into specific engineering tasks (e.g., 'strengthen NSFW filter for specific artist name triggers') and a risk assessment for legal/compliance.
Answer Strategy
This tests understanding of nuanced failures beyond simple block/allow. Categorize this as a **Circumvention via Indirect Prompting** and a failure of **Contextual Integrity**. The core competency is recognizing that safety filters can be brittle. Sample answer: 'I'd report this as a High-severity jailbreak. The failure is not in the refusal mechanism but in the model's inability to maintain its ethical stance within a different narrative frame. The fix likely requires alignment training to recognize harmful themes across all output formats, not just direct Q&A. I'd recommend a dedicated test suite for fictional and role-play scenarios.'
1 career found
Try a different search term.