AI Hallucination Detection Specialist
An AI Hallucination Detection Specialist identifies, measures, and mitigates fabricated or factually incorrect outputs generated b…
Skill Guide
The systematic practice of crafting inputs to probe, evaluate, and break language models by identifying their failure modes, safety boundaries, and performance limits.
Scenario
You are given access to a commercial API model (e.g., a provider's safety-tuned model). Your goal is to map its refusal boundaries for generating harmful content across five categories: violence, hate speech, illegal acts, self-harm, and sexual content.
Scenario
You are testing a model that uses a system prompt to enforce company policy (e.g., 'Never discuss competitor X'). A user attempts to override this via the user prompt. Your task is to design and execute an attack to make the model violate the system-level instruction.
Scenario
Your team is pre-launch for a customer-facing model. You must build a continuous adversarial testing suite that automatically runs nightly, flags regressions, and provides data to the fine-tuning team.
Use these to programmatically define, execute, and evaluate prompt-based adversarial attacks at scale. 'inspect-ai' is particularly robust for complex, multi-turn red-teaming evaluations. Integrate these into CI/CD pipelines for models.
Leverage these pre-compiled datasets of malicious prompts to stress-test model safety. They provide a standardized way to measure and compare model robustness against known attack vectors.
Apply these to structure your thinking. Use ATLAS/OWASP to ensure comprehensive attack coverage. Use FMEA to systematically analyze potential failure points in the model's response pipeline before they are exploited.
Answer Strategy
The candidate should demonstrate a structured, risk-based approach. They should mention categorization of attacks, use of existing benchmarks, and prioritization based on business impact. Sample Answer: 'I'd start by categorizing attacks using a framework like OWASP LLM Top 10. I'd prioritize vectors with high real-world likelihood and severe potential harm, such as direct prompt injection to extract system prompts or generate hateful content. My testing would combine manual creative red-teaming for novel attacks and automated sweeps using a framework like Garak against datasets like HarmBench to ensure baseline coverage. The goal is a measurable safety profile, not just anecdotes.'
Answer Strategy
The question tests systematic debugging and adversarial thinking under pressure. The candidate should outline a step-by-step forensic analysis. Sample Answer: 'First, I'd secure and reproduce the exact failing prompts and context from production logs. Next, I'd check for data poisoning or leakage in the fine-tuning dataset. I'd then conduct targeted adversarial probing around the failure domain-likely testing for indirect prompt injection via retrieved context or subtle keyword triggers that bypass safety layers. I'd also verify if the model's internal safety representations were degraded during fine-tuning by running a focused suite of safety benchmarks. The root cause is often in the fine-tuning data or a poorly guarded retrieval-augmented generation (RAG) component.'
1 career found
Try a different search term.