AI Responsible Disclosure Specialist
An AI Responsible Disclosure Specialist identifies, documents, and coordinates the ethical reporting of vulnerabilities, safety fa…
Skill Guide
The systematic practice of discovering and exploiting adversarial inputs (prompts) that cause large language models to bypass safety filters, reveal confidential training data, or execute unintended actions, forming the basis of offensive security for AI systems.
Scenario
You are given access to a commercial chatbot API (e.g., a well-aligned model like ChatGPT or Claude). Your goal is to make it generate harmful, unethical, or prohibited content it's designed to refuse.
Scenario
You are testing an AI assistant that summarizes web pages and answers questions about them. The assistant is integrated into a corporate knowledge base.
Scenario
You are red-teaming a customer service agent that uses an LLM with plugins: it can read emails (text), view product photos (image), and issue refunds via an API (tool use). Your objective is to trigger an unauthorized refund.
Use Garak for automated, library-driven fuzzing of models against known attack patterns. Rebuff can be integrated as a defensive layer to test your own mitigations. Adversarial datasets provide a baseline of known malicious prompts. Burp Suite is for manual, deep-dive HTTP-level analysis of LLM API traffic.
OWASP provides the industry-standard checklist for vulnerability classes. NIST offers a high-level framework for building organizational risk governance. PyRIT is a tool for security teams to proactively generate adversarial prompts and measure their model's resilience, enabling a 'red team by design' approach.
These platforms are used post-deployment to log all prompts and completions, allowing for the detection of exploitation attempts in production. They help identify novel attacks, measure the frequency of injection attempts, and trigger alerts for anomalous model behavior.
Answer Strategy
The interviewer is testing your ability to think systematically, consider the full attack surface, and prioritize. Structure your answer using the 'Attack Surface -> Threat Model -> Test Cases -> Mitigations' framework. Sample Answer: 'First, I'd map the attack surface: the model's input context, any tool calls it makes, and its output channels. Next, I'd build a threat model focusing on indirect prompt injection via malicious content in source documents and data exfiltration through the model's responses. My test plan would include: 1) Poisoning test documents with escalating payloads from simple to complex, 2) Testing for system prompt leakage, and 3) Attempting to make the model manipulate downstream systems (e.g., calendar invites). For each finding, I'd propose specific mitigations like input sanitization, instruction hierarchy, and output parsing.'
Answer Strategy
This is a behavioral question testing hands-on experience, creativity, and impact assessment. Use the STAR method (Situation, Task, Action, Result). Focus on the technical details of your discovery and its business/security implications. Sample Answer: 'In testing a legal document analyzer, I discovered an indirect injection via font color. The model processed visible black text, but the document contained hidden white text containing malicious instructions. By setting the font color to match the background, I bypassed a key safety filter that only scanned visible content. This made the model inject false citations into its summary. The impact was critical: it could have led to legal malpractice. I documented the technique, which led to the vendor implementing more robust HTML/CSS parsing and color contrast analysis in their preprocessing pipeline.'
1 career found
Try a different search term.