AI Trust & Safety Policy Specialist
An AI Trust & Safety Policy Specialist designs, implements, and enforces policies that govern responsible AI development and deplo…
Skill Guide
A systematic discipline for assessing the capabilities, limitations, safety, and alignment of large language models (LLMs) through rigorous, adversarial testing to uncover failure modes and vulnerabilities.
Scenario
You are given API access to a generic customer service chatbot. Your goal is to identify at least three distinct ways to make it break character, reveal its system prompt, or generate harmful content.
Scenario
You need to evaluate a newly fine-tuned model for a hiring assistant tool against a standardized bias benchmark (e.g., BBQ, WinoBias) before its internal pilot release.
Scenario
As the lead of an AI safety team, you must ensure that every model version update for your flagship product does not reintroduce known critical vulnerabilities and is tested against new attack vectors.
Use Evaluate/Harness for standardized benchmark execution (MMLU, HellaSwag). Use PyRIT and Garak for automated, multi-turn adversarial attack generation and vulnerability scanning, moving beyond static benchmarks.
ATLAS and OWASP provide standardized taxonomies of attack tactics. NIST AI RMF and STRIDE/DREAD provide the procedural frameworks for integrating evaluation into risk governance and systematic threat identification.
Answer Strategy
The interviewer is assessing domain-specific threat modeling and creative attack design. Structure your answer: 1) Threat Model: Define harmful advice (e.g., dangerous self-treatment, discouraging professional consultation). 2) Attack Vectors: Design multi-turn scenarios (e.g., empathetic patient persona, gradual escalation). 3) Execution: Plan to combine manual expert probing with automated template generation. 4) Measurement: Define success metrics (e.g., harmful output rate per attack type). Sample: 'I'd start by partnering with a medical SME to define a taxonomy of harmful advice. Then, I'd develop scenarios where the model is primed as a 'helpful medical assistant' and tested with emotionally charged, symptom-specific queries that edge toward dangerous recommendations. We'd measure the refusal rate and the safety of any generated advice against clinical guidelines.'
Answer Strategy
This tests the end-to-end incident handling process. Use STAR-L (Situation, Task, Action, Result, Learning). Emphasize: 1) Reproducibility and validation steps. 2) Clear severity classification. 3) Communication strategy to both technical and non-technical stakeholders. 4) The fix and post-mortem. Sample: 'During testing, I discovered the model could be manipulated via a specific Unicode sequence to bypass safety filters. I documented the minimal reproducible prompt and 10 variants to confirm it wasn't flaky. I classified it as 'Critical' per our risk matrix and convened a 30-minute war room with engineering, product, and legal. We implemented a short-term input sanitization rule and scheduled a longer-term fine-tuning fix. The post-mortem led to adding Unicode normalization to our standard preprocessing pipeline.'
1 career found
Try a different search term.