Skill Guide

Adversarial ML techniques (FGSM, PGD, C&W attacks, prompt injection, jailbreaking)

Adversarial ML techniques are methods for systematically finding or crafting inputs that cause machine learning models to produce incorrect, unintended, or malicious outputs, encompassing both classical gradient-based attacks on neural networks and prompt-based exploitation of large language models.

This skill is critical for building robust and trustworthy AI systems, directly mitigating financial, reputational, and regulatory risks. It enables proactive security hardening of models before deployment, which is now a key requirement in sectors like finance, healthcare, and autonomous systems.

1 Careers

1 Categories

9.2 Avg Demand

15% Avg AI Risk

How to Learn Adversarial ML techniques (FGSM, PGD, C&W attacks, prompt injection, jailbreaking)

1. **Foundational Math & ML**: Solidify understanding of gradients (Jacobian/Hessian matrices), loss functions, and backpropagation. 2. **Core Attack Taxonomy**: Memorize the key families-evasion (FGSM, PGD, C&W), data poisoning, and model inversion-for classical ML; and prompt injection, jailbreaking for LLMs. 3. **Hands-on Tooling**: Get comfortable running baseline attacks on pre-trained models using frameworks like CleverHans or the ART library on simple datasets (e.g., MNIST, CIFAR-10).

1. **From White-Box to Black-Box**: Move beyond knowing attacks to implementing transferability studies and query-based black-box attacks. 2. **Defense Implementation**: Practice implementing and evaluating common defenses like adversarial training, input preprocessing (JPEG compression, spatial smoothing), and certified defenses on a benchmark. 3. **Common Pitfall**: Avoid focusing solely on attack success rate; learn to measure perturbation magnitude (L∞, L2 norms) and computational cost to understand trade-offs.

1. **System-Level Integration**: Design and architect adversarial robustness pipelines integrated into MLOps, including monitoring for adversarial drift. 2. **LLM-Specific Red Teaming**: Develop comprehensive red teaming frameworks for large language models, covering not just prompt injection but also indirect injection, payload delivery, and multilingual attacks. 3. **Strategic Leadership**: Lead the creation of organizational security standards, threat models for AI assets, and mentor teams on secure ML development lifecycle (SMLD).

Practice Projects

Beginner

Project

FGSM Attack on a Pre-trained Image Classifier

Scenario

You have a pre-trained ResNet model on ImageNet. Your goal is to generate adversarial examples using the Fast Gradient Sign Method (FGSM) that cause misclassification with minimal visible perturbation.

How to Execute

1. Load the pre-trained model and a sample image. 2. Compute the loss and the gradient of the loss with respect to the input image. 3. Create the adversarial image by adding a small epsilon-scaled perturbation in the sign direction of the gradient. 4. Verify misclassification and visualize the original vs. adversarial image and the perturbation.

Intermediate

Project

Evaluating Robustness of a Custom Model with PGD and Defenses

Scenario

You've trained a custom CNN on a private dataset. You need to conduct a full robustness audit using Projected Gradient Descent (PGD) and then harden the model using adversarial training.

How to Execute

1. Implement a PGD attack with iterative steps and random restarts on your model's test set. 2. Calculate the model's robust accuracy under this attack. 3. Implement adversarial training by augmenting each training batch with PGD-generated adversarial examples. 4. Retrain the model and compare the clean accuracy vs. robust accuracy trade-off before and after defense.

Advanced

Project

Designing a Multi-Vector LLM Red Teaming Campaign

Scenario

Your organization is deploying a customer-facing LLM chatbot. You must design and execute a red teaming campaign that tests for prompt injection (direct & indirect), jailbreaking, and data extraction, then produce a remediation plan.

How to Execute

1. **Threat Modeling**: Map potential attack surfaces (user input, retrieved documents, tool APIs). 2. **Attack Generation**: Develop attack prompts for each vector (e.g., DAN-style jailbreaks, indirect injection via hidden text in documents). 3. **Execution & Monitoring**: Run the campaign against a staging model, logging all inputs/outputs and model behavior. 4. **Analysis & Hardening**: Classify failures, propose mitigations (input/output filtering, prompt hardening, system prompt isolation), and present a risk report to engineering and product leadership.

Tools & Frameworks

Software & Platforms

IBM Adversarial Robustness Toolbox (ART)CleverHans (TensorFlow)Foolbox (PyTorch)Microsoft CounterfitGarak (LLM Vulnerability Scanner)

ART is the industry-standard, comprehensive library for classical adversarial ML attacks and defenses. CleverHans and Foolbox are foundational for research. Counterfit and Garak are purpose-built for red teaming AI systems, with Garak specializing in LLMs.

Methodologies & Frameworks

MITRE ATLAS (Adversarial Threat Landscape for AI Systems)OWASP Top 10 for LLM ApplicationsNIST AI Risk Management Framework (AI RMF)

ATLAS provides a knowledge base of adversary tactics and techniques for AI. The OWASP Top 10 for LLMs is a critical checklist for securing LLM applications. NIST AI RMF offers high-level guidance on governing and managing AI risks, including adversarial robustness.

Interview Questions

Answer Strategy

Demonstrate depth by explaining the reformulation of the attack as an optimization problem with a modified loss function and box constraints, contrasting it with FGSM's one-step linear approximation. Emphasize its ability to find minimal perturbations but note its high computational cost due to iterative optimization and binary search for the constant 'c'. Sample: 'C&W frames the attack as an optimization problem minimizing perturbation magnitude subject to misclassification, using a custom loss function to bypass defensive distillation. Unlike FGSM's single gradient step, it uses iterative gradient descent with a binary search over a constant, yielding stronger, more targeted attacks at significantly higher computational cost.'

Answer Strategy

The interviewer is testing for a structured, defense-in-depth approach. Outline a phased strategy: 1) **Threat Assessment**: Classify attack types (role-play, prompt leakage, DoS). 2) **Pre-processing**: Implement input sanitization and jailbreak keyword/regex filters. 3) **Model-Level**: Use prompt hardening (e.g., system prompt isolation, instruction hierarchy) and fine-tuning on refusal datasets. 4) **Post-processing**: Add output classifiers to detect and block unsafe responses. 5) **Monitoring**: Log and analyze failed jailbreak attempts to iteratively improve defenses. Sample: 'My strategy is layered. First, I'd threat model potential jailbreak vectors. Then, I'd implement input filtering and robust system prompts that clearly define the model's boundaries. I'd augment this with a fine-tuned safety classifier on the output side. Critically, I'd establish a continuous monitoring loop to analyze attack attempts and update defenses accordingly.'