CS230 • Adversarial Attacks and Defenses
- Overview
- Adversarial Attacks
- Defenses to Adversarial Attacks
- Why Are Neural Networks Vulnerable to Adversarial Examples?
- Further Readings
- Citation
Overview
-
Neural networks are powerful but fragile. Online platforms increasingly rely on neural network–based classifiers to automatically flag and remove harmful content, such as violent or sexually explicit material. Yet, attackers can circumvent these filters by applying subtle, human-imperceptible perturbations to the input. This phenomenon is known as an adversarial attack.
-
The vulnerability of neural networks to adversarial perturbations was first identified in 2013 by Szegedy et al.. They demonstrated that it is possible to craft fake datapoints across a variety of domains—images (Szegedy et al., 2013), text (Papernot et al., 2017; Ebrahimi et al., 2018), speech (Carlini & Wagner, 2018), and even structured data (McDaniel et al., 2017)—that can reliably fool state-of-the-art classifiers.
-
Since then, increasingly powerful attack methods have emerged:
- Carlini & Wagner (C&W) attacks (Carlini & Wagner, 2017) showed how to design optimization-based attacks that evade many defenses.
- Projected Gradient Descent (PGD) (Madry et al., 2018) became the de facto “universal first-order adversary” and is now a benchmark attack in robustness research.
- AutoAttack (Croce & Hein, 2020) combined multiple strong attacks into a reliable evaluation suite that exposes overly optimistic defense claims.
- Beyond classification, adversarial attacks have expanded to large-scale generative models (e.g., GANs and diffusion models) and multimodal systems (Wei et al., 2022).
-
Adversarial examples reveal a fundamental paradox: the same linearity and gradient-based properties that make neural networks efficient to train also render them vulnerable to adversarial perturbations. Research in this field remains highly active, with directions ranging from robust training algorithms to certifiable defenses and even theoretical frameworks for understanding adversarial robustness.
-
Recent advances in defenses include:
- Adversarial training with PGD (Madry et al., 2018), still considered the gold standard for robustness, though computationally expensive.
- Randomized smoothing (Cohen et al., 2019), which provides probabilistic certified robustness guarantees under \(L_2\)-norm perturbations.
- Certified defenses using convex relaxations (Wong & Kolter, 2018) and interval bound propagation (Gowal et al., 2019).
- Robust pretraining and large-scale models (Xie et al., 2020), which suggest that scaling and data diversity can improve robustness.
-
As neural networks become integrated into critical applications—autonomous driving, healthcare diagnostics, security systems—the robustness of these models will determine whether AI can be trusted in high-stakes environments. Studying adversarial examples is therefore not just a scientific curiosity but a matter of ensuring the safety and reliability of AI systems.
-
The following figure shows how an artificially generated “fake” datapoint can be created to fool a classifier into outputting a chosen target label. Even though the image looks like random noise to a human observer, the classifier confidently mislabels it as belonging to a chosen target class.
-
More surprisingly, adversarial perturbations can be designed to be imperceptible. For example, consider an image of a cat. By adding small, carefully calculated pixel-level noise, an attacker can trick a classifier into predicting “iguana” while the perturbed image still looks like a perfectly normal cat to a human observer.
-
The following figure demonstrates this phenomenon: a subtle perturbation applied to a cat image leads the classifier to misclassify it as an iguana, even though the cat is still clearly visible to humans.
-
This has serious real-world implications:
- Autonomous vehicles rely on object detectors to identify pedestrians, vehicles, and traffic signs. An adversarially perturbed “STOP” sign might instead be classified as a “70 mph speed limit” sign, leading to catastrophic accidents.
- Face recognition systems deployed in security settings can be fooled. An unauthorized individual could manipulate their photo so that the system identifies them as an authorized user, granting access.
- Content moderation platforms that screen for prohibited imagery (e.g., violence, explicit content) can be bypassed. Perturbed images evade classifiers, resulting in harmful material slipping past automated filters.
-
Because of these risks, the study of adversarial examples has become a central research area in modern AI. The field continues to evolve, with ongoing debates about whether true robustness is achievable, or whether adversarial vulnerability is an inherent property of high-dimensional learning systems.
Adversarial Attacks
General Procedure
-
The process of crafting adversarial examples typically exploits the differentiable structure of neural networks. Consider a convolutional network pre-trained on the ImageNet dataset. The attacker’s objective is to start from a benign input image \(x\) and produce a perturbed image \(x_{adv}\) such that the model confidently misclassifies it as a chosen target class \(y_i\).
-
The basic procedure can be broken down into the following steps:
- Forward pass: Select an input \(x\) and compute the model’s prediction \(\hat{y}\).
-
Define a target loss: Construct a loss function that penalizes deviation from the chosen target class. A simple squared-error loss is:
\[\mathcal{L}(\hat{y}, y_i) = \frac{1}{2} \, || \hat{y} - y_i ||^2\]where \(\hat{y}\) is the prediction vector (e.g., class probabilities) and \(y_i\) is the one-hot vector for the target class.
-
Iterative optimization on the input: Keep the network weights frozen and update the input image using gradient descent:
\[x \leftarrow x - \alpha \frac{\partial \mathcal{L}}{\partial x}\]where \(\alpha\) is a step size controlling the perturbation strength.
-
After several iterations, this optimization produces an adversarial example \(x_{adv}\) that is highly likely to be classified as the target \(y_i\).
State-of-the-Art Attacks
-
While the above describes the original approach (Szegedy et al., 2013), more advanced attack algorithms have since been developed:
-
Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2014): A single-step method that perturbs the input in the direction of the sign of the gradient:
\[x_{adv} = x + \varepsilon \cdot \text{sign}\left(\nabla_x \mathcal{L}(W, x, y)\right)\]It is computationally efficient and widely used both for attacks and adversarial training.
-
Projected Gradient Descent (PGD) (Madry et al., 2018): An iterative version of FGSM, considered the “universal first-order adversary”. It is stronger than FGSM because it applies multiple small updates while projecting back into an \(\varepsilon\)-ball around the input to ensure perturbations remain bounded.
-
Carlini & Wagner (C&W) attack (Carlini & Wagner, 2017): An optimization-based method that can bypass many defenses by carefully designing the loss function. It remains one of the most influential targeted attack frameworks.
-
AutoAttack (Croce & Hein, 2020): A parameter-free ensemble of strong attacks (including PGD and C&W variants). It has become the de facto benchmark for evaluating defenses, as it reliably exposes overly optimistic robustness claims.
-
Beyond classification: Modern attacks extend to object detection, segmentation, multimodal models (vision-language systems like CLIP), and diffusion models (Wei et al., 2022). These show that adversarial vulnerability is not limited to classifiers but is pervasive across AI modalities.
-
Ensuring Natural-Looking Perturbations
-
A key challenge is that unconstrained optimization can produce adversarial examples that look like random noise, since the input space is vast. For example, the space of 32×32 color images is:
\[255^{32 \times 32 \times 3} \approx 10^{7400}\]which is astronomically larger than the space of realistic natural images.
-
To ensure the adversarial example resembles a natural image, the loss can be modified to include a perceptual similarity term:
\[\mathcal{L}(\hat{y}, y_i, x) = \frac{1}{2}||\hat{y} - y_i||^2 + \lambda \, ||x - x_j||^2\]where \(x_j\) is a chosen reference image (e.g., the original input), and \(\lambda\) is a hyperparameter controlling the trade-off between classification success and perceptual similarity.
-
The result is an adversarial example that looks like \(x_j\) to humans, but the classifier labels it as \(y_i\).
Practical Refinements
-
In practice, researchers often apply additional tricks to keep perturbations imperceptible:
- Gradient clipping: Restricting update magnitudes prevents large pixel changes.
-
Norm-based constraints: Perturbations are often bounded by an \(L_p\)-norm, such as
\[||x_{adv} - x||_\infty \leq \varepsilon\]ensuring that each pixel is altered by at most \(\varepsilon\).
- Early stopping: Once the model confidently predicts the target label, the optimization halts to avoid introducing unnecessary distortion.
-
These refinements, combined with state-of-the-art attack methods like PGD and AutoAttack, make adversarial examples subtle yet devastatingly effective—posing a persistent challenge for defense research.
Defenses to Adversarial Attacks
- The discovery of adversarial examples raised urgent questions about how to safeguard machine learning models in real-world deployments. While numerous defense strategies have been proposed, no single method can guarantee robustness against all possible attacks. Nonetheless, defenses have evolved considerably, from heuristic approaches to certified robustness guarantees.
Types of Attacks
-
White-box attacks
- The attacker has full knowledge of the model’s architecture, parameters, and gradients.
- Enables exact gradient-based adversarial optimization.
- Example: PGD (Madry et al., 2018) is often evaluated in this setting.
-
Black-box attacks
- The attacker only has query access: they can submit inputs and observe outputs, but cannot see internals.
-
Gradients can be estimated using finite differences:
\[\frac{\partial \mathcal{L}}{\partial x} \approx \frac{f(x+\varepsilon) - f(x)}{\varepsilon}\]where \(f(\cdot)\) is the model’s prediction.
- Due to transferability, adversarial examples crafted for one model often fool another (Papernot et al., 2017).
Early Defense Methods
-
SafetyNet (adversarial detection)
- Lu et al. (2017) introduced a detector “firewall” network that learns to identify adversarial inputs from hidden activations.
- Effective initially, but many adaptive attacks later bypassed detection.
-
Adversarial data augmentation
- Train on adversarially perturbed examples labeled correctly.
- Helps reshape decision boundaries but requires massive adversarial data generation.
-
Adversarial training (FGSM)
-
Goodfellow et al. (2014) proposed adding FGSM-generated adversarial examples into training:
\[\mathcal{L}_{new}(W, x, y) = L(W, x, y) + \lambda L(W, x_{adv}, y)\] -
Improves robustness, but still fails against stronger iterative attacks (e.g., PGD, C&W).
-
State-of-the-Art Defenses
-
Adversarial Training with PGD
- Madry et al. (2018) extended adversarial training using PGD-generated examples, establishing it as the “gold standard” defense.
- While effective, it is computationally expensive, often requiring several times the resources of standard training.
-
Randomized Smoothing
- Cohen et al. (2019) showed that adding Gaussian noise during inference transforms a classifier into a smoothed classifier that has certified robustness under \(L_2\)-norm perturbations.
- Scales well to large models and has become a practical way to provide probabilistic robustness guarantees.
-
Certified Defenses
- Convex relaxations (Wong & Kolter, 2018) approximate the adversarial optimization problem to provide formal guarantees of robustness within a perturbation radius.
- Interval Bound Propagation (IBP) (Gowal et al., 2019) propagates input intervals through the network to certify robustness.
- These methods trade off model capacity for provable guarantees.
-
Robust Pretraining and Large-Scale Defenses
- Large-scale adversarial pretraining on diverse datasets (Xie et al., 2020) shows improved robustness compared to training from scratch.
- Recent works combine self-supervised learning with adversarial training to improve robustness without prohibitive computational costs.
-
Defense Benchmarking
- To counter “broken defenses” (methods that appear robust but fail under stronger attacks), the community now relies on rigorous evaluation suites like AutoAttack (Croce & Hein, 2020) to stress-test defenses.
Computational Considerations
- Adversarial training remains the most reliable defense but is resource-intensive. Training robust ImageNet-scale models can be 10–30× more expensive than standard training.
- Certified defenses offer provable guarantees but currently lag behind in scalability and accuracy.
- Hybrid approaches — such as combining PGD training with randomized smoothing — are promising directions for future research.
Why Are Neural Networks Vulnerable to Adversarial Examples?
- The central puzzle of adversarial machine learning is why neural networks are so fragile. Even tiny, human-imperceptible perturbations can completely flip a model’s prediction. While early work attributed this to overfitting or complex non-linear boundaries, research has revealed deeper structural reasons tied to the geometry of high-dimensional spaces and the design of modern neural networks.
Example: Adversarial Attack on Logistic Regression
-
To illustrate, consider a simple logistic regression model:
\[\hat{y} = \sigma(Wx + b)\]- where \(\sigma(\cdot)\) is the sigmoid function, \(x \in \mathbb{R}^{n \times 1}\) is the input, and \(W \in \mathbb{R}^{1 \times n}\) is the weight vector. For simplicity, let \(b = 0\) and \(n = 6\).
-
Suppose:
- The model outputs:
-
Thus, the model predicts class \(y=0\) with 73% confidence.
-
Now add a perturbation aligned with the sign of \(W\):
- For \(\varepsilon = 0.4\),
- yielding:
- Now the model predicts class \(y=1\) with 69% confidence. A minuscule change flipped the decision.
Key Observations
- Why sign perturbations work: Adding perturbations aligned with \(\text{sign}(W)\) ensures each feature shift contributes positively toward flipping the decision boundary.
- Dimensionality amplifies vulnerability: The larger the input dimension \(n\), the more cumulative effect small perturbations can have on the output.
-
Generalization to neural networks: For deep models, the same principle applies, but gradients are taken with respect to the loss:
\[x_{adv} = x + \varepsilon \cdot \text{sign}\left(\nabla_x \mathcal{L}(W, x, y)\right)\]- This generalization is known as the Fast Gradient Sign Method (FGSM), introduced by Goodfellow et al. (2014).
Why Linearity Makes Networks Fragile
- Early explanations for adversarial examples attributed them to overfitting or non-linear decision boundaries. However, Goodfellow and colleagues showed the opposite: adversarial vulnerability arises from the linear behavior of neural networks.
- Neural networks are deliberately designed to be (piecewise) linear—for example, using ReLU activations, LSTMs, or linear regions of \(\tanh\) and sigmoid.
- Initialization methods like Xavier initialization and techniques like Batch Normalization encourage operation in near-linear regimes to stabilize training.
- This linearity makes models easy to optimize, but also easy to attack: gradients propagate efficiently, enabling adversaries to exploit them.
- The FGSM attack illustrates this tension clearly: the very property that makes models trainable (gradient-based optimization) is the same property that makes them vulnerable (gradient-based attacks).
Modern Theoretical Perspectives
-
Curvature of decision boundaries
- Moosavi-Dezfooli et al., 2016 showed that adversarial examples exploit the geometry of decision boundaries, which are often close to the data manifold in high dimensions.
-
Universal adversarial perturbations
- Moosavi-Dezfooli et al., 2017 demonstrated that a single perturbation vector can fool a model across many inputs, highlighting the global fragility of neural networks.
-
Robustness–accuracy trade-off
- Tsipras et al., 2019 argued that accuracy and robustness may be inherently at odds: features that maximize accuracy (highly predictive but brittle) may be non-robust, making adversarial vulnerability unavoidable under current training paradigms.
-
Non-robust features hypothesis
- Ilyas et al., 2019 suggested adversarial examples exploit “non-robust features”: patterns invisible to humans but predictive in training data. Neural networks rely on these features, which makes them inherently susceptible.
-
Scaling laws and large models
- Larger pretrained models (e.g., Vision Transformers, multimodal systems like CLIP) show somewhat improved robustness due to richer feature representations, but remain vulnerable to adaptive attacks (Wei et al., 2022).
-
Certified robustness perspective
- Formal frameworks (Wong & Kolter, 2018) aim to prove robustness within defined perturbation bounds, reframing the problem as one of guarantees rather than empirical resilience.
Takeaway
-
Adversarial examples are not accidents of poor training or data noise. They are a structural property of high-dimensional learning systems driven by:
- Linearity in input–output mappings.
- Reliance on predictive but non-robust features.
- Fragile geometric decision boundaries.
- Fundamental trade-offs between accuracy and robustness.
-
This makes adversarial robustness not merely a matter of patching models but a central challenge in the future of reliable AI.
Further Readings
- Adversarial robustness has matured from isolated attacks and defenses into a core research area of trustworthy AI, spanning attacks, defenses, theory, and applications to new modalities.
- For readers seeking deeper exploration, the following works provide structured entry points:
Foundational Work
- Szegedy et al. (2013) — introduced adversarial examples and showed that imperceptible perturbations can fool state-of-the-art classifiers across multiple modalities.
- Goodfellow et al. (2014), Explaining and Harnessing Adversarial Examples — proposed the Fast Gradient Sign Method (FGSM) and argued that vulnerability arises from linearity in high dimensions.
Core Attack and Defense Methods
-
Attacks:
- Carlini & Wagner (2017) introduced optimization-based C&W attacks that bypass many defenses.
- Madry et al. (2018) established PGD attacks as the “universal first-order adversary.”
- Croce & Hein (2020) introduced AutoAttack, a benchmark ensemble that exposes overly optimistic defense claims.
-
Defenses:
- Cohen et al. (2019) proposed randomized smoothing to provide probabilistic certified robustness.
- Wong & Kolter (2018) pioneered certified robustness with convex relaxations.
- Gowal et al. (2019) developed Interval Bound Propagation (IBP) for scalable certification.
- Xie et al. (2020) demonstrated adversarial pretraining and robustness gains in large-scale models.
Survey Papers
- Yuan et al. (2017) — a comprehensive early survey of adversarial examples, attack methods, and defenses.
- Akhtar & Mian (2018) — broad overview of visual adversarial attacks and defenses.
- Kurakin et al. (2018) — practical scenarios of adversarial attacks/defenses and their real-world implications.
- Dong et al. (2020) — survey of black-box adversarial attacks.
Domain-Specific Vulnerabilities
- Text: Ebrahimi et al. (2018) — adversarial perturbations in NLP.
- Speech: Carlini & Wagner (2018) — audio adversarial examples for speech recognition.
- Structured Data: McDaniel et al. (2017) — adversarial ML in tabular and structured domains.
- Vision & multimodal systems: Wei et al. (2022) — adversarial robustness of vision–language models (e.g., CLIP).
Modern Theoretical Perspectives
- Tsipras et al. (2019) — formalized the robustness–accuracy trade-off, suggesting adversarial examples may be unavoidable.
- Ilyas et al. (2019) — adversarial examples exploit non-robust features that are predictive but imperceptible to humans.
- Moosavi-Dezfooli et al. (2016, 2017) — adversarial vulnerability explained via geometry of decision boundaries and universal perturbations.
Emerging Directions
- Adversarial robustness in large pretrained models: Scaling trends suggest robustness improves slightly with larger architectures and datasets but remains a challenge (Xie et al., 2020).
- Robustness in generative models: Attacks on GANs and diffusion models raise new concerns (Wei et al., 2022).
- Adversarial robustness in LLMs: Recent studies highlight prompt-based adversarial attacks on large language models, raising questions about safety and reliability in real-world use (Zou et al., 2023).
- Benchmarking: AutoAttack and robust model leaderboards (e.g., RobustBench) are now essential for evaluating defenses fairly.
Citation
If you found our work useful, please cite it as:
@article{Chadha2020AdversarialAttacks,
title = {Adversarial Attacks and Defenses},
author = {Chadha, Aman},
journal = {Distilled Notes for Stanford CS230: Deep Learning},
year = {2020},
note = {\url{https://aman.ai}}
}