Aman's AI Journal • Primers • Support Vector Machines (SVM)

Overview
Key Concepts
Loss Function: Hinge Loss
Kernel Trick
Applications
Pros and Cons
- Pros
- Cons
Example: SVM in Classification
- Given Dataset
- Model
Citation

Overview

Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for both classification and regression tasks. They are widely appreciated for their ability to handle high-dimensional data and robust performance, even in complex datasets.
SVMs were first introduced by Vladimir Vapnik and Alexey Chervonenkis in 1963 as a binary classifier. The goal of SVM is to find the best hyperplane that separates data points of different classes in the feature space. For datasets that are not linearly separable, SVM employs kernel functions to map data into a higher-dimensional space where separation becomes feasible.

Key Concepts

Hyperplane

A hyperplane is a decision boundary that separates data into different classes. In an \(n\)-dimensional space, the hyperplane is an \((n-1)\)-dimensional subspace.

Equation of a Hyperplane

\(\mathbf{w} \cdot \mathbf{x} + b = 0\)

where:
- \(\mathbf{w}\): Weight vector (normal to the hyperplane)
- \(\mathbf{x}\): Feature vector
- \(b\): Bias term
The goal is to maximize the margin between the hyperplane and the nearest data points of any class (support vectors).

Margin

The margin is the distance between the hyperplane and the closest data points (support vectors) of either class. A larger margin generally leads to better generalization.

Functional Margin

\(y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1\)

where \(y_i \in \{-1, +1\}\): True label of the \(i\)-th data point.

Support Vectors

Support vectors are the data points that lie closest to the hyperplane and influence its orientation and position. These points are critical in defining the margin.

Optimization Problem

The optimization problem for a linear SVM is:

\[\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2\]

Subject to:

\[y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 \quad \forall i\]

This is a convex optimization problem, solvable using techniques like Lagrange multipliers and quadratic programming.

Loss Function: Hinge Loss

The hinge loss function is used to penalize misclassified points and points within the margin: \(L = \sum_{i=1}^N \max(0, 1 - y_i (\mathbf{w} \cdot \mathbf{x}_i + b))\)
- where \(N\) is the total number of samples.
The objective function becomes:
\[\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^N \max(0, 1 - y_i (\mathbf{w} \cdot \mathbf{x}_i + b))\]
- where \(C\) is a regularization parameter controlling the trade-off between maximizing the margin and minimizing the classification error.

Kernel Trick

The kernel trick is a fundamental concept in SVMs that enables them to classify data that is not linearly separable in the original feature space. It works by implicitly mapping data into a higher-dimensional feature space using kernel functions, without explicitly performing the transformation. This makes computations efficient while still allowing complex decision boundaries.

Why Kernel Trick?

For non-linearly separable data, a linear hyperplane cannot effectively classify the data. Instead of manually adding more dimensions, the kernel trick allows SVM to compute the separation in a higher-dimensional space using only inner products.

The key insight

If \(\phi(\mathbf{x})\) is the mapping to a higher-dimensional space, the dot product in the higher dimension can often be computed directly using a kernel function \(K\) in the original space:

\[K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x}_j)\]

This avoids the explicit computation of \(\phi(\mathbf{x})\), saving time and resources.

Common Kernel Functions

Linear Kernel

\(K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^\top \mathbf{x}_j\)

Suitable for linearly separable data.
No explicit transformation is performed.

Polynomial Kernel

\(K(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^\top \mathbf{x}_j + c)^d\)

\(c\): A constant that controls the flexibility.
\(d\): Degree of the polynomial.
Captures polynomial relationships between features.

Radial Basis Function (RBF) Kernel (Gaussian Kernel)

\(K(\mathbf{x}_i, \mathbf{x}_j) = \exp\left(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2\right)\)

\(\gamma\): Parameter controlling the influence of each data point.
Maps data into an infinite-dimensional space.
Effective for highly non-linear relationships.

Sigmoid Kernel

\(K(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\alpha \mathbf{x}_i^\top \mathbf{x}_j + c)\)

Inspired by neural networks.
Parameters \(\alpha\) and \(c\) control the shape of the decision boundary.

Advantages of Kernel Trick

Efficiency:
- Avoids the explicit computation of the high-dimensional feature space, reducing computational cost.
Flexibility:
- Provides the ability to apply SVM to a wide range of non-linear problems.
Scalability:
- Works for very high-dimensional data without increasing memory requirements significantly.

Choosing the Right Kernel

Linear Kernel: For linearly separable data or datasets with many features.
Polynomial Kernel: When interactions of features up to a certain degree are relevant.
RBF Kernel: Default choice for most non-linear datasets.
Sigmoid Kernel: Use with caution; can behave unpredictably in certain datasets.

Applications

Classification Tasks:
- Handwriting recognition (e.g., MNIST dataset)
- Text categorization (e.g., spam filtering)
- Bioinformatics (e.g., cancer detection)
Regression Tasks:
- Predicting real estate prices.
- Financial forecasting.
Outlier Detection:
- Detecting anomalies in time series data.
- Fraud detection in financial systems.

Pros and Cons

Pros

Effective in High Dimensions: SVM works well in datasets with a large number of features.
Robust to Overfitting: Particularly effective in scenarios with a clear margin of separation.
Versatility: Through the use of kernels, SVM can handle nonlinear relationships.
Sparse Solutions: The model depends only on support vectors, reducing computation.

Cons

Scalability: SVM is computationally expensive for large datasets due to quadratic programming.
Choice of Kernel: The performance heavily depends on the choice of the kernel function and its parameters.
Interpretability: Nonlinear SVM models are harder to interpret than linear models.
Memory-Intensive: Storing support vectors for large datasets can be memory-intensive.

Example: SVM in Classification

Given Dataset

\[X = \{ \mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_N \}\]
Labels: \(y_i \in \{-1, +1\}\)

Model

Using the RBF kernel: \(K(\mathbf{x}_i, \mathbf{x}_j) = \exp\left(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2\right)\)
Optimization: \(\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^N \max(0, 1 - y_i (\mathbf{w} \cdot \mathbf{x}_i + b))\)
After training, for a new point \(\mathbf{x}_{\text{new}}\):

\(f(\mathbf{x}_{\text{new}}) = \text{sign} \left( \sum_{i=1}^N \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}_{\text{new}}) + b \right)\)
- where \(\alpha_i\) are the Lagrange multipliers obtained during optimization.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledSVM,
  title   = {Support Vector Machines},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}