Overview

  • A loss function, sometimes referred to as a cost function or objective function, is the cornerstone of nearly all machine learning algorithms. It provides a quantitative measure of how wrong a model’s predictions are compared to the true outcomes. Without a loss function, there is no way to guide the training process, since the model would not know how to adjust its parameters to improve performance.

  • In supervised learning, the goal is to learn a mapping from inputs to outputs. Suppose we are training a classifier on a dataset of images. Each training example consists of an input image, denoted by \(x_i\), and its corresponding ground-truth label, denoted by \(y_i\). The model, parameterized by weights \(W\) and biases \(b\), produces a vector of scores:

    \[s = f(x_i; W, b)\]
    • where, \(s\) is a vector of raw scores (sometimes called logits), one for each possible class. The loss function then compares these predicted scores against the true label \(y_i\).
  • Formally, for a dataset with \(N\) examples, the overall loss is defined as the average of the individual per-example losses:

    \[L = \frac{1}{N} \sum_{i=1}^N L_i(f(x_i; W, b), y_i)\]
    • where:
      • \(L_i\) represents the loss for a single training example.
      • \(f(x_i; W, b)\) are the scores (or probabilities) the model produces for input \(x_i\).
      • The summation averages across all training samples, ensuring that the loss does not scale with dataset size.
  • This average loss, often called the empirical risk, is the value that optimization algorithms attempt to minimize during training (Vapnik, 1998).

Why is the loss function important?

  • Guides optimization: The loss function provides a direction for gradient-based optimization. Gradients are computed with respect to the loss, and these gradients tell us how to update weights $W$ and biases $b$.
  • Encodes model goals: The choice of loss function reflects what we care about. For example, in regression tasks we often use Mean Squared Error (MSE) (Gauss, 1821), while for classification tasks we might use Cross-Entropy (Shannon, 1948) or Hinge loss (Cortes & Vapnik, 1995). Different loss functions induce different training dynamics and learning behaviors.
  • Balances accuracy vs. generalization: Loss functions can include not only data loss (how well we fit the training set) but also additional terms like regularization penalties (Tikhonov, 1963), which encourage simpler models that generalize better.

Binary Classifier (Logistic Regression)

  • Before diving into multiclass settings such as SVM or Softmax, it is useful to first study the binary classification problem, where there are only two possible outcomes (e.g., “cat” vs. “not cat”). Many of the principles in multiclass classification extend from this simpler case.

  • In binary classification, we typically assign labels as \(y_i \in \{0, 1\}\) or equivalently \(y_i \in \{-1, +1\}\). Given an input \(x_i\), the model predicts a scalar score \(s = f(x_i; W, b)\), which is then mapped to a probability that the input belongs to the positive class.

Logistic Regression Model

  • The most common binary classifier is Logistic Regression (Cox, 1958).
  • Logistic regression models the conditional probability of class membership using the sigmoid (logistic) function:

\(\) P(Y = 1 \mid X = x_i) = \sigma(s) = \frac{1}{1 + e^{-s}} \(\)

  • where,

    • \(s = W^T x_i + b\) is the linear score,
    • The sigmoid function squashes any real-valued score into the interval \((0, 1)\), making it interpretable as a probability.
  • If \(P(Y=1 \mid X=x_i) > 0.5\), the model predicts the positive class; otherwise, it predicts the negative class.

Loss Function: Binary Cross-Entropy

  • To train logistic regression, we minimize the Binary Cross-Entropy loss (also called Log Loss) (Shannon, 1948):

\(\) L_i = -\Big( y_i \log(\sigma(s)) + (1 - y_i)\log(1 - \sigma(s)) \Big) \(\)

  • Interpretation:

    • If \(y_i = 1\), the loss reduces to \(-\log(\sigma(s))\), penalizing the model if the predicted probability of the positive class is low.
    • If \(y_i = 0\), the loss reduces to \(-\log(1 - \sigma(s))\), penalizing the model if the predicted probability of the negative class is low.
    • The loss is minimized when the predicted probability matches the true label.
  • For a dataset of \(N\) examples, the total loss is the average:

\(\) L = \frac{1}{N} \sum_{i=1}^N L_i \(\)

Geometric View

  • Logistic regression learns a linear decision boundary between the two classes:

\(\) W^T x + b = 0 \(\)

  • The weights \(W\) define the orientation of the boundary, while \(b\) shifts it.
  • The sigmoid ensures that confidence grows smoothly as points move farther from the boundary.

Worked Example

  • Suppose we want to classify emails as spam (\(y=1\)) or not spam (\(y=0\)) using features like word frequency. If the score for an email is \(s=2.0\):

    • Predicted probability: \(\sigma(2.0) = \frac{1}{1+e^{-2}} \approx 0.88\)

    • If \(y=1\) (spam), loss = \(-\log(0.88) \approx 0.13\) (good prediction).

    • If \(y=0\) (not spam), loss = \(-\log(1 - 0.88) \approx 2.13\) (bad prediction).

Key Properties

  • Probabilistic interpretation: Logistic regression directly models $P(Y \mid X)$, which is useful for applications requiring uncertainty estimates.
  • Convex optimization: The binary cross-entropy loss is convex in $W, b$, ensuring global optima can be found efficiently with gradient descent.
  • Simplicity and interpretability: Each weight in $W$ indicates how strongly a feature contributes to predicting the positive class.
  • Limitations: The linear decision boundary restricts its expressiveness; it cannot separate classes that are not linearly separable. Extensions like polynomial features or kernels can address this.

Softmax Classifier (Multinomial Logistic Regression)

  • The Softmax classifier offers a fundamentally different perspective from hinge loss. While SVM loss enforces a margin-based separation, the Softmax classifier interprets scores in a probabilistic framework (Bishop, 2006), providing outputs that can be directly interpreted as probabilities. This makes it especially attractive in tasks where calibrated confidence estimates are important.

  • The idea is to take the unbounded raw scores (logits) produced by the model, and transform them into a normalized probability distribution over all possible classes. This is achieved using the Softmax function.

The Softmax Function

  • Given a score vector \(s = f(x_i; W, b)\) for input \(x_i\), the probability that the input belongs to class \(k\) is:
\[P(Y = k \mid X = x_i) = \frac{e^{s_k}}{\sum_j e^{s_j}}\]
  • The numerator exponentiates the raw score of class \(k\). Larger scores yield larger exponentials, making high-scoring classes more dominant.
  • The denominator sums over exponentials of all class scores, ensuring the output probabilities are normalized and sum to 1.
  • The transformation is smooth and differentiable, making it ideal for gradient-based optimization.

The figure contrasts the philosophies of SVM and Softmax. SVM focuses on ensuring a margin between classes, while Softmax converts scores into probabilities, aiming to maximize the likelihood of the correct class.

Cross-Entropy Loss

  • To train the Softmax classifier, we use the Cross-Entropy Loss, which has roots in information theory (Shannon, 1948). Cross-Entropy Loss compares the predicted probability distribution with the true distribution. For a single example, where the correct class is \(y_i\):
\[L_i = - \log \left( \frac{e^{s_{y_i}}}{\sum_j e^{s_j}} \right)\]
  • This can be rewritten as:
\[L_i = -s_{y_i} + \log \left( \sum_j e^{s_j} \right)\]
  • where:
    • The first term penalizes the model if the score of the correct class is low.
    • The second term ensures normalization and prevents trivial solutions where all scores go to negative infinity.
  • The loss is minimized when the predicted probability for the correct class approaches 1.

Worked Example

  • Using the same set of scores as before (Cat, Frog, Car), we compute the Softmax losses:

    • Example 1: Cat (correct score: 3.2) \(L_i = -\log \left( \frac{e^{3.2}}{e^{3.2} + e^{5.1} + e^{-1.7}} \right) = -\log(0.129) = 2.04\)

    • Example 2: Frog (correct score: 1.3) \(L_i = -\log \left( \frac{e^{1.3}}{e^{4.9} + e^{1.3} + e^{2.0}} \right) = -\log(0.026) = 3.65\)

    • Example 3: Car (correct score: 2.2) \(L_i = -\log \left( \frac{e^{2.2}}{e^{-3.1} + e^{2.5} + e^{2.2}} \right) = -\log(0.424) = 0.86\)

Key Properties

  • Probabilistic interpretation: Each score is converted into a probability, enabling uncertainty estimates.
  • Smooth gradients: Unlike hinge loss, which is piecewise linear, Softmax + Cross-Entropy provides smooth gradients everywhere, which often helps optimization.
  • Unbounded confidence drive: The loss decreases as the model pushes \(s_{y_i}\) further above others. Unlike hinge loss, there is no plateau — the model is always incentivized to increase its confidence in the correct class.
  • Common in deep learning: Cross-Entropy with Softmax is the standard choice for classification tasks in modern neural networks.

Multiclass SVM Loss (“Hinge Loss”)

  • The Multiclass Support Vector Machine (SVM) loss, commonly known as Hinge loss, is one of the earliest and most influential loss functions for classification tasks (Cortes & Vapnik, 1995). Its central idea is not just to predict the correct class with the highest score, but to do so with a sufficient safety margin. This makes the classifier more robust, since it requires the correct prediction to be confidently separated from incorrect ones.

  • At its core, the hinge loss encourages a decision boundary where the score for the true class is at least a margin \(\Delta\) larger than every incorrect class score. Unlike probabilistic methods (e.g., Softmax), the SVM approach is margin-based, which means it explicitly focuses on geometric separation between classes.

The name “hinge loss” comes from the piecewise linear hinge-shaped curve. The loss remains at zero when the correct class score is sufficiently larger than the others, but increases linearly if the margin is violated.

Formal Definition

  • Suppose the model produces a score vector \(s = f(x_i; W, b)\) for input \(x_i\). Let \(s_{y_i}\) denote the score of the true class and \(s_j\) denote the score for an incorrect class \(j\). The multiclass SVM loss for a single training example is:

    \[L_i = \sum_{j \ne y_i} \max(0, s_j - s_{y_i} + \Delta)\]
    • where:
      • The term \(s_j - s_{y_i} + \Delta\) measures how much the incorrect class \(j\) violates the margin requirement.
      • If \(s_{y_i} \geq s_j + \Delta\), the margin is satisfied and the contribution to the loss for that class is zero.
      • If \(s_{y_i} < s_j + \Delta\), the margin is violated and the loss grows linearly with the violation.
  • The margin parameter \(\Delta > 0\) is a constant, usually set to 1. It does not drastically affect the final performance but ensures that predictions are not only correct, but confidently correct.

Geometric Intuition

  • Hinge loss reflects the geometric principle of SVMs: maximize the margin between the decision boundary and data points. Intuitively, the classifier prefers solutions where data points are not just classified correctly, but are far away from the boundary that separates classes. This helps reduce sensitivity to noise or small perturbations in the data.

Worked Example

  • To see hinge loss in action, consider the example scores for three classes: Cat, Frog, and Car (with \(\Delta = 1\)). The calculations below show how only the classes that violate the margin contribute to the loss.

  • Example 1: Cat (correct score: 3.2)

    • Loss for Frog: \(\max(0, 5.1 - 3.2 + 1) = 2.9\)
    • Loss for Car: \(\max(0, -1.7 - 3.2 + 1) = 0\)
    • Total: \(L_i = 2.9\)
  • Example 2: Frog (correct score: 1.3)

    • Loss for Cat: \(\max(0, 4.9 - 1.3 + 1) = 4.6\)
    • Loss for Car: \(\max(0, 2.0 - 1.3 + 1) = 1.7\)
    • Total: \(L_i = 6.3\)
  • Example 3: Car (correct score: 2.2)

    • Loss for Cat: \(\max(0, -3.1 - 2.2 + 1) = 0\)
    • Loss for Frog: \(\max(0, 2.5 - 2.2 + 1) = 1.3\)
    • Total: \(L_i = 1.3\)

Key Properties

  • Zero-loss regions: Once the margin is satisfied, hinge loss is zero, meaning the classifier stops “caring” about further increasing \(s_{y_i}\).
  • Piecewise linearity: The loss increases linearly when the margin is violated, creating simple gradients for optimization.
  • Robustness: By demanding confident predictions, hinge loss reduces the risk of misclassification under small perturbations.

Comparing SVM and Softmax

  • Both the Multiclass SVM loss (hinge loss) and the Softmax loss (cross-entropy) are widely used for classification. At first glance, they may appear quite different: one is rooted in geometric margins, while the other is grounded in probability theory. However, both ultimately encourage the model to assign higher scores to the correct class and lower scores to incorrect ones. The key lies in how they enforce this preference.

Conceptual Difference

  • SVM Loss (Margin-Based Objective):
    • The multiclass hinge loss enforces that the correct class score \(s_{y_i}\) is at least \(\Delta\) higher than every incorrect score \(s_j\). Once this margin requirement is satisfied, the loss becomes zero. In other words, hinge loss focuses on relative separation and stops caring about absolute magnitudes beyond the margin.
  • Softmax Loss (Probability-Based Objective):
    • The cross-entropy loss never reaches zero unless the predicted probability of the correct class is exactly 1. It encourages the model to continually increase \(s_{y_i}\) relative to the others, making the predicted probability of the correct class approach 1. This leads to stronger incentives for confidence maximization.

Visual Analogy

  • Imagine a race between athletes (the classes).

  • The SVM says: “As long as my favorite athlete is ahead of everyone else by at least \(\Delta\) seconds, I am satisfied and will not push them to run faster.”
  • The Softmax says: “I always want my favorite athlete to be faster than everyone else — the bigger the gap, the happier I am. There is no upper limit to my encouragement.”

Practical Implications

  • SVM Loss:

    • Good at ensuring clear decision boundaries.
    • Robust once the margin is satisfied (no incentive to over-optimize).
    • Can be less smooth for optimization since its gradient is piecewise.
  • Softmax Loss:

    • Produces calibrated probabilities that can be interpreted as confidence scores.
    • Encourages continual improvement of correct class scores.
    • Smooth and differentiable everywhere, which generally helps optimization with gradient-based methods.

Performance in Practice

  • Interestingly, despite their theoretical differences, both SVM and Softmax classifiers often yield very similar performance on real-world tasks. This is because both losses push the model in the same general direction: to increase \(s_{y_i}\) relative to other scores.

  • However, Softmax loss has become the default choice in deep learning, mainly because:

    1. It provides probabilistic outputs.
    2. Its smooth gradient makes optimization easier.
    3. It integrates seamlessly into neural networks where probabilities are required (e.g., language models, image classifiers).
  • SVM losses, while historically important, are less common in modern deep learning pipelines but remain conceptually useful and are still applied in certain margin-based contexts.

L2 Regularization

  • In machine learning, there is always a tension between fitting the training data perfectly and generalizing to unseen data. A model with too much capacity can memorize the training set, achieving very low training loss but failing to perform well on new inputs — a phenomenon called overfitting. Regularization provides a systematic way to combat overfitting by penalizing overly complex models.

Regularized Loss Function

  • The general form of a regularized loss function is:
\[L = \frac{1}{N} \sum_{i=1}^N L_i(f(x_i; W, b), y_i) + \lambda R(W)\]
  • The data loss (first term) measures how well the model fits the training data.
  • The regularization loss (second term) penalizes the complexity of the model.
  • \(\lambda\) is a hyperparameter that controls the trade-off:

    • Large \(\lambda\) emphasizes regularization, possibly underfitting the data.
    • Small \(\lambda\) reduces the effect of regularization, risking overfitting.
  • This combined objective ensures that the model learns not only to minimize error but also to remain simple and robust.

L2 Regularization (Weight Decay)

  • One of the most widely used regularization techniques is L2 Regularization, also known as weight decay (Hoerl & Kennard, 1970). It penalizes the sum of squared weights:

    \[R(W) = \sum_{k,l,c} (W_{k,l}^c)^2\]
    • where, the summation runs over all weights in the model. The effect is to discourage large weight values, nudging the optimizer toward smaller, more evenly distributed weights.

Why does this help?

  • Stability: Large weights can cause the model to react strongly to small changes in the input, making it sensitive and unstable. By penalizing large weights, L2 regularization promotes stability.
  • Smoother decision boundaries: With smaller weights, decision boundaries tend to be less sharp, reducing the risk of overfitting noise.
  • Optimization behavior: During gradient descent, the regularization term adds an additional component to the gradient that continuously pulls weights toward zero. This is why the method is nicknamed “weight decay.”

  • Mathematically, the weight update rule with L2 regularization becomes:

$$W_{new} = W_{current} - \eta \left( \frac{\partial L_{data}}{\partial W} + 2\lambda W \right)$

  • The second term, \(2\lambda W\), is the gradient of the regularization penalty. It ensures that each step slightly shrinks the weights, even if the data loss gradient is zero.

Intuition and Example

  • Imagine a neural network that learns to classify cats and dogs. Without regularization, the network might rely heavily on a single feature (say, the presence of whiskers) and assign it a very large weight. While this may work on training images, it might fail on test images with unusual lighting or partial occlusions.

  • L2 regularization discourages such “single-feature dominance” by spreading importance more evenly across features. Instead of relying only on whiskers, the model might also consider ear shape, fur texture, and overall silhouette. This redundancy improves robustness.

Key Properties

  • L2 regularization does not force weights to zero (unlike L1 regularization, which can yield sparse solutions). Instead, it shrinks them smoothly toward smaller values.
  • It is computationally cheap to implement, requiring only an additional summation during training.
  • It is almost universally applied in modern deep learning frameworks as the default regularization method.

Optimization

  • At the heart of machine learning is the process of optimization — the search for model parameters (weights \(W\) and biases \(b\)) that minimize the loss function. The central framework of optimization in machine learning, particularly Gradient Descent, was first formalized in applied mathematics (Cauchy, 1847), later adapted into modern machine learning (Robbins & Monro, 1951).
  • Without optimization, defining a loss function would be pointless, since the model would have no systematic way of improving its predictions.

The Goal of Optimization

  • Formally, we want to solve:

    \[W^{*}, b^{*} = \arg \min_{W, b} L(W, b)\]
    • where \(L(W, b)\) is the loss function over the dataset. The solution corresponds to the parameter values that minimize average loss across all training examples.
  • However, this is rarely a simple task. In deep learning, loss functions are typically non-convex and defined over extremely high-dimensional parameter spaces (millions of parameters for modern networks). This means we cannot solve them analytically; instead, we rely on iterative optimization algorithms.

The Valley Analogy

  • Optimization can be visualized as navigating a mountainous landscape. Each point in the landscape corresponds to a particular choice of parameters, and the altitude represents the loss. The goal is to find the lowest valley (minimum loss).

  • If we are at the top of a hill (poor parameters), the loss is high.
  • If we move in the right direction, the loss decreases.
  • Our challenge is to systematically take steps downhill until we approach a low-loss region.

  • The following figure illustrates the fact that we can think of optimization problems as starting on top of a valley and looking for a path down to the bottom. Here, optimizing a function means finding the weights and bias that correspond to the minimum loss function.

Optimization can be imagined as descending into a valley in the dark. The loss function provides the “altitude,” and the gradient tells us which direction slopes downward most steeply.

Gradients as a Guide

  • The key mathematical tool for optimization is the gradient of the loss function. The gradient is a vector of partial derivatives:
\[\nabla_W L(W) = \left[ \frac{\partial L}{\partial W_1}, \frac{\partial L}{\partial W_2}, \dots \right]\]
  • It points in the direction of steepest ascent (uphill).
  • To minimize the loss, we move in the opposite direction of the gradient.

  • This forms the basis of Gradient Descent, the most widely used optimization algorithm.

Gradient Descent Update Rule

  • For a single weight parameter \(W\), the update rule is:

    \[W_{new} = W_{current} - \eta \frac{\partial L}{\partial W}\]
    • where,
      • \(\eta\) is the learning rate, a hyperparameter controlling the step size.
      • If \(\eta\) is too small, training will be slow and may get stuck in shallow regions.
      • If \(\eta\) is too large, the algorithm may overshoot the minimum and fail to converge.
  • Finding a good learning rate is often one of the most important tasks in practical deep learning.

Challenges in Optimization

  • Local minima and saddle points: Non-convex loss landscapes may contain many low-lying regions. In practice, however, local minima are usually not a serious issue; saddle points (flat regions where gradients vanish) are more problematic.
  • Vanishing/exploding gradients: In very deep networks, gradients can shrink toward zero or grow uncontrollably, making optimization unstable. Special architectures and initialization strategies mitigate this.
  • High dimensionality: With millions of parameters, optimization must balance efficiency and stability, requiring advanced techniques like adaptive learning rates (Adam, RMSProp, etc.).

Optimization, together with well-defined loss functions, forms the engine of learning. While the loss tells us what to improve, optimization tells us how to improve it.

Gradient Descent Variants

  • Gradient Descent is the cornerstone of optimization in machine learning. While the basic idea — updating parameters in the opposite direction of the gradient — is simple, there are several practical ways to compute and apply gradients. These approaches trade off accuracy of gradient estimation against computational efficiency.

Batch Gradient Descent

  • In Batch Gradient Descent, the gradient is computed using the entire training dataset of size \(N\):
\[L(W) = \frac{1}{N} \sum_{i=1}^N L_i(x_i, y_i; W)\]
  • The update rule becomes:
\[W_{new} = W - \eta \nabla_W L(W)\]
  • Advantages:

    • Provides the exact gradient of the loss over the dataset.
    • Guarantees a stable and smooth path toward convergence.
  • Disadvantages:

    • Extremely expensive for large datasets (millions of examples).
    • Requires storing the entire dataset in memory, which is often impractical.
  • For this reason, pure batch gradient descent is rarely used in modern deep learning.

Stochastic Gradient Descent (SGD)

  • Stochastic Gradient Descent improves computational efficiency by computing the gradient using only one randomly chosen training example:
\[L_i(W) = L(x_i, y_i; W)\]
  • Update rule:
\[W_{new} = W - \eta \nabla_W L_i(W)\]
  • Advantages:

    • Very fast updates — only one example is needed at a time.
    • Introduces randomness, which helps escape saddle points and poor local minima.
  • Disadvantages:

    • The gradient is a noisy approximation of the true gradient.
    • The optimization path zig-zags heavily, which may slow convergence.
  • Despite its noisy updates, SGD has been incredibly influential because of its scalability to large datasets.

  • The figure below illustrates the calculation of partial derivatives to approximate the gradient. In practice, SGD computes gradients from only one or a few data points at a time, introducing randomness but keeping computations efficient.*

Mini-Batch Gradient Descent

  • Mini-Batch Gradient Descent is the most widely used compromise between batch and stochastic approaches. Instead of using the full dataset or just a single example, it uses a small batch of size \(M\):
\[L(W) = \frac{1}{M} \sum_{i=1}^M L_i(x_i, y_i; W)\]
  • Update rule:
\[W_{new} = W - \eta \nabla_W L(W)\]
  • Advantages:

    • Balances stability (less noisy than SGD) with efficiency (faster than full batch).
    • Exploits hardware parallelism (GPUs and TPUs handle matrix operations on batches efficiently).
    • Improves generalization by adding a mild amount of stochasticity.
  • Disadvantages:

    • Still involves hyperparameter tuning for batch size and learning rate.
  • Typical mini-batch sizes range from 32 to 256, depending on memory capacity and the problem.

Comparing the Variants

  • Batch GD: Precise but impractical for very large datasets.
  • SGD: Very fast, noisy, but good for escaping bad regions.
  • Mini-Batch GD: The standard in deep learning — stable, efficient, and scalable.

  • In modern practice, when people say “SGD,” they usually mean Mini-Batch SGD, often enhanced with advanced techniques such as momentum, adaptive learning rates, or gradient clipping.

The Optimization Loop

  • Training a neural network is an iterative process that repeatedly adjusts parameters to minimize the loss. This process is structured as a loop, where each cycle gradually moves the model toward better performance. The combination of forward passes, loss computation, backward passes, and parameter updates forms the backbone of all modern deep learning training routines.

  • The optimization loop is where loss functions and optimization algorithms come together. The loss tells the model what direction to move in, while the optimization algorithm decides how to take the steps. Iterating this process across epochs enables the model to transform raw parameters into a solution that captures meaningful patterns in the data.

The General Training Procedure

  1. Initialize Parameters

    • Start with random (or carefully chosen) values for weights \(W\) and biases \(b\).
    • Initialization matters: poor initialization can lead to vanishing/exploding gradients or slow convergence.
  2. Loop Over Epochs An epoch is one complete pass over the entire training dataset. Training typically involves many epochs.

    • Shuffle Data Randomizing the order of training examples ensures that mini-batches are representative and prevents the model from overfitting to fixed sequences of data.

    • Loop Over Mini-Batches For each mini-batch of size \(M\):

      • Forward Pass: Compute model scores \(s = f(x; W, b)\)
        • where \(x\) is the current batch input.
      • Loss Calculation: Evaluate the loss over the mini-batch \(L = \frac{1}{M} \sum_{i=1}^M L_i(f(x_i; W, b), y_i)\)

      • Backward Pass: Compute gradients of the loss w.r.t. parameters using backpropagation.
        • This provides \(\nabla_W L\) and \(\nabla_b L\).
      • Parameter Update: Adjust weights using an update rule such as gradient descent: \(W_{new} = W - \eta \nabla_W L\)
  3. Validation Step After each epoch (or periodically), evaluate the model on a validation set (data not seen during training).

    • Helps monitor generalization.
    • Prevents overfitting by identifying when to stop training (early stopping).
  • The figure illustrates gradient descent in action. At each step, we compute the gradient of the loss at the current point and update parameters by moving in the opposite direction. The learning rate \(\alpha\) (or \(\eta\)) determines the step size.

Key Considerations in the Loop

  • Learning Rate Scheduling: Keeping a fixed learning rate can be inefficient. Many training pipelines reduce the learning rate gradually (e.g., exponentially decay, cosine annealing, or step schedules).

  • Number of Epochs: Training too few epochs leads to underfitting; too many leads to overfitting. The right balance depends on validation performance.

  • Regularization During Training: Techniques like dropout, batch normalization, and weight decay are often integrated into the loop to improve generalization.

  • Checkpointing: Periodically saving model weights ensures progress is not lost and allows resuming training.

Image Features

  • Before the deep learning revolution, computer vision systems relied heavily on handcrafted features rather than learning directly from raw pixel data. The central idea was to transform raw pixel intensities into more structured and meaningful representations that a classifier (such as an SVM or logistic regression) could use effectively.
  • Ultimately, while handcrafted features like Color Histograms (Swain & Ballard, 1991), HOG (Dalal & Triggs, 2005), SIFT (Lowe, 2004), and Bag of Visual Words (Csurka et al., 2004) have dominated computer vision for decades, they were gradually replaced by deep neural networks, which can learn features automatically from data. Deep features are often more abstract, hierarchical, and task-specific, resulting in dramatically better performance.

Why Extract Features?

  • Raw images are simply grids of pixel values. For instance, a grayscale image of size \(100 \times 100\) contains 10,000 intensity values, while an RGB image triples this number. These values alone are not very informative for a linear classifier, since small changes in lighting, rotation, or viewpoint could drastically alter pixel values while the semantic meaning (e.g., a cat) remains the same.

  • Image features aimed to solve this by capturing invariant properties of images:

    • Edges and contours that describe shapes.
    • Color distributions that capture the overall appearance.
    • Texture patterns that characterize surfaces (e.g., fur, grass).
  • The figure illustrates how feature extraction transforms raw data into a feature space where classification becomes easier. On the left, raw pixel intensities may not be linearly separable. On the right, with meaningful features, a simple linear classifier can separate the classes effectively.

Examples of Handcrafted Features

  1. Color Histograms Capture the distribution of colors across an image. They ignore spatial arrangement but provide a robust summary of appearance.

  2. Histogram of Oriented Gradients (HOG) Encode edges and gradient directions, useful for recognizing shapes and outlines (e.g., pedestrians, animals).

  3. Bag of Visual Words (BoW) Treats local features as “visual words” and builds a histogram representation, analogous to how text documents are modeled in natural language processing.

Strengths of Handcrafted Features

  • Computationally efficient to calculate.
  • Often invariant to basic transformations like scaling, translation, or rotation.
  • Effective for traditional machine learning classifiers.

Limitations

  • Task-specific: Features had to be carefully designed for each application. A feature that works well for face detection might fail for vehicle recognition.
  • Limited expressiveness: Handcrafted features cannot easily capture higher-level semantic concepts (e.g., “this is a cat sitting on a chair”).
  • Engineering burden: Designing, tuning, and combining features required significant human expertise.

Color Histogram

  • A Color Histogram is one of the simplest yet most widely used feature descriptors in early computer vision. It captures the distribution of colors in an image, essentially summarizing which colors are present and in what proportion. Importantly, it disregards the spatial arrangement of colors — only their frequency matters.
  • While limited in expressive power, color histograms represented an important step in bridging raw pixel data and useful visual features. They are often combined with other descriptors (like texture or edges) to form richer representations.

How It Works

  1. Define Color Bins

    • The color range (e.g., 0–255 for each RGB channel) is divided into discrete bins.
    • For instance, splitting each channel into 16 bins results in a total of \(16 \times 3 = 48\) features.
  2. Count Pixels

    • For each pixel in the image, determine which bin its value falls into.
    • Increment the count for that bin.
  3. Normalize

    • Normalize the histogram so that it becomes invariant to image size.
    • This ensures a small thumbnail and a large high-resolution version of the same image yield comparable histograms.
  • The figure illustrates feature extraction using color histograms. Each pixel contributes to bins corresponding to Red, Green, and Blue intensities. The final histogram encodes the overall color distribution of the image.

Strengths

  • Simplicity: Very fast to compute and easy to implement.
  • Invariant to rotation and scaling: Since only color counts are considered, transformations like rotation do not change the histogram.
  • Effective for coarse tasks: Works well in distinguishing broad categories like “day vs. night,” “indoor vs. outdoor,” or identifying objects with distinctive color schemes (e.g., sky, vegetation, road).

Weaknesses

  • No spatial information: A picture of a blue sky and a blue car on a road could have similar color histograms, even though the content is very different.
  • Sensitive to illumination: Lighting changes can shift color distributions, making histograms less reliable under varied conditions.
  • Limited discriminative power: Different objects with similar color palettes (e.g., a red apple and a red car) may be indistinguishable using histograms alone.

Use Cases

  • Image retrieval: Searching databases for images with similar color distributions.
  • Scene classification: Differentiating between categories like “forest,” “desert,” or “ocean.”
  • Tracking in videos: Tracking a colored object (e.g., a red ball) by maintaining its histogram across frames.

Histogram of Oriented Gradients (HOG)

  • The Histogram of Oriented Gradients (HOG) is a more sophisticated feature descriptor than color histograms, designed to capture the shape and structure of objects within an image. Instead of focusing on colors, HOG encodes the distribution of edge directions and gradient intensities. This makes it especially effective for object detection, since the outline and silhouette of objects are often more discriminative than their color.
  • HOG represented a significant leap forward in feature engineering. By focusing on local edge structures, it provided classifiers with rich information about object geometry. However, like color histograms, HOG has largely been replaced by features automatically learned by deep convolutional networks — which can capture not only edges but also progressively higher-level concepts.

How It Works

  1. Divide the Image into Cells

    • The image is split into small regions, typically \(8 \times 8\) or \(16 \times 16\) pixels.
    • Each cell will produce its own local histogram of gradient directions.
  2. Compute Gradients

    • For every pixel in a cell, compute the horizontal and vertical gradients: \(G_x = I(x+1, y) - I(x-1, y)\) \(G_y = I(x, y+1) - I(x, y-1)\)

    • From these, calculate gradient magnitude and orientation: \(\text{magnitude} = \sqrt{G_x^2 + G_y^2}, \quad \text{orientation} = \arctan\left(\frac{G_y}{G_x}\right)\)

  3. Bin the Orientations

    • Divide orientations into bins (e.g., 9 bins for \(0^\circ\)–\(180^\circ\)).
    • Each pixel votes for a bin based on its gradient orientation, weighted by its gradient magnitude.
  4. Normalize Blocks

    • To reduce the effect of illumination or contrast changes, group neighboring cells into larger blocks (e.g., \(2 \times 2\) cells).
    • Normalize the histograms within each block.
  5. Concatenate Histograms

    • Finally, concatenate histograms from all blocks into a long feature vector that represents the entire image.
  • The figure illustrates how gradient orientations are extracted and binned into histograms. Unlike color histograms, which focus on pixel intensity distributions, HOG emphasizes local edge patterns that describe object shapes.

Strengths

  • Captures shape information: By encoding edges and gradients, HOG provides a strong description of object outlines and silhouettes.
  • Illumination robustness: Normalization makes the feature less sensitive to changes in lighting or contrast.
  • Proven effectiveness: HOG was famously used in the Dalal–Triggs pedestrian detector (2005), which set a new benchmark for detecting people in images.

Weaknesses

  • Scale sensitivity: Objects of different sizes may produce different HOG representations unless scale-invariant techniques are added.
  • Computational cost: More expensive to compute than simple color histograms, especially for high-resolution images.
  • No semantic understanding: Like other handcrafted features, HOG captures low-level patterns but not higher-level concepts.

Use Cases

  • Pedestrian detection: One of the most famous applications of HOG, enabling robust detection of people in surveillance and driving scenes.
  • Object detection in general: Applied to vehicles, animals, and other structured objects where shape matters more than color.
  • Texture recognition: Useful in differentiating surfaces with distinct edge distributions (e.g., bricks vs. grass).

Bag of Words for Visual Recognition

  • The Bag of Words (BoW) model is an approach that draws inspiration from natural language processing. In text analysis, a document can be represented as a histogram of word frequencies, disregarding the order of words but retaining their presence and relative importance. Similarly, in computer vision, the BoW model represents an image as a histogram of “visual words,” constructed from local feature descriptors.
  • This method was a major advance before deep learning, as it allowed large image collections to be represented and compared efficiently.
  • Although highly influential, BoW ultimately reached a performance ceiling due to its loss of spatial structure. Deep learning later surpassed it by automatically learning hierarchical features that preserved both local and global context. Nevertheless, BoW remains an important historical step in computer vision and is still used in some lightweight or resource-constrained applications.

The Process

  1. Feature Extraction

    • Local features (e.g., corners, edges, or distinctive patches) are extracted from each image.
    • Common algorithms include SIFT (Scale-Invariant Feature Transform) and SURF (Speeded-Up Robust Features), which generate descriptors for image keypoints.
  2. Visual Vocabulary Construction

    • Combine local descriptors from many training images.
    • Use clustering algorithms (commonly K-means) to group descriptors into clusters.
    • Each cluster center becomes a visual word in the vocabulary.
    • The vocabulary size \(K\) (typically hundreds or thousands) is a hyperparameter that controls granularity.
  3. Mapping Features to Words

    • For a new image, each extracted local descriptor is matched to the nearest cluster center (visual word).
    • This step “quantizes” continuous descriptors into discrete categories.
  4. Histogram Creation

    • Count the occurrences of each visual word in the image.
    • This count histogram becomes the image’s BoW feature vector.
  • The figure illustrates the Bag of Words pipeline. Local features are clustered into a “visual vocabulary,” and each image is then represented as a histogram of visual word occurrences. This transforms raw local descriptors into a fixed-length vector suitable for classifiers.

Strengths

  • Fixed-length representation: Regardless of image size or number of features, each image is represented as a \(K\)-dimensional vector.
  • Translation and rotation invariance: Since spatial arrangement is ignored, shifting or rotating objects does not significantly change the histogram.
  • Scalability: Enables efficient indexing and retrieval of large image databases.

Weaknesses

  • Loss of spatial information: The arrangement of features is discarded. For example, the histogram cannot distinguish between a face (eyes above nose above mouth) and a scrambled arrangement of the same parts.
  • Choice of vocabulary size: Too few clusters = overly coarse representation; too many clusters = overfitting and computational inefficiency.
  • Dependence on feature extractors: The quality of BoW representations relies heavily on the robustness of the underlying local descriptors (e.g., SIFT).

Use Cases

  • Image classification: BoW histograms can be fed into classifiers like SVMs for object recognition.
  • Image retrieval systems: Efficiently matching histograms enables searching for visually similar images in large datasets.
  • Scene recognition: Used for distinguishing categories like “beach,” “forest,” or “city” where global arrangements are less important than recurring local patterns.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020LossFunctions,
  title   = {Loss Functions},
  author  = {Chadha, Aman},
  journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}