Loss Functions

  • In the last section we talked about linear classifier but we didn’t discussed how we could train the parameters of that model to get best \(W's\) and \(b's\).

  • We need a loss function to measure how good or bad our current parameters are to solving the task at hand. The loss per example can be written as,

    \[Loss_{i^{th} example} = L_i = (f(X_i,W), Y_i)\]
  • While the loss per batch is the average of loss for each example,

    \[Loss_{batch} = \frac{1}{N} * \sum_{1}^{N} L_i\]
  • Then we find a way to minimize the loss function given some parameters. This is called optimization.

  • Loss function for a linear SVM classifier:

    • L[i] = sum where all classes except the predicted class (max(0, s[j] - s[y[i]] + 1))
    • We call this the hinge loss.
    • A loss function outputs a values of zero if our prediction is the same as the true value otherwise it returns an error value which indicates the degree of difference between the true value and the prediction.
    • Example: ![]((assets/loss-functions/1.jpg)
      • Given this example we want to compute the loss of this image as,
      \[\begin{aligned} L &= max (0, 437.9 - (-96.8) + 1) + max(0, 61.95 - (-96.8) + 1) \\ &= max(0, 535.7) + max(0, 159.75) \\ &= 695.45 \end{aligned}\]
      • Final loss is 695.45 which is big and reflects that the cat score needs to be the best over all classes as its the lowest value now. We need to minimize that loss.
    • Its OK for the margin to be 1, but it’s a hyperparameter too.
  • If your loss function gives you zero, are this value is the same value for your parameter? No there are a lot of parameters that can give you best score.

  • You’ll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM). that penalizes violated margins more strongly (quadratically instead of linearly). The unsquared version is more standard, but in some datasets the squared hinge loss can work better.

  • We add regularization for the loss function so that the discovered model don’t overfit the data.

    • \[Loss = L = \frac{1}{N} * \sum_{1}^{N} L_i + \lambda * R(W)\]
    • where \(R\) is the regularizer, and \(\lambda\) is the regularization term.
  • There are different regularizations techniques:

Regularizer Equation Comments
L2 \(R(W) = \sum_{1}^{N}W^2)\) Sum all the \(W\) squared
L1 \(R(W) = \sum_{1}^{N}\mid W \mid)\) Sum of all \(W's\) with abs
Elastic Net (L1 + L2) \(R(W) = \beta * \sum_{1}^{N}W^2) + \sum_{1}^{N}\mid W \mid)\) Use \(\beta\) to control L2 component
Dropout N/A No Equation
  • Regularization prefers smaller \(W's\) over big \(W's\).

  • Regularizations is called weight decay. biases should not included in regularization.

  • Softmax loss (generalization of linear regression for more than two classes):

    • Softmax function:

      • \[A[L] = \frac{e^{(score[L])}}{\sum_{1}^{N}e^{(score[L])}, NumOfClasses)}\]
      • Sum of the vector should be 1.
    • Softmax loss:

      • \[Loss = -\log P(Y = y[i] \mid X = x[i])\]
      • Interpreted as log of the probability of the good class. We want it to be as close as possible to one which is why we added a minus sign.

      • Softmax loss is called cross-entropy loss.
    • Consider this numerical problem when you are computing Softmax:

      • f = np.array([123, 456, 789]) # example with 3 classes and each having large scores
        p = np.exp(f) / np.\sum_{1}^{N}np.exp(f)) # Bad: Numeric problem, potential blowup
        # instead: first shift the values of f so that the highest number is 0:
        f -= np.max(f) # f becomes [-666, -333, 0]
        p = np.exp(f) / np.\sum_{1}^{N}np.exp(f)) # safe to do, gives the correct answer


If you found our work useful, please cite it as:

  title   = {Loss Functions},
  author  = {Chadha, Aman},
  journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
  year    = {2020},
  note    = {\url{https://aman.ai}}