• The partial derivative of the multiclass hinge loss function (that multiclass-SVM deploys) with respect to the weights \(w_j\) is:
\[\text{for the correct class, i.e., }j = y_i:\\ \nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i\\ \\\\ \text{for the incorrect classes, i.e., }j \neq y_i:\\ \nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i\]
  • Starting with the SVM loss function for a single datapoint:
\[L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]\]
  • Differentiating the function to obtain the gradient with respect to the weights corresponding to the correct class \(w_{y_i}\), we obtain:

    \[\boxed{\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i}\]
    • where \(\mathbb{1}\{\cdot\}\) is the indicator function that is 1 if the condition inside is true or 0 otherwise.
  • Note that this is the gradient only with respect to the row of \(W\) that corresponds to the correct class.
  • To build some intuition as far as this expression goes, you’re simply counting the number of classes that didn’t meet the desired margin (and hence contributed to the loss function) and then the data vector \(x_i\) scaled by this count is the effective gradient.

  • For the other rows where \(j \neq y_i \), the gradient is:
\[\boxed{\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i}\]

References