Primers • Partial Derivative for SVM/Hinge Loss
 The partial derivative of the multiclass hinge loss function (that multiclassSVM deploys) with respect to the weights \(w_j\) is:
 Starting with the SVM loss function for a single datapoint:

Differentiating the function to obtain the gradient with respect to the weights corresponding to the correct class \(w_{y_i}\), we obtain:
\[\boxed{\nabla_{w_{y_i}} L_i =  \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i  w_{y_i}^Tx_i + \Delta > 0) \right) x_i}\] where \(\mathbb{1}\{\cdot\}\) is the indicator function that is \(1\) if the condition inside is true or \(0\) otherwise.
 Note that this is the gradient only with respect to the row of \(W\) that corresponds to the correct class.
 To build some intuition as far as this expression goes, you’re simply counting the number of classes that didn’t meet the desired margin (and hence contributed to the loss function) and then the data vector \(x_i\) scaled by this count is the effective gradient.
 For the other rows where \(j \neq y_i \), the gradient is: