The cross-entropy loss is a commonly used loss-function in neural-network training. One of the key pieces of building an autograd is writing functions that know how to do derivatives of output w.r.t inputs (See minitorch’s backward). There are already a number of good articles on the web on how to derive this for cross-entropy loss, this one is just me working this out all over again and leaving a note to my future self.

Notations

This article considers the scenario where there are \(d\) output classes. A neural network or something generates \(d\) dimensional logits at the output layer, on which the softmax activation is applied to generate probabilities.

In the derivations ahead, I will use the following notations at times to simplify writing. The softmax function is denoted by \(\mathbf{\sigma}(\mathbf{x}): \mathbb{R}^d \rightarrow \mathbb{R}^d\) for \(\mathbf{x} \in \mathbb{R}^d\). Softmax converts logits \(x_i\) to probabilities \(p_i\).

\[\begin{align*} \mathbf{\sigma}(\mathbf{x}) &= \{ p_i \}_{i=1}^{d} \\ p_i &= \dfrac{e^{x_i}}{\sum_{j=0}^{d}{e^{x_j}}} \end{align*}\]

Cross Entropy

The cross-entropy loss for multi-class classification is formulated as follows, applying a few more operations on the output of softmax:

\[\begin{align*} \mathrm{CE}(\mathbf{x}, \mathbf{y}) &= - \sum_{i}^{d}{y_i \log p_i} \\ &= -\left[ y_c \log p_c + \sum_{ i=0, i\neq c}^{d}{y_i \log p_i} \right]\\ & = -y_c \log p_c \\ &= - y_c \log \left(\dfrac{e^{x_c}}{\sum_{j=0}^{d}{e^{x_j}}}\right) \\ \end{align*}\]

Derivative

The partial derivative w.r.t a component of \(\mathbf{x}\) represented by \(x_k\) can be computed as below. \(p_c\) and \(y_c\) denotes the probability predicted and the label respectively of the label (\(c\)). By design, it holds \(y_c = 1\) since it’s for the true label/class, but is denoted as \(y_c\) for better readability in text here.

\[\begin{align*} \dfrac{\partial\mathrm{CE}(\mathbf{x}, \mathbf{y})}{\partial{x_k}} &= - y_c \dfrac{\partial{\log p_c }}{x_k} \\ &= - y_c \left(\dfrac{\partial{\log p_c }}{\partial p_c}\right) \left(\dfrac{\partial p_c }{ \partial x_k}\right) \\ &= - y_c \left(\dfrac{1}{p_c}\right) \left(\dfrac{\partial p_c}{\partial x_k}\right) \\ % &= - y_c \dfrac{1}{p_i} p_i \cdot (1 - p_j) \\ % &= - \sum_{i=1}^{N}{y_i \cdot (1 - p_j)} \\ \end{align*}\]

Next is expanding \(\partial pc/\partial x_k\). It is worthwhile to note that \(p_c\) is the output of the softmax function - so, we also happen to be computing the derivative of softmax function on vector inputs here.

There are two cases for the derivative, for when \(k = c\) and \(k \neq c\). This is because in one case the numerator has to be considered as a variable requiring the application of the product / quotient rule for derivatives and in the other the numerator can be considered a constant.

For the case when \(k = c\), we have on applying the product rule:

\[\begin{align*} \dfrac{\partial p_c}{\partial x_c} &= e^{x_c} \dfrac{-1}{\left(\sum_{j=0}^{d}{e^{x_j}}\right)^2} \cdot e^{x_c} + e^{x_{c}} \left(\dfrac{1}{\sum_{j=0}^{d}{e^{x_j}}}\right) \\ &= \dfrac{e^{x_c}}{\left(\sum_{j=0}^{d}{e^{x_j}}\right)} \left[ 1 - \dfrac{e^{x_c}}{\left(\sum_{j=0}^{d}{e^{x_j}}\right)} \right] \\ &= p_c \cdot (1 - p_c) \\ \end{align*}\]

In the case when \(k \neq c\) we obtain:

\[\begin{align*} \dfrac{\partial p_c}{\partial x_k} &={e^{x_c}} \cdot \dfrac{-1}{\left(\sum_{j=0}^{d}{e^{x_j}}\right)^2} \cdot e^{x_k} \\ &= \dfrac{e^{x_c}}{\left(\sum_{j=0}^{d}{e^{x_j}}\right)} \dfrac{-e^{x_k}}{\left(\sum_{j=0}^{d}{e^{x_j}}\right)} \\ &= - p_c \cdot p_k \end{align*}\]

We can consolidate the result as:

\[\begin{align*} \dfrac{\partial\mathrm{p_c}(\mathbf{x}, \mathbf{y})}{\partial{x_k}} &= \begin{cases} p_c (1 - p_c) & k = c \\ - p_c p_k & k \neq c \end{cases} \end{align*}\] \[\begin{align*} \dfrac{\partial\mathrm{CE}(\mathbf{x}, \mathbf{y})}{\partial{x_k}} &= \begin{cases} (p_k - 1) & k = c \\ p_k & k \neq c \end{cases} \end{align*}\] For computational convenience without branching, we can consolidate this further as: \[\begin{align*} \dfrac{\partial\mathrm{CE}(\mathbf{x}, \mathbf{y})}{\partial{x_k}} &= (p_k - \mathbb{I}[k=c]) \end{align*}\]

Matrix and vector notations

Looking closely, one can observe that derivatives above can be written in a vector/matrix notation. The cross-entropy function is from a vector input to a scalar value (loss). i.e, \(\mathrm{CE}: \mathbb{R}^d \rightarrow \mathbb{R}\). Because of this, the derivative can be represented as a \(d \times 1\) matrix, which is only really a vector. In literature this can be denoted by prefixing a \(\nabla\).

\[\begin{align*} \nabla_{\mathbb{x}}{\mathrm{CE}} &= \left[ p_k - o_k\right]_{k=1}^{d}, & o_k = \mathbb{I}[k == c] \end{align*}\]

The derivative of the softmax function w.r.t its inputs form a matrix of \(d \times d\), since it is a vector valued function of a vector. i.e, \(\mathbb{\sigma}_{\mathbb{x}}: \mathbb{R}^d \rightarrow \mathbb{R}^d\). The matrix of partial derivatives of one output component w.r.t another input component forms the Jacobian, denoted often by \(\mathbf{J}\). \(J\) doesn’t appear much pleasant to unroll here, but from the two-indices in the equation above, it should be evident this forms a matrix.

This notation is what you’ll find in Eli Bendersky’s post and CS224n’s Vectorized Gradients, both of which I found useful. I just preferred at the time of writing this post to do things by hand using even more basic math (the one I learned in school) due to the lack of familiarity with multivariate calculus vocabulary used in places.

Numerical Stability

While doing the implementation, one has to be mindful of the numerical stability. This is why some computations of the softmax function looks slightly more complicated. Take a look at marian’s CrossEntropy backward, which computes the probabilities from softmax for the derivative, for example.

The usual trick applied is to multiply the numerator and denominator in \(p_i\) with a constant, making \(e^{x}\) more amenable to floating point computations.

\[\begin{align*} p_i &= \dfrac{e^{x_i}}{\sum_{j=0}^{d}{e^{x_j}}} &= \dfrac{e^{x_i}}{\sum_{j=0}^{d}{e^{x_j}}} \times \dfrac{e^{-M}}{e^{-M}} &= \dfrac{e^{(x_i - M)}}{\sum_{j=0}^{d}{e^{(x_j - M)}}} \end{align*}\]

If we choose \(M\) as follows as the max among \(x_i\), we get less troubles of overflow and the likes:

\[\begin{align*} M = \max \{ x_i \} \end{align*}\]

My own reference implementation looks something like below:

void cross_entropy_with_logits_grad(float *logits, int *labels,
                                    size_t batch_size, size_t num_classes,
                                    float *grad_out) {
  for (size_t i = 0; i < batch_size; i++) {
    size_t offset = i * num_classes;

    float *x = logits + offset;
    float *dce = grad_out + offset;
    auto label = static_cast<size_t>(labels[i]);

    // Find maximum among x-s
    float max_value = std::numeric_limits<float>::lowest();
    for (size_t j = 0; j < num_classes; j++) {
      max_value = std::max<float>(max_value, x[j]);
    }

    // Find sumexp after subtracting max-value.
    float sumexp = 0;
    for (size_t j = 0; j < num_classes; j++) {
      sumexp += std::exp(x[j] - max_value);
    }

    // Compute gradients using predicted probability.
    for (size_t j = 0; j < num_classes; j++) {
      float p = std::exp(x[j] - max_value) / sumexp;
      auto o = static_cast<float>(j == label);
      dce[j] = (p - o);
    }
  }
}

References

  1. CS224n: Gradient Notes
  2. Eli Bendersky: The Softmax function and its derivative
  3. MLDawn: Backprop, xent and softmax