Softmax function is a very common function used in machine learning, especially in logistic regression models and neural networks. In this post I would like to compute the derivatives of softmax function as well as its cross entropy.

The definition of softmax function is:

$$

\sigma(z_j) = \frac{e^{z_j}}{e^{z_1} + e^{z_2} + \cdots + e^{z_n}}, j \in \{1, 2, \cdots, n\},

$$

Or use summation form,

$$

\sigma(z_j) = \frac{e^{z_j}}{\sum_{i=1}^n e^{z_i}}, j \in \{1, 2, \cdots, n \}.

$$

And computing the derivative of softmax function is one of the most common task in machine learning. **But there is a pitfall**. We know that softmax is computed against one of the components \(z_j\) of vector \(\mathbf{z}\). When computing partial derivative, the component we take partial derivative against IS NOT necessarily the component on which softmax is computed! Or in math language, what we need to compute is \( \frac{\partial \sigma(z_j)}{\partial z_k} \), where j and k are not necessarily equal! Many beginners end up with computing \(\frac{\partial \sigma(z_j)}{\partial z_j}\), which is only a small propotion of the problem.

Now let’s compute \( \frac{\partial \sigma(z_j)}{\partial z_k} \). We will use the quotient rule: \((f(x)/g(x))’ = (f'(x)g(x) – f(x)g(x)’) / g(x)^2\). Also we denote \(\sum_{i=1}^n e^{z_i}\) as \(\Sigma\) for convenience.

When \(j = k\),

$$

\begin{align*}

\frac{\partial \sigma(z_j)}{\partial z_k}

&= \frac{\partial}{\partial z_j} \frac{e^{z_j}}{\Sigma} && \cdots \text {since } j = k \\

&= \frac{\frac{\partial e^{z_j}}{\partial z_j} \cdot \Sigma – e^{z_j} \cdot \frac{\partial \Sigma}{\partial z_j}{}}{\Sigma ^ 2} && \cdots \text{quotient rule} \\

&= \frac{e^{z_j} \cdot \Sigma – e^{z_j} \cdot e^{z_j}}{\Sigma ^ 2} \\

&= \frac{e^{z_j}}{\Sigma} – \Big( \frac{e^{z_j}}{\Sigma}\Big) ^ 2 \\

&= \sigma(z_j) – (\sigma(z_j))^2 \\ &= \sigma(z_j) (1 – \sigma(z_j)),

\end{align*}

$$

When \(j \neq k\),

$$

\begin{align*}

\frac{\partial \sigma(z_j)}{\partial z_k}

&= \frac{\partial}{\partial z_k} \frac{e^{z_j}}{\Sigma} \\

&= \frac{\frac{\partial e^{z_j}}{\partial z_k} \cdot \Sigma – e^{z_j} \cdot \frac{\partial \Sigma}{\partial z_k}}{\Sigma ^ 2} && \cdots \text{quotient rule} \\

&= \frac{0 \cdot \Sigma – e^{z_j} \cdot e^{z_k}}{\Sigma ^ 2} && \cdots e^{z_j} \text{ is constant when taking derivative of }z_k \\

&= -\frac{e^{z_j}}{\Sigma} \cdot \frac{e^{z_k}}{\Sigma} \\

&= -\sigma(z_j) \sigma(z_k).

\end{align*}

$$

Thus, the derivative of softmax is:

$$

\frac{\partial \sigma(z_j)}{\partial z_k} = \begin{cases}

\sigma(z_j)(1-\sigma(z_j)), & \text{when } j = k, \\

-\sigma(z_j)\sigma(z_k), &\text{when }j \neq k.

\end{cases}

$$

## Cross Entropy with Softmax

Another common task in machine learning is to compute the derivative of cross entropy with softmax. This can be written as:

$$

\text{CE} = \sum_{j=1}^n \big(- y_j \log \sigma(z_j) \big)

$$

In classification problem, the n here represents the number of classes, and \(y_j\) is the one-hot representation of the actual class. One-hot is a vector that only one component is 1 and all other components are zero. In other words, there is only one 1 in \(y_1, y_2, \cdots, y_n\) and others are all zero.

For example, for a 5-class classification problem, if a specific data point has class #4, then its class label \(\mathbf{y} = (0, 0, 0, 1, 0) \).

Now let us compute the derivative of cross entropy with softmax. We will use the chain rule: \((f(g(x)))’ = f'(g(x)) \cdot g'(x)\).

$$

\begin{align*}

\frac{\partial}{\partial z_k}\text{CE} &= \frac{\partial}{\partial z_k}\sum_{j=1}^n \big(- y_j \log \sigma(z_j) \big) \\

&= -\sum_{j=1}^n y_j \frac{\partial}{\partial z_k} \log \sigma(z_j) && \cdots \text{addition rule, } -y_j \text{ is constant}\\

&= -\sum_{j=1}^n y_j \frac{1}{\sigma(z_j)} \cdot \frac{\partial}{\partial z_k} \sigma(z_j) && \cdots \text{chain rule}\\

&= -y_k \cdot \frac{\sigma(z_k)(1-\sigma(k))}{\sigma(z_k)} –

\sum_{j \neq k} y_j \cdot \frac{-\sigma(z_j)\sigma(z_k)}{\sigma(z_j)} && \cdots \text{consier both }j = k \text{ and } j \neq k \\

&= -y_k \cdot (1-\sigma(z_k)) + \sum_{j \neq k} y_j \sigma(z_k) \\

&= -y_k + y_k \sigma(z_k) + \sum_{j \neq k} y_j \sigma(z_k) \\

&= -y_k + \sigma(z_k) \sum_j y_j.

\end{align*}

$$

Since y is one-hot encoding, \(\sum_j y_j = 1\). Thus the derivative of cross entropy with softmax is simply

$$

\frac{\partial}{\partial z_k}\text{CE} = \sigma(z_k) – y_k.

$$

This is a very simple, very easy to compute equation.

Where you write “Or in math language, what we need to compute is ∂σ(zj)∂zk, where i and j are not necessarily equal!”

do you mean “k and j” rather than “i and j”? My rationale is that you use subscript k for the z in the denominator and it would read a bit better if this was explicit?

Yes it should be “k and j”. Fixed. Thanks for pointing out!