交叉熵(CrossEntropy)是常见的损失函数,本文详细推导一下它的梯度,面试大厂或者工程实践中都可能会用到。
前向传播
假设分类任务类别数是$V$,隐层输出是$V$维向量$\mathbf{h}$,标准的one-hot向量是$\mathbf{y}$,正确的类别是$k$。那么交叉熵损失可以定义为:
$$
\mathcal{L}(\mathbf p, \mathbf q) = -\sum_i \mathbf{p}_i \log(\mathbf{q}_i)
$$
其中$\mathbf p = (1 - \alpha) \mathbf y + \frac{\alpha}{V}\cdot \mathbf 1$,$\mathbf q = \mathrm{Softmax}(\mathbf h)$,$0\le \alpha\le 1$是平滑参数。Softmax函数大家都很熟悉了,具体形式为:$\mathbf{q}_i = \frac{e^{\mathbf{h}_i}}{\sum_{j}{e^{\mathbf{h}_j}}}$。
反向传播
$\mathcal{L}$对$\mathbf{h}_i$的梯度要分两种情况:
$$
\frac{\partial{\mathcal{L}}}{\partial{\mathbf{h}_i}} = \left\{
\begin{array}{ll}
\mathbf{q}_i -\frac{\alpha}{V} - 1 + \alpha & i = k \\
\mathbf{q}_i -\frac{\alpha}{V} & i\neq k
\end{array} \right.
$$
推导过程
根据求导法则有:
$$
\frac{\partial \mathcal{L}}{\partial \mathbf{h}_i} = -\sum_{j}{\frac{\partial{\mathcal{L}}}{\partial{\mathbf{q}_j}} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} = -\sum_{j}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}}
$$
其中$\frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}$就是Softmax函数的梯度(这个推导比较简单,放在了文末):
$$
\frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}} = \left\{
\begin{array}{ll}
\mathbf{q}_i (1 - \mathbf{q}_i) & j = i \\
-\mathbf{q}_i \mathbf{q}_j & j\neq i
\end{array} \right.
$$
下面分两种情况讨论:
- 当$i = k$时:
$$
\begin{aligned}
\frac{\partial \mathcal{L}}{\partial \mathbf{h}_i} &= -\sum_{j}{\frac{\partial{\mathcal{L}}}{\partial{\mathbf{q}_j}} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} = -\sum_{j}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} \\
&= -\frac{\mathbf{p}_k}{\mathbf{q}_k} \cdot \frac{\partial{\mathbf{q}_k}}{\partial{\mathbf{h}_i}} -\sum_{j \neq k}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} \\
&= -\frac{\mathbf{p}_k}{\mathbf{q}_k} \cdot \mathbf{q}_k (1 - \mathbf{q}_k) -\sum_{j \neq k}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot (-\mathbf{q}_i \mathbf{q}_j)} \\
&= \mathbf{p}_k (\mathbf{q}_k - 1) + \mathbf{q}_i\sum_{j \neq k}{\mathbf{p}_j} \\
&= (1 - \alpha + \frac{\alpha}{V})(\mathbf{q}_k - 1) + (V - 1) \cdot \frac{\alpha}{V}\cdot \mathbf{q}_i \\
&= \mathbf{q}_i -\frac{\alpha}{V} - 1 + \alpha
\end{aligned}
$$ - 当$i \neq k$时:
$$
\begin{aligned}
\frac{\partial \mathcal{L}}{\partial \mathbf{h}_i} &= -\sum_{j}{\frac{\partial{\mathcal{L}}}{\partial{\mathbf{q}_j}} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} = -\sum_{j}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} \\
&= -\frac{\mathbf{p}_k}{\mathbf{q}_k} \cdot \frac{\partial{\mathbf{q}_k}}{\partial{\mathbf{h}_i}} -\frac{\mathbf{p}_i}{\mathbf{q}_i} \cdot \frac{\partial{\mathbf{q}_i}}{\partial{\mathbf{h}_i}} -\sum_{j \neq k, j \neq i}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} \\
&= -\frac{\mathbf{p}_k}{\mathbf{q}_k} \cdot (-\mathbf{q}_k\mathbf{q}_i) -\frac{\mathbf{p}_i}{\mathbf{q}_i} \cdot \mathbf{q}_i (1 - \mathbf{q}_i) -\sum_{j \neq k, j \neq i}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot (-\mathbf{q}_i \mathbf{q}_j)} \\
&= \mathbf{p}_k \mathbf{q}_i - \mathbf{p}_i (1 - \mathbf{q}_i) + \mathbf{q}_i\sum_{j \neq k, j \neq i}{\mathbf{p}_j} \\
&= (1 - \alpha + \frac{\alpha}{V})\cdot\mathbf{q}_i - \frac{\alpha}{V} \cdot (1 - \mathbf{q}_i) + (V - 2) \cdot \frac{\alpha}{V}\cdot\mathbf{q}_i \\
&= \mathbf{q}_i -\frac{\alpha}{V}
\end{aligned}
$$
Softmax梯度
回顾Softmax函数的形式:
$$
\mathbf{q}_j = \frac{e^{\mathbf{h}_j}}{\sum_{l}{e^{\mathbf{h}_l}}} = \frac{e^{\mathbf{h}_j}}{e^{\mathbf{h}_i} + \sum_{l \neq i}{e^{\mathbf{h}_l}}}
$$
这里也分两种情况讨论:
- 当$j = i$时:
$$
\begin{aligned}
\frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}} &= \frac{e^{\mathbf{h}_j}}{e^{\mathbf{h}_i} + \sum_{l \neq i}{e^{\mathbf{h}_l}}} - \frac{e^{\mathbf{h}_j} \cdot e^{\mathbf{h}_j}}{\left(e^{\mathbf{h}_i} + \sum_{l \neq i}{e^{\mathbf{h}_l}}\right)^2} \\
&= \mathbf{q}_j - \mathbf{q}_j\mathbf{q}_j \\
&= \mathbf{q}_j (1 - \mathbf{q}_j)
\end{aligned}
$$ - 当$j \neq i$时:
$$
\begin{aligned}
\frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}} &= - \frac{e^{\mathbf{h}_j} \cdot e^{\mathbf{h}_i}}{\left(e^{\mathbf{h}_i} + \sum_{l \neq i}{e^{\mathbf{h}_l}}\right)^2} \\
&= -\mathbf{q}_j\mathbf{q}_i
\end{aligned}
$$