“交叉熵”反向传播推导

交叉熵(CrossEntropy)是常见的损失函数,本文详细推导一下它的梯度,面试大厂或者工程实践中都可能会用到。

前向传播

假设分类任务类别数是$V$,隐层输出是$V$维向量$\mathbf{h}$,标准的one-hot向量是$\mathbf{y}$,正确的类别是$k$。那么交叉熵损失可以定义为:
$$
\mathcal{L}(\mathbf p, \mathbf q) = -\sum_i \mathbf{p}_i \log(\mathbf{q}_i)
$$
其中$\mathbf p = (1 - \alpha) \mathbf y + \frac{\alpha}{V}\cdot \mathbf 1$,$\mathbf q = \mathrm{Softmax}(\mathbf h)$,$0\le \alpha\le 1$是平滑参数。Softmax函数大家都很熟悉了,具体形式为:$\mathbf{q}_i = \frac{e^{\mathbf{h}_i}}{\sum_{j}{e^{\mathbf{h}_j}}}$。

反向传播

$\mathcal{L}$对$\mathbf{h}_i$的梯度要分两种情况:
$$
\frac{\partial{\mathcal{L}}}{\partial{\mathbf{h}_i}} = \left\{
\begin{array}{ll}
\mathbf{q}_i -\frac{\alpha}{V} - 1 + \alpha & i = k \\
\mathbf{q}_i -\frac{\alpha}{V} & i\neq k
\end{array} \right.
$$

推导过程

根据求导法则有:
$$
\frac{\partial \mathcal{L}}{\partial \mathbf{h}_i} = -\sum_{j}{\frac{\partial{\mathcal{L}}}{\partial{\mathbf{q}_j}} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} = -\sum_{j}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}}
$$
其中$\frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}$就是Softmax函数的梯度(这个推导比较简单,放在了文末):
$$
\frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}} = \left\{
\begin{array}{ll}
\mathbf{q}_i (1 - \mathbf{q}_i) & j = i \\
-\mathbf{q}_i \mathbf{q}_j & j\neq i
\end{array} \right.
$$

下面分两种情况讨论:

  1. 当$i = k$时:
    $$
    \begin{aligned}
    \frac{\partial \mathcal{L}}{\partial \mathbf{h}_i} &= -\sum_{j}{\frac{\partial{\mathcal{L}}}{\partial{\mathbf{q}_j}} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} = -\sum_{j}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} \\
    &= -\frac{\mathbf{p}_k}{\mathbf{q}_k} \cdot \frac{\partial{\mathbf{q}_k}}{\partial{\mathbf{h}_i}} -\sum_{j \neq k}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} \\
    &= -\frac{\mathbf{p}_k}{\mathbf{q}_k} \cdot \mathbf{q}_k (1 - \mathbf{q}_k) -\sum_{j \neq k}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot (-\mathbf{q}_i \mathbf{q}_j)} \\
    &= \mathbf{p}_k (\mathbf{q}_k - 1) + \mathbf{q}_i\sum_{j \neq k}{\mathbf{p}_j} \\
    &= (1 - \alpha + \frac{\alpha}{V})(\mathbf{q}_k - 1) + (V - 1) \cdot \frac{\alpha}{V}\cdot \mathbf{q}_i \\
    &= \mathbf{q}_i -\frac{\alpha}{V} - 1 + \alpha
    \end{aligned}
    $$
  2. 当$i \neq k$时:
    $$
    \begin{aligned}
    \frac{\partial \mathcal{L}}{\partial \mathbf{h}_i} &= -\sum_{j}{\frac{\partial{\mathcal{L}}}{\partial{\mathbf{q}_j}} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} = -\sum_{j}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} \\
    &= -\frac{\mathbf{p}_k}{\mathbf{q}_k} \cdot \frac{\partial{\mathbf{q}_k}}{\partial{\mathbf{h}_i}} -\frac{\mathbf{p}_i}{\mathbf{q}_i} \cdot \frac{\partial{\mathbf{q}_i}}{\partial{\mathbf{h}_i}} -\sum_{j \neq k, j \neq i}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}}} \\
    &= -\frac{\mathbf{p}_k}{\mathbf{q}_k} \cdot (-\mathbf{q}_k\mathbf{q}_i) -\frac{\mathbf{p}_i}{\mathbf{q}_i} \cdot \mathbf{q}_i (1 - \mathbf{q}_i) -\sum_{j \neq k, j \neq i}{\frac{\mathbf{p}_j}{\mathbf{q}_j} \cdot (-\mathbf{q}_i \mathbf{q}_j)} \\
    &= \mathbf{p}_k \mathbf{q}_i - \mathbf{p}_i (1 - \mathbf{q}_i) + \mathbf{q}_i\sum_{j \neq k, j \neq i}{\mathbf{p}_j} \\
    &= (1 - \alpha + \frac{\alpha}{V})\cdot\mathbf{q}_i - \frac{\alpha}{V} \cdot (1 - \mathbf{q}_i) + (V - 2) \cdot \frac{\alpha}{V}\cdot\mathbf{q}_i \\
    &= \mathbf{q}_i -\frac{\alpha}{V}
    \end{aligned}
    $$

Softmax梯度

回顾Softmax函数的形式:
$$
\mathbf{q}_j = \frac{e^{\mathbf{h}_j}}{\sum_{l}{e^{\mathbf{h}_l}}} = \frac{e^{\mathbf{h}_j}}{e^{\mathbf{h}_i} + \sum_{l \neq i}{e^{\mathbf{h}_l}}}
$$

这里也分两种情况讨论:

  1. 当$j = i$时:
    $$
    \begin{aligned}
    \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}} &= \frac{e^{\mathbf{h}_j}}{e^{\mathbf{h}_i} + \sum_{l \neq i}{e^{\mathbf{h}_l}}} - \frac{e^{\mathbf{h}_j} \cdot e^{\mathbf{h}_j}}{\left(e^{\mathbf{h}_i} + \sum_{l \neq i}{e^{\mathbf{h}_l}}\right)^2} \\
    &= \mathbf{q}_j - \mathbf{q}_j\mathbf{q}_j \\
    &= \mathbf{q}_j (1 - \mathbf{q}_j)
    \end{aligned}
    $$
  2. 当$j \neq i$时:
    $$
    \begin{aligned}
    \frac{\partial{\mathbf{q}_j}}{\partial{\mathbf{h}_i}} &= - \frac{e^{\mathbf{h}_j} \cdot e^{\mathbf{h}_i}}{\left(e^{\mathbf{h}_i} + \sum_{l \neq i}{e^{\mathbf{h}_l}}\right)^2} \\
    &= -\mathbf{q}_j\mathbf{q}_i
    \end{aligned}
    $$

   转载规则


《“交叉熵”反向传播推导》 韦阳 采用 知识共享署名 4.0 国际许可协议 进行许可。
 上一篇
手推公式之“层归一化”梯度 手推公式之“层归一化”梯度
昨天推导了一下交叉熵的反向传播梯度,今天再来推导一下层归一化(LayerNorm),这是一种常见的归一化方法。 前向传播假设待归一化的$m$维向量为$x$,均值和标准差分别是$\mu{(x)}$和$\sigma{(x)}$,LayerNor
2022-05-22
下一篇 
拼图还能这么玩? 拼图还能这么玩?
这两天将我所有微信好友的头像弄出来了,一共5000多张。然后想着可以用它们来做些啥,最后用它们拼图玩。 Mac微信的头像保存在: ~/Library/Containers/com.tencent.xinWeChat/Data/Library
2022-04-22
  目录