transformer是一種不同於RNN的架構,模型同樣包含 encoder 和 decoder ,但是encoder 和 decoder 拋棄 了RNN,而使用各種前饋層堆疊在一起。
Encoder:
編碼器是由N個完全一樣的層堆疊起來的,每層又包括兩個子層(sub-layer),第一個子層是multi-head self-attention mechanism層,第二個子層是一個簡單的多層全連接層(fully connected feed-forward network)
Decoder:
解碼器也是由N 個相同層的堆疊起來的。 但每層包括三個子層(sub-layer),第一個子層是multi-head self-attention層,第二個子層是multi-head context-attention 層,第三個子層是一個簡單的多層全連接層(fully connected feed-forward network)
模型的架構如下
一 module
(1)multi-head self-attention
multi-head self-attention是key=value=query=隱層的注意力機制
Encoder的multi-head self-attention是key=value=query=編碼層隱層的注意力機制
Decoder的multi-head self-attention是key=value=query=解碼層隱層的注意力機制
這里介紹自注意力機制(self-attention)也就是key=value=query=H的情況下的輸出
隱層所有時間序列的狀態H,$h_{i}$代表第i個詞對應的隱藏層狀態
\[H = \left[ \begin{array}{l}
{h_1}\\
{h_2}\\
...\\
{h_n}
\end{array} \right] \in {R^{n \times \dim }}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {h_i} \in {R^{1 \times \dim }}\]
H的轉置為
\[{H^T} = [h_1^T,h_2^T,...,h_n^T] \in {R^{\dim \times n}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {h_i} \in {R^{1 \times \dim }}\]
如果只計算一個單詞對應的隱層狀態$h_{i}$的self-attention
\[\begin{array}{l}
weigh{t_{_{{h_i}}}}{\rm{ = softmax}}(({h_i}W_{query}^i)*(W_{key}^i*\left[ {\begin{array}{*{20}{c}}
{h_1^T}&{h_2^T}&{...}&{h_n^T}
\end{array}} \right])) = \left[ {\begin{array}{*{20}{c}}
{weigh{t_{i1}},}&{weigh{t_{i2}},}&{...}&{weigh{t_{in}}}
\end{array}} \right]\\
{\rm{value = }}\left[ {\begin{array}{*{20}{l}}
{{h_1}}\\
{{h_2}}\\
{...}\\
{{h_n}}
\end{array}} \right]*W_{value}^i = \left[ {\begin{array}{*{20}{l}}
{{h_1}W_{value}^i}\\
{{h_2}W_{value}^i}\\
{...}\\
{{h_n}W_{value}^i}
\end{array}} \right]\\
Attentio{n_{{h_i}}} = weigh{t_{_{{h_i}}}}*value = \sum\limits_{k = 1}^n {({h_k}W_{value}^i} )(weigh{t_{ik}})
\end{array}\]
同理,一次性計算所有單詞隱層狀態$h_{i}(1<=i<=n)$的self-attention
\[\begin{array}{l}
{\rm{weight}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{ = softmax}}(\left[ {\begin{array}{*{20}{l}}
{{h_1}}\\
{{h_2}}\\
{...}\\
{{h_n}}
\end{array}} \right]W_{query}^i*(W_{key}^i*\left[ {\begin{array}{*{20}{c}}
{h_1^T}&{h_2^T}&{...}&{h_n^T}
\end{array}} \right])\\
{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{ = softmax}}(\left[ {\begin{array}{*{20}{l}}
{{h_1}W_{query}^i}\\
{{h_2}W_{query}^i}\\
{...}\\
{{h_n}W_{query}^i}
\end{array}} \right]*\left[ {\begin{array}{*{20}{c}}
{W_{key}^ih_1^T}&{W_{key}^ih_2^T}&{...}&{W_{key}^ih_n^T}
\end{array}} \right]\\
{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{ = softmax}}(\left[ {\begin{array}{*{20}{c}}
{({h_1}W_{query}^i)(W_{key}^ih_1^T)}&{({h_1}W_{query}^i)(W_{key}^ih_2^T)}&{...}&{({h_1}W_{query}^i)(W_{key}^ih_n^T)}\\
{({h_2}W_{query}^i)(W_{key}^ih_1^T)}&{({h_2}W_{query}^i)(W_{key}^ih_2^T)}&{...}&{({h_2}W_{query}^i)(W_{key}^ih_n^T)}\\
{...}&{...}&{...}&{...}\\
{({h_n}W_{query}^i)(W_{key}^ih_1^T)}&{({h_n}W_{query}^i)(W_{key}^ih_2^T)}&{...}&{({h_n}W_{query}^i)(W_{key}^ih_n^T)}
\end{array}} \right])\\
{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{ = }}\left[ {\begin{array}{*{20}{c}}
{{\rm{softmax}}(({h_1}W_{query}^iW_{key}^ih_1^T)}&{({h_1}W_{query}^iW_{key}^ih_2^T)}&{...}&{({h_1}W_{query}^iW_{key}^ih_n^T))}\\
{{\rm{softmax}}(({h_2}W_{query}^iW_{key}^ih_1^T)}&{({h_2}W_{query}^iW_{key}^ih_2^T)}&{...}&{({h_2}W_{query}^iW_{key}^ih_n^T))}\\
{...}&{...}&{...}&{...}\\
{{\rm{softmax}}(({h_n}W_{query}^iW_{key}^ih_1^T)}&{({h_n}W_{query}^iW_{key}^ih_2^T)}&{...}&{({h_n}W_{query}^iW_{key}^ih_n^T))}
\end{array}} \right]
\end{array}\]
\[\begin{array}{l}
{\rm{sum}}(weight*value) = \left[ \begin{array}{l}
{\rm{Weigh}}{{\rm{t}}_{11}}({h_1}W_{value}^i) + {\rm{Weigh}}{{\rm{t}}_{12}}({h_2}W_{value}^i) + ...{\kern 1pt} {\kern 1pt} + {\rm{Weigh}}{{\rm{t}}_{1n}}({h_n}W_{value}^i)\\
{\rm{Weigh}}{{\rm{t}}_{21}}({h_1}W_{value}^i) + {\rm{Weigh}}{{\rm{t}}_{22}}({h_2}W_{value}^i) + ... + {\rm{Weigh}}{{\rm{t}}_{2n}}({h_n}W_{value}^i)\\
{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} ......\\
{\rm{Weigh}}{{\rm{t}}_{n1}}({h_1}W_{value}^i) + {\rm{Weigh}}{{\rm{t}}_{n2}}({h_2}W_{value}^i) + ... + {\rm{Weigh}}{{\rm{t}}_{nn}}({h_n}W_{value}^i)
\end{array} \right]\\
{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left[ \begin{array}{l}
\sum\limits_{k = 1}^n {{\rm{Weigh}}{{\rm{t}}_{1k}}({h_k}W_{value}^i)} \\
\sum\limits_{k = 1}^n {{\rm{Weigh}}{{\rm{t}}_{2k}}({h_k}W_{value}^i)} \\
{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} .......\\
\sum\limits_{k = 1}^n {{\rm{Weigh}}{{\rm{t}}_{nk}}({h_k}W_{value}^i)}
\end{array} \right]
\end{array}\]
所以最后的注意力向量為$head_{i}$
\[\begin{array}{l}
hea{d_i} = Attention(QW_{query}^i,KW_{query}^i,VW_{query}^i)\\
{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{ = }}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{sum}}(weight*value)\\
{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left[ \begin{array}{l}
\sum\limits_{k = 1}^n {{\rm{Weigh}}{{\rm{t}}_{1k}}({h_k}W_{value}^i)} \\
\sum\limits_{k = 1}^n {{\rm{Weigh}}{{\rm{t}}_{2k}}({h_k}W_{value}^i)} \\
{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} .......\\
\sum\limits_{k = 1}^n {{\rm{Weigh}}{{\rm{t}}_{nk}}({h_k}W_{value}^i)}
\end{array} \right]
\end{array}\]
softmax函數需要加一個平滑系數$ \sqrt {{{\rm{d}}_k}} $
\[\begin{array}{l}
hea{d_i} = Attention(QW_{query}^i,KW_{key}^i,VW_{value}^i)\\
{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{softmax}}(\frac{{(QW_{query}^i){{(KW_{key}^i)}^T}}}{{\sqrt {{d_k}} }})VW_{value}^i\\
{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\rm{softmax}}(\left[ {\begin{array}{*{20}{c}}
{\frac{{({h_1}W_{query}^i)(W_{key}^ih_1^T)}}{{\sqrt {{d_k}} }}}&{\frac{{({h_1}W_{query}^i)(W_{key}^ih_2^T)}}{{\sqrt {{d_k}} }}}&{...}&{\frac{{({h_1}W_{query}^i)(W_{key}^ih_n^T)}}{{\sqrt {{d_k}} }}}\\
{\frac{{({h_2}W_{query}^i)(W_{key}^ih_1^T)}}{{\sqrt {{d_k}} }}}&{\frac{{({h_2}W_{query}^i)(W_{key}^ih_2^T)}}{{\sqrt {{d_k}} }}}&{...}&{\frac{{({h_2}W_{query}^i)(W_{key}^ih_n^T)}}{{\sqrt {{d_k}} }}}\\
{...}&{...}&{...}&{...}\\
{\frac{{({h_n}W_{query}^i)(W_{key}^ih_1^T)}}{{\sqrt {{d_k}} }}}&{\frac{{({h_n}W_{query}^i)(W_{key}^ih_2^T)}}{{\sqrt {{d_k}} }}}&{...}&{\frac{{({h_n}W_{query}^i)(W_{key}^ih_n^T)}}{{\sqrt {{d_k}} }}}
\end{array}} \right])VW_{value}^i\\
{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = s{\rm{um}}({\rm{weigh}}{{\rm{t}}_{\sqrt {{d_k}} }}*value)
\end{array}\]
注意$\sqrt d_k$ 是softmax中的temperature參數:
\[{p_i} = \frac{{{e^{\frac{{{\rm{logit}}{{\rm{s}}_i}}}{\tau }}}}}{{\sum\limits_i {e_i^{\frac{{{\rm{logit}}{{\rm{s}}_i}}}{\tau }}} }}\]
t越大,則經過softmax的得到的概率值之間越接近。t越小,則經過softmax得到的概率值之間越差異越大。當t趨近於0的時候,只有最大的一項是1,其他均幾乎為0:
\[\mathop {\lim }\limits_{\tau \to 0} {p_i} \to 1{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} if{\kern 1pt} {\kern 1pt} {\kern 1pt} {p_i} = \max ({p_k}){\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 1 \le k \le N{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} else{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 0\]
MultiHead注意力向量由多個$head_{i}$拼接后過一個線性層得到最終的MultiHead Attention
\[\begin{array}{l}
MulitiHead = Concat(hea{d_1},hea{d_2},...,hea{d_n}){W^o}{\kern 1pt} {\kern 1pt} {\kern 1pt} \\
where{\kern 1pt} {\kern 1pt} hea{d_i} = Attention(QW_{query}^i,KW_{key}^i,VW_{value}^i) = {\rm{softmax}}(\frac{{(QW_{query}^i){{(KW_{key}^i)}^T}}}{{\sqrt {{d_k}} }})VW_{value}^i
\end{array}\]
(2)LayerNorm+Position-wise Feed-Forward Networks
\[FFN(x) = \max (0,x{W_1} + {b_1}){W_2} + {b_2}\]
注意這里實現上和論文中有點區別,具體實現是先LayerNorm然后再FFN
class PositionwiseFeedForward(nn.Module): """ A two-layer Feed-Forward-Network with residual layer norm. Args: d_model (int): the size of input for the first-layer of the FFN. d_ff (int): the hidden layer size of the second-layer of the FNN. dropout (float): dropout probability(0-1.0). """ def __init__(self, d_model, d_ff, dropout=0.1): super(PositionwiseFeedForward, self).__init__() self.w_1 = nn.Linear(d_model, d_ff) self.w_2 = nn.Linear(d_ff, d_model) self.layer_norm = onmt.modules.LayerNorm(d_model) self.dropout_1 = nn.Dropout(dropout) self.relu = nn.ReLU() self.dropout_2 = nn.Dropout(dropout) def forward(self, x): """ Layer definition. Args: input: [ batch_size, input_len, model_dim ] Returns: output: [ batch_size, input_len, model_dim ] """ inter = self.dropout_1(self.relu(self.w_1(self.layer_norm(x)))) output = self.dropout_2(self.w_2(inter)) return output + x
(3)Layer Normalization
\[{\rm{x = }}\left[ {\begin{array}{*{20}{c}}
{{x_1}}&{{x_2}}&{...}&{{x_n}}
\end{array}} \right]\] $x_{1}, x_{2}, x_{3},... ,x_{n}$為樣本$x$的不同特征
\[{{\hat x}_i} = \frac{{{x_i} - E(x)}}{{\sqrt {Var(x)} }}\]
\[{\rm{\hat x = }}\left[ {\begin{array}{*{20}{c}}
{{{\hat x}_1}}&{{{\hat x}_2}}&{...}&{{{\hat x}_n}}
\end{array}} \right]\]
最終$\hat x$為layer normalization的輸出,並且$\hat x$均值為0,方差為1:
\[\begin{array}{l}
E({\rm{\hat x}}) = \frac{1}{n}\sum\limits_{i = 1}^n {{{\hat x}_i}} = \frac{1}{n}\sum\limits_{i = 1}^n {\frac{{{x_i} - E(x)}}{{\sqrt {Var(x)} }} = } \frac{1}{n}\frac{{[({x_1} + {x_2} + ... + {x_n}) - nE(x)]}}{{\sqrt {Var(x)} }} = 0\\
Var({\rm{\hat x}}) = \frac{1}{{n - 1}}\sum\limits_{i = 1}^n {{{({\rm{\hat x}} - E({\rm{\hat x}}))}^2}} = \frac{1}{{n - 1}}\sum\limits_{i = 1}^n {{{{\rm{\hat x}}}^2}} = \frac{1}{{n - 1}}\sum\limits_{i = 1}^n {\frac{{{{({x_i} - E(x))}^2}}}{{Var(x)}} = } \frac{{\frac{1}{{n - 1}}\sum\limits_{i = 1}^n {{{({x_i} - E(x))}^2}} }}{{Var(x)}} = \frac{{Var(x)}}{{Var(x)}} = 1
\end{array}\]
但是通常引入兩個超參數w和bias, w和bias通過反向傳遞更新,但是初始值$w_{initial}=1, bias_{bias}=0$,$\varepsilon$防止分母為0:
\[{{\hat x}_i} = w*\frac{{{x_i} - E(x)}}{{\sqrt {Var(x) + \varepsilon } }} + bias\]
偽代碼如下:
class LayerNorm(nn.Module): """ Layer Normalization class """ def __init__(self, features, eps=1e-6): super(LayerNorm, self).__init__() self.a_2 = nn.Parameter(torch.ones(features)) self.b_2 = nn.Parameter(torch.zeros(features)) self.eps = eps def forward(self, x): """ x=[-0.0101, 1.4038, -0.0116, 1.4277], [ 1.2195, 0.7676, 0.0129, 1.4265] """ mean = x.mean(-1, keepdim=True) """ mean=[[ 0.7025], [ 0.8566]] """ std = x.std(-1, keepdim=True) """ std=[[0.8237], [0.6262]] """ return self.a_2 * (x - mean) / (std + self.eps) + self.b_2 """ self.a_2=[1,1,1,1] self.b_2=[0,0,0,0] return [[-0.8651, 0.8515, -0.8668, 0.8804], [ 0.5795, -0.1422, -1.3475, 0.9101]] """
(4)Embedding
位置向量 Position Embedding
\[\begin{array}{l}
P{E_{pos,2i}} = \sin (\frac{{pos}}{{{\rm{1000}}{{\rm{0}}^{\frac{{{\rm{2i}}}}{{{{\rm{d}}_{\bmod el}}}}}}}}) = \sin (pos*div\_term)\\
P{E_{pos,2i + 1}} = \cos (\frac{{pos}}{{{\rm{1000}}{{\rm{0}}^{\frac{{{\rm{2i}}}}{{{{\rm{d}}_{\bmod el}}}}}}}}) = \cos (pos*div\_term)\\
div\_term = {e^{\log (\frac{1}{{{\rm{1000}}{{\rm{0}}^{\frac{{{\rm{2i}}}}{{{{\rm{d}}_{\bmod el}}}}}}}}{\rm{)}}}}{\rm{ = }}{{\rm{e}}^{ - \frac{{2i}}{{{{\rm{d}}_{\bmod el}}}}\log (10000)}} = {{\rm{e}}^{2i*( - \frac{{\log (10000)}}{{{{\rm{d}}_{\bmod el}}}})}}
\end{array}\]
計算Position Embedding舉例:
輸入句子$S=[w_1,w_2,...,w_{max\_len}]$, m為句子長度 ,假設max_len=3,且$d_{model}=4$:
pe = torch.zeros(max_len, dim)
position = torch.arange(0, max_len).unsqueeze(1) #position=[0,1,2] position.shape=(3,1) div_term = torch.exp((torch.arange(0, dim, 2, dtype=torch.float) *-(math.log(10000.0) / dim))) """ torch.arange(0, dim, 2, dtype=torch.float)=[0,2,4] shape=(3) -(math.log(10000.0) / dim)=-1.5350567286626973 (torch.arange(0, dim, 2, dtype=torch.float) *-(math.log(10000.0) / dim))=[0,2,4]*-1.5350567286626973=[-0.0000, -3.0701, -6.1402] div_term=exp([-0.0000, -3.0701, -6.1402])=[1.0000, 0.0464, 0.0022] """ pe[:, 0::2] = torch.sin(position.float() * div_term) """ pe=[[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.8415, 0.0000, 0.0464, 0.0000, 0.0022, 0.0000], [ 0.9093, 0.0000, 0.0927, 0.0000, 0.0043, 0.0000]] """ pe[:, 1::2] = torch.cos(position.float() * div_term) """ pe=[[ 0.0000, 1.0000, 0.0000, 1.0000, 0.0000, 1.0000], [ 0.8415, 0.5403, 0.0464, 0.9989, 0.0022, 1.0000], [ 0.9093, -0.4161, 0.0927, 0.9957, 0.0043, 1.0000]] """ pe = pe.unsqueeze(1) #pe.shape=[3,1,6]
max_len=20,$d_{model}=4$Position Embedding,可以觀察到同一個時間序列內t位置內大約只有前半部分起到區分位置的作用:
語義向量normal Embedding:
$x=[x_1,x_2,x_3,...,x_n]$,$x_i$為one-hot行向量
那么,代表語義的embedding是$emb=[emb_{1},emb_{2},emb_{3},...,emb_{n}$ $emb_{i}=x_iW$,transformer中的詞向量表示為語義向量emb_{i}和位置向量pe_{i}之和
$emb^{final}_{i}=emb_{i}+pe_{i}$
二 Encoder
(1)Encoder是由多個相同的層堆疊在一起的:$[input\rightarrow embedding\rightarrow self-attention \rightarrow Add Norm \rightarrow FFN \rightarrow Add Norm]$:
(2)Encoder的self-attention是既考慮前面的詞也考慮后面的詞的,而Decoder的self-attention只考慮前面的詞,因此mask矩陣是全1。因此encoder的self-attention如下圖:
偽代碼如下:
class TransformerEncoderLayer(nn.Module): """ A single layer of the transformer encoder. Args: d_model (int): the dimension of keys/values/queries in MultiHeadedAttention, also the input size of the first-layer of the PositionwiseFeedForward. heads (int): the number of head for MultiHeadedAttention. d_ff (int): the second-layer of the PositionwiseFeedForward. dropout (float): dropout probability(0-1.0). """ def __init__(self, d_model, heads, d_ff, dropout): super(TransformerEncoderLayer, self).__init__() self.self_attn = onmt.modules.MultiHeadedAttention( heads, d_model, dropout=dropout) self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout) self.layer_norm = onmt.modules.LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, inputs, mask): """ Transformer Encoder Layer definition. Args: inputs (`FloatTensor`): `[batch_size x src_len x model_dim]` mask (`LongTensor`): `[batch_size x src_len x src_len]` Returns: (`FloatTensor`): * outputs `[batch_size x src_len x model_dim]` """ input_norm = self.layer_norm(inputs) context, _ = self.self_attn(input_norm, input_norm, input_norm, mask=mask) out = self.dropout(context) + inputs return self.feed_forward(out)
二 Decoder
(1)decoder中的self attention層在計算self attention的時候,因為實際預測中只能知道前面的詞,因此在訓練過程中只需要計算當前位置和前面位置的self attention,通過掩碼來計算Masked Multi-head Attention層。
例如"I have an app",中翻譯出第一個詞后"I",
"I"的self attention只計算與"I"與自己的self attention: Attention("I","I"),來預測下一個詞
翻譯出"I have"后,計算"have"與"have","have"與"I"的self attention: Attention("have","I"), Attention("have","have"),來預測下一個詞
翻譯出"I have an"后,計算"an"與"an","an"與"have","an"與"I"的self attention: Attention("an","an"), Attention("an","have"),Attention("an","I")來預測下一個詞
可以用下圖來表示:

self-attention的偽代碼如下:
class MultiHeadedAttention(nn.Module): """ Args: head_count (int): number of parallel heads model_dim (int): the dimension of keys/values/queries, must be divisible by head_count dropout (float): dropout parameter """ def __init__(self, head_count, model_dim, dropout=0.1): assert model_dim % head_count == 0 self.dim_per_head = model_dim // head_count self.model_dim = model_dim super(MultiHeadedAttention, self).__init__() self.head_count = head_count self.linear_keys = nn.Linear(model_dim,model_dim) self.linear_values = nn.Linear(model_dim,model_dim) self.linear_query = nn.Linear(model_dim,model_dim) self.softmax = nn.Softmax(dim=-1) self.dropout = nn.Dropout(dropout) self.final_linear = nn.Linear(model_dim, model_dim) def forward(self, key, value, query, mask=None, layer_cache=None, type=None): """ Compute the context vector and the attention vectors. Args: key (`FloatTensor`): set of `key_len` key vectors `[batch, key_len, dim]` value (`FloatTensor`): set of `key_len` value vectors `[batch, key_len, dim]` query (`FloatTensor`): set of `query_len` query vectors `[batch, query_len, dim]` mask: binary mask indicating which keys have non-zero attention `[batch, query_len, key_len]` Returns: (`FloatTensor`, `FloatTensor`) : * output context vectors `[batch, query_len, dim]` * one of the attention vectors `[batch, query_len, key_len]` """ batch_size = key.size(0) dim_per_head = self.dim_per_head head_count = self.head_count key_len = key.size(1) query_len = query.size(1) def shape(x): """ projection """ return x.view(batch_size, -1, head_count, dim_per_head) \ .transpose(1, 2) def unshape(x): """ compute context """ return x.transpose(1, 2).contiguous() \ .view(batch_size, -1, head_count * dim_per_head) # 1) Project key, value, and query. if layer_cache is not None: key = self.linear_keys(key) #key.shape=[batch_size,key_len,dim] => key.shape=[batch_size,key_len,dim] value = self.linear_values(value) #value.shape=[batch_size,key_len,dim] => key.shape=[batch_size,key_len,dim] query = self.linear_query(query) #query.shape=[batch_size,key_len,dim] => key.shape=[batch_size,key_len,dim] key = shape(key) #key.shape=[batch_size,head_count,key_len,dim_per_head] value = shape(value) #value.shape=[batch_size,head_count,value_len,dim_per_head] query = shape(query) #query.shape=[batch_size,head_count,query_len,dim_per_head] key_len = key.size(2) query_len = query.size(2) # 2) Calculate and scale scores. query = query / math.sqrt(dim_per_head) scores = torch.matmul(query, key.transpose(2, 3)) #query.shape=[batch_size,head_count,query_len,dim_per_head] #key.transpose(2, 3).shape=[batch_size,head_count,dim_per_head,key_len] #scores.shape=[batch_size,head_count,query_len,key_len] if mask is not None: mask = mask.unsqueeze(1).expand_as(scores) scores = scores.masked_fill(mask, -1e18) # 3) Apply attention dropout and compute context vectors. attn = self.softmax(scores) #scores.shape=[batch_size,head_count,query_len,key_len] drop_attn = self.dropout(attn) context = unshape(torch.matmul(drop_attn, value)) #drop_attn.shape=[batch_size,head_count,query_len,key_len] #value.shape=[batch_size,head_count,value_len,dim_per_head] #torch.matmul(drop_attn, value).shape=[batch_size,head_count,query_len,dim_per_head] #context.shape=[batch_size,query_len,head_count*dim_per_head] output = self.final_linear(context) #context.shape=[batch_size,query_len,head_count*dim_per_head] return output
(2)Decoder的結構為$[input\rightarrow embedding\rightarrow self-attention \rightarrow Add Norm \rightarrow context-attention \rightarrow FFN \rightarrow Add Norm]$:
class TransformerDecoderLayer(nn.Module): """ Args: d_model (int): the dimension of keys/values/queries in MultiHeadedAttention, also the input size of the first-layer of the PositionwiseFeedForward. heads (int): the number of heads for MultiHeadedAttention. d_ff (int): the second-layer of the PositionwiseFeedForward. dropout (float): dropout probability(0-1.0). self_attn_type (string): type of self-attention scaled-dot, average """ def __init__(self, d_model, heads, d_ff, dropout, self_attn_type="scaled-dot"): super(TransformerDecoderLayer, self).__init__() self.self_attn_type = self_attn_type if self_attn_type == "scaled-dot": self.self_attn = onmt.modules.MultiHeadedAttention( heads, d_model, dropout=dropout) elif self_attn_type == "average": self.self_attn = onmt.modules.AverageAttention( d_model, dropout=dropout) self.context_attn = onmt.modules.MultiHeadedAttention( heads, d_model, dropout=dropout) self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout) self.layer_norm_1 = onmt.modules.LayerNorm(d_model) self.layer_norm_2 = onmt.modules.LayerNorm(d_model) self.dropout = dropout self.drop = nn.Dropout(dropout) mask = self._get_attn_subsequent_mask(MAX_SIZE) # Register self.mask as a buffer in TransformerDecoderLayer, so # it gets TransformerDecoderLayer's cuda behavior automatically. self.register_buffer('mask', mask) def forward(self, inputs, memory_bank, src_pad_mask, tgt_pad_mask, previous_input=None, layer_cache=None, step=None): """ Args: inputs (`FloatTensor`): `[batch_size x 1 x model_dim]` memory_bank (`FloatTensor`): `[batch_size x src_len x model_dim]` src_pad_mask (`LongTensor`): `[batch_size x 1 x src_len]` tgt_pad_mask (`LongTensor`): `[batch_size x 1 x 1]` Returns: (`FloatTensor`, `FloatTensor`, `FloatTensor`): * output `[batch_size x 1 x model_dim]` * attn `[batch_size x 1 x src_len]` * all_input `[batch_size x current_step x model_dim]` """ dec_mask = torch.gt(tgt_pad_mask + self.mask[:, :tgt_pad_mask.size(1), :tgt_pad_mask.size(1)], 0) input_norm = self.layer_norm_1(inputs) all_input = input_norm if previous_input is not None: all_input = torch.cat((previous_input, input_norm), dim=1) dec_mask = None if self.self_attn_type == "scaled-dot": query, attn = self.self_attn(all_input, all_input, input_norm, mask=dec_mask, layer_cache=layer_cache, type="self") elif self.self_attn_type == "average": query, attn = self.self_attn(input_norm, mask=dec_mask, layer_cache=layer_cache, step=step) query = self.drop(query) + inputs query_norm = self.layer_norm_2(query) mid, attn = self.context_attn(memory_bank, memory_bank, query_norm, mask=src_pad_mask, layer_cache=layer_cache, type="context") output = self.feed_forward(self.drop(mid) + query) return output, attn, all_input
五 label smoothing (標簽平滑)
普通的交叉熵損失函數:
\[\begin{array}{l}
{\rm{loss}} = - \sum\limits_{k = 1}^K {tru{e_k}log (p(k|x))} \\
p(k|x) = softmax (\log it{s_k})\\
\log it{s_k} = \sum\limits_i {{w_{ik}}{z_i}}
\end{array}\]
梯度為:
\[\begin{array}{l}
\Delta {w_{ik}} = \frac{{\partial loss}}{{\partial {w_{ik}}}} = \frac{{\partial loss}}{{\partial logit{s_{ik}}}}\frac{{\partial logits}}{{\partial {w_{ik}}}} = ({y_k} - labe{l_k}){z_k}\\
label = [\begin{array}{*{20}{c}}
{\begin{array}{*{20}{c}}
{\frac{\alpha }{4}}&{\frac{\alpha }{4}}
\end{array}}&{1 - \alpha }&{\frac{\alpha }{4}}&{\frac{\alpha }{4}}
\end{array}]
\end{array}\]
有一個問題
只有正確的那一個類別有貢獻,其他標注數據中不正確的類別概率是0,無貢獻,朝一個方向優化,容易導致過擬合
因此提出label smoothing 讓標注數據中正確的類別概率小於1,其他不正確類別的概率大於0:
也就是之前$label=[0,0,0,1,0]$,通過標簽平滑,給定一個固定參數$\alpha$, 概率為1地方減去這個小概率,標簽為0的地方平分這個小概率$\alpha$變成:
\[labe{l^{new}} = [\begin{array}{*{20}{c}}
{\begin{array}{*{20}{c}}
{\frac{\alpha }{4}}&{\frac{\alpha }{4}}
\end{array}}&{1 - \alpha }&{\frac{\alpha }{4}}&{\frac{\alpha }{4}}
\end{array}]\]
損失函數為
\[\begin{array}{l}
loss = - \sum\limits_{k = 1}^K {label_k^{new}\log p(k|x)} \\
label_k^{new} = (1{\rm{ - }}\alpha ){\delta _{k,y}} + \frac{\alpha }{K}({\delta _{k,y}} = 1 \quad if \quad k==y \quad else \quad 0)\\
loss = - (1{\rm{ - }}\alpha )\sum\limits_{k = 1}^K {{\rm{label}}\log p(k|x)} - \frac{\alpha }{K}\sum\limits_{k = 1}^K {(\frac{\alpha }{K})\log p(k|x)} \\
loss = (1{\rm{ - }}\alpha )CrossEntropy({\rm{label}},p(k|x)) + \frac{\alpha }{K}CrossEntropy(\frac{\alpha }{K},p(k|x))
\end{array}\]
引入相對熵函數:
\[{D_{KL}}(Y||X) = \sum\limits_i {Y(i)\log (\frac{{Y(i)}}{{X(i)}})} = \sum\limits_i {Y(i)\log Y(i)} - Y(i)\log X(i)\]
pytorch中的torch.nn.function.kl_div用來計算相對熵:
torch.nn.function.kl_div(y,x):$x=[x_1,x_2,...,x_N] y=[y_1,y_2,...,y_N]$:
$L=l_1+l_2+...+l_N其中 l_i=x_i*(log(x_i)-y_i)$
舉例:x=[3] y=[2] torch.nn.function.kl_div(y,x)=3(log3-2)=-2.7042
class LabelSmoothingLoss(nn.Module): """ With label smoothing, KL-divergence between q_{smoothed ground truth prob.}(w) and p_{prob. computed by model}(w) is minimized. """ def __init__(self, label_smoothing, tgt_vocab_size, ignore_index=-100): assert 0.0 < label_smoothing <= 1.0 self.padding_idx = ignore_index super(LabelSmoothingLoss, self).__init__() smoothing_value = label_smoothing / (tgt_vocab_size - 2) one_hot = torch.full((tgt_vocab_size,), smoothing_value) one_hot[self.padding_idx] = 0 self.register_buffer('one_hot', one_hot.unsqueeze(0)) self.confidence = 1.0 - label_smoothing def forward(self, output, target): """ output (FloatTensor): batch_size x n_classes target (LongTensor): batch_size """ model_prob = self.one_hot.repeat(target.size(0), 1) model_prob.scatter_(1, target.unsqueeze(1), self.confidence) model_prob.masked_fill_((target == self.padding_idx).unsqueeze(1), 0) return F.kl_div(output, model_prob, size_average=False)
附: Transformer與RNN的結合RNMT+(The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation)
(1)RNN:難以訓練並且表達能力較弱 trainability versus expressivity
(2)Transformer:有很強的特征提取能力(a strong feature extractor),但是沒有memory機制,因此需要額外引入位置向量。