Basic Information
作者:李丕績(騰訊AI Lab)
模型:Transformer + copy mechanism for abstractive summarization
數據集:CNN/Daily Mail
Parameters
WARNING: IN DEBUGGING MODE
USE COPY MECHANISM
USE COVERAGE MECHANISM
USE AVG NLL as LOSS
USE LEARNABLE W2V EMBEDDING
RNN TYPE: transformer
idx_gpu: 0
norm_clip: 2 # gradient clipping by norm
dim_x: 512
dim_y: 512
len_x: 401
len_y: 101
num_x: 1
num_y: 1
hidden_size: 512
d_ff: 1024
num_heads: 8 # 8頭注意力機制
dropout: 0.2
num_layers: 4
label_smoothing: 0.1
alpha: 0.9
beta: 5
batch_size: 5
testing_batch_size: 1
min_len_predict: 35
max_len_predict: 120
max_byte_predict: None
testing_print_size: 500
lr: 0.15
beam_size: 4
max_epoch: 50
print_time: 20 # 每個 epoch 打印信息,以及保存模型的次數
save_epoch: 1
dict_size: 50003 # vocabulary 大小
pad_token_idx: 0
loading train set...
num_files = 13
num_batches = 3
Model Structure
Model(
(tok_embed): Embedding(50003, 512, padding_idx=0)
(pos_embed): LearnedPositionalEmbedding(
(weights): Embedding(1024, 512)
)
(enc_layers): ModuleList(
(0): TransformerLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(attn_layer_norm): LayerNorm()
(ff_layer_norm): LayerNorm()
)
(1): TransformerLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(attn_layer_norm): LayerNorm()
(ff_layer_norm): LayerNorm()
)
(2): TransformerLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(attn_layer_norm): LayerNorm()
(ff_layer_norm): LayerNorm()
)
(3): TransformerLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(attn_layer_norm): LayerNorm()
(ff_layer_norm): LayerNorm()
)
)
(dec_layers): ModuleList(
(0): TransformerLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(attn_layer_norm): LayerNorm()
(ff_layer_norm): LayerNorm()
(external_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(external_layer_norm): LayerNorm()
)
(1): TransformerLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(attn_layer_norm): LayerNorm()
(ff_layer_norm): LayerNorm()
(external_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(external_layer_norm): LayerNorm()
)
(2): TransformerLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(attn_layer_norm): LayerNorm()
(ff_layer_norm): LayerNorm()
(external_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(external_layer_norm): LayerNorm()
)
(3): TransformerLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(attn_layer_norm): LayerNorm()
(ff_layer_norm): LayerNorm()
(external_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(external_layer_norm): LayerNorm()
)
)
(attn_mask): SelfAttentionMask()
(emb_layer_norm): LayerNorm()
(word_prob): WordProbLayer(
(external_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(proj): Linear(in_features=1536, out_features=50003, bias=True)
)
(smoothing): LabelSmoothing()
)
模型結構:
1. 嵌入表示:token embedding,positional embedding
2. encoder:3個blocks(8頭注意力)
3. decoder:3個blocks(8頭注意力)
4. mask attention層
5. layer normalization:層標准化
6. word probability layer:映射到單詞表上的概率分布
Source Code Analysis
1. prepare_data.py
處理數據,得到的結果如下:
拿test set舉例,數據集中有11489對 article-summary。數據結構如下:
所有的數據對組成一級列表,每對數據由兩個子列表組成,分別代表article 和 summary
article列表和summary中,分為兩個子列表,第一個是分詞后的序列,第二個是分詞之前的原始文本
2. model.py
利用pytorch框架構建了Transformer summarizer模型:
2.1. __initial__
設置模型參數,幾個比較重要的:
使用了copy mechanism 和 coverage mechanism;
使用NLL作為loss function;
d_ff size = 1024;
context size = 512;
hidden size = 512;
定義了幾個比較常用的結構:
label smoothing
可學習的token embedding,positional embedding
word probability layer
embedding layer normalization
2.2. structure of encoder & decoder
可以看到encoder,decoder中都包含多個(4個)基本modules。每個module的結構如下:
其中包含了子模塊,子模塊的下級模塊如下:self-attention結構,兩個全連接層,attention normalization layer,feed forward nomalization layer:
self-attention 模塊結構如下圖:
2.2.1. embedding(transformer.py)
token embedding 用的是 nn.Embedding
在訓練的時候進行學習。模塊化參數:vocabulary size = 50003,embedding dim = 512
positional embedding 用的還nn.Embedding
,隨着訓練進行學習。模塊參數:init_size = 1024(最大的position),embedding dim = 512;參數用正態分布進行隨機初始化
作者同樣實現了類 SinusoidalPositionalEmbedding
可以將 learnable positional embedding 變為 fixed,即論文中加入位置信息的方式
2.2.2. encoder & decoder
# encoder
self.enc_layers = nn.ModuleList()
for i in range(self.num_layers):
self.enc_layers.append(TransformerLayer(self.dim_x, self.d_ff, self.num_heads, self.dropout))
# decoder
self.dec_layers = nn.ModuleList()
for i in range(self.num_layers):
self.dec_layers.append(TransformerLayer(self.dim_x, self.d_ff, self.num_heads, self.dropout, with_external=True))
torch可以用nn.ModuleList()
來增加模塊的層,此處num_layer = 4 ,即加入4層Transformer stack作為encoder。
定義基本單元的時候TransformerLayer
的初始化定義不同。
2.3. encoding
流程:
- 獲得嵌入表示(token embedding + positional embedding)
- 層標准化
- dropout
- padding mask
- 對於N層encoder stack進行編碼(利用Transformer單元;Transformer單元之間參數不共享),上一層encoder的輸出做為下一層encoder的輸入(layer的輸入是:序列的嵌入表示x,和padding mask)
- 返回最終的編碼向量x
2.4. decoding
decoding 此處分為兩種情況:
- coverage mechanism + copy mechanism
- coverage emchanism
流程:
- 前者需要比后者多傳入兩個參數:x_ext, max_ext_len;即拓展此表中的詞也錄入id了,max_ext_len表示OOV詞的個數
- 獲得嵌入表示(token embedding + positional embedding)
- 層標准化
- dropout
- padding mask
- 對於N層decoder stack進行編碼,上一層decoder的輸出做為下一層decoder的輸入(layer的輸入是:序列的嵌入表示x,和padding mask,self attention mask,external memories,external padding mask)
- 利用final decoder state進行word probability distribution的計算。分為兩種計算方式,在WordProb中實現
2.5. word probability projection (WordProbLayer.py)
進行條件判別:
-
copy mechanism:
- 利用external_attention函數(
class MultiheadAttention
)計算attention。 - 輸入的是query = decoder final hidden states;key = value = encoder hidden states,返回的是(attention output, attention weights)
- 將解碼狀態,解碼輸入(的嵌入表示),external_attention的輸出進行concatenation;進行線性映射;經過softmax,得到pred
- 如果source article中出現了OOV,max_ext_len>0,則需要將pred拼接上一個全零張量,以至於pred的dim3的維度等於fixed vocabulary size + number of OOV
- 設置gate,將解碼狀態,解碼輸入(的嵌入表示),external_attention的輸出進行concatenation;進行線性映射;經過sigmoid,得到gate
- 最終的概率分布:pred = gate * pred + (1-gate) * attention_weights
- 利用external_attention函數(
-
no copy:利用全連接層進行簡單的線性映射,再經過softmax激活函數得到單詞表上的概率分布
函數參數:scatter_add_(dim, indexTensor, otherTensor) → 輸出Tensor
函數用法:selfTensor.scatter_add_(dim, indexTensor, otherTensor)
# 該函數將 otherTensor 的所有值加到 selfTensor 中,加入位置由 indexTensor 指明
2.6. loss
2.6.1. label_smoothing_loss (label_smoothing.py)
def label_smotthing_loss(self, y_pred, y, y_mask, avg=True):
seq_len, bsz = y.size()
y_pred = T.log(y_pred.clamp(min=1e-8))
loss = self.smoothing(y_pred.view(seq_len * bsz, -1), y.view(seq_len * bsz, -1))
if avg:
return loss / T.sum(y_mask)
else:
return loss / bsz
loss function的實現中用到了一個clamp
夾逼函數:
torch.clamp(input, min, max, out=None) → Tensor
作用是將inpout tensor的數值限制在min到max之間,大於max即為max,小於min即為min,在兩者之間不變
然后y_pred(預測的單詞對應的概率值)與真實值y送入smoothing函數,作者寫了一個類class LabelSmoothing
初始化類的時候需要傳入label_smoothing_factor,即padding位在vocabulary中的索引。利用target計算出model prob;
返回model prob與y_pred之間的KL divergence作為loss
2.6.2. negative_log_likelihood
def nll_loss(self, y_pred, y, y_mask, avg=True):
cost = -T.log(T.gather(y_pred, 2, y.view(y.size(0), y.size(1), 1)))
cost = cost.view(y.shape)
y_mask = y_mask.view(y.shape)
if avg:
cost = T.sum(cost * y_mask, 0) / T.sum(y_mask, 0)
else:
cost = T.sum(cost * y_mask, 0)
cost = cost.view((y.size(1), -1))
return T.mean(cost)
3. transformer.py
3.1. class Transformer
class TransformerLayer(nn.Module):
def __init__(self, embed_dim, ff_embed_dim, num_heads, dropout, with_external=False, weights_dropout = True):
super(TransformerLayer, self).__init__()
self.self_attn = MultiheadAttention(embed_dim, num_heads, dropout, weights_dropout)
self.fc1 = nn.Linear(embed_dim, ff_embed_dim)
self.fc2 = nn.Linear(ff_embed_dim, embed_dim)
self.attn_layer_norm = LayerNorm(embed_dim)
self.ff_layer_norm = LayerNorm(embed_dim)
self.with_external = with_external
self.dropout = dropout
if self.with_external:
self.external_attn = MultiheadAttention(embed_dim, num_heads, dropout, weights_dropout)
self.external_layer_norm = LayerNorm(embed_dim)
self.reset_parameters()
def reset_parameters(self):
nn.init.normal_(self.fc1.weight, std=0.02)
nn.init.normal_(self.fc2.weight, std=0.02)
nn.init.constant_(self.fc1.bias, 0.)
nn.init.constant_(self.fc2.bias, 0.)
def forward(self, x, kv = None,
self_padding_mask = None, self_attn_mask = None,
external_memories = None, external_padding_mask=None,
need_weights = False):
# x: seq_len x bsz x embed_dim
residual = x # 殘差:add & norm操作中,需要先將residual,以及residual經過feed forward得到的output進行求和,再進行norm的計算
if kv is None:
x, self_attn = self.self_attn(query=x, key=x, value=x, key_padding_mask=self_padding_mask, attn_mask=self_attn_mask, need_weights = need_weights)
else:
x, self_attn = self.self_attn(query=x, key=kv, value=kv, key_padding_mask=self_padding_mask, attn_mask=self_attn_mask, need_weights = need_weights)
x = F.dropout(x, p=self.dropout, training=self.training)
x = self.attn_layer_norm(residual + x) # 先將residual,以及residual經過feed forward得到的output進行求和,再進行norm的計算
if self.with_external:
residual = x
x, external_attn = self.external_attn(query=x, key=external_memories, value=external_memories, key_padding_mask=external_padding_mask, need_weights = need_weights)
x = F.dropout(x, p=self.dropout, training=self.training)
x = self.external_layer_norm(residual + x)
else:
external_attn = None
residual = x
x = gelu(self.fc1(x)) # 高斯誤差線性單元 Gaussian Error Linear Units(GELU)
x = F.dropout(x, p=self.dropout, training=self.training)
x = self.fc2(x) # 全連接層
x = F.dropout(x, p=self.dropout, training=self.training)
x = self.ff_layer_norm(residual + x)
return x, self_attn, external_attn
基本結構:
-
self attention層(
class MultiheadAttention
):初始化Transformer單元的時候,需要指定number of head -
external attention層(
class MultiheadAttention
):如果參數with_external
為True,計算attention的時候需要考慮外界的輸入。即不再是query = key = value了,而key 和 value可能來自source端,這在decoder中需要用到。 -
全連接層(兩個),形狀:(embed_dim, ff_embed_dim),(ff_embed_dim, embed_dim)
-
attention layer normalization + feedforward normalization(
class LayerNorm
) -
dropout
-
parameter initialization:主要針對的是兩個全連接層的weights和bias
功能(forward函數):
- 記錄下residual。在add & norm操作中,需要先將經過計算得到的output與residual進行求和,再進行normalization
- self_attention,用的是
class MultiheadAttention
,需要提供給forward函數:query,key,value;以及key_padding_mask,atten_mask - dropout
- add & attention normalization
- 如果參數
with_external
為True,需要額外進行:- external attention
- dropout
- add & attention normalization
- 再記錄residual
- 經過全連接層f1
- 高斯誤差線性單元 Gaussian Error Linear Units(GELU)
- dropout
- 經過全連接層f2
- dropout
- add & feedforward normalization
其中激活函數GLUE的定義如下:
def gelu(x):
cdf = 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
return cdf*x
3.2. class MultiheadAttention
初始化MultiheadAttention需要幾個參數:
- attention head count:8
- embed_dim (dim_x) : 512
- ff_embed_dim:1024
head_dim,即每個注意力頭的維度的計算方式為:head_dim = embed_dim // num_heads
,前者必須可以為后者所整除
attention 用的還是論文中scaled attention,scaling參數是head_dim開方
對收入、輸出的映射:
-
in_proj_weight:(3*embed_dim, embed_dim)
-
in_proj_bia:(3*embed_dim),先定義出來,前1/3是Query的,中間1/3是Key的,最后1/3是Value的。后面定義了一個函數
_in_proj
,根據傳入的參數確定需要對qkv中的那幾個進行映射,取出來就行了。但是輸入映射的參數肯定是從in_proj_weight,in_proj_bia中取的 -
out_proj:(embed_dim, embed_dim)
對具體情況進行判別,對應地對輸入的qkv進行輸入映射:
- 如果qkv相同,則是self-attention
- 如果qkv不同,但是kv相同,則是encoder-decoder attention
- 如果qkv均不同,則是一般的attention
對attention weights進行mask,使用的是方法masked_fill_
輸入一個ByteTensor,其中元素為1的位置,對應Tensor中元素會被置0。
對attention weights進行mask,再進行softmax,再進行dropout
attention output是attention weights與value進行bmm (batch matrix multiply),結果再包上一層dropout
進行一次輸出映射,得到MultiheadAttention的輸出
3.3. class LayerNorm
def forward(self, x):
u = x.mean(-1, keepdim=True)
s = (x - u).pow(2).mean(-1, keepdim=True)
x = (x - u) / torch.sqrt(s + self.eps)
return self.weight * x + self.bias
流程:
-
計算均值
-
計算方差
-
減去方差,除以標准差
-
經過一個線性映射(fully connected layer)
3.4. Positional Embedding
定義了兩個類:
- 可學習的位置編碼:
class LearnedPositionalEmbedding
- 由正弦函數給出的固定位置編碼:
class SinusoidalPositionalEmbedding
定義與Attention is All You Need 論文中一致