[code] Transformer For Summarization Source Code Reading [1]

本文轉載自查看原文 2019-07-22 23:08 350 source code analysis

Basic Information

作者：李丕績（騰訊AI Lab）

模型：Transformer + copy mechanism for abstractive summarization

數據集：CNN/Daily Mail

Parameters

WARNING: IN DEBUGGING MODE
USE COPY MECHANISM
USE COVERAGE MECHANISM
USE AVG NLL as LOSS
USE LEARNABLE W2V EMBEDDING
RNN TYPE: transformer
idx_gpu: 0
norm_clip: 2  # gradient clipping by norm
dim_x: 512
dim_y: 512
len_x: 401
len_y: 101
num_x: 1
num_y: 1
hidden_size: 512
d_ff: 1024
num_heads: 8  # 8頭注意力機制
dropout: 0.2
num_layers: 4
label_smoothing: 0.1
alpha: 0.9
beta: 5
batch_size: 5
testing_batch_size: 1
min_len_predict: 35
max_len_predict: 120
max_byte_predict: None
testing_print_size: 500
lr: 0.15
beam_size: 4
max_epoch: 50
print_time: 20  # 每個 epoch 打印信息，以及保存模型的次數
save_epoch: 1
dict_size: 50003  # vocabulary 大小
pad_token_idx: 0
loading train set...
num_files =  13
num_batches =  3

Model Structure

Model(
  (tok_embed): Embedding(50003, 512, padding_idx=0)
  (pos_embed): LearnedPositionalEmbedding(
    (weights): Embedding(1024, 512)
  )
  (enc_layers): ModuleList(
    (0): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
    )
    (1): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
    )
    (2): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
    )
    (3): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
    )
  )
  (dec_layers): ModuleList(
    (0): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
      (external_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (external_layer_norm): LayerNorm()
    )
    (1): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
      (external_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (external_layer_norm): LayerNorm()
    )
    (2): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
      (external_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (external_layer_norm): LayerNorm()
    )
    (3): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
      (external_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (external_layer_norm): LayerNorm()
    )
  )
  (attn_mask): SelfAttentionMask()
  (emb_layer_norm): LayerNorm()
  (word_prob): WordProbLayer(
    (external_attn): MultiheadAttention(
      (out_proj): Linear(in_features=512, out_features=512, bias=True)
    )
    (proj): Linear(in_features=1536, out_features=50003, bias=True)
  )
  (smoothing): LabelSmoothing()
)

模型結構：

1. 嵌入表示：token embedding，positional embedding
2. encoder：3個blocks（8頭注意力）
3. decoder：3個blocks（8頭注意力）
4. mask attention層
5. layer normalization：層標准化
6. word probability layer：映射到單詞表上的概率分布

Source Code Analysis

1. prepare_data.py

處理數據，得到的結果如下：

拿test set舉例，數據集中有11489對 article-summary。數據結構如下：

所有的數據對組成一級列表，每對數據由兩個子列表組成，分別代表article 和 summary

article列表和summary中，分為兩個子列表，第一個是分詞后的序列，第二個是分詞之前的原始文本

2. model.py

利用pytorch框架構建了Transformer summarizer模型：

2.1. initial

設置模型參數，幾個比較重要的：

使用了copy mechanism 和 coverage mechanism；
使用NLL作為loss function；
d_ff size = 1024；
context size = 512；
hidden size = 512；

定義了幾個比較常用的結構：

label smoothing
可學習的token embedding，positional embedding
word probability layer
embedding layer normalization

2.2. structure of encoder & decoder

DECODER的結構

ENCODER的結構

可以看到encoder，decoder中都包含多個（4個）基本modules。每個module的結構如下：

ENCODER, DECODER中的基本module

其中包含了子模塊，子模塊的下級模塊如下：self-attention結構，兩個全連接層，attention normalization layer，feed forward nomalization layer：

編解碼器基本module中還有下級結構

self-attention 模塊結構如下圖：

self-attention

2.2.1. embedding（transformer.py）

token embedding 用的是 nn.Embedding 在訓練的時候進行學習。模塊化參數：vocabulary size = 50003，embedding dim = 512

positional embedding 用的還nn.Embedding，隨着訓練進行學習。模塊參數：init_size = 1024（最大的position），embedding dim = 512；參數用正態分布進行隨機初始化

作者同樣實現了類 SinusoidalPositionalEmbedding 可以將 learnable positional embedding 變為 fixed，即論文中加入位置信息的方式

2.2.2. encoder & decoder

# encoder
self.enc_layers = nn.ModuleList()
for i in range(self.num_layers):
    self.enc_layers.append(TransformerLayer(self.dim_x, self.d_ff, self.num_heads, self.dropout))

# decoder 
self.dec_layers = nn.ModuleList()
for i in range(self.num_layers):
	self.dec_layers.append(TransformerLayer(self.dim_x, self.d_ff, self.num_heads, self.dropout, with_external=True))

torch可以用nn.ModuleList()來增加模塊的層，此處num_layer = 4 ，即加入4層Transformer stack作為encoder。

定義基本單元的時候TransformerLayer的初始化定義不同。

2.3. encoding

流程：

獲得嵌入表示（token embedding + positional embedding）
層標准化
dropout
padding mask
對於N層encoder stack進行編碼（利用Transformer單元；Transformer單元之間參數不共享），上一層encoder的輸出做為下一層encoder的輸入（layer的輸入是：序列的嵌入表示x，和padding mask）
返回最終的編碼向量x

2.4. decoding

decoding 此處分為兩種情況：

coverage mechanism + copy mechanism
coverage emchanism

流程：

前者需要比后者多傳入兩個參數：x_ext, max_ext_len；即拓展此表中的詞也錄入id了，max_ext_len表示OOV詞的個數
獲得嵌入表示（token embedding + positional embedding）
層標准化
dropout
padding mask
對於N層decoder stack進行編碼，上一層decoder的輸出做為下一層decoder的輸入（layer的輸入是：序列的嵌入表示x，和padding mask，self attention mask，external memories，external padding mask）
利用final decoder state進行word probability distribution的計算。分為兩種計算方式，在WordProb中實現

2.5. word probability projection (WordProbLayer.py)

進行條件判別：

copy mechanism：
1. 利用external_attention函數（class MultiheadAttention）計算attention。
2. 輸入的是query = decoder final hidden states；key = value = encoder hidden states，返回的是（attention output, attention weights）
3. 將解碼狀態，解碼輸入（的嵌入表示），external_attention的輸出進行concatenation；進行線性映射；經過softmax，得到pred
4. 如果source article中出現了OOV，max_ext_len>0，則需要將pred拼接上一個全零張量，以至於pred的dim3的維度等於fixed vocabulary size + number of OOV
5. 設置gate，將解碼狀態，解碼輸入（的嵌入表示），external_attention的輸出進行concatenation；進行線性映射；經過sigmoid，得到gate
6. 最終的概率分布：pred = gate * pred + (1-gate) * attention_weights
no copy：利用全連接層進行簡單的線性映射，再經過softmax激活函數得到單詞表上的概率分布

函數參數：scatter_add_(dim,  indexTensor,  otherTensor) → 輸出Tensor
函數用法：selfTensor.scatter_add_(dim,  indexTensor,  otherTensor)
# 該函數將 otherTensor 的所有值加到 selfTensor 中，加入位置由 indexTensor 指明

2.6. loss

2.6.1. label_smoothing_loss (label_smoothing.py)

def label_smotthing_loss(self, y_pred, y, y_mask, avg=True):
    seq_len, bsz = y.size()

    y_pred = T.log(y_pred.clamp(min=1e-8))
    loss = self.smoothing(y_pred.view(seq_len * bsz, -1), y.view(seq_len * bsz, -1))
    if avg:
        return loss / T.sum(y_mask)
    else:
        return loss / bsz

loss function的實現中用到了一個clamp夾逼函數：

torch.clamp(input, min, max, out=None) → Tensor

作用是將inpout tensor的數值限制在min到max之間，大於max即為max，小於min即為min，在兩者之間不變

然后y_pred（預測的單詞對應的概率值）與真實值y送入smoothing函數，作者寫了一個類class LabelSmoothing

初始化類的時候需要傳入label_smoothing_factor，即padding位在vocabulary中的索引。利用target計算出model prob；

返回model prob與y_pred之間的KL divergence作為loss

2.6.2. negative_log_likelihood

def nll_loss(self, y_pred, y, y_mask, avg=True):
    cost = -T.log(T.gather(y_pred, 2, y.view(y.size(0), y.size(1), 1)))
    cost = cost.view(y.shape)
    y_mask = y_mask.view(y.shape)
    if avg:
        cost = T.sum(cost * y_mask, 0) / T.sum(y_mask, 0)
        else:
            cost = T.sum(cost * y_mask, 0)
            cost = cost.view((y.size(1), -1))
            return T.mean(cost)

3. transformer.py

3.1. class Transformer

class TransformerLayer(nn.Module):
    
    def __init__(self, embed_dim, ff_embed_dim, num_heads, dropout, with_external=False, weights_dropout = True):
        super(TransformerLayer, self).__init__()
        self.self_attn = MultiheadAttention(embed_dim, num_heads, dropout, weights_dropout)
        self.fc1 = nn.Linear(embed_dim, ff_embed_dim)
        self.fc2 = nn.Linear(ff_embed_dim, embed_dim)
        self.attn_layer_norm = LayerNorm(embed_dim)
        self.ff_layer_norm = LayerNorm(embed_dim)
        self.with_external = with_external
        self.dropout = dropout
        if self.with_external:
            self.external_attn = MultiheadAttention(embed_dim, num_heads, dropout, weights_dropout)
            self.external_layer_norm = LayerNorm(embed_dim)
        self.reset_parameters()
    
    def reset_parameters(self):
        nn.init.normal_(self.fc1.weight, std=0.02)
        nn.init.normal_(self.fc2.weight, std=0.02)
        nn.init.constant_(self.fc1.bias, 0.)
        nn.init.constant_(self.fc2.bias, 0.)

    def forward(self, x, kv = None,
                self_padding_mask = None, self_attn_mask = None,
                external_memories = None, external_padding_mask=None,
                need_weights = False):
        # x: seq_len x bsz x embed_dim
        residual = x  # 殘差：add & norm操作中，需要先將residual，以及residual經過feed forward得到的output進行求和，再進行norm的計算
        if kv is None:
            x, self_attn = self.self_attn(query=x, key=x, value=x, key_padding_mask=self_padding_mask, attn_mask=self_attn_mask, need_weights = need_weights)
        else:
            x, self_attn = self.self_attn(query=x, key=kv, value=kv, key_padding_mask=self_padding_mask, attn_mask=self_attn_mask, need_weights = need_weights)

        x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.attn_layer_norm(residual + x)  # 先將residual，以及residual經過feed forward得到的output進行求和，再進行norm的計算

        if self.with_external:
            residual = x
            x, external_attn = self.external_attn(query=x, key=external_memories, value=external_memories, key_padding_mask=external_padding_mask, need_weights = need_weights)
            x = F.dropout(x, p=self.dropout, training=self.training)
            x = self.external_layer_norm(residual + x)
        else:
            external_attn = None

        residual = x
        x = gelu(self.fc1(x))  # 高斯誤差線性單元 Gaussian Error Linear Units(GELU)
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.fc2(x)  # 全連接層
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.ff_layer_norm(residual + x)

        return x, self_attn, external_attn

基本結構：

self attention層（class MultiheadAttention）：初始化Transformer單元的時候，需要指定number of head
external attention層（class MultiheadAttention）：如果參數with_external為True，計算attention的時候需要考慮外界的輸入。即不再是query = key = value了，而key 和 value可能來自source端，這在decoder中需要用到。
全連接層（兩個），形狀：(embed_dim, ff_embed_dim)，(ff_embed_dim, embed_dim)
attention layer normalization + feedforward normalization（class LayerNorm）
dropout
parameter initialization：主要針對的是兩個全連接層的weights和bias

功能（forward函數）：

記錄下residual。在add & norm操作中，需要先將經過計算得到的output與residual進行求和，再進行normalization
self_attention，用的是class MultiheadAttention，需要提供給forward函數：query，key，value；以及key_padding_mask，atten_mask
dropout
add & attention normalization
如果參數with_external為True，需要額外進行：
1. external attention
2. dropout
3. add & attention normalization
再記錄residual
經過全連接層f1
高斯誤差線性單元 Gaussian Error Linear Units(GELU)
dropout
經過全連接層f2
dropout
add & feedforward normalization

其中激活函數GLUE的定義如下：

def gelu(x):
    cdf = 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
    return cdf*x

3.2. class MultiheadAttention

初始化MultiheadAttention需要幾個參數：

attention head count：8
embed_dim (dim_x) : 512
ff_embed_dim：1024

head_dim，即每個注意力頭的維度的計算方式為：head_dim = embed_dim // num_heads，前者必須可以為后者所整除

attention 用的還是論文中scaled attention，scaling參數是head_dim開方

對收入、輸出的映射：

in_proj_weight：(3*embed_dim, embed_dim)
in_proj_bia：(3*embed_dim)，先定義出來，前1/3是Query的，中間1/3是Key的，最后1/3是Value的。后面定義了一個函數_in_proj，根據傳入的參數確定需要對qkv中的那幾個進行映射，取出來就行了。但是輸入映射的參數肯定是從in_proj_weight，in_proj_bia中取的
out_proj：(embed_dim, embed_dim)

對具體情況進行判別，對應地對輸入的qkv進行輸入映射：

如果qkv相同，則是self-attention
如果qkv不同，但是kv相同，則是encoder-decoder attention
如果qkv均不同，則是一般的attention

對attention weights進行mask，使用的是方法masked_fill_ 輸入一個ByteTensor，其中元素為1的位置，對應Tensor中元素會被置0。

對attention weights進行mask，再進行softmax，再進行dropout

attention output是attention weights與value進行bmm (batch matrix multiply)，結果再包上一層dropout

進行一次輸出映射，得到MultiheadAttention的輸出

3.3. class LayerNorm

def forward(self, x):
    u = x.mean(-1, keepdim=True)
    s = (x - u).pow(2).mean(-1, keepdim=True)
    x = (x - u) / torch.sqrt(s + self.eps)
    return self.weight * x + self.bias

流程：

計算均值
計算方差
減去方差，除以標准差
經過一個線性映射（fully connected layer）

3.4. Positional Embedding

定義了兩個類：

可學習的位置編碼：class LearnedPositionalEmbedding
由正弦函數給出的固定位置編碼：class SinusoidalPositionalEmbedding定義與Attention is All You Need 論文中一致

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 VSCode & outline & source code openBMC source code kubesphere source code解析 Keras MAE和MSE source code OBDII Interface Project Source Code Source Code Pro 編程字體 Editor placeholder in source code錯誤蘋果 macOS 安裝 Source Code Pro SYN Flood DOS Attack with C Source Code Spring 4 MVC example with Maven - [Source Code Download]