NLP學習(5)----attention/ self-attention/ seq2seq/ transformer

本文轉載自查看原文 2019-08-27 11:52 577 NLP

一. 前提:

1. RNN : 解決INPUT是序列化的問題,但是RNN存在的缺陷是難以並行化處理.

(1) RNN（N vs N）	(2) RNN (N vs 1)

(3) RNN (1 vs N)	(4) RNN (N vs M)---seq2seq

2. CNN : 使用CNN來replaceRNN,可以並行,如下圖每個黃色三角形都可以並行. 但是問題是難解決長依賴的序列, 解決辦法是疊加多層的CNN,比如下圖的CNN黃色三角形和藍色三角形為兩層CNN,

3. attention:

在Encoder-Decoder結構中，Encoder把所有的輸入序列都編碼成一個統一的語義特征c再解碼，因此， c中必須包含原始序列中的所有信息，它的長度就成了限制模型性能的瓶頸。如機器翻譯問題，當要翻譯的句子較長時，一個c可能存不下那么多信息，就會造成翻譯精度的下降。

Attention機制通過在每個時間輸入不同的c來解決這個問題，下圖是Attention機制的encoder and Decoder：

4. self-attention : 其輸入和輸出和RNN一樣,就是中間不一樣. 如下圖, b1到b4是同時計算出來, RNN的b4必須要等到b1計算完.

二.Attention

1. 為什么要用attention model？

The attention model用來幫助解決機器翻譯在句子過長時效果不佳的問題。並且可以解決RNN難並行的問題.

　　3. attentionl類型

　　　　點積注意力機制的優點是速度快、占用空間小。

三. self-attention

　1. self-attention 的計算(Attention is all you need)

　　用每個query q去對每個key k做attention , 即計算得到α_1,1 , α_1,2 ……,

　　為什么要除以d [d等於q或k的維度,兩者維度一樣] ? 因為q和k的維度越大,dot product 之后值會更大,為了平衡值,相當於歸一化這個值,除以一個d.

2. self-attention如何並行

　　self-attention最終為一些矩陣相乘的形式,可以采用並行方式來計算.

　　以上每個α都可以並行計算

3. 計算總結:

4. self_attention的類型

多頭: 為何?因為不同的head可以關注不同的信息, 比如第一個head關注長時間的信息,第二個head關注短時間的信息.

將兩個b^i,1和b^i,2進行concat並乘以W⁰來降為成bⁱ

四. seq2seq

　　傳統的seq2seq: 中間用的是RNN

　　seq2seq with attention

五. Transformer

細扣 : https://mp.weixin.qq.com/s/RLxWevVWHXgX-UcoxDS70w

1. 整體架構:

Transformer遵循這種結構，encoder和decoder都使用堆疊的self-attention和point-wise，fully connected layers。

Encoder: encoder由6個相同的層堆疊而成，每個層有兩個子層。

第一個子層是多頭自我注意力機制(multi-head self-attention mechanism)，

第二層是簡單的位置的全連接前饋網絡(position-wise fully connected feed-forward network)。

　　中間: 兩個子層中會使用一個殘差連接，接着進行層標准化(layer normalization)。

　　也就是說每一個子層的輸出都是LayerNorm(x + sublayer(x))。

網絡輸入是三個相同的向量q, k和v，是word embedding和position embedding相加得到的結果。為了方便進行殘差連接，我們需要子層的輸出和輸入都是相同的維度。

Decoder:

　　三層: (多頭self-attention + 多頭attention + feed-forword )

　　　　 decoder也是由N（N=6）個完全相同的Layer組成，decoder中的Layer由encoder的Layer中插入一個Multi-Head Attention + Add&Norm組成。

　　　　輸入 : 輸出的embedding與輸出的position embedding求和做為decoder的輸入，

　　　　MA-1層: 經過一個Multi-HeadAttention + Add&Norm（（MA-1）層，MA-1層的輸出做為下一Multi-Head Attention + Add&Norm（MA-2）的query（Q）輸入，

　　　　MA-2層的Key和Value輸入（從圖中看，應該是encoder中第i（i = 1,2,3,4,5,6）層的輸出對於decoder中第i（i = 1,2,3,4，5,6）層的輸入）。

　　　　　　MA-2層的輸出輸入到一個前饋層（FF）, 層與層之間使用的Position-wise feed forward network，經過AN(Add&norm)操作后，經過一個線性+softmax變換得到最后目標輸出的概率。
　　　　mask : 對於decoder中的第一個多頭注意力子層，需要添加masking，確保預測位置i的時候僅僅依賴於位置小於i的輸出。
　　　　

2. trip細節

(1) 三種應用

Transformer會在三個不同的方面使用multi-head attention：
1. encoder-decoder attention：使用multi-head attention，輸入為encoder的輸出和decoder的self-attention輸出，其中encoder的self-attention作為 key and value，decoder的self-attention作為query

2. encoder self-attention：使用 multi-head attention，輸入的Q、K、V都是一樣的（input embedding and positional embedding）
3. decoder self-attention：在decoder的self-attention層中，deocder 都能夠訪問當前位置前面的位置

(2)位置encoding

這樣做的目的是因為正弦和余弦函數具有周期性，對於固定長度偏差k（類似於周期），post +k位置的PE可以表示成關於pos位置PE的一個線性變化（存在線性關系），這樣可以方便模型學習詞與詞之間的一個相對位置關系。

　　上面的self-attention有個問題,q缺乏位置信息,因為近鄰和長遠的輸入是同等的計算α.

　　位置encoding的eⁱ是人工設置的,不是學習的.將其加入aⁱ中.

　　為何是和ai相加,而不是concat?

　　這里的W^p是通過別的方法計算的,如下圖所示

(3) 殘差

對於每個encoder里面的每個sub-layer，它們都有一個殘差的連接，理論上這可以回傳梯度.

這種方式理論上可以很好的回傳梯度

作者：收到一只叮咚
鏈接：https://www.imooc.com/article/67493
來源：慕課網

(4) Layer Norm

每個sub-layer后面還有一步 layer-normalization [layer Norm一般和RNN相接] 。可以加快模型收斂速度.

Batch Norm和Layer Norm 的區別, 下圖右上角, 橫向為batch size取均值為0, sigma = 1. 縱向為layer Norm , 不需要batch size.

(5) Position-wise feed forward network 前饋神經網絡

用了兩層Dense層，activation用的都是Relu。

可以看成是兩層的1*1的1d-convolution。hidden_size變化為：512->2048->512
Position-wise feed forward network，其實就是一個MLP 網絡，1 的輸出中，每個 d_model 維向量 x 在此先由 xW_1+b_1 變為 d_f $維的 x'，再經過max(0,x')W_2+b_2 回歸 d_model 維。之后再是一個residual connection。輸出 size 仍是 $[sequence_length, d_model]$

(6) Masked : [decoder]

注意encoder里面是叫self-attention，decoder里面是叫masked self-attention。

這里的masked就是要在做language modelling（或者像翻譯）的時候，不給模型看到未來的信息。

mask就是沿着對角線把灰色的區域用0覆蓋掉，不給模型看到未來的信息。

(7) 優化

模型的訓練采用了Adam方法，文章提出了一種叫warm up的學習率調節方法，如公式所示：

作者：收到一只叮咚
鏈接：https://www.imooc.com/article/67493
來源：慕課網

　　發展: universal transformer

　　應用: NLP \ self attention GAN (用在圖像上)

3. 實戰

https://www.jianshu.com/p/2b0a5541a17c

3.1 encoder

　　(1) 輸入: encoder embedding和position embedding相加

　　(2) 兩種attention

　　(3) Add & Normalize & FFN

3.2 decoder

　　(1)輸入: decoder embedding和position embedding相加

　　(2)mask multi-head attention和encoder-decoder attention

　　(3)Add & Normalize & FFN & 輸出

3.1 encoder

(1)輸入: input embedding和position embedding相加

　　原始數據: word2vec [embedding表] + input_sentence [x] + output_sentence [y] + position embedding(固定)

　　①輸入input_sentence [x] 和 word2vec [embedding表]

假設我們有兩條訓練數據（input_sentence [x]）：

[機、器、學、習] -> [ machine、learning]
[學、習、機、器] -> [learning、machine]

encoder的輸入在轉換成id后變為[[0,1,2,3],[2,3,0,1]]。

接下來，通過查找中文的embedding表(word2vec)，轉換為embedding為：

　　　　②將position embedding設為固定值,但實際是通過三角函數來計算得到的,這里為了方便設為固定值,注意這個position embedding是不用迭代訓練的:

　　　　③對輸入input_embedding加入位置偏置position_embedding，注意這里是兩個向量的對位相加：

　　④output_sentence [y]和input_sentence做相同的處理

代碼:　

import tensorflow as tf

chinese_embedding = tf.constant([[0.11,0.21,0.31,0.41],
                         [0.21,0.31,0.41,0.51],
                         [0.31,0.41,0.51,0.61],
                         [0.41,0.51,0.61,0.71]],dtype=tf.float32)


english_embedding = tf.constant([[0.51,0.61,0.71,0.81],
                         [0.52,0.62,0.72,0.82],
                         [0.53,0.63,0.73,0.83],
                         [0.54,0.64,0.74,0.84]],dtype=tf.float32)


position_encoding = tf.constant([[0.01,0.01,0.01,0.01],
                         [0.02,0.02,0.02,0.02],
                         [0.03,0.03,0.03,0.03],
                         [0.04,0.04,0.04,0.04]],dtype=tf.float32)

encoder_input = tf.constant([[0,1,2,3],[2,3,0,1]],dtype=tf.int32)


with tf.variable_scope("encoder_input"):
    encoder_embedding_input = tf.nn.embedding_lookup(chinese_embedding,encoder_input)
    encoder_embedding_input = encoder_embedding_input + position_encoding


with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run([encoder_embedding_input]))

View Code

(2) attention

[scaled dot-product attention]

　　①計算Q . K . V [embedding × 三個W]

咱們先說說Q、K、V。比如我們想要計算上圖中machine和機、器、學、習四個字的attention，並加權得到一個輸出，那么Query由machine對應的embedding計算得到，K和V分別由機、器、學、習四個字對應的embedding得到。

在encoder的self-attention中，由於是計算自身和自身的相似度，所以Q、K、V都是由輸入的embedding得到的，不過我們還是加以區分。

這里， Q、K、V分別通過一層全連接神經網絡得到，同樣，我們把對應的參數矩陣都寫作常量。

接下來，以第一條輸入為例, 將embedding 和三個 W 矩陣相乘：

　　②計算α權重 [ softmax(Q * K^T / sqrt(d_k))]

　　計算Q和K的相關性大小，這里使用內積的方式，相當於QK^T: (下圖中V應該改成K，不影響整個過程理解),得到結果為attention map

機和機自身的相關性是2.37(未進行歸一化處理),機和器的相關性是3.26，依次類推。

接着除以一個規范化因子，然后進行softmax操作，這里的規范化因子選擇除以8，然后每行進行一個softmax歸一化操作（按行做歸一化是因為attention的初衷是計算每個Query和所有的Keys之間的相關性)：

　　③將α 與V加權求和:

最后就是得到每個輸入embedding 對應的輸出embedding，也就是基於attention map對V進行加權求和，以“機”這個輸入為例，最后的輸出應該是V對應的四個向量的加權求和：

　　代碼:

with tf.variable_scope("encoder_scaled_dot_product_attention"):
    encoder_Q = tf.matmul(tf.reshape(encoder_embedding_input,(-1,tf.shape(encoder_embedding_input)[2])),w_Q)
    encoder_K = tf.matmul(tf.reshape(encoder_embedding_input,(-1,tf.shape(encoder_embedding_input)[2])),w_K)
    encoder_V = tf.matmul(tf.reshape(encoder_embedding_input,(-1,tf.shape(encoder_embedding_input)[2])),w_V)
    
####① 計算Q K V
    encoder_Q = tf.reshape(encoder_Q,(tf.shape(encoder_embedding_input)[0],tf.shape(encoder_embedding_input)[1],-1))
    encoder_K = tf.reshape(encoder_K,(tf.shape(encoder_embedding_input)[0],tf.shape(encoder_embedding_input)[1],-1))
    encoder_V = tf.reshape(encoder_V,(tf.shape(encoder_embedding_input)[0],tf.shape(encoder_embedding_input)[1],-1))
                          
 ####②計算α , softmax( Q * KT / sqrt(dk) )  
    attention_map = tf.matmul(encoder_Q,tf.transpose(encoder_K,[0,2,1]))
    attention_map = attention_map / 8
    attention_map = tf.nn.softmax(attention_map)

###③ α * V ,自己補的,不一定對
    #### weightedSumV = tf.matmul(attention_map,encoder_V)


with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run(attention_map))
    # print(sess.run(weightedSumV))

View Code

[multi-head attention]

　　Multi-Head Attention就是把Scaled Dot-Product Attention的過程做H次，然后把輸出Z合起來。　　

　　① 切分多個Head ( 即多個Q K V) , 並多次進行scaled dot-product attention

假設我們剛才計算得到的Q、K、V從中間切分，分別作為兩個Head的輸入：

　　　　重復上面的Scaled Dot-Product Attention過程，我們分別得到兩個Head的輸出：

　　②concat多個Scaled Dot-Product Attention的結果 z , 並將其乘以W降維

接下來，我們需要通過一個權重矩陣，來得到最終輸出。

為了我們能夠進行后面的Add的操作，我們需要把輸出的長度和輸入保持一致，即每個單詞得到的輸出embedding長度保持為4。

同樣，我們這里把轉換矩陣W設置為常數：

代碼:

w_Z = tf.constant([[0.1,0.2,0.3,0.4],
                   [0.1,0.2,0.3,0.4],
                   [0.1,0.2,0.3,0.4],
                   [0.1,0.2,0.3,0.4],
                   [0.1,0.2,0.3,0.4],
                   [0.1,0.2,0.3,0.4]],dtype=tf.float32)


with tf.variable_scope("encoder_input"):
    encoder_embedding_input = tf.nn.embedding_lookup(chinese_embedding,encoder_input)
    encoder_embedding_input = encoder_embedding_input + position_encoding

with tf.variable_scope("encoder_multi_head_product_attention"):
    encoder_Q = tf.matmul(tf.reshape(encoder_embedding_input,(-1,tf.shape(encoder_embedding_input)[2])),w_Q)
    encoder_K = tf.matmul(tf.reshape(encoder_embedding_input,(-1,tf.shape(encoder_embedding_input)[2])),w_K)
    encoder_V = tf.matmul(tf.reshape(encoder_embedding_input,(-1,tf.shape(encoder_embedding_input)[2])),w_V)
    
###① 生成Q K V
    encoder_Q = tf.reshape(encoder_Q,(tf.shape(encoder_embedding_input)[0],tf.shape(encoder_embedding_input)[1],-1))
    encoder_K = tf.reshape(encoder_K,(tf.shape(encoder_embedding_input)[0],tf.shape(encoder_embedding_input)[1],-1))
    encoder_V = tf.reshape(encoder_V,(tf.shape(encoder_embedding_input)[0],tf.shape(encoder_embedding_input)[1],-1))
          
###②最后一維切分 成多個Head ---Q K V
    encoder_Q_split = tf.split(encoder_Q,2,axis=2)
    encoder_K_split = tf.split(encoder_K,2,axis=2)
    encoder_V_split = tf.split(encoder_V,2,axis=2)
    
    ###第一維合並Q K V ,方便計算
    encoder_Q_concat = tf.concat(encoder_Q_split,axis=0)
    encoder_K_concat = tf.concat(encoder_K_split,axis=0)
    encoder_V_concat = tf.concat(encoder_V_split,axis=0)
    
    ###計算attention α
    attention_map = tf.matmul(encoder_Q_concat,tf.transpose(encoder_K_concat,[0,2,1]))
    attention_map = attention_map / 8
    attention_map = tf.nn.softmax(attention_map)
    
    ### α和V加權求和,結果為Z
    weightedSumV = tf.matmul(attention_map,encoder_V_concat)
    
###③ 將多個head求出來的Z合並
    outputs_z = tf.concat(tf.split(weightedSumV,2,axis=0),axis=2)
    
###④ 合並的Z與W相乘降維得到最終的Z
    outputs = tf.matmul(tf.reshape(outputs_z,(-1,tf.shape(outputs_z)[2])),w_Z)
    outputs = tf.reshape(outputs,(tf.shape(encoder_embedding_input)[0],tf.shape(encoder_embedding_input)[1],-1))
    
import numpy as np
with tf.Session() as sess:
#     print(sess.run(encoder_Q))
#     print(sess.run(encoder_Q_split))
    #print(sess.run(weightedSumV))
    #print(sess.run(outputs_z))
    print(sess.run(outputs))

View Code

更詳細的解釋split函數和concat函數

split函數主要有三個參數，第一個是要split的tensor，第二個是分割成幾個tensor，第三個是在哪一維進行切分。也就是說， encoder_Q_split = tf.split(encoder_Q,2,axis=2)，執行這段代碼的話，encoder_Q這個tensor會按照axis=2切分成兩個同樣大的tensor，這兩個tensor的axis=0和axis=1維度的長度是不變的，但axis=2的長度變為了一半，我們在后面通過圖示的方式來解釋。

從代碼可以看到，共有兩次split和concat的過程，第一次是將Q、K、V切分為不同的Head：

也就是說，原先每條數據的所對應的各Head的Q並非相連的，而是交替出現的，即 [Head1-Q11,Head1-Q21,Head2-Q12,Head2-Q22]

第二次是最后計算完每個Head的輸出Z之后，通過split和concat進行還原，過程如下：

(3) Add & Normalize & FFN

　　第一次Add & Normalize:

接下來是一個FFN，我們仍然假設是固定的參數，那么output是：

　　第二次Add & Normalize

　　我們終於在經過一個Encoder的Block后得到了每個輸入對應的輸出，分別為：

　　代碼:

with tf.variable_scope("encoder_block"):
    encoder_Q = tf.matmul(tf.reshape(encoder_embedding_input,(-1,tf.shape(encoder_embedding_input)[2])),w_Q)
    encoder_K = tf.matmul(tf.reshape(encoder_embedding_input,(-1,tf.shape(encoder_embedding_input)[2])),w_K)
    encoder_V = tf.matmul(tf.reshape(encoder_embedding_input,(-1,tf.shape(encoder_embedding_input)[2])),w_V)
    
    encoder_Q = tf.reshape(encoder_Q,(tf.shape(encoder_embedding_input)[0],tf.shape(encoder_embedding_input)[1],-1))
    encoder_K = tf.reshape(encoder_K,(tf.shape(encoder_embedding_input)[0],tf.shape(encoder_embedding_input)[1],-1))
    encoder_V = tf.reshape(encoder_V,(tf.shape(encoder_embedding_input)[0],tf.shape(encoder_embedding_input)[1],-1))
          
    encoder_Q_split = tf.split(encoder_Q,2,axis=2)
    encoder_K_split = tf.split(encoder_K,2,axis=2)
    encoder_V_split = tf.split(encoder_V,2,axis=2)
    
    encoder_Q_concat = tf.concat(encoder_Q_split,axis=0)
    encoder_K_concat = tf.concat(encoder_K_split,axis=0)
    encoder_V_concat = tf.concat(encoder_V_split,axis=0)
    
    attention_map = tf.matmul(encoder_Q_concat,tf.transpose(encoder_K_concat,[0,2,1]))
    attention_map = attention_map / 8
    attention_map = tf.nn.softmax(attention_map)
    
    #multi-head attention的計算結果
    weightedSumV = tf.matmul(attention_map,encoder_V_concat)
    
    outputs_z = tf.concat(tf.split(weightedSumV,2,axis=0),axis=2)
    
    #將多頭結果的維度轉為和encoder_embedding_input維度一樣
    sa_outputs = tf.matmul(tf.reshape(outputs_z,(-1,tf.shape(outputs_z)[2])),w_Z)
    sa_outputs = tf.reshape(sa_outputs,(tf.shape(encoder_embedding_input)[0],tf.shape(encoder_embedding_input)[1],-1))
    
    ##第一次add
    sa_outputs = sa_outputs + encoder_embedding_input
    
    # todo :add BN
    W_f = tf.constant([[0.2,0.3,0.5,0.4],
                       [0.2,0.3,0.5,0.4],
                       [0.2,0.3,0.5,0.4],
                       [0.2,0.3,0.5,0.4]])
    ##FFN
    ffn_outputs = tf.matmul(tf.reshape(sa_outputs,(-1,tf.shape(sa_outputs)[2])),W_f)
    ffn_outputs = tf.reshape(ffn_outputs,(tf.shape(sa_outputs)[0],tf.shape(sa_outputs)[1],-1))
    
    #第二次add
    encoder_outputs = ffn_outputs + sa_outputs
    # todo :add BN

import numpy as np
with tf.Session() as sess:
#     print(sess.run(encoder_Q))
#     print(sess.run(encoder_Q_split))
    #print(sess.run(weightedSumV))
    #print(sess.run(outputs_z))
    #print(sess.run(sa_outputs))
    #print(sess.run(ffn_outputs))
    print(sess.run(encoder_outputs))

3.2 decoder

相比Encoder，這里的過程分為6步，分別是 masked multi-head self attention、Add & Normalize、encoder-decoder attention、Add & Normalize、Feed Forward Network、Add & Normalize。

咱們還是重點來講masked multi-head self attention和encoder-decoder attention。

(1) Decoder輸入

總體input : [機、器、學、習] -> [ machine、learning]

因此，Decoder階段的輸入是：[ machine、learning]

　　代碼:

english_embedding = tf.constant([[0.51,0.61,0.71,0.81],
                         [0.61,0.71,0.81,0.91],
                         [0.71,0.81,0.91,1.01],
                         [0.81,0.91,1.01,1.11]],dtype=tf.float32)


position_encoding = tf.constant([[0.01,0.01,0.01,0.01],
                         [0.02,0.02,0.02,0.02],
                         [0.03,0.03,0.03,0.03],
                         [0.04,0.04,0.04,0.04]],dtype=tf.float32)

decoder_input = tf.constant([[1,2],[2,1]],dtype=tf.int32)

with tf.variable_scope("decoder_input"):
    decoder_embedding_input = tf.nn.embedding_lookup(english_embedding,decoder_input)
    decoder_embedding_input = decoder_embedding_input + position_encoding[0:tf.shape(decoder_embedding_input)[1]]

View Code

　　(2) masked multi-head self attention

這個過程和multi-head self attention基本一致，只不過對於decoder來說，得到每個階段的輸出時，我們是看不到后面的信息的。舉個例子，我們的第一條輸入是：[機、器、學、習] -> [ machine、learning] ，decoder階段兩次的輸入分別是machine和learning，在輸入machine時，我們是看不到learning的信息的，因此在計算attention的權重的時候，machine和learning的權重是沒有的。我們還是先通過excel來演示一下，再通過代碼來理解：

計算Attention的權重矩陣是：

仍然以兩個Head為例，計算Q、K、V：

分別計算兩個Head的attention map

　　咱們先來實現這部分的代碼，masked attention map的計算過程：

　　前兩步和encoder一樣,只是得到attention map [ Q*K^T / sqrt(d_k) ]之后加上masked.然后再softmax ,最后與V相乘.

先定義下權重矩陣，同encoder一樣，定義成常數：

w_Q_decoder_sa = tf.constant([[0.15,0.25,0.35,0.45,0.55,0.65],
                   [0.25,0.35,0.45,0.55,0.65,0.75],
                   [0.35,0.45,0.55,0.65,0.75,0.85],
                   [0.45,0.55,0.65,0.75,0.85,0.95]],dtype=tf.float32)

w_K_decoder_sa = tf.constant([[0.13,0.23,0.33,0.43,0.53,0.63],
                   [0.23,0.33,0.43,0.53,0.63,0.73],
                   [0.33,0.43,0.53,0.63,0.73,0.83],
                   [0.43,0.53,0.63,0.73,0.83,0.93]],dtype=tf.float32)

w_V_decoder_sa = tf.constant([[0.17,0.27,0.37,0.47,0.57,0.67],
                   [0.27,0.37,0.47,0.57,0.67,0.77],
                   [0.37,0.47,0.57,0.67,0.77,0.87],
                   [0.47,0.57,0.67,0.77,0.87,0.97]],dtype=tf.float32)

View Code

隨后，計算添加mask之前的attention map：

with tf.variable_scope("decoder_sa_block"):
    decoder_Q = tf.matmul(tf.reshape(decoder_embedding_input,(-1,tf.shape(decoder_embedding_input)[2])),w_Q_decoder_sa)
    decoder_K = tf.matmul(tf.reshape(decoder_embedding_input,(-1,tf.shape(decoder_embedding_input)[2])),w_K_decoder_sa)
    decoder_V = tf.matmul(tf.reshape(decoder_embedding_input,(-1,tf.shape(decoder_embedding_input)[2])),w_V_decoder_sa)
    
    decoder_Q = tf.reshape(decoder_Q,(tf.shape(decoder_embedding_input)[0],tf.shape(decoder_embedding_input)[1],-1))
    decoder_K = tf.reshape(decoder_K,(tf.shape(decoder_embedding_input)[0],tf.shape(decoder_embedding_input)[1],-1))
    decoder_V = tf.reshape(decoder_V,(tf.shape(decoder_embedding_input)[0],tf.shape(decoder_embedding_input)[1],-1))
          
    decoder_Q_split = tf.split(decoder_Q,2,axis=2)
    decoder_K_split = tf.split(decoder_K,2,axis=2)
    decoder_V_split = tf.split(decoder_V,2,axis=2)
    
    decoder_Q_concat = tf.concat(decoder_Q_split,axis=0)
    decoder_K_concat = tf.concat(decoder_K_split,axis=0)
    decoder_V_concat = tf.concat(decoder_V_split,axis=0)
    
    decoder_sa_attention_map_raw = tf.matmul(decoder_Q_concat,tf.transpose(decoder_K_concat,[0,2,1]))
    decoder_sa_attention_map = decoder_sa_attention_map_raw / 8

View Code

隨后，對attention map添加mask：

diag_vals = tf.ones_like(decoder_sa_attention_map[0,:,:])
tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense()
masks = tf.tile(tf.expand_dims(tril,0),[tf.shape(decoder_sa_attention_map)[0],1,1])
paddings = tf.ones_like(masks) * (-2 ** 32 + 1)
decoder_sa_attention_map = tf.where(tf.equal(masks,0),paddings,decoder_sa_attention_map)
#softmax
decoder_sa_attention_map = tf.nn.softmax(decoder_sa_attention_map)

View Code

　　這里我們首先構造一個全1的矩陣diag_vals，這個矩陣的大小同attention map。隨后通過tf.contrib.linalg.LinearOperatorTriL方法把上三角部分變為0，該函數的示意如下：

基於這個函數生成的矩陣tril，我們便可以構造對應的mask了。不過需要注意的是，對於我們要加mask的地方，不能賦值為0，而是需要賦值一個很小的數，這里為-2^32 + 1。因為我們后面要做softmax，e^0=1，是一個很大的數啦。

補全multi-head attention得到attention map 后面的代碼

weightedSumV = tf.matmul(decoder_sa_attention_map,decoder_V_concat)
    
decoder_outputs_z = tf.concat(tf.split(weightedSumV,2,axis=0),axis=2)
    
decoder_sa_outputs = tf.matmul(tf.reshape(decoder_outputs_z,(-1,tf.shape(decoder_outputs_z)[2])),w_Z_decoder_sa)
    
decoder_sa_outputs = tf.reshape(decoder_sa_outputs,(tf.shape(decoder_embedding_input)[0],tf.shape(decoder_embedding_input)[1],-1))


with tf.Session() as sess:
    print(sess.run(decoder_sa_outputs))

View Code

(3)encoder-decoder attention

在encoder-decoder attention之間，還有一個Add & Normalize的過程，同樣，我們忽略 Normalize，只做Add操作：

接下來，就是encoder-decoder了，這里跟multi-head attention相同，但是需要注意的一點是，我們這里想要做的是，計算decoder的每個階段的輸入和encoder階段所有輸出的attention，所以Q的計算通過decoder對應的embedding計算，而K和V通過encoder階段輸出的embedding來計算：

接下來，計算Attention Map，注意，這里attention map的大小為2 * 4的，每一行代表一個decoder的輸入，與所有encoder輸出之間的attention score。同時，我們不需要添加mask，因為decoder的輸入是可以看到所有encoder的輸出信息的。得到的attention map結果如下：

接下來，我們得到整個encoder-decoder階段的輸出為：

接下來，還有Add & Normalize、Feed Forward Network、Add & Normalize過程，咱們這里就省略了。encoder-decoder代碼:

w_Q_decoder_sa2 = tf.constant([[0.2,0.3,0.4,0.5,0.6,0.7],
                   [0.3,0.4,0.5,0.6,0.7,0.8],
                   [0.4,0.5,0.6,0.7,0.8,0.9],
                   [0.5,0.6,0.7,0.8,0.9,1]],dtype=tf.float32)

w_K_decoder_sa2 = tf.constant([[0.18,0.28,0.38,0.48,0.58,0.68],
                   [0.28,0.38,0.48,0.58,0.68,0.78],
                   [0.38,0.48,0.58,0.68,0.78,0.88],
                   [0.48,0.58,0.68,0.78,0.88,0.98]],dtype=tf.float32)

w_V_decoder_sa2 = tf.constant([[0.22,0.32,0.42,0.52,0.62,0.72],
                   [0.32,0.42,0.52,0.62,0.72,0.82],
                   [0.42,0.52,0.62,0.72,0.82,0.92],
                   [0.52,0.62,0.72,0.82,0.92,1.02]],dtype=tf.float32)

w_Z_decoder_sa2 = tf.constant([[0.1,0.2,0.3,0.4],
                   [0.1,0.2,0.3,0.4],
                   [0.1,0.2,0.3,0.4],
                   [0.1,0.2,0.3,0.4],
                   [0.1,0.2,0.3,0.4],
                   [0.1,0.2,0.3,0.4]],dtype=tf.float32)


with tf.variable_scope("decoder_encoder_attention_block"):
    
    decoder_sa_outputs = decoder_sa_outputs + decoder_embedding_input
    
    encoder_decoder_Q = tf.matmul(tf.reshape(decoder_sa_outputs,(-1,tf.shape(decoder_sa_outputs)[2])),w_Q_decoder_sa2)
    encoder_decoder_K = tf.matmul(tf.reshape(encoder_outputs,(-1,tf.shape(encoder_outputs)[2])),w_K_decoder_sa2)
    encoder_decoder_V = tf.matmul(tf.reshape(encoder_outputs,(-1,tf.shape(encoder_outputs)[2])),w_V_decoder_sa2)
    
    encoder_decoder_Q = tf.reshape(encoder_decoder_Q,(tf.shape(decoder_embedding_input)[0],tf.shape(decoder_embedding_input)[1],-1))
    encoder_decoder_K = tf.reshape(encoder_decoder_K,(tf.shape(encoder_outputs)[0],tf.shape(encoder_outputs)[1],-1))
    encoder_decoder_V = tf.reshape(encoder_decoder_V,(tf.shape(encoder_outputs)[0],tf.shape(encoder_outputs)[1],-1))
          
    encoder_decoder_Q_split = tf.split(encoder_decoder_Q,2,axis=2)
    encoder_decoder_K_split = tf.split(encoder_decoder_K,2,axis=2)
    encoder_decoder_V_split = tf.split(encoder_decoder_V,2,axis=2)
    
    encoder_decoder_Q_concat = tf.concat(encoder_decoder_Q_split,axis=0)
    encoder_decoder_K_concat = tf.concat(encoder_decoder_K_split,axis=0)
    encoder_decoder_V_concat = tf.concat(encoder_decoder_V_split,axis=0)
    ##注意,不用mask
    encoder_decoder_attention_map_raw = tf.matmul(encoder_decoder_Q_concat,tf.transpose(encoder_decoder_K_concat,[0,2,1]))
    encoder_decoder_attention_map = encoder_decoder_attention_map_raw / 8
    
    encoder_decoder_attention_map = tf.nn.softmax(encoder_decoder_attention_map)
    
    weightedSumV = tf.matmul(encoder_decoder_attention_map,encoder_decoder_V_concat)
    
    encoder_decoder_outputs_z = tf.concat(tf.split(weightedSumV,2,axis=0),axis=2)
    
    encoder_decoder_outputs = tf.matmul(tf.reshape(encoder_decoder_outputs_z,(-1,tf.shape(encoder_decoder_outputs_z)[2])),w_Z_decoder_sa2)
    
    encoder_decoder_attention_outputs = tf.reshape(encoder_decoder_outputs,(tf.shape(decoder_embedding_input)[0],tf.shape(decoder_embedding_input)[1],-1))
    
    encoder_decoder_attention_outputs = encoder_decoder_attention_outputs + decoder_sa_outputs
    
    # todo :add BN
    W_f = tf.constant([[0.2,0.3,0.5,0.4],
                       [0.2,0.3,0.5,0.4],
                       [0.2,0.3,0.5,0.4],
                       [0.2,0.3,0.5,0.4]])
    
    decoder_ffn_outputs = tf.matmul(tf.reshape(encoder_decoder_attention_outputs,(-1,tf.shape(encoder_decoder_attention_outputs)[2])),W_f)
    decoder_ffn_outputs = tf.reshape(decoder_ffn_outputs,(tf.shape(encoder_decoder_attention_outputs)[0],tf.shape(encoder_decoder_attention_outputs)[1],-1))
    
    decoder_outputs = decoder_ffn_outputs + encoder_decoder_attention_outputs
    # todo :add BN

with tf.Session() as sess:
    print(sess.run(decoder_outputs))

View Code

(4)全連接層及最終輸出

最后的全連接層很簡單了，對於decoder階段的輸出，通過全連接層和softmax之后，最終得到選擇每個單詞的概率，並計算交叉熵損失：

　　代碼:

W_final = tf.constant([[0.2,0.3,0.5,0.4],
                       [0.2,0.3,0.5,0.4],
                       [0.2,0.3,0.5,0.4],
                       [0.2,0.3,0.5,0.4]])

logits = tf.matmul(tf.reshape(decoder_outputs,(-1,tf.shape(decoder_outputs)[2])),W_final)
logits = tf.reshape(logits,(tf.shape(decoder_outputs)[0],tf.shape(decoder_outputs)[1],-1))
    
    
logits = tf.nn.softmax(logits)

y = tf.one_hot(decoder_input,depth=4)

loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=y)

train_op = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(loss)

View Code

參考:

https://jalammar.github.io/

https://www.jianshu.com/p/3f2d4bc126e6

https://www.leiphone.com/news/201709/8tDpwklrKubaecTa.html

https://www.cnblogs.com/hellojamest/p/11128799.html

https://blog.csdn.net/longxinchen_ml/article/details/86533005

https://www.imooc.com/article/67493

李宏毅老師的課程

NLP 學習:http://www.shuang0420.com/categories/NLP/page/9/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 從Seq2seq到Attention模型到Self Attention NLP與深度學習（三）Seq2Seq模型與Attention機制 Self-Attention 和 Transformer Self-Attention與Transformer 深度學習之seq2seq模型以及Attention機制 Seq2Seq模型與 Attention 策略 seq2seq聊天模型（三）—— attention 模型 Attention機制詳解（二）——Self-Attention與Transformer NLP（五）Seq2seq/Transformer/BERT Seq2Seq和Attention機制入門介紹

NLP學習(5)----attention/ self-attention/ seq2seq/ transformer

目錄:

一. 前提:

二.Attention

3. attentionl類型

三. self-attention

1. self-attention 的計算(Attention is all you need)

2. self-attention如何並行

3. 計算總結:

4. self_attention的類型

四. seq2seq

五. Transformer

1. 整體架構:

2. trip細節

(1) 三種應用

(2)位置encoding

(3) 殘差

3. 實戰

3.1 encoder

(1)輸入: input embedding和position embedding相加

(2) attention

[scaled dot-product attention]

[multi-head attention]

(3) Add & Normalize & FFN

3.2 decoder

(1) Decoder輸入

(2) masked multi-head self attention

(3)encoder-decoder attention

(4)全連接層及最終輸出

參考:

免責聲明！

　　3. attentionl類型

　1. self-attention 的計算(Attention is all you need)

　　(2) masked multi-head self attention