multi-head attention

本文轉載自查看原文 2018-12-13 17:45 14619

■ 論文 | Attention Is All You Need

■ 鏈接 | https://www.paperweekly.site/papers/224

■ 源碼 | https://github.com/Kyubyong/transformer

■ 論文 | Weighted Transformer Network for Machine Translation

■ 鏈接 | https://www.paperweekly.site/papers/2013

■ 源碼 | https://github.com/JayParks/transformer

思想：舍棄 RNN，只用注意力模型來進行序列的建模

新型的網絡結構： Transformer，里面所包含的注意力機制稱之為 self-attention。這套 Transformer 是能夠計算 input 和 output 的 representation 而不借助 RNN 的的 model，所以作者說有 attention 就夠了。

模型：同樣包含 encoder 和 decoder 兩個 stage，encoder 和 decoder 都是拋棄 RNN，而是用堆疊起來的 self-attention，和 fully-connected layer 來完成，模型的架構如下：

模型共包含三個 attention 成分，分別是 encoder 的 self-attention，decoder 的 self-attention，以及連接 encoder 和 decoder 的 attention。這三個 attention block 都是 multi-head attention 的形式，輸入都是 query Q 、key K 、value V 三個元素，只是 Q 、 K 、 V 的取值不同罷了。接下來重點討論最核心的模塊 multi-head attention（多頭注意力）。

multi-head attention 由多個 scaled dot-product attention 這樣的基礎單元經過 stack 而成。

按字面意思理解，scaled dot-product attention 即縮放了的點乘注意力，我們來對它進行研究。

那么 Q、K、V 到底是什么？encoder 里的 attention 叫 self-attention，顧名思義，就是自己和自己做 attention。在傳統的 seq2seq 中的 encoder 階段，我們得到 n 個時刻的 hidden states 之后，可以用每一時刻的 hidden state hi，去分別和任意的 hidden state hj,j=1,2,…,n 計算 attention，這就有點 self-attention 的意思。回到當前的模型，由於拋棄了 RNN，encoder 過程就沒了 hidden states，那拿什么做 self-attention 來自嗨呢？

可以想到，假如作為 input 的 sequence 共有 n 個 word，那么我可以先對每一個 word 做 embedding 吧？就得到 n 個 embedding，然后我就可以用 embedding 代替 hidden state 來做 self-attention 了。所以 Q 這個矩陣里面裝的就是全部的 word embedding，K、V 也是一樣。

所以為什么管 Q 叫query？就是你每次拿一個 word embedding，去“查詢”其和任意的 word embedding 的 match 程度（也就是 attention 的大小），你一共要做 n 輪這樣的操作。

我們記 word embedding 的 dimension 為 dmodel ，所以 Q 的 shape 就是 n*dmodel， K、V 也是一樣，第 i 個 word 的 embedding 為 vi，所以該 word 的 attention 應為：

scaled dot-product attention 基本就是這樣了。基於 RNN 的傳統 encoder 在每個時刻會有輸入和輸出，而現在 encoder 由於拋棄了 RNN 序列模型，所以可以一下子把序列的全部內容輸進去，來一次 self-attention 的自嗨。

理解了 scaled dot-product attention 之后，multi-head attention 就好理解了，因為就是 scaled dot-product attention 的 stacking。

先把 Q、K、V 做 linear transformation，然后對新生成的 Q’、K’、V’ 算 attention，重復這樣的操作 h 次，然后把 h 次的結果做 concat，最后再做一次 linear transformation，就是 multi-head attention 這個小 block 的輸出了。

以上介紹了 encoder 的 self-attention。decoder 中的 encoder-decoder attention 道理類似，可以理解為用 decoder 中的每個 vi 對 encoder 中的 vj 做一種交叉 attention。

decoder 中的 self-attention 也一樣的道理，只是要注意一點，decoder 中你在用 vi 對 vj 做 attention 時，有一些 pair 是不合法的。原因在於，雖然 encoder 階段你可以把序列的全部 word 一次全輸入進去，但是 decoder 階段卻並不總是可以，想象一下你在做 inference，decoder 的產出還是按從左至右的順序，所以你的 vi 是沒機會和 vj ( j>i ) 做 attention 的。

那怎么將這一點體現在 attention 的計算中呢？文中說只需要令 score(vi,vj)=-∞ 即可。為何？因為這樣的話：

所以在計算 vi 的 self-attention 的時候，就能夠把 vj 屏蔽掉。所以這個問題也就解決了。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 從Attention到Self-Attention再到Multi-Head Attention的一點小筆記 ICLR 2020 | 拋開卷積，multi-head self-attention能夠表達任何卷積操作第五課第四周筆記3：Multi-Head Attention多頭注意力 Keras的多頭自注意力實現(multi head attention) pytorch中使用muti-head-attention [論文閱讀] Residual Attention(Multi-Label Recognition) 位姿估計 - 2 -Multi-Context Attention for Human Pose Estimation（+ attention） - 1 - 論文學習 Attention 和self-attention 從attention到self-attention Attention in CNN