self-attention詳解

本文轉載自查看原文 2019-07-09 10:08 7440 Deep Learning

對於簡單、無狀態的自定義操作，你也許可以通過 layers.core.Lambda 層來實現。但是對於那些包含了可訓練權重的自定義層，你應該自己實現這種層。

這是一個 Keras2.0 中，Keras 層的骨架（如果你用的是舊的版本，請更新到新版）。你只需要實現三個方法即可:

build(input_shape): 這是你定義權重的地方。這個方法必須設 self.built = True，可以通過調用 super([Layer], self).build() 完成。
call(x): 這里是編寫層的功能邏輯的地方。你只需要關注傳入 call 的第一個參數：輸入張量，除非你希望你的層支持masking。
compute_output_shape(input_shape): 如果你的層更改了輸入張量的形狀，你應該在這里定義形狀變化的邏輯，這讓Keras能夠自動推斷各層的形狀

本文主要講解Self_attention方面的內容，這方面的知識是建立在attention機制之上的，因此若讀者不了解attention mechanism的話，希望你們能去看我的關於深入理解attention機制。本人也將在這里稍微的解釋一下。

對於encoder-decoder模型，decoder的輸入包括（注意這里是包括）encoder的輸出。但是根據常識來講，某一個輸出並不需要所有encoder信息，而是只需要部分信息。這句話就是attention的精髓所在。怎么理解這句話呢？舉個例子來說：假如我們正在做機器翻譯，將“I am a student”翻譯成中文“我是一個學生”。根據encoder-decoder模型，在輸出“學生”時，我們用到了“我”“是”“一個”以及encoder的輸出。但事實上，我們或許並不需要“I am a ”這些無關緊要的信息，而僅僅只需要“student”這個詞的信息就可以輸出“學生”（或者說“I am a”這些信息沒有“student”重要）。這個時候就需要用到attention機制來分別為“I”、“am”、“a”、“student”賦一個權值了。例如分別給“I am a”賦值為0.1，給“student”賦值剩下的0.7，顯然這時student的重要性就體現出來了。具體怎么操作，我這里就不在講了。

2、self-attention
self-attention顯然是attentio機制的一種。上面所講的attention是輸入對輸出的權重，例如在上文中，是I am a student 對學生的權重。self-attention則是自己對自己的權重，例如I am a student分別對am的權重、對student的權重。之所以這樣做，是為了充分考慮句子之間不同詞語之間的語義及語法聯系。

那么這個權值應該怎么計算呢？我在別處看到的圖片以及我自己的理解如下：

注釋：q\k\v分別對應attention機制中的Q\K\V，它們是通過輸入詞向量分別和W(Q)、W(K)、W(V)做乘積得到的。其目的主要是計算權值。

注釋：q與k做點乘、然后歸一化，就得到權值（乘積越大，相似度越高，權值越高）。得到的兩個權值分別與v相乘后，再相加就是輸出。同理就可以得到另一個單詞的輸出。

以上是一個單詞一個單詞的輸出，如果寫成矩陣形式就是Q*K，得到的矩陣歸一化直接得到權值。

#self-attentiom模型的搭建：
 from keras.preprocessing import sequence from keras.datasets import imdb from matplotlib import pyplot as plt import pandas as pd from keras import backend as K from keras.engine.topology import Layer class Self_Attention(Layer): def __init__(self, output_dim, **kwargs): self.output_dim = output_dim super(Self_Attention, self).__init__(**kwargs) def build(self, input_shape): # 為該層創建一個可訓練的權重
        #inputs.shape = (batch_size, time_steps, seq_len)
        self.kernel = self.add_weight(name='kernel', shape=(3,input_shape[2], self.output_dim), initializer='uniform', trainable=True) super(Self_Attention, self).build(input_shape) # 一定要在最后調用它
 
    def call(self, x): WQ = K.dot(x, self.kernel[0]) WK = K.dot(x, self.kernel[1]) WV = K.dot(x, self.kernel[2]) print("WQ.shape",WQ.shape) print("K.permute_dimensions(WK, [0, 2, 1]).shape",K.permute_dimensions(WK, [0, 2, 1]).shape) QK = K.batch_dot(WQ,K.permute_dimensions(WK, [0, 2, 1])) QK = QK / (64**0.5) #64*5是歸一化的值，不同問題不一樣 QK = K.softmax(QK) print("QK.shape",QK.shape) V = K.batch_dot(QK,WV) return V def compute_output_shape(self, input_shape): return (input_shape[0],input_shape[1],self.output_dim)

在Keras上對IMDB進行簡單的測試（不做Mask）：

from __future__ import print_function from keras.preprocessing import sequence from keras.datasets import imdb max_features = 20000 maxlen = 80 batch_size = 32

print('Loading data...') (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features) print(len(x_train), 'train sequences') print(len(x_test), 'test sequences') print('Pad sequences (samples x time)') x_train = sequence.pad_sequences(x_train, maxlen=maxlen) x_test = sequence.pad_sequences(x_test, maxlen=maxlen) print('x_train shape:', x_train.shape) print('x_test shape:', x_test.shape) from keras.models import Model from keras.layers import * S_inputs = Input(shape=(None,), dtype='int32') embeddings = Embedding(max_features, 128)(S_inputs) # embeddings = Position_Embedding()(embeddings) # 增加Position_Embedding能輕微提高准確率
O_seq = Attention(8,16)([embeddings,embeddings,embeddings]) O_seq = GlobalAveragePooling1D()(O_seq) O_seq = Dropout(0.5)(O_seq) outputs = Dense(1, activation='sigmoid')(O_seq) model = Model(inputs=S_inputs, outputs=outputs) # try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print('Train...') model.fit(x_train, y_train, batch_size=batch_size, epochs=5, validation_data=(x_test, y_test))

參考博客：

https://blog.csdn.net/xiaosongshine/article/details/90600028

https://blog.csdn.net/cpluss/article/details/85330256

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Attention機制詳解（二）——Self-Attention與Transformer Attention 和self-attention 從attention到self-attention Self-Attention 和 Transformer Self-Attention與Transformer Keras實現Self-Attention Self-attention（自注意力機制） Self-attention + transformer 和其他一些總結 NLP學習(5)----attention/ self-attention/ seq2seq/ transformer 案例學習--Self-Attention及其實現實現