算法探究-Transformer-Attention Is All You Need(無可或缺的注意力機制)


Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

主要的序列轉換任務是基於復雜的循環或者卷積神經網絡包括一個編碼器和解碼器。最好表現的模型通過一個注意機制連接編碼器和解碼器。我們采用了一個新的簡單的網絡架構,Transformer,僅基於注意機制,完全拋棄循環和卷積。在兩個機器翻譯的任務上顯示這些模型質量要更優同時更加並行化和需要更少的時間進行訓練。我們的模型實現了28.4BLEU 在WMT2014Englishto-German翻譯任務中,超過了現有的最好的結果,包括集合體,超過了2個BLEU。WMT2014English-to-French 翻譯任務中,在8塊GPU上訓練了3.5天后,我們建立了一個簡單的模型最先進的BLEU得分為41.8, 這只是文獻中最有模型的一小部分訓練成本。通過成功將transformer應用到大量和有限的數據樣本的英文選區任務,我們顯示了transformer可以很好的推廣到其他任務上 

Introduction

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].

循環神經網絡,特別是長短時記憶和門控遞歸神經網絡,已經穩固的被確定時最好的方法在序列模型和轉換任務比如語言模型和機器翻譯。從那以后,無數的努力持續的推進循環模型和編解碼器結構的邊界

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

 循環模型通過輸入和輸出序列的符號位置進行因子計算。在計算時間內將位置和步數對齊,他們生成一系列隱藏層,作為先前隱藏層ht-1和輸入位置t的函數。這種固有的序列結構排除了訓練樣本的並引化,在較長序列上的長度變得至關重要,因為內存約束限制了跨示例的批處理。最近的工作通過分解技巧和條件計算實現了一個重要的改進,同時在最新的例子種也改善了模型的效果。然而順序計算的基本約束仍然存在

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.

注意機制已經稱為引人矚目的各種序列模型和轉導模型的重要組成部分,允許對依賴關系建模而不需要考慮輸入和輸出序列的長度。然后在少數任務種,注意機制也於循環網絡結合使用

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

在這個任務種我們使用Transformer,一個模型結構避開循環和完全依賴注意機制去吸引輸入和輸出之間的全局依賴。Transformer允許更多的並行化和在8塊P100的GPU上訓練12個小時可以在翻譯中達到一個好的質量。 

Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively

最具競爭力的神經轉導序列模型有一個編解碼結構。這里,編碼器映射一個輸入序列(x1, ..., xn)符號表示一個持續的序列表示Z=(z1,.....zn).給出的Z,這個解碼器生成一個輸出序列(y1,....,ym)每次一個元素的符號。在每一步中,模型都是自回歸的,當生成下一個時,評估之前生成的符號作為另外的輸入。Transformer准許這個整體的架構使用堆疊自我注意和點式,全連接層對於整個編碼器和解碼器,將分別展示在圖一中的左和右兩個部分

3.1 Encoder and Decoder Stacks

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.

編碼器:  這個編碼器有6個完全相同的層堆疊而成。每一個層有兩個子層。第一個是一個多頭的自注意機制,第二個是一個簡單的,全連接的feed-forward網絡,我們在這兩個子層周圍使用殘差連接,后面跟着一個層歸一化。每一個子層的輸出是LayerNorm(x + Sublayer(x)),其中子層是由子層自身實現的功能。為了促進殘差的連接,在模型中的所有子層和隱藏層,提供一個模型尺寸512的輸出

Encoder的代碼

class Encoder(nn.Module):
    """多層EncoderLayer組成Encoder。"""

    def __init__(self,
               vocab_size,
               max_seq_len,
               num_layers=6,
               model_dim=512,
               num_heads=8,
               ffn_dim=2048,
               dropout=0.0):
        super(Encoder, self).__init__()

        self.encoder_layers = nn.ModuleList(
          [EncoderLayer(model_dim, num_heads, ffn_dim, dropout) for _ in
           range(num_layers)])

        self.seq_embedding = nn.Embedding(vocab_size + 1, model_dim, padding_idx=0)
        self.pos_embedding = PositionalEncoding(model_dim, max_seq_len)

    def forward(self, inputs, inputs_len):
        output = self.seq_embedding(inputs)  # 對輸入求取embedding
        output += self.pos_embedding(inputs_len) # 對位置做embedding

        self_attention_mask = padding_mask(inputs, inputs) # 生成全零的B L L mask 

      attentions = []
      for encoder in self.encoder_layers:
        output, attention = encoder(output, self_attention_mask)
        attentions.append(attention)

      return output, attentions


       

上面使用到的EncoderLayer的代碼

class EncoderLayer(nn.Module):
    """Encoder的一層。"""

    def __init__(self, model_dim=512, num_heads=8, ffn_dim=2048, dropout=0.0):
        super(EncoderLayer, self).__init__()

        self.attention = MultiHeadAttention(model_dim, num_heads, dropout)
        self.feed_forward = PositionalWiseFeedForward(model_dim, ffn_dim, dropout)

    def forward(self, inputs, attn_mask=None):

        # self attention
        context, attention = self.attention(inputs, inputs, inputs, padding_mask)

        # feed forward network
        output = self.feed_forward(context)

        return output, attention

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

解碼器: 解碼器同樣也是有6個相同的堆疊層構成。另外兩個子層在每個解碼器中,解碼器插入了一個第三個子層,對編碼器的堆疊的輸出實現多頭注意。與編碼器相同的,我們在每一個子層的周圍采用殘差連接,后面跟層歸一化。我們也修改自注意子層在解碼器的堆疊層,防止一個位置注意到下一個位置。這種掩飾,加上輸出嵌入被偏移一個像素的事實,確定對於位置i的輸出僅依賴已經的輸出位置小於i 

Decoder的代碼

 

class Decoder(nn.Module):

    def __init__(self,
               vocab_size,
               max_seq_len,
               num_layers=6,
               model_dim=512,
               num_heads=8,
               ffn_dim=2048,
               dropout=0.0):
        super(Decoder, self).__init__()

        self.num_layers = num_layers

        self.decoder_layers = nn.ModuleList(
          [DecoderLayer(model_dim, num_heads, ffn_dim, dropout) for _ in
           range(num_layers)])

        self.seq_embedding = nn.Embedding(vocab_size + 1, model_dim, padding_idx=0)
        self.pos_embedding = PositionalEncoding(model_dim, max_seq_len)

    def forward(self, inputs, inputs_len, enc_output, context_attn_mask=None):
        output = self.seq_embedding(inputs) # 輸入嵌入
        output += self.pos_embedding(inputs_len) # 位置嵌入

        self_attention_padding_mask = padding_mask(inputs, inputs) # 對輸入做補零操作
        seq_mask = sequence_mask(inputs) #輸入的特征掩碼,使得當前特征只能看到之前的特征
    

      self_attn_mask = torch.gt((self_attention_padding_mask + seq_mask), 0) #大於0的位置顯示1

 
         

      self_attentions = []
      context_attentions = []
      for decoder in self.decoder_layers:
        output, self_attn, context_attn = decoder(
          output, enc_output, self_attn_mask, context_attn_mask)
        self_attentions.append(self_attn)
        context_attentions.append(context_attn)

 
         

      return output, self_attentions, context_attentions

上面的Sequnce_mask的代碼

def sequence_mask(seq):
    batch_size, seq_len = seq.size()
    mask = torch.triu(torch.ones((seq_len, seq_len), dtype=torch.uint8),
                    diagonal=1) #主對角上向下一個位置是1,其他位置是0 [0 1 1
                                                               # 0 0 1
                                                               # 0 0 0]
    mask = mask.unsqueeze(0).expand(batch_size, -1, -1)  # [B, L, L]
    return mask

 

 

上面DecoderLayer的代碼, 相比與EncoderLayer,這里多了一個多頭注意機制模塊,用來對之后輸入做掩碼隱藏 

class DecoderLayer(nn.Module):

    def __init__(self, model_dim, num_heads=8, ffn_dim=2048, dropout=0.0):
        super(DecoderLayer, self).__init__()

        self.attention = MultiHeadAttention(model_dim, num_heads, dropout)
        self.feed_forward = PositionalWiseFeedForward(model_dim, ffn_dim, dropout)

    def forward(self,
              dec_inputs,
              enc_outputs,
              self_attn_mask=None,
              context_attn_mask=None):
        # self attention, all inputs are decoder inputs
        dec_output, self_attention = self.attention(
          dec_inputs, dec_inputs, dec_inputs, self_attn_mask)

        # context attention
        # query is decoder's outputs, key and value are encoder's inputs
        dec_output, context_attention = self.attention(
          enc_outputs, enc_outputs, dec_output, context_attn_mask)

        # decoder's output, or context
        dec_output = self.feed_forward(dec_output)

        return dec_output,

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

 

 

 注意機制可以被描述為映射一個查詢和一系列對於輸出的鍵值對,查詢,鍵,值和輸出都是向量。輸出是作為值的加權和的計算,分配給每一個值的權重是通過一個兼容查詢功能和對應鍵計算獲得的

 

3.2.1 Scaled Dot-Product Attention

We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by √ dk, and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as:

 

我們叫我們特別的注意 "按比例縮小的點積注意"。這個輸入包含問題和維度的鍵dk和維度的值dv。我們計算查詢與所有鍵的點積,每個被√ dk除,采用一個softmax函數去獲得值的權重。在練習中,我們計算注意功能在一系列相似的問題,打包成矩形Q,這些鍵和值同意被打包成矩形K和V,我們計算輸出的矩形如

The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of √ 1 dk . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code

兩種最常見的注意力功能是附加注意力和點乘注意力。點乘注意力和我們的算法是一樣的,除了一個√ 1 dk縮放因子。附加注意力使用一個帶簡單隱藏層的前向網絡計算兼容性函數。雖然兩者在理論復雜性是一致的,在實際中,點乘注意機制更快並且空間上更有效,因此它可以實現使用高度優化的矩形乘法代碼

While for small values of dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of dk [3]. We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4 . To counteract this effect, we scale the dot products by √ 1 dk .

然而當dk很小的時候,兩種結果表現是一致的,當dk的值增大時,加注意機制優於點積的注意。我們認為對於大的dk值,點積的量變大,使得推動softmax函數向梯度極小的區域移動,為了避免這種影響,我們將點積進行√ 1 dk縮放

 

class ScaledDotProductAttention(nn.Module):
    """Scaled dot-product attention mechanism."""

    def __init__(self, attention_dropout=0.0):
        super(ScaledDotProductAttention, self).__init__()
        self.dropout = nn.Dropout(attention_dropout)
        self.softmax = nn.Softmax(dim=2)

    def forward(self, q, k, v, scale=None, attn_mask=None):
        """
        前向傳播.
        Args:
            q: Queries張量,形狀為[B, L_q, D_q]
            k: Keys張量,形狀為[B, L_k, D_k]
            v: Values張量,形狀為[B, L_v, D_v],一般來說就是k
            scale: 縮放因子,一個浮點標量
            attn_mask: Masking張量,形狀為[B, L_q, L_k]

        Returns:
            上下文張量和attention張量
        """
        attention = torch.bmm(q, k.transpose(1, 2))
        if scale:
            attention = attention * scale
        if attn_mask:
            # 給需要mask的地方設置一個負無窮
            attention = attention.masked_fill_(attn_mask, -np.inf)
        # 計算softmax
        attention = self.softmax(attention)
        # 添加dropout

      attention = self.dropout(attention)
         # 和V做點積
         context = torch.bmm(attention, v)
         return context, attention

 

3.2.2 Multi-Head Attention

Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.

而不是使用dmodel維度的鍵,值,問題的單一注意功能,我們發現,將查詢和鍵,值以不同的方式線性投影h次是有益的,分別學習了dk,dk和dv維度的線性投影。在每一個查詢投影版本上, 鍵和值我們同時執行注意機制,生成dv維度的輸出值。這些被連接並進行再次投射,造成的最終的值,在圖2中被顯示

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

多頭注意使得模型能在不同的位置共同關注來自不同子模塊的信息。只用一個簡單的注意頭,將平均抑制這種情況

Where the projections are parameter matrices W Q i ∈ R dmodel×dk , W K i ∈ R dmodel×dk , WV i ∈ R dmodel×dv and WO ∈ R hdv×dmodel . In this work we employ h = 8 parallel attention layers, or heads. For each of these we use dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

 

 

其中投影為參數矩陣WQ, WK和WV和WO。在這個工作中我們采用8個平行注意層或者頭。對於每一個這些,我們使用dk=dv=dmodel/h = 64表示。由於減少了每個頭的尺寸,總的消耗與單個頭注意的全維度的相似。

Multi-Head Attention代碼

 

class MultiHeadAttention(nn.Module):

    def __init__(self, model_dim=512, num_heads=8, dropout=0.0):
        super(MultiHeadAttention, self).__init__()

        self.dim_per_head = model_dim // num_heads
        self.num_heads = num_heads
        self.linear_k = nn.Linear(model_dim, self.dim_per_head * num_heads)
        self.linear_v = nn.Linear(model_dim, self.dim_per_head * num_heads)
        self.linear_q = nn.Linear(model_dim, self.dim_per_head * num_heads)

        self.dot_product_attention = ScaledDotProductAttention(dropout)
        self.linear_final = nn.Linear(model_dim, model_dim)
        self.dropout = nn.Dropout(dropout)

        # multi-head attention之后需要做layer norm
        self.layer_norm = nn.LayerNorm(model_dim)

    def forward(self, key, value, query, attn_mask=None):
        # 殘差連接
        residual = query
        dim_per_head = self.dim_per_head
        num_heads = self.num_heads
        batch_size = key.size(0)

        # linear project
         key = self.linear_k(key)
        value = self.linear_v(value)
        query = self.linear_q(query)

        # split by heads
        key = key.view(batch_size * num_heads, -1, dim_per_head) # 拆分成多個頭
        value = value.view(batch_size * num_heads, -1, dim_per_head)
        query = query.view(batch_size * num_heads, -1, dim_per_head)

        if attn_mask:
            attn_mask = attn_mask.repeat(num_heads, 1, 1)

        # scaled dot product attention
        scale = (key.size(-1)) ** -0.5
        context, attention = self.dot_product_attention(
            query, key, value, scale, attn_mask)

        # concat heads
        context = context.view(batch_size, -1, dim_per_head * num_heads) # 重新組合

        # final linear projection
        output = self.linear_final(context)

        # dropout
        output = self.dropout(output)

        # add residual and norm layer
        output = self.layer_norm(residual + output)

        return output, attention

 

3.2.3 Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways:

In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].

Transformer使用多頭注意機制在3個不同方面:

在編碼和解碼注意層,詢問來自於上一個與之前的解碼層,並且記憶的鍵和值來自於編碼器的輸出。這使得編碼器每一個位置可以參加輸入序列的每一個位置。這模仿編碼器和解碼器的注意機制在序列對序列模型中。

The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

這個編碼器包含自注意層。在自注意層的鍵,值和詢問來自於相同的地方,在這個例子中,是編碼器前一層的輸出。編碼器上每一個位置都可以對應上編碼器中上一個層的位置。

Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.

相似的,自注意層在解碼器中解碼器的每一個位置關注解碼器中的任意位置包括其自身。我們需要防止解碼器的信息向左流動,保證自回歸的准確性。我們通過屏蔽設置(setting to −∞)softmax輸入對所有非法的連接縮放點乘注意。

3.3 Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

 

 

 

除了注意子層,在我們每一層的編碼器和解碼器都包含全連接的前向網絡,它分別和相同應用在每一個位置,在這包含兩個線性transformations之間包含Relu激活層

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality df f = 2048.

不同位置的線性變化是相同的,在層與層之間,他們使用不同的參數。另外一個方法描述這兩個卷積使用1的核尺寸。輸入和輸出的維度是512,內部層的尺寸是2048 

class PositionalWiseFeedForward(nn.Module):

    def __init__(self, model_dim=512, ffn_dim=2048, dropout=0.0):
        super(PositionalWiseFeedForward, self).__init__()
        self.w1 = nn.Conv1d(model_dim, ffn_dim, 1)
        self.w2 = nn.Conv1d(ffn_dim, model_dim, 1)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(model_dim)

    def forward(self, x):
        output = x.transpose(1, 2)
        output = self.w2(F.relu(self.w1(output)))
        output = self.dropout(output.transpose(1, 2))

        # add residual and norm layer
        output = self.layer_norm(x + output)
        return output

3.5 Positional Encoding

 Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9].

因為我們的模型不包含循環和卷積,為了使得模型可以使用序列的順序,我們必須注入一些關於在序列中相對和絕對標志的位置信息。在最后,我們給編碼器和解碼器堆疊的底部嵌入添加位置編碼。這個位置編碼又相同的尺寸dmodel作為嵌入向量,這兩個是可以相加。位置編碼有很多選擇,學習和固定。

In this work, we use sine and cosine functions of different frequencies:

 

在這個任務中,我們在不同的片段下使用sine和cosine功能

where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, P Epos+k can be represented as a linear function of P Epos.

 

 這里pos是位置,i是維度。這里,每一個位置編碼器的維度都對應一個正弦函數。波長從2π到10000*2π成幾何級數 

 

 

 

 

 

 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM