Attention 和self-attention

本文轉載自查看原文 2019-12-10 22:09 348 NLP/ DeepLearning

一、Attention

1.基本信息

最先出自於Bengio團隊一篇論文：NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE ，論文在2015年發表在ICLR。

encoder-decoder模型通常的做法是將一個輸入的句子編碼成一個固定大小的state，然后將這樣的一個state輸入到decoder中的每一個時刻，這種做法對處理長句子會很不利，尤其是隨着句子長度的增加，效果急速下滑。

motivation：是針對encoder-decoder長句子翻譯效果較差的問題。

解決原理：仿造人腦結構，對一張圖片或是一個句子，有選擇性的關注重點部分。

論文解決思路：在生成當前詞的時候，只要把上一個state與所有的input word融合，而后做一個權重計算。通過這種方式生成的詞就會有針對性，在句子長度較長時效果尤其明顯。

2.核心算法

假設當前的輸出的詞位置為i，j是輸入的詞位置，$s_{i-1}$輸出位置的上一個隱藏狀態，$h_{j}$是輸入的隱藏狀態，$h_{j}$對應的權重，就是將$s_{i-1}$和$h_{j}$融合都一塊，計算在所有的輸入隱藏狀態的比重。具體可參照如下公式。

二、Self-attention

1.基本信息

出自於Google團隊的論文：Attention Is All You Need ，2017年發表在NIPS。

1)motivation：RNN本身的結構，阻礙了並行化；同時RNN對長距離依賴問題，效果會很差。

2)解決思路：通過不同詞向量之間矩陣相乘，得到一個詞與詞之間的相似度，進而無距離限制。

3)優勢：

attention的計算可以並行化，tensor之間的矩陣乘法，不存在時序；
同一個句子中每個詞之間均可以做相似度計算，無視距離；
多頭機制，關注每一部分維度的表示，比如第一部分是詞性，第二部分是語義等等；
可以增加到非常深的深度，堆疊很多塊，充分發掘DNN模型的特性。

4)整體結構：

2.self-attention

$Attention(Q;K;V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$　　

對於$ softmax(\frac{QK^T}{\sqrt{d_k}})$，是得到一個相似度，而$ softmax(\frac{QK^T}{\sqrt{d_k}})V$是將相似度溶於embedding中。

每個位置的詞都可以無視方向和距離，有機會直接和句子中的每個詞encoding。比如下圖這個句子，每個單詞和同句其他單詞之間都有一條邊，邊的顏色越深表明相關性越強，而一般意義模糊的詞語所連的邊都比較深。比如：law，application，missing，opinion。

3.multi-head attention

1）multi-head的實現方式

將一個詞的詞向量切分成h個塊，求attention相似度時是一個句子中每個詞之間第i個塊的相似度。原論文是有h次線性映射，后來的bert是切分為h個部分。

原論文multi-head操作方式：引自原論文

Instead of performing a single attention function with dmodel-dimensional keys, values and queries,we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.

2）bert實現代碼：先線性映射，而后reshape，之后再矩陣乘法

 # Scalar dimensions referenced here:
  #   B = batch size (number of sequences)
  #   F = `from_tensor` sequence length
  #   T = `to_tensor` sequence length
  #   N = `num_attention_heads`
  #   H = `size_per_head`

  from_tensor_2d = reshape_to_matrix(from_tensor)
  to_tensor_2d = reshape_to_matrix(to_tensor)

  # `query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query",
      kernel_initializer=create_initializer(initializer_range))

  # `key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key",
      kernel_initializer=create_initializer(initializer_range))

  # `value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value",
      kernel_initializer=create_initializer(initializer_range))

  # `query_layer` = [B, N, F, H]
  query_layer = transpose_for_scores(query_layer, batch_size,
                                     num_attention_heads, from_seq_length,
                                     size_per_head)

  # `key_layer` = [B, N, T, H]
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                   to_seq_length, size_per_head)

  # Take the dot product between "query" and "key" to get the raw
  # attention scores.
  # `attention_scores` = [B, N, F, T]
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))

3）multi-head的意義

由於單詞映射在高維空間作為向量形式，每一維空間都可以學到不同的特征，相鄰空間所學結果更相似，相較於全體空間放到一起對應更加合理。比如對於vector-size=512的詞向量，取h=8，每64個空間做一個attention，學到結果更細化。

4.分類和生成任務

1）分類任務

分類任務只是用到了encoder，不需要decoder部分。encoder輸出的tenser可以直接reshape，加一個全連接即可分類；也可以對每一個詞求權重和，而后再分類。

2）生成式任務

生成式任務需要decoder部分，而decoder需要把當前位置i之后的詞mask掉，就是設置為負無窮，因為生成當前詞的時候，是無法看到之后的詞。整個encoder完成之后，才開始decoder部分，其中decoder的K和Q是來源於encoder，V是decoder的上一個狀態。

對應的multi_attention的代碼如下， causality指示是否mask。

def multihead_attention(queries,
                        keys,
                        num_units=None,
                        num_heads=8,
                        dropout_rate=0,
                        is_training=True,
                        causality=False,
                        scope="multihead_attention",
                        reuse=None):
    '''Applies multihead attention.

    Args:
      queries: A 3d tensor with shape of [N, T_q, C_q].
      keys: A 3d tensor with shape of [N, T_k, C_k].
      num_units: A scalar. Attention size.
      dropout_rate: A floating point number.
      is_training: Boolean. Controller of mechanism for dropout.
      causality: Boolean. If true, units that reference the future are masked.
      num_heads: An int. Number of heads.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.

    Returns
      A 3d tensor with shape of (N, T_q, C)
    '''
    with tf.variable_scope(scope, reuse=reuse):
        # Set the fall back option for num_units
        if num_units is None:
            num_units = queries.get_shape().as_list[-1]

        # Linear projections
        Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu)  # (N, T_q, C)
        K = tf.layers.dense(keys, num_units, activation=tf.nn.relu)  # (N, T_k, C)
        V = tf.layers.dense(keys, num_units, activation=tf.nn.relu)  # (N, T_k, C)
        # Split and concat
        Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0)  # (h*N, T_q, C/h)
        K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0)  # (h*N, T_k, C/h)
        V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0)  # (h*N, T_k, C/h)
        # Multiplication
        outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1]))  # (h*N, T_q, T_k)

        # Scale
        outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)

        # Key Masking
        key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis=-1)))  # (N, T_k)
        key_masks = tf.tile(key_masks, [num_heads, 1])  # (h*N, T_k)
        key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1])  # (h*N, T_q, T_k)

        paddings = tf.ones_like(outputs) * (-2 ** 32 + 1)
        outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs)  # (h*N, T_q, T_k)
        # Causality = Future blinding
        if causality:
            diag_vals = tf.ones_like(outputs[0, :, :])  # (T_q, T_k)
            #生成一個下三角矩陣，就是對角線全為0
            tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense()  # (T_q, T_k)
            masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1])  # (h*N, T_q, T_k)

            paddings = tf.ones_like(masks) * (-2 ** 32 + 1)
            #將outputs上三角的值，都改成負無窮，
            outputs = tf.where(tf.equal(masks, 0), paddings, outputs)  # (h*N, T_q, T_k)
        #outputs的上三角全為0
        outputs = tf.nn.softmax(outputs)  # (h*N, T_q, T_k)

        # Dropouts
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))

        # Weighted sum
        outputs = tf.matmul(outputs, V_)  # ( h*N, T_q, C/h)

        # Restore shape
        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2)  # (N, T_q, C)

        # Residual connection
        outputs += queries

        # Normalize
        outputs = normalize(outputs)  # (N, T_q, C)

    return outputs

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 從attention到self-attention Self-Attention 和 Transformer Self-Attention與Transformer self-attention詳解 Attention機制詳解（二）——Self-Attention與Transformer Keras實現Self-Attention Self-attention（自注意力機制） Self-attention + transformer 和其他一些總結 NLP學習(5)----attention/ self-attention/ seq2seq/ transformer 從Attention到Self-Attention再到Multi-Head Attention的一點小筆記