google tensorflow bert代碼分析

本文轉載自查看原文 2019-03-03 14:44 721 算法

參考網上博客閱讀了bert的代碼，記個筆記。代碼是 bert_modeling.py

參考的博客地址：

https://blog.csdn.net/weixin_39470744/article/details/84401339

https://www.jianshu.com/p/2a3872148766

主要分為三部分：

1、輸入數據處理，將詞（中文的字）轉換為對應的embeddging，增加positional embeddding 和token type embedding.

positional embedding 是詞的位置信息，詞在句子中的位置。token type embedding表示是哪個句子中的詞。

輸出的數據格式是[batch_size,seq_length;width], width是詞向量的長度。

2、encoder部分主要是使用transformer對句子進行編碼，transformer的主要結構是來自 attention is all you need，但是和論文中的結構有些小區別。

3、decoder部分主要是解碼部分。

先介紹數據處理部分：

1、bert模型輸入的文本處理之后封裝為InputExample類，這個類包擴 guid,text_a，text_b，label

這些內容會被轉換成一下的格式。##表示被mark的詞，[CLS]起始第一個，在分類任務中表示句子的 sentence vector

[seq]表示句子的分隔符，如果只有一個句子text_b可以為空
tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
#  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1

這里的輸入句子會限定一個最大輸入長度，不足的補0，這個0是指詞對應的token_id。處理完成之后，將詞的ID序列

轉化為詞向量的序列。

詞ID序列到詞向量序列的代碼如下：

1 # Perform embedding lookup on the word ids.
2 (self.embedding_output, self.embedding_table) = embedding_lookup(
3     input_ids=input_ids,
4     vocab_size=config.vocab_size,
5     embedding_size=config.hidden_size,
6     initializer_range=config.initializer_range,
7     word_embedding_name="word_embeddings",
8     use_one_hot_embeddings=use_one_hot_embeddings)

下面代碼在詞向量序列上增加了 positional embeddings 和 token type embeddings。embedding_postprocessor 它包括token_type_embedding和position_embedding。也就是圖中的Segement Embeddings和Position Embeddings。

##配置項 這部分代碼注釋寫的非常詳細，embedding_postprocessor的具體實現可以看源碼的注釋，Bert的position Embedding是作為參數學習得到的，
transformer的論文里是計算得到的。

 1 self.embedding_output = embedding_postprocessor(
 2     input_tensor=self.embedding_output,
 3     use_token_type=True,
 4     token_type_ids=token_type_ids,
 5     token_type_vocab_size=config.type_vocab_size,
 6     token_type_embedding_name="token_type_embeddings",
 7     use_position_embeddings=True,
 8     position_embedding_name="position_embeddings",
 9     initializer_range=config.initializer_range,
10     max_position_embeddings=config.max_position_embeddings,
11     dropout_prob=config.hidden_dropout_prob)


特別說明一下，最后的輸出增加了 norm和dropout  output = layer_norm_and_dropout(output, dropout_prob)

2、Encoder部分代碼

首先是對輸入做了個attention_mask的處理

attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask)

這個主要是減少對於mask的詞和填充部分的詞的關注。mask部分和填充部分在計算attention的時候分數自然應該很低才對。

然后是transformer_model，這部分主要是transformer，關於transformer可以參考 attention is all you need,這篇博客寫的也不錯，https://blog.csdn.net/yujianmin1990/article/details/85221271，這是翻譯的一篇。

 1 self.all_encoder_layers = transformer_model(
 2     input_tensor=self.embedding_output,
 3     attention_mask=attention_mask,
 4     hidden_size=config.hidden_size,
 5     num_hidden_layers=config.num_hidden_layers,
 6     num_attention_heads=config.num_attention_heads,
 7     intermediate_size=config.intermediate_size,
 8     intermediate_act_fn=get_activation(config.hidden_act),
 9     hidden_dropout_prob=config.hidden_dropout_prob,
10     attention_probs_dropout_prob=config.attention_probs_dropout_prob,
11     initializer_range=config.initializer_range,
12     do_return_all_layers=True)


接下來詳細寫寫transformer_model的代碼
函數定義如下：

 1 def transformer_model(input_tensor,
 2                       attention_mask=None,
 3                       hidden_size=768,
 4                       num_hidden_layers=12,
 5                       num_attention_heads=12,
 6                       intermediate_size=3072,
 7                       intermediate_act_fn=gelu,
 8                       hidden_dropout_prob=0.1,
 9                       attention_probs_dropout_prob=0.1,
10                       initializer_range=0.02,
11                       do_return_all_layers=False):


input_tensor是[batch_size, seq_length, hidden_size]
attention_mask就是之前提過的用於處理padding部分和mask部分attention值的 形狀[batch_size, seq_length,seq_length]

hidden_size這個是transformer的隱層的大小

num_hidden_layers：transformer有多少層，也就是blocks的數目。一個block的結構如下：

num_attention_heads： transformer中attention heads的個數，比如bert設置的是12，多頭機制中head數。

intermediate_size：feed forward中間層的大小
接下來開始介紹代碼，開始判斷了一下hidden_size是否是num_attention_size的整數倍

對輸入由三維改為二維，避免處理過程中多次tensor的變相，提高效率。
這一步將[batch_size,seq_len,width]改為[batch_size*seq_len,width]
prev_output = reshape_to_matrix(input_tensor)

接下來是 attention layer，這個是計算self-attention,當然如果 query和key一樣的話，就是self-attention
首先第一步是計算query_layer，key_layer，value_layer。
這里把attention的計算抽象為 query,key和value三部分，通常key和value是一樣的，然后根據query來計算不同的key 其value貢獻的大小。
比如如果RNN這種seq2seq的話（encoder和decoder都是RNN）,query是decoder前一時刻的輸出，key和value是encoder RNN各個時刻的狀態。
在計算時query_layer=W*query ，其他key value類似

 1 # `query_layer` = [B*F, N*H]
 2 query_layer = tf.layers.dense(
 3     from_tensor_2d,
 4     num_attention_heads * size_per_head,
 5     activation=query_act,
 6     name="query",
 7     kernel_initializer=create_initializer(initializer_range))
 8 
 9 # `key_layer` = [B*T, N*H]
10 key_layer = tf.layers.dense(
11     to_tensor_2d,
12     num_attention_heads * size_per_head,
13     activation=key_act,
14     name="key",
15     kernel_initializer=create_initializer(initializer_range))
16 # `value_layer` = [B*T, N*H]
17 value_layer = tf.layers.dense(
18     to_tensor_2d,
19     num_attention_heads * size_per_head,
20     activation=value_act,
21     name="value",
22     kernel_initializer=create_initializer(initializer_range))

然后是計算attention的分數，這個和transformer論文中的計算方式一致，

1 attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
2 attention_scores = tf.multiply(attention_scores,
3                                1.0 / math.sqrt(float(size_per_head)))



這部分代碼中tensor的形狀變化，和矩陣乘法的應用比較巧妙，可以推一下看看，代碼寫的很簡潔。

這個部分是對attention mask的使用，如果是之前被mask和padding的部分，對應的分數設置為-10000，然后使用softmax計算分數

if attention_mask is not None:
  # `attention_mask` = [B, 1, F, T]
  attention_mask = tf.expand_dims(attention_mask, axis=[1])

  # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
  # masked positions, this operation will create a tensor which is 0.0 for
  # positions we want to attend and -10000.0 for masked positions.
  adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

  # Since we are adding it to the raw scores before the softmax, this is
  # effectively the same as removing these entirely.
  attention_scores += adder
# Normalize the attention scores to probabilities.
# `attention_probs` = [B, N, F, T]
attention_probs = tf.nn.softmax(attention_scores)



attention的分數這部分也有dropout

# This is actually dropping out entire tokens to attend to, which might

# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

接下來就是value_layer乘以attention_probs
attention_layer最后的輸出是

[B*F, N*V]或者[B, F, N*V]

# Scalar dimensions referenced here:
#   B = batch size (number of sequences)
#   F = `from_tensor` sequence length
#   T = `to_tensor` sequence length
#   N = `num_attention_heads`
#   H = `size_per_head`

對於多頭機制，每個head都計算完attention_layer之后，將這些結果全都拼接起來。

attention_output = tf.concat(attention_heads, axis=-1)

注意這里attention_output最后一維的維度和layer_input一樣的

attention_output = dropout(attention_output, hidden_dropout_prob)
attention_output = layer_norm(attention_output + layer_input)
這個是加上殘差鏈接。
兩個全連接層，最后加上dropout和 layer_norm

 1 # The activation is only applied to the "intermediate" hidden layer.
 2 with tf.variable_scope("intermediate"):
 3   intermediate_output = tf.layers.dense(
 4       attention_output,
 5       intermediate_size,
 6       activation=intermediate_act_fn,
 7       kernel_initializer=create_initializer(initializer_range))
 8 
 9 # Down-project back to `hidden_size` then add the residual.
10 with tf.variable_scope("output"):
11   layer_output = tf.layers.dense(
12       intermediate_output,
13       hidden_size,
14       kernel_initializer=create_initializer(initializer_range))
15   layer_output = dropout(layer_output, hidden_dropout_prob)
16   layer_output = layer_norm(layer_output + attention_output)
17   prev_output = layer_output
18   all_layer_outputs.append(layer_output)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Google BERT摘要 Tensorflow多層LSTM代碼分析 tensorflow的寫詩代碼分析【轉】 TensorFlow入門之MNIST樣例代碼分析 tensorflow筆記：多層LSTM代碼分析 tensorflow筆記：多層CNN代碼分析 tensorflow筆記：多層LSTM代碼分析 Bert tensorflow 版本的線上預測demo pytorch版本的bert模型代碼學習《TensorFlow實戰Google深度學習框架 (第2版) 》中文PDF和代碼