谷歌BERT預訓練源碼解析（二）：模型構建

本文轉載自查看原文 2019-07-29 16:56 414

目錄
前言
源碼解析
模型配置參數
BertModel
word embedding
embedding_postprocessor
Transformer
self_attention
模型應用
前言
BERT的模型主要是基於Transformer架構（論文：Attention is all you need）。它拋開了RNN等固有模式，直接用注意力機制處理Seq2Seq問題，體現了大道至簡的思想。網上對此模型解析的資料有很多，但大都千篇一律。這里推薦知乎的一篇《Attention is all you need》解讀，我覺得這篇把transformer介紹的非常好。
由於模型最鬧心的就是維度問題，維度理清了，理解模型就很容易，所以我在源碼中會注釋每個操作后tensor的維度信息。
下面開始介紹BERT的模型 modeling.py是怎么建立的，我始終認為讀代碼和注釋是理解的最快方法，所以看代碼時如果官方注釋有的地方看不懂。請善看中文注釋和維度信息

源碼解析
模型配置參數
" attention_probs_dropout_prob": 0.1, #乘法attention時，softmax后dropout概率
"hidden_act": "gelu", #激活函數
"hidden_dropout_prob": 0.1, #隱藏層dropout概率
"hidden_size": 768, #隱藏單元數
"initializer_range": 0.02, #初始化范圍
"intermediate_size": 3072, #升維維度
"max_position_embeddings": 512, #一個大於seq_length的參數，用於生成position_embedding
"num_attention_heads": 12, #每個隱藏層中的attention head數
"num_hidden_layers": 12, #隱藏層數
"type_vocab_size": 2, #segment_ids類別 [0,1]
"vocab_size": 30522 #詞典中詞數
1
2
3
4
5
6
7
8
9
10
11
這里的輸入參數：input_ids,input_mask,token_type_ids對應上篇文章中輸出的input_ids,input_mask,segment_ids

BertModel
這部分是總流程，整個modling腳本有900多行代碼，所以我列個流程圖一部一部走。整體流程如下。首先對input_ids和token_type_ids進行embedding操作，將embedding結果送入Transformer訓練，最后得到編碼結果。

def __init__(self,
config,
is_training,
input_ids,
input_mask=None,
token_type_ids=None,
use_one_hot_embeddings=True,
scope=None):
"""Constructor for BertModel.
Args:
config: `BertConfig` instance.
is_training: bool. rue for training model, false for eval model. Controls
whether dropout will be applied.
input_ids: int32 Tensor of shape [batch_size, seq_length].
input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
embeddings or tf.embedding_lookup() for the word embeddings. On the TPU,
it is must faster if this is True, on the CPU or GPU, it is faster if
this is False.
scope: (optional) variable scope. Defaults to "bert".
Raises:
ValueError: The config is invalid or one of the input tensor shapes
is invalid.
"""
config = copy.deepcopy(config)
if not is_training:
config.hidden_dropout_prob = 0.0
config.attention_probs_dropout_prob = 0.0

input_shape = get_shape_list(input_ids, expected_rank=2)
batch_size = input_shape[0]
seq_length = input_shape[1]

if input_mask is None:
input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

if token_type_ids is None:
token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

with tf.variable_scope(scope, default_name="bert"):
with tf.variable_scope("embeddings"):
# Perform embedding lookup on the word ids.

#[batch_size,seq_length,embedding_size] [vocab_size,embedding_size]
(self.embedding_output, self.embedding_table) = embedding_lookup( #word_embedding
input_ids=input_ids, #[batch_size,seq_length]
vocab_size=config.vocab_size,
embedding_size=config.hidden_size,
initializer_range=config.initializer_range,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=use_one_hot_embeddings)

# Add positional embeddings and token type embeddings, then layer
# normalize and perform dropout.
self.embedding_output = embedding_postprocessor( #token_embedding和position_embedding [batch_size,seq_length,embedding_size]
input_tensor=self.embedding_output,
use_token_type=True,
token_type_ids=token_type_ids,
token_type_vocab_size=config.type_vocab_size,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=config.initializer_range,
max_position_embeddings=config.max_position_embeddings,
dropout_prob=config.hidden_dropout_prob)

with tf.variable_scope("encoder"):
# This converts a 2D mask of shape [batch_size, seq_length] to a 3D
# mask of shape [batch_size, seq_length, seq_length] which is used
# for the attention scores.
attention_mask = create_attention_mask_from_input_mask(
input_ids, input_mask)

# Run the stacked transformer.
# `sequence_output` shape = [batch_size, seq_length, hidden_size].
self.all_encoder_layers = transformer_model( #transformer_model list(#[batch_size,seq_length,embedding_size])
input_tensor=self.embedding_output,
attention_mask=attention_mask,
hidden_size=config.hidden_size,
num_hidden_layers=config.num_hidden_layers,
num_attention_heads=config.num_attention_heads,
intermediate_size=config.intermediate_size,
intermediate_act_fn=get_activation(config.hidden_act),
hidden_dropout_prob=config.hidden_dropout_prob,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
initializer_range=config.initializer_range,
do_return_all_layers=True)

self.sequence_output = self.all_encoder_layers[-1] #獲取最后一層的輸出
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
with tf.variable_scope("pooler"):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token. We assume that this has been pre-trained
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1) #取每個每個訓練語料的第一個詞的編碼結果[CLS]，它有整條訓練語料的編碼信息 [batch_size, hidden_size]
self.pooled_output = tf.layers.dense( #接一個全連接層進行輸出 [batch_size, hidden_size]
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
word embedding
首先看word_embedding部分,它傳入input_ids，運用one_hot為中介返回embedding結果

def embedding_lookup(input_ids,
vocab_size,
embedding_size=128,
initializer_range=0.02,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=False):
"""Looks up words embeddings for id tensor.
Args:
input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
ids.
vocab_size: int. Size of the embedding vocabulary.
embedding_size: int. Width of the word embeddings.
initializer_range: float. Embedding initialization range.
word_embedding_name: string. Name of the embedding table.
use_one_hot_embeddings: bool. If True, use one-hot method for word
embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better
for TPUs.
Returns:
float Tensor of shape [batch_size, seq_length, embedding_size].
"""
# This function assumes that the input is of shape [batch_size, seq_length,
# num_inputs].
#
# If the input is a 2D tensor of shape [batch_size, seq_length], we
# reshape to [batch_size, seq_length, 1].
if input_ids.shape.ndims == 2:
input_ids = tf.expand_dims(input_ids, axis=[-1]) #最低維擴維 [batch_size,seq_length,1]

embedding_table = tf.get_variable(
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=create_initializer(initializer_range))

if use_one_hot_embeddings:
flat_input_ids = tf.reshape(input_ids, [-1]) #[batch_size*seq_length]
one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) #[batch_size*seq_length,vocab_size]
output = tf.matmul(one_hot_input_ids, embedding_table) #[batch_size*seq_length,embedding_size]
else:
output = tf.nn.embedding_lookup(embedding_table, input_ids)

input_shape = get_shape_list(input_ids)

output = tf.reshape(output,
input_shape[0:-1] + [input_shape[-1] * embedding_size]) #[batch_size,seq_length,embedding_size]
return (output, embedding_table)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
embedding_postprocessor
再看embedding_postprocessor 它包括token_type_embedding和position_embedding。也就是圖中的Segement Embeddings和Position Embeddings。

但此代碼中Position Embeddings部分與之前提出的Transformer不同，此代碼中Position Embeddings是訓練出來的，而傳統的Transformer（如下）是固定值

def embedding_postprocessor(input_tensor, #[batch_size,seq_length,embedding_size]
use_token_type=False,
token_type_ids=None, #[batch_size,seq_length]
token_type_vocab_size=16,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=0.02,
max_position_embeddings=512,
dropout_prob=0.1):
"""Performs various post-processing on a word embedding tensor.
Args:
input_tensor: float Tensor of shape [batch_size, seq_length,
embedding_size].
use_token_type: bool. Whether to add embeddings for `token_type_ids`.
token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
Must be specified if `use_token_type` is True.
token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
token_type_embedding_name: string. The name of the embedding table variable
for token type ids.
use_position_embeddings: bool. Whether to add position embeddings for the
position of each token in the sequence.
position_embedding_name: string. The name of the embedding table variable
for positional embeddings.
initializer_range: float. Range of the weight initialization.
max_position_embeddings: int. Maximum sequence length that might ever be
used with this model. This can be longer than the sequence length of
input_tensor, but cannot be shorter.
dropout_prob: float. Dropout probability applied to the final output tensor.
Returns:
float tensor with same shape as `input_tensor`.
Raises:
ValueError: One of the tensor shapes or input values is invalid.
"""
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
width = input_shape[2]

output = input_tensor

if use_token_type: #Segement Embeddings部分
if token_type_ids is None:
raise ValueError("`token_type_ids` must be specified if"
"`use_token_type` is True.")
token_type_table = tf.get_variable(
name=token_type_embedding_name,
shape=[token_type_vocab_size, width],
initializer=create_initializer(initializer_range))
# This vocab will be small so we always do one-hot here, since it is always
# faster for a small vocabulary.
flat_token_type_ids = tf.reshape(token_type_ids, [-1]) #[batch_size*seq_length]
one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size) #[batch_size*seq_length,2] token_type只有0，1
token_type_embeddings = tf.matmul(one_hot_ids, token_type_table) #[batch_size*seq_length,embedding_size]
token_type_embeddings = tf.reshape(token_type_embeddings,
[batch_size, seq_length, width]) #[batch_size, seq_length, width=embedding_size]
output += token_type_embeddings #[batch_size, seq_length, embedding_size]

if use_position_embeddings: #Position Embeddings部分
assert_op = tf.assert_less_equal(seq_length, max_position_embeddings) #確保seq_length<max_position_embedding
with tf.control_dependencies([assert_op]):
full_position_embeddings = tf.get_variable(
name=position_embedding_name,
shape=[max_position_embeddings, width],
initializer=create_initializer(initializer_range))
# Since the position embedding table is a learned variable, we create it
# using a (long) sequence length `max_position_embeddings`. The actual
# sequence length might be shorter than this, for faster training of
# tasks that do not have long sequences.
#
# So `full_position_embeddings` is effectively an embedding table
# for position [0, 1, 2, ..., max_position_embeddings-1], and the current
# sequence has positions [0, 1, 2, ... seq_length-1], so we can just
# perform a slice.
position_embeddings = tf.slice(full_position_embeddings, [0, 0], #[seq_length,embedding_size]
[seq_length, -1])
num_dims = len(output.shape.as_list())

# Only the last two dimensions are relevant (`seq_length` and `width`), so
# we broadcast among the first dimensions, which is typically just
# the batch size.
position_broadcast_shape = []
for _ in range(num_dims - 2):
position_broadcast_shape.append(1)
position_broadcast_shape.extend([seq_length, width]) #[1,seq_length,embedding_size]
position_embeddings = tf.reshape(position_embeddings, #[1,seq_length,embedding_size]
position_broadcast_shape)
output += position_embeddings #[batch_size, seq_length, embedding_size] 與#[1,seq_length,embedding_size]相加
#因為每一個batch的同一位置的position_embedding是一樣的，所以相當於batch_size個position_embeddings與output相加

output = layer_norm_and_dropout(output, dropout_prob)
return output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
Transformer
embedding之后，首先構造一個attention_mask，這個attention_mask表示的含義是將原來的input_mask的[batch_size,seq_length]擴維到[batch_size,from_seq_length,to_seq_length]。保證對於每個from_seq_length都有一個input_mask。之后將他們傳入到transformer模型。
transformer整體架構如圖所示

下面我們來看transformer_model。首先對embedding進行multi-head attention,對輸入進行殘差和layer_norm。后傳入feed forward，再進行殘差和layer_norm。
本塊代碼中與原論文中不一樣的點為：在進行multi-head attention后先鏈接了一個全連接層，再進行的殘差和layer_norm。而原論文中貌似沒有那個全連接層。下面是代碼，關鍵部分我已寫上注釋

def transformer_model(input_tensor,
attention_mask=None, #[batch_size,form_seq_length,to_seq_length]
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
intermediate_act_fn=gelu,
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
do_return_all_layers=False):
"""Multi-headed, multi-layer Transformer from "Attention is All You Need".
This is almost an exact implementation of the original Transformer encoder.
See the original paper:
https://arxiv.org/abs/1706.03762
Also see:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
Args:
input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
seq_length], with 1 for positions that can be attended to and 0 in
positions that should not be.
hidden_size: int. Hidden size of the Transformer.
num_hidden_layers: int. Number of layers (blocks) in the Transformer.
num_attention_heads: int. Number of attention heads in the Transformer.
intermediate_size: int. The size of the "intermediate" (a.k.a., feed
forward) layer.
intermediate_act_fn: function. The non-linear activation function to apply
to the output of the intermediate/feed-forward layer.
hidden_dropout_prob: float. Dropout probability for the hidden layers.
attention_probs_dropout_prob: float. Dropout probability of the attention
probabilities.
initializer_range: float. Range of the initializer (stddev of truncated
normal).
do_return_all_layers: Whether to also return all layers or just the final
layer.
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size], the final
hidden layer of the Transformer.
Raises:
ValueError: A Tensor shape or parameter is invalid.
"""
if hidden_size % num_attention_heads != 0:
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, num_attention_heads))

attention_head_size = int(hidden_size / num_attention_heads)
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
input_width = input_shape[2]

# The Transformer performs sum residuals on all layers so the input needs
# to be the same as the hidden size.
if input_width != hidden_size:
raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
(input_width, hidden_size))

# We keep the representation as a 2D tensor to avoid re-shaping it back and
# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
# the GPU/CPU but may not be free on the TPU, so we want to minimize them to
# help the optimizer.
prev_output = reshape_to_matrix(input_tensor) #這里官方說為了避免來回升降維，所以直接先變形為2D，最后再恢復成3D [batch_size*seq_length,hidden_size]

all_layer_outputs = []
for layer_idx in range(num_hidden_layers):
with tf.variable_scope("layer_%d" % layer_idx):
layer_input = prev_output

with tf.variable_scope("attention"):
attention_heads = []
with tf.variable_scope("self"):
attention_head = attention_layer( #進行self_attention 即multi-head attention
from_tensor=layer_input, #[batch_size*seq_length,hidden_size]
to_tensor=layer_input, #[batch_size*seq_length,hidden_size]
attention_mask=attention_mask,
num_attention_heads=num_attention_heads,
size_per_head=attention_head_size,
attention_probs_dropout_prob=attention_probs_dropout_prob,
initializer_range=initializer_range,
do_return_2d_tensor=True,
batch_size=batch_size,
from_seq_length=seq_length,
to_seq_length=seq_length)
attention_heads.append(attention_head)

attention_output = None
if len(attention_heads) == 1:
attention_output = attention_heads[0]
else:
# In the case where we have other sequences, we just concatenate
# them to the self-attention head before the projection.
attention_output = tf.concat(attention_heads, axis=-1)

# Run a linear projection of `hidden_size` then add a residual
# with `layer_input`.
with tf.variable_scope("output"):
attention_output = tf.layers.dense( #對attention的輸出做一個全連接層
attention_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
attention_output = dropout(attention_output, hidden_dropout_prob)
attention_output = layer_norm(attention_output + layer_input) #殘差和layer_norm
#Feed Foward過程，先對輸出升維、再進行降維
# The activation is only applied to the "intermediate" hidden layer.
with tf.variable_scope("intermediate"):
intermediate_output = tf.layers.dense( #升維
attention_output,
intermediate_size,
activation=intermediate_act_fn,
kernel_initializer=create_initializer(initializer_range))

# Down-project back to `hidden_size` then add the residual.
with tf.variable_scope("output"): #降維
layer_output = tf.layers.dense(
intermediate_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
layer_output = dropout(layer_output, hidden_dropout_prob)
layer_output = layer_norm(layer_output + attention_output) #加入殘差
prev_output = layer_output #本層輸出作為下一層輸入
all_layer_outputs.append(layer_output) #所有層的輸出結果列表

if do_return_all_layers:
final_outputs = []
for layer_output in all_layer_outputs:
final_output = reshape_from_matrix(layer_output, input_shape)
final_outputs.append(final_output)
return final_outputs
else:
final_output = reshape_from_matrix(prev_output, input_shape)
return final_output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
self_attention
接下來介紹self_attention機制。他運用乘法注意力，自己和自己做attention，使每個詞都全局語義信息。同時運用Multi-head attention。即將hidden_size平分為多個部分(head)。每個head進行self_attention。不同head學習不同子空間語義。

下面是代碼，關鍵部分我已寫上注釋。首先將輸入的key和value，reshape成[batch_size,num_head,seq_length,size_per_head]。在對這些head進行乘法注意力運算。經過softmax后乘以value。最后返回tensor with shape [batch_size*seq_length,hidden_size]

def attention_layer(from_tensor, #from_tensor和to_tensor都是輸入embedding [batch_size*seq_length,hidden_size]
to_tensor,
attention_mask=None, #[batch_size,form_seq_length,to_seq_length]
num_attention_heads=1,
size_per_head=512,
query_act=None,
key_act=None,
value_act=None,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
do_return_2d_tensor=False,
batch_size=None,
from_seq_length=None,
to_seq_length=None):
"""Performs multi-headed attention from `from_tensor` to `to_tensor`.
This is an implementation of multi-headed attention based on "Attention
is all you Need". If `from_tensor` and `to_tensor` are the same, then
this is self-attention. Each timestep in `from_tensor` attends to the
corresponding sequence in `to_tensor`, and returns a fixed-with vector.
This function first projects `from_tensor` into a "query" tensor and
`to_tensor` into "key" and "value" tensors. These are (effectively) a list
of tensors of length `num_attention_heads`, where each tensor is of shape
[batch_size, seq_length, size_per_head].
Then, the query and key tensors are dot-producted and scaled. These are
softmaxed to obtain attention probabilities. The value tensors are then
interpolated by these probabilities, then concatenated back to a single
tensor and returned.
In practice, the multi-headed attention are done with transposes and
reshapes rather than actual separate tensors.
Args:
from_tensor: float Tensor of shape [batch_size, from_seq_length,
from_width].
to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
attention_mask: (optional) int32 Tensor of shape [batch_size,
from_seq_length, to_seq_length]. The values should be 1 or 0. The
attention scores will effectively be set to -infinity for any positions in
the mask that are 0, and will be unchanged for positions that are 1.
num_attention_heads: int. Number of attention heads.
size_per_head: int. Size of each attention head.
query_act: (optional) Activation function for the query transform.
key_act: (optional) Activation function for the key transform.
value_act: (optional) Activation function for the value transform.
attention_probs_dropout_prob: (optional) float. Dropout probability of the
attention probabilities.
initializer_range: float. Range of the weight initializer.
do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
* from_seq_length, num_attention_heads * size_per_head]. If False, the
output will be of shape [batch_size, from_seq_length, num_attention_heads
* size_per_head].
batch_size: (Optional) int. If the input is 2D, this might be the batch size
of the 3D version of the `from_tensor` and `to_tensor`.
from_seq_length: (Optional) If the input is 2D, this might be the seq length
of the 3D version of the `from_tensor`.
to_seq_length: (Optional) If the input is 2D, this might be the seq length
of the 3D version of the `to_tensor`.
Returns:
float Tensor of shape [batch_size, from_seq_length,
num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
true, this will be of shape [batch_size * from_seq_length,
num_attention_heads * size_per_head]).
Raises:
ValueError: Any of the arguments or tensor shapes are invalid.
"""

def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
seq_length, width):
output_tensor = tf.reshape(
input_tensor, [batch_size, seq_length, num_attention_heads, width])

output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
return output_tensor

from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

if len(from_shape) != len(to_shape):
raise ValueError(
"The rank of `from_tensor` must match the rank of `to_tensor`.")

if len(from_shape) == 3:
batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_seq_length = to_shape[1]
elif len(from_shape) == 2:
if (batch_size is None or from_seq_length is None or to_seq_length is None):
raise ValueError(
"When passing in rank 2 tensors to attention_layer, the values "
"for `batch_size`, `from_seq_length`, and `to_seq_length` "
"must all be specified.")

# Scalar dimensions referenced here:
# B = batch size (number of sequences)
# F = `from_tensor` sequence length
# T = `to_tensor` sequence length
# N = `num_attention_heads`
# H = `size_per_head`

from_tensor_2d = reshape_to_matrix(from_tensor) #[batch_size*seq_length,hidden_size]
to_tensor_2d = reshape_to_matrix(to_tensor) #[batch_size*seq_length,hidden_size]
#首先將key和value輸入進全連接層但是激活函數為None，這里為什么我也不知道。。。
# `query_layer` = [B*F, N*H]
query_layer = tf.layers.dense(
from_tensor_2d,
num_attention_heads * size_per_head,
activation=query_act, #None
name="query",
kernel_initializer=create_initializer(initializer_range)) # [batch_size*seq_length,hidden_size] hidden_size即num_attention_heads*size_per_head

# `key_layer` = [B*T, N*H]
key_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=key_act, #None
name="key",
kernel_initializer=create_initializer(initializer_range))

# `value_layer` = [B*T, N*H]
value_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=value_act, #None
name="value",
kernel_initializer=create_initializer(initializer_range))
#reshape成四位，用於注意力矩陣運算
# `query_layer` = [B, N, F, H]
query_layer = transpose_for_scores(query_layer, batch_size, #將num_attention_heads調到第二維。這里表示每個batch有N個head，每個head有F個token，每個token用H表示。不同head學習不同子空間的特征
num_attention_heads, from_seq_length,
size_per_head)

# `key_layer` = [B, N, T, H]
key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
to_seq_length, size_per_head)

# Take the dot product between "query" and "key" to get the raw
# attention scores. 乘法注意力
# `attention_scores` = [B, N, F, T]
attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
attention_scores = tf.multiply(attention_scores,
1.0 / math.sqrt(float(size_per_head)))

if attention_mask is not None:
# `attention_mask` = [B, 1, F, T]
attention_mask = tf.expand_dims(attention_mask, axis=[1])

#這部分將每條訓練語料的結尾padding的部分都變為一個極小值，其他有實數據的部分都為0
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
#相加后，有實數據的部分加的，padding部分都是一個極小值
attention_scores += adder

# Normalize the attention scores to probabilities.
# `attention_probs` = [B, N, F, T]
attention_probs = tf.nn.softmax(attention_scores)

# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

# `value_layer` = [B, T, N, H]
value_layer = tf.reshape(
value_layer,
[batch_size, to_seq_length, num_attention_heads, size_per_head])

# `value_layer` = [B, N, T, H]
value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

# `context_layer` = [B, N, F, H]
# 注意力矩陣乘以value
context_layer = tf.matmul(attention_probs, value_layer)

# `context_layer` = [B, F, N, H]
context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

if do_return_2d_tensor:
# 返回2D結果
# `context_layer` = [B*F, N*V]
context_layer = tf.reshape(
context_layer,
[batch_size * from_seq_length, num_attention_heads * size_per_head])
else:
# `context_layer` = [B, F, N*V]
context_layer = tf.reshape(
context_layer,
[batch_size, from_seq_length, num_attention_heads * size_per_head])

return context_layer
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
模型應用
模型怎么用呢，在BertModel class中有兩個函數。get_pool_output表示獲取每個batch第一個詞的[CLS]表示結果。BERT認為這個詞包含了整條語料的信息；適用於句子級別的分類問題。get_sequence_output表示BERT最終的輸出結果,shape為[batch_size,seq_length,hidden_size]。可以直觀理解為對每條語料的最終表示，適用於seq2seq問題。

def get_pooled_output(self):
return self.pooled_outp #[batch_size, hidden_size]
def get_sequence_output(self):
"""Gets final hidden layer of encoder.
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
to the final hidden of the transformer encoder.
"""
return self.sequence_output
1
2
3
4
5
6
7
8
9
下一篇是訓練過程。最近突然有兩件事要忙，所以可能要鴿幾天了
---------------------
作者：保持一份率性
來源：CSDN
原文：https://blog.csdn.net/weixin_39470744/article/details/84401339
版權聲明：本文為博主原創文章，轉載請附上博文鏈接！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 谷歌BERT預訓練源碼解析（一）：訓練數據生成谷歌BERT預訓練源碼解析（三）：訓練過程預訓練模型（三）-----Bert bert 預訓練模型路徑 Notes | Bert系列的預訓練模型關於bert預訓練模型的輸出是什么 NLP與深度學習（五）BERT預訓練模型 BERT的通俗理解預訓練模型微調 Bert源碼解讀(三)之預訓練部分【算法】Bert預訓練源碼閱讀