這里我們將實現一個Transformer模型,將葡萄牙語翻譯為英語。Transformer的核心思想是self-attention–通過關注序列不同位置的內容獲取句子的表示。
Transformer的一些優點:
不受限於數據的時間/空間關系
可以並行計算
遠距離token的相互影響不需要通過很長的時間步或很深的卷積層
可以學習遠程依賴
Transformer的缺點:
對於時間序列,輸出需要根據整個歷史,而不是當前狀態和輸入,可能造成效率較低
如果想要獲取時間空間信息,需要額外的位置編碼
from __future__ import absolute_import, division, print_function, unicode_literals
# 安裝tfds pip install tfds-nightly==1.0.2.dev201904090105
import tensorflow_datasets as tfds
import tensorflow as tf
import tensorflow.keras.layers as layers
import time
import numpy as np
import matplotlib.pyplot as plt
print(tf.__version__)
2.0.0-alpha0
1.數據輸入pipeline
我們將使用到Portugese-English翻譯數據集。
該數據集包含大約50000個訓練樣例,1100個驗證示例和2000個測試示例。
examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
as_supervised=True)
將數據轉化為subwords格式
train_examples, val_examples = examples['train'], examples['validation']
tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(en.numpy() for pt, en in train_examples), target_vocab_size=2**13)
tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)
token轉化測試
sample_str = 'hello world, tensorflow 2'
tokenized_str = tokenizer_en.encode(sample_str)
print(tokenized_str)
original_str = tokenizer_en.decode(tokenized_str)
print(original_str)
[3222, 439, 150, 7345, 1378, 2824, 2370, 7881]
hello world, tensorflow 2
添加start、end的token表示
def encode(lang1, lang2):
lang1 = [tokenizer_pt.vocab_size] + tokenizer_pt.encode(
lang1.numpy()) + [tokenizer_pt.vocab_size+1]
lang2 = [tokenizer_en.vocab_size] + tokenizer_en.encode(
lang2.numpy()) + [tokenizer_en.vocab_size+1]
return lang1, lang2
過濾長度超過40的數據
MAX_LENGTH=40
def filter_long_sent(x, y, max_length=MAX_LENGTH):
return tf.logical_and(tf.size(x) <= max_length,
tf.size(y) <= max_length)
將python運算,轉換為tensorflow運算節點
def tf_encode(pt, en):
return tf.py_function(encode, [pt, en], [tf.int64, tf.int64])
構造數據集
BUFFER_SIZE = 20000
BATCH_SIZE = 64
# 使用.map()運行相關圖操作
train_dataset = train_examples.map(tf_encode)
# 過濾過長的數據
train_dataset = train_dataset.filter(filter_long_sent)
# 使用緩存數據加速讀入
train_dataset = train_dataset.cache()
# 打亂並獲取批數據
train_dataset = train_dataset.padded_batch(
BATCH_SIZE, padded_shapes=([40], [40])) # 填充為最大長度-90
# 設置預取數據
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)
# 驗證集數據
val_dataset = val_examples.map(tf_encode)
val_dataset = val_dataset.filter(filter_long_sent).padded_batch(
BATCH_SIZE, padded_shapes=([40], [40]))
de_batch, en_batch = next(iter(train_dataset))
de_batch, en_batch
(
array([[8214, 116, 84, ..., 0, 0, 0],
[8214, 7, 261, ..., 0, 0, 0],
[8214, 155, 39, ..., 0, 0, 0],
...,
[8214, 639, 590, ..., 0, 0, 0],
[8214, 204, 3441, ..., 0, 0, 0],
[8214, 27, 13, ..., 0, 0, 0]])>,
array([[8087, 83, 145, ..., 0, 0, 0],
[8087, 4670, 1783, ..., 0, 0, 0],
[8087, 169, 56, ..., 0, 0, 0],
...,
[8087, 174, 79, ..., 0, 0, 0],
[8087, 11, 16, ..., 0, 0, 0],
[8087, 4, 12, ..., 0, 0, 0]])>)
2.位置嵌入
將位置編碼矢量添加得到詞嵌入,相同位置的詞嵌入將會更接近,但並不能直接編碼相對位置
基於角度的位置編碼方法如下:
PE(pos,2i)=sin(pos/100002i/dmodel)\Large{PE_{(pos, 2i)} = sin(pos / 10000^{2i / d_{model}})} PE(pos,2i)=sin(pos/100002i/dmodel)
PE(pos,2i+1)=cos(pos/100002i/dmodel)\Large{PE_{(pos, 2i+1)} = cos(pos / 10000^{2i / d_{model}})} PE(pos,2i+1)=cos(pos/100002i/dmodel)
def get_angles(pos, i, d_model):
# 這里的i等價與上面公式中的2i和2i+1
angle_rates = 1 / np.power(10000, (2*(i // 2))/ np.float32(d_model))
return pos * angle_rates
def positional_encoding(position, d_model):
angle_rads = get_angles(np.arange(position)[:, np.newaxis],
np.arange(d_model)[np.newaxis,:],
d_model)
# 第2i項使用sin
sines = np.sin(angle_rads[:, 0::2])
# 第2i+1項使用cos
cones = np.cos(angle_rads[:, 1::2])
pos_encoding = np.concatenate([sines, cones], axis=-1)
pos_encoding = pos_encoding[np.newaxis, ...]
return tf.cast(pos_encoding, dtype=tf.float32)
獲得位置嵌入編碼
pos_encoding = positional_encoding(50, 512)
print(pos_encoding.shape)
plt.pcolormesh(pos_encoding[0], cmap='RdBu')
plt.xlabel('Depth')
plt.xlim((0, 512))
plt.ylabel('Position')
plt.colorbar()
plt.show() # 在這里左右邊分別為原來2i 和 2i+1的特征
(1, 50, 512)
3.掩碼
為了避免輸入中padding的token對句子語義的影響,需要將padding位mark掉,原來為0的padding項的mark輸出為1
def create_padding_mark(seq):
# 獲取為0的padding項
seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
# 擴充維度以便用於attention矩陣
return seq[:, np.newaxis, np.newaxis, :] # (batch_size,1,1,seq_len)
# mark 測試
create_padding_mark([[1,2,0,0,3],[3,4,5,0,0]])
array([[[[0., 0., 1., 1., 0.]]],
[[[0., 0., 0., 1., 1.]]]], dtype=float32)>
look-ahead mask 用於對未預測的token進行掩碼
這意味着要預測第三個單詞,只會使用第一個和第二個單詞。 要預測第四個單詞,僅使用第一個,第二個和第三個單詞,依此類推。
def create_look_ahead_mark(size):
# 1 - 對角線和取下三角的全部對角線(-1->全部)
# 這樣就可以構造出每個時刻未預測token的掩碼
mark = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
return mark # (seq_len, seq_len)
# x = tf.random.uniform((1,3))
temp = create_look_ahead_mark(3)
print(temp)
tf.Tensor(
[[0. 1. 1.]
[0. 0. 1.]
[0. 0. 0.]], shape=(3, 3), dtype=float32)
4.Scaled dot product attention
進行attention計算的時候有3個輸入 Q (query), K (key), V (value)。計算公式如下:
Attention(Q,K,V)=softmaxk(QKTdk)V\Large{Attention(Q, K, V) = softmax_k(\frac{QK^T}{\sqrt{d_k}}) V} Attention(Q,K,V)=softmaxk(dkQKT)V
點積注意力通過深度d_k的平方根進行縮放,因為較大的深度會使點積變大,由於使用softmax,會使梯度變小。
例如,考慮Q和K的均值為0且方差為1.它們的矩陣乘法的均值為0,方差為dk。我們使用dk的根用於縮放(而不是任何其他數字),因為Q和K的matmul應該具有0的均值和1的方差。
在這里我們將被掩碼的token乘以-1e9(表示負無窮),這樣softmax之后就為0,不對其他token產生影響。
def scaled_dot_product_attention(q, k, v, mask):
# query key 相乘獲取匹配關系
matmul_qk = tf.matmul(q, k, transpose_b=True)
# 使用dk進行縮放
dk = tf.cast(tf.shape(k)[-1], tf.float32)
scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
# 掩碼
if mask is not None:
scaled_attention_logits += (mask * -1e9)
# 通過softmax獲取attention權重
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
# attention 乘上value
output = tf.matmul(attention_weights, v) # (.., seq_len_v, depth)
return output, attention_weights
使用attention獲取需要關注的語義
def print_out(q, k, v):
temp_out, temp_att = scaled_dot_product_attention(
q, k, v, None)
print('attention weight:')
print(temp_att)
print('output:')
print(temp_out)
attention測試
# 顯示為numpy類型
np.set_printoptions(suppress=True)
temp_k = tf.constant([[10,0,0],
[0,10,0],
[0,0,10],
[0,0,10]], dtype=tf.float32) # (4, 3)
temp_v = tf.constant([[ 1,0],
[ 10,0],
[ 100,5],
[1000,6]], dtype=tf.float32) # (4, 3)
# 關注第2個key, 返回對應的value
temp_q = tf.constant([[0,10,0]], dtype=tf.float32)
print_out(temp_q, temp_k, temp_v)
attention weight:
tf.Tensor([[0. 1. 0. 0.]], shape=(1, 4), dtype=float32)
output:
tf.Tensor([[10. 0.]], shape=(1, 2), dtype=float32)
# 關注重復的key(第3、4個), 返回對應的value(平均)
temp_q = tf.constant([[0,0,10]], dtype=tf.float32)
print_out(temp_q, temp_k, temp_v)
attention weight:
tf.Tensor([[0. 0. 0.5 0.5]], shape=(1, 4), dtype=float32)
output:
tf.Tensor([[550. 5.5]], shape=(1, 2), dtype=float32)
# 關注第1、2個key, 返回對應的value(平均)
temp_q = tf.constant([[10,10,0]], dtype=tf.float32)
print_out(temp_q, temp_k, temp_v)
attention weight:
tf.Tensor([[0.5 0.5 0. 0. ]], shape=(1, 4), dtype=float32)
output:
tf.Tensor([[5.5 0. ]], shape=(1, 2), dtype=float32)
# 依次放入每個query
temp_q = tf.constant([[0, 0, 10], [0, 10, 0], [10, 10, 0]], dtype=tf.float32) # (3, 3)
print_out(temp_q, temp_k, temp_v)
attention weight:
tf.Tensor(
[[0. 0. 0.5 0.5]
[0. 1. 0. 0. ]
[0.5 0.5 0. 0. ]], shape=(3, 4), dtype=float32)
output:
tf.Tensor(
[[550. 5.5]
[ 10. 0. ]
[ 5.5 0. ]], shape=(3, 2), dtype=float32)
5.Mutil-Head Attention
mutil-head attention包含3部分:
線性層與分頭
縮放點積注意力
頭連接
末尾線性層
每個多頭注意塊有三個輸入; Q(查詢),K(密鑰),V(值)。 它們通過第一層線性層並分成多個頭。
注意:點積注意力時需要使用mask, 多頭輸出需要使用tf.transpose調整各維度。
Q,K和V不是一個單獨的注意頭,而是分成多個頭,因為它允許模型共同參與來自不同表征空間的不同信息。 在拆分之后,每個頭部具有降低的維度,總計算成本與具有全維度的單個頭部注意力相同。
# 構造mutil head attention層
class MutilHeadAttention(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads):
super(MutilHeadAttention, self).__init__()
self.num_heads = num_heads
self.d_model = d_model
# d_model 必須可以正確分為各個頭
assert d_model % num_heads == 0
# 分頭后的維度
self.depth = d_model // num_heads
self.wq = tf.keras.layers.Dense(d_model)
self.wk = tf.keras.layers.Dense(d_model)
self.wv = tf.keras.layers.Dense(d_model)
self.dense = tf.keras.layers.Dense(d_model)
def split_heads(self, x, batch_size):
# 分頭, 將頭個數的維度 放到 seq_len 前面
x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
return tf.transpose(x, perm=[0, 2, 1, 3])
def call(self, v, k, q, mask):
batch_size = tf.shape(q)[0]
# 分頭前的前向網絡,獲取q、k、v語義
q = self.wq(q) # (batch_size, seq_len, d_model)
k = self.wk(k)
v = self.wv(v)
# 分頭
q = self.split_heads(q, batch_size) # (batch_size, num_heads, seq_len_q, depth)
k = self.split_heads(k, batch_size)
v = self.split_heads(v, batch_size)
# scaled_attention.shape == (batch_size, num_heads, seq_len_v, depth)
# attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
# 通過縮放點積注意力層
scaled_attention, attention_weights = scaled_dot_product_attention(
q, k, v, mask)
# 把多頭維度后移
scaled_attention = tf.transpose(scaled_attention, [0, 2, 1, 3]) # (batch_size, seq_len_v, num_heads, depth)
# 合並多頭
concat_attention = tf.reshape(scaled_attention,
(batch_size, -1, self.d_model))
# 全連接重塑
output = self.dense(concat_attention)
return output, attention_weights
測試多頭attention
temp_mha = MutilHeadAttention(d_model=512, num_heads=8)
y = tf.random.uniform((1, 60, 512))
output, att = temp_mha(y, k=y, q=y, mask=None)
print(output.shape, att.shape)
(1, 60, 512) (1, 8, 60, 60)
point wise前向網絡
def point_wise_feed_forward_network(d_model, diff):
return tf.keras.Sequential([
tf.keras.layers.Dense(diff, activation='relu'),
tf.keras.layers.Dense(d_model)
])
sample_fnn = point_wise_feed_forward_network(512, 2048)
sample_fnn(tf.random.uniform((64, 50, 512))).shape
TensorShape([64, 50, 512])
6.編碼器和解碼器
通過N個編碼器層,為序列中的每個字/令牌生成輸出。
解碼器連接編碼器的輸出和它自己的輸入(自我注意)以預測下一個字。
編碼層
每個編碼層包含以下子層
Multi-head attention(帶掩碼)
Point wise feed forward networks
每個子層中都有殘差連接,並最后通過一個正則化層。殘差連接有助於避免深度網絡中的梯度消失問題。
每個子層輸出是LayerNorm(x + Sublayer(x)),規范化是在d_model維的向量上。Transformer一共有n個編碼層。
class LayerNormalization(tf.keras.layers.Layer):
def __init__(self, epsilon=1e-6, **kwargs):
self.eps = epsilon
super(LayerNormalization, self).__init__(**kwargs)
def build(self, input_shape):
self.gamma = self.add_weight(name='gamma', shape=input_shape[-1:],
initializer=tf.ones_initializer(), trainable=True)
self.beta = self.add_weight(name='beta', shape=input_shape[-1:],
initializer=tf.zeros_initializer(), trainable=True)
super(LayerNormalization, self).build(input_shape)
def call(self, x):
mean = tf.keras.backend.mean(x, axis=-1, keepdims=True)
std = tf.keras.backend.std(x, axis=-1, keepdims=True)
return self.gamma * (x - mean) / (std + self.eps) + self.beta
def compute_output_shape(self, input_shape):
return input_shape
class EncoderLayer(tf.keras.layers.Layer):
def __init__(self, d_model, n_heads, ddf, dropout_rate=0.1):
super(EncoderLayer, self).__init__()
self.mha = MutilHeadAttention(d_model, n_heads)
self.ffn = point_wise_feed_forward_network(d_model, ddf)
self.layernorm1 = LayerNormalization(epsilon=1e-6)
self.layernorm2 = LayerNormalization(epsilon=1e-6)
self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
self.dropout2 = tf.keras.layers.Dropout(dropout_rate)
def call(self, inputs, training, mask):
# 多頭注意力網絡
att_output, _ = self.mha(inputs, inputs, inputs, mask)
att_output = self.dropout1(att_output, training=training)
out1 = self.layernorm1(inputs + att_output) # (batch_size, input_seq_len, d_model)
# 前向網絡
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
out2 = self.layernorm2(out1 + ffn_output) # (batch_size, input_seq_len, d_model)
return out2
encoder層測試
sample_encoder_layer = EncoderLayer(512, 8, 2048)
sample_encoder_layer_output = sample_encoder_layer(
tf.random.uniform((64, 43, 512)), False, None)
sample_encoder_layer_output.shape
TensorShape([64, 43, 512])
解碼層
每個編碼層包含以下子層:
Masked muti-head attention(帶padding掩碼和look-ahead掩碼)
Muti-head attention(帶padding掩碼)value和key來自encoder輸出,query來自Masked muti-head attention層輸出
Point wise feed forward network
每個子層中都有殘差連接,並最后通過一個正則化層。殘差連接有助於避免深度網絡中的梯度消失問題。
每個子層輸出是LayerNorm(x + Sublayer(x)),規范化是在d_model維的向量上。Transformer一共有n個解碼層。
當Q從解碼器的第一個注意塊接收輸出,並且K接收編碼器輸出時,注意權重表示基於編碼器輸出給予解碼器輸入的重要性。 換句話說,解碼器通過查看編碼器輸出並自我關注其自己的輸出來預測下一個字。
ps:因為padding在后面所以look-ahead掩碼同時掩padding
class DecoderLayer(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads, dff, drop_rate=0.1):
super(DecoderLayer, self).__init__()
self.mha1 = MutilHeadAttention(d_model, num_heads)
self.mha2 = MutilHeadAttention(d_model, num_heads)
self.ffn = point_wise_feed_forward_network(d_model, dff)
self.layernorm1 = LayerNormalization(epsilon=1e-6)
self.layernorm2 = LayerNormalization(epsilon=1e-6)
self.layernorm3 = LayerNormalization(epsilon=1e-6)
self.dropout1 = layers.Dropout(drop_rate)
self.dropout2 = layers.Dropout(drop_rate)
self.dropout3 = layers.Dropout(drop_rate)
def call(self,inputs, encode_out, training,
look_ahead_mask, padding_mask):
# masked muti-head attention
att1, att_weight1 = self.mha1(inputs, inputs, inputs,look_ahead_mask)
att1 = self.dropout1(att1, training=training)
out1 = self.layernorm1(inputs + att1)
# muti-head attention
att2, att_weight2 = self.mha2(encode_out, encode_out, inputs, padding_mask)
att2 = self.dropout2(att2, training=training)
out2 = self.layernorm2(out1 + att2)
ffn_out = self.ffn(out2)
ffn_out = self.dropout3(ffn_out, training=training)
out3 = self.layernorm3(out2 + ffn_out)
return out3, att_weight1, att_weight2
測試解碼層
sample_decoder_layer = DecoderLayer(512, 8, 2048)
sample_decoder_layer_output, _, _ = sample_decoder_layer(
tf.random.uniform((64, 50, 512)), sample_encoder_layer_output,
False, None, None)
sample_decoder_layer_output.shape
TensorShape([64, 50, 512])
編碼器
編碼器包含:
Input Embedding
Positional Embedding
N個編碼層
class Encoder(layers.Layer):
def __init__(self, n_layers, d_model, n_heads, ddf,
input_vocab_size, max_seq_len, drop_rate=0.1):
super(Encoder, self).__init__()
self.n_layers = n_layers
self.d_model = d_model
self.embedding = layers.Embedding(input_vocab_size, d_model)
self.pos_embedding = positional_encoding(max_seq_len, d_model)
self.encode_layer = [EncoderLayer(d_model, n_heads, ddf, drop_rate)
for _ in range(n_layers)]
self.dropout = layers.Dropout(drop_rate)
def call(self, inputs, training, mark):
seq_len = inputs.shape[1]
word_emb = self.embedding(inputs)
word_emb *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
emb = word_emb + self.pos_embedding[:,:seq_len,:]
x = self.dropout(emb, training=training)
for i in range(self.n_layers):
x = self.encode_layer[i](x, training, mark)
return x
編碼器測試
sample_encoder = Encoder(2, 512, 8, 1024, 5000, 200)
sample_encoder_output = sample_encoder(tf.random.uniform((64, 120)),
False, None)
sample_encoder_output.shape
TensorShape([64, 120, 512])
解碼器
解碼器包含以下部分:1、輸出嵌入;2、位置編碼;3、n個解碼層
輸出嵌入和位置編碼疊加后輸入解碼器,解碼器最后的輸出送給一個全連接
# import pdb
# pdb.set_trace()
class Decoder(layers.Layer):
def __init__(self, n_layers, d_model, n_heads, ddf,
target_vocab_size, max_seq_len, drop_rate=0.1):
super(Decoder, self).__init__()
self.d_model = d_model
self.n_layers = n_layers
self.embedding = layers.Embedding(target_vocab_size, d_model)
self.pos_embedding = positional_encoding(max_seq_len, d_model)
self.decoder_layers= [DecoderLayer(d_model, n_heads, ddf, drop_rate)
for _ in range(n_layers)]
self.dropout = layers.Dropout(drop_rate)
def call(self, inputs, encoder_out,training,
look_ahead_mark, padding_mark):
seq_len = tf.shape(inputs)[1]
attention_weights = {}
h = self.embedding(inputs)
h *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
h += self.pos_embedding[:,:seq_len,:]
h = self.dropout(h, training=training)
# print('--------------------\n',h, h.shape)
# 疊加解碼層
for i in range(self.n_layers):
h, att_w1, att_w2 = self.decoder_layers[i](h, encoder_out,
training, look_ahead_mark,
padding_mark)
attention_weights['decoder_layer{}_att_w1'.format(i+1)] = att_w1
attention_weights['decoder_layer{}_att_w2'.format(i+1)] = att_w2
return h, attention_weights
解碼器測試
sample_decoder = Decoder(2, 512,8,1024,5000, 200)
sample_decoder_output, attn = sample_decoder(tf.random.uniform((64, 100)),
sample_encoder_output, False,
None, None)
sample_decoder_output.shape, attn['decoder_layer1_att_w1'].shape
(TensorShape([64, 100, 512]), TensorShape([64, 8, 100, 100]))
創建Transformer
Transformer包含編碼器、解碼器和最后的線性層,解碼層的輸出經過線性層后得到Transformer的輸出
class Transformer(tf.keras.Model):
def __init__(self, n_layers, d_model, n_heads, diff,
input_vocab_size, target_vocab_size,
max_seq_len, drop_rate=0.1):
super(Transformer, self).__init__()
self.encoder = Encoder(n_layers, d_model, n_heads,diff,
input_vocab_size, max_seq_len, drop_rate)
self.decoder = Decoder(n_layers, d_model, n_heads, diff,
target_vocab_size, max_seq_len, drop_rate)
self.final_layer = tf.keras.layers.Dense(target_vocab_size)
def call(self, inputs, targets, training, encode_padding_mask,
look_ahead_mask, decode_padding_mask):
encode_out = self.encoder(inputs, training, encode_padding_mask)
print(encode_out.shape)
decode_out, att_weights = self.decoder(targets, encode_out, training,
look_ahead_mask, decode_padding_mask)
print(decode_out.shape)
final_out = self.final_layer(decode_out)
return final_out, att_weights
Transformer測試
sample_transformer = Transformer(
n_layers=2, d_model=512, n_heads=8, diff=1024,
input_vocab_size=8500, target_vocab_size=8000, max_seq_len=120
)
temp_input = tf.random.uniform((64, 62))
temp_target = tf.random.uniform((64, 26))
fn_out, _ = sample_transformer(temp_input, temp_target, training=False,
encode_padding_mask=None,
look_ahead_mask=None,
decode_padding_mask=None,
)
fn_out.shape
(64, 62, 512)
(64, 26, 512)
TensorShape([64, 26, 8000])
7.實驗設置
設置超參
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
input_vocab_size = tokenizer_pt.vocab_size + 2
target_vocab_size = tokenizer_en.vocab_size + 2
max_seq_len = 40
dropout_rate = 0.1
優化器
帶自定義學習率調整的Adam優化器
lrate=dmodel−0.5∗min(step_num−0.5,step_num∗warmup_steps−1.5)\Large{lrate = d_{model}^{-0.5} * min(step{\_}num^{-0.5}, step{\_}num * warmup{\_}steps^{-1.5})}lrate=dmodel−0.5∗min(step_num−0.5,step_num∗warmup_steps−1.5)
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, d_model, warmup_steps=4000):
super(CustomSchedule, self).__init__()
self.d_model = tf.cast(d_model, tf.float32)
self.warmup_steps = warmup_steps
def __call__(self, step):
arg1 = tf.math.rsqrt(step)
arg2 = step * (self.warmup_steps ** -1.5)
return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)
learing_rate = CustomSchedule(d_model)
optimizer = tf.keras.optimizers.Adam(learing_rate, beta_1=0.9,
beta_2=0.98, epsilon=1e-9)
# 測試無錫人流哪家好 http://www.wxbhffk.com/
temp_learing_rate = CustomSchedule(d_model)
plt.plot(temp_learing_rate(tf.range(40000, dtype=tf.float32)))
plt.xlabel('learning rate')
plt.ylabel('train step')
Text(0, 0.5, 'train step')
損失和指標
由於目標序列是填充的,因此在計算損耗時應用填充掩碼很重要。
padding的掩碼為0,沒padding的掩碼為1
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,
reduction='none')
def loss_fun(y_ture, y_pred):
mask = tf.math.logical_not(tf.math.equal(y_ture, 0)) # 為0掩碼標1
loss_ = loss_object(y_ture, y_pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
8、訓練和保持模型
transformer = Transformer(num_layers, d_model, num_heads, dff,
input_vocab_size, target_vocab_size,
max_seq_len, dropout_rate)
# 構建掩碼
def create_mask(inputs,targets):
encode_padding_mask = create_padding_mark(inputs)
# 這個掩碼用於掩輸入解碼層第二層的編碼層輸出
decode_padding_mask = create_padding_mark(inputs)
# look_ahead 掩碼, 掩掉未預測的詞
look_ahead_mask = create_look_ahead_mark(tf.shape(targets)[1])
# 解碼層第一層得到padding掩碼
decode_targets_padding_mask = create_padding_mark(targets)
# 合並解碼層第一層掩碼
combine_mask = tf.maximum(decode_targets_padding_mask, look_ahead_mask)
return encode_padding_mask, combine_mask, decode_padding_mask
創建checkpoint管理器
checkpoint_path = './checkpoint/train'
ckpt = tf.train.Checkpoint(transformer=transformer,
optimizer=optimizer)
# ckpt管理器
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=3)
if ckpt_manager.latest_checkpoint:
ckpt.restore(ckpt_manager.latest_checkpoint)
print('last checkpoit restore')
target分為target_input和target real.
target_input是傳給解碼器的輸入,target_real是其左移一個位置的結果,每個target_input位置對應下一個預測的標簽
如句子=“SOS A叢林中的獅子正在睡覺EOS”
target_input =“SOS叢林中的獅子正在睡覺”
target_real =“叢林中的獅子正在睡覺EOS”
transformer是個自動回歸模型:它一次預測一個部分,並使用其到目前為止的輸出,決定下一步做什么。
在訓練期間使用teacher-forcing,即無論模型當前輸出什么都強制將正確輸出傳給下一步。
而預測時則根據前一個的輸出預測下一個詞
為防止模型在預期輸出處達到峰值,模型使用look-ahead mask
@tf.function
def train_step(inputs, targets):
tar_inp = targets[:,:-1]
tar_real = targets[:,1:]
# 構造掩碼
encode_padding_mask, combined_mask, decode_padding_mask = create_mask(inputs, tar_inp)
with tf.GradientTape() as tape:
predictions, _ = transformer(inputs, tar_inp,
True,
encode_padding_mask,
combined_mask,
decode_padding_mask)
loss = loss_fun(tar_real, predictions)
# 求梯度
gradients = tape.gradient(loss, transformer.trainable_variables)
# 反向傳播
optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
# 記錄loss和准確率
train_loss(loss)
train_accuracy(tar_real, predictions)
葡萄牙語用作輸入語言,英語是目標語言。
EPOCHS = 20
for epoch in range(EPOCHS):
start = time.time()
# 重置記錄項
train_loss.reset_states()
train_accuracy.reset_states()
# inputs 葡萄牙語, targets英語
for batch, (inputs, targets) in enumerate(train_dataset):
# 訓練
train_step(inputs, targets)
if batch % 500 == 0:
print('epoch {}, batch {}, loss:{:.4f}, acc:{:.4f}'.format(
epoch+1, batch, train_loss.result(), train_accuracy.result()
))
if (epoch + 1) % 2 == 0:
ckpt_save_path = ckpt_manager.save()
print('epoch {}, save model at {}'.format(
epoch+1, ckpt_save_path
))
print('epoch {}, loss:{:.4f}, acc:{:.4f}'.format(
epoch+1, train_loss.result(), train_accuracy.result()
))
print('time in 1 epoch:{} secs\n'.format(time.time()-start))
(64, 40, 128)
(64, 39, 128)
(64, 40, 128)
(64, 39, 128)
epoch 1, batch 0, loss:4.0259, acc:0.0000
epoch 1, batch 500, loss:3.4436, acc:0.0340
(31, 40, 128)
(31, 39, 128)
epoch 1, loss:3.2112, acc:0.0481
time in 1 epoch:467.3876633644104 secs
…
epoch 20, batch 0, loss:0.5182, acc:0.3193
epoch 20, batch 500, loss:0.5374, acc:0.3263
epoch 20, save model at ./checkpoint/train/ckpt-10
epoch 20, loss:0.5344, acc:0.3257
time in 1 epoch:377.9467544555664 secs
