深度學習用的有一年多了,最近開始NLP自然處理方面的研發。剛好趁着這個機會寫一系列NLP機器翻譯深度學習實戰課程。
本系列課程將從原理講解與數據處理深入到如何動手實踐與應用部署,將包括以下內容:(更新ing)
- NLP機器翻譯深度學習實戰課程·零(基礎概念)
- NLP機器翻譯深度學習實戰課程·壹(RNN base)
- NLP機器翻譯深度學習實戰課程·貳(RNN+Attention base)
- NLP機器翻譯深度學習實戰課程·叄(CNN base)
- NLP機器翻譯深度學習實戰課程·肆(Self-Attention base)
- NLP機器翻譯深度學習實戰課程·伍(應用部署)
本系列教程參考博客:https://me.csdn.net/chinatelecom08
0. 項目背景
在上個文章中,我們已經簡單介紹了NLP機器翻譯,這次我們將用實戰的方式講解基於RNN的翻譯模型。
0.1 基於RNN的seq2seq架構翻譯模型介紹
基於RNN的seq2seq架構包含encoder和decoder,decoder部分又分train和inference兩個過程,具體結構如下面兩圖所示:
可以看出結構很簡單(相較於CNN與Attention base),下面我們就通過代碼的形式實現,來進一步探究理解模型內在原理。
1. 數據准備
1.1 下載數據
此網站http://www.manythings.org/anki/上有許多翻譯數據,包含多種語言,這里此教程選擇的是中文到英語的數據集。
訓練下載地址:http://www.manythings.org/anki/cmn-eng.zip
解壓cmn-eng.zip,可以找到cmn.txt文件,內容如下:
# ========讀取原始數據========
with open('cmn.txt', 'r', encoding='utf-8') as f:
data = f.read()
data = data.split('\n')
data = data[:100]
print(data[-5:])

['Tom died.\t湯姆去世了。', 'Tom quit.\t湯姆不干了。', 'Tom swam.\t湯姆游泳了。', 'Trust me.\t相信我。', 'Try hard.\t努力。']

可以發現,每對翻譯數據在同一行,左邊是英文,右邊是中文使用 \t 作為英語與中文的分界。
1.2 數據預處理
使用網絡訓練,需要我們把數據處理成網絡可以接收的格式。
針對這個數據,具體來說就是需要把字符轉換為數字(句子數字化),句子長度歸一化處理。
句子數字化
可以參考我的這博客:『深度應用』NLP命名實體識別(NER)開源實戰教程,數據預處理的實現。
分別對英語與漢字做處理。
英文處理
因為英語每個單詞都是用空格分開的(除了縮寫詞,這里縮寫詞當做一個詞處理),還有就是標點符號和單詞沒有分開,也需要特殊處理一下
這里我用的是一個簡單方法處理下,實現在標點前加空格:
def split_dot(strs,dots=", . ! ?"): for d in dots.split(" "): #print(d) strs = strs.replace(d," "+d) #print(strs) return(strs)

使用這個方法來把詞個字典化:
def get_eng_dicts(datas): w_all_dict = {} for sample in datas: for token in sample.split(" "): if token not in w_all_dict.keys(): w_all_dict[token] = 1 else: w_all_dict[token] += 1 sort_w_list = sorted(w_all_dict.items(), key=lambda d: d[1], reverse=True) w_keys = [x for x,_ in sort_w_list[:7000-2]] w_keys.insert(0,"<PAD>") w_keys.insert(0,"<UNK>") w_dict = { x:i for i,x in enumerate(w_keys) } i_dict = { i:x for i,x in enumerate(w_keys) } return w_dict,i_dict

中文處理
在處理中文時可以發現,有繁體也有簡體,所以最好轉換為統一形式:(參考地址)
# 安裝 pip install opencc-python-reimplemented # t2s - 繁體轉簡體(Traditional Chinese to Simplified Chinese) # s2t - 簡體轉繁體(Simplified Chinese to Traditional Chinese) # mix2t - 混合轉繁體(Mixed to Traditional Chinese) # mix2s - 混合轉簡體(Mixed to Simplified Chinese)

使用方法,把繁體轉換為簡體:
import opencc cc = opencc.OpenCC('t2s') s = cc.convert('這是什麼啊?') print(s) #這是什么啊?

再使用jieba分詞的方法來從句子中分出詞來:
def get_chn_dicts(datas): w_all_dict = {} for sample in datas: for token in jieba.cut(sample): if token not in w_all_dict.keys(): w_all_dict[token] = 1 else: w_all_dict[token] += 1 sort_w_list = sorted(w_all_dict.items(), key=lambda d: d[1], reverse=True) w_keys = [x for x,_ in sort_w_list[:10000-4]] w_keys.insert(0,"<EOS>") w_keys.insert(0,"<GO>") w_keys.insert(0,"<PAD>") w_keys.insert(0,"<UNK>") w_dict = { x:i for i,x in enumerate(w_keys) } i_dict = { i:x for i,x in enumerate(w_keys) } return w_dict,i_dict

下面進行padding
def get_val(keys,dicts): if keys in dicts.keys(): val = dicts[keys] else: keys = "<UNK>" val = dicts[keys] return(val) def padding(lists,lens=LENS): list_ret = [] for l in lists: while(len(l)<lens): l.append(1) if len(l)>lens: l = l[:lens] list_ret.append(l) return(list_ret)

最后統一運行處理一下:
if __name__ == "__main__": df = read2df("cmn-eng/cmn.txt") eng_dict,id2eng = get_eng_dicts(df["eng"]) chn_dict,id2chn = get_chn_dicts(df["chn"]) print(list(eng_dict.keys())[:20]) print(list(chn_dict.keys())[:20]) enc_in = [[get_val(e,eng_dict) for e in eng.split(" ")] for eng in df["eng"]] dec_in = [[get_val("<GO>",chn_dict)]+[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]] dec_out = [[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]] enc_in_ar = np.array(padding(enc_in,32)) dec_in_ar = np.array(padding(dec_in,30)) dec_out_ar = np.array(padding(dec_out,30))

輸出結果如下:
(TF_GPU) D:\Files\Prjs\Pythons\Kerases\MNT_RNN>C:/Datas/Apps/RJ/Miniconda3/envs/TF_GPU/python.exe d:/Files/Prjs/Pythons/Kerases/MNT_RNN/mian.py Using TensorFlow backend. eng chn 0 Hi . 嗨。 1 Hi . 你好。 2 Run . 你用跑的。 3 Wait ! 等等! 4 Hello ! 你好。 save csv Building prefix dict from the default dictionary ... Loading model from cache C:\Users\xiaos\AppData\Local\Temp\jieba.cache Loading model cost 0.788 seconds. Prefix dict has been built succesfully. ['<UNK>', '<PAD>', '.', 'I', 'to', 'the', 'you', 'a', '?', 'is', 'Tom', 'He', 'in', 'of', 'me', ',', 'was', 'for', 'have', 'The'] ['<UNK>', '<PAD>', '<GO>', '<EOS>', '。', '我', '的', '了', '你', '他', '?', '在', '湯姆', '是', '她', '嗎', '我們', ',', '不', '很']

2. 構建模型與訓練
2.1 構建模型與超參數
用的是雙層LSTM網絡
# =======預定義模型參數======== EN_VOCAB_SIZE = 7000 CH_VOCAB_SIZE = 10000 HIDDEN_SIZE = 256 LEARNING_RATE = 0.001 BATCH_SIZE = 50 EPOCHS = 100 # ======================================keras model================================== from keras.models import Model from keras.layers import Input, LSTM, Dense, Embedding,CuDNNLSTM from keras.optimizers import Adam import numpy as np def get_model(): # ==============encoder============= encoder_inputs = Input(shape=(None,)) emb_inp = Embedding(output_dim=128, input_dim=EN_VOCAB_SIZE)(encoder_inputs) encoder_h1, encoder_state_h1, encoder_state_c1 = CuDNNLSTM(HIDDEN_SIZE, return_sequences=True, return_state=True)(emb_inp) encoder_h2, encoder_state_h2, encoder_state_c2 = CuDNNLSTM(HIDDEN_SIZE, return_state=True)(encoder_h1) # ==============decoder============= decoder_inputs = Input(shape=(None,)) emb_target = Embedding(output_dim=128, input_dim=CH_VOCAB_SIZE)(decoder_inputs) lstm1 = CuDNNLSTM(HIDDEN_SIZE, return_sequences=True, return_state=True) lstm2 = CuDNNLSTM(HIDDEN_SIZE, return_sequences=True, return_state=True) decoder_dense = Dense(CH_VOCAB_SIZE, activation='softmax') decoder_h1, _, _ = lstm1(emb_target, initial_state=[encoder_state_h1, encoder_state_c1]) decoder_h2, _, _ = lstm2(decoder_h1, initial_state=[encoder_state_h2, encoder_state_c2]) decoder_outputs = decoder_dense(decoder_h2) model = Model([encoder_inputs, decoder_inputs], decoder_outputs) # encoder模型和訓練相同 encoder_model = Model(encoder_inputs, [encoder_state_h1, encoder_state_c1, encoder_state_h2, encoder_state_c2]) # 預測模型中的decoder的初始化狀態需要傳入新的狀態 decoder_state_input_h1 = Input(shape=(HIDDEN_SIZE,)) decoder_state_input_c1 = Input(shape=(HIDDEN_SIZE,)) decoder_state_input_h2 = Input(shape=(HIDDEN_SIZE,)) decoder_state_input_c2 = Input(shape=(HIDDEN_SIZE,)) # 使用傳入的值來初始化當前模型的輸入狀態 decoder_h1, state_h1, state_c1 = lstm1(emb_target, initial_state=[decoder_state_input_h1, decoder_state_input_c1]) decoder_h2, state_h2, state_c2 = lstm2(decoder_h1, initial_state=[decoder_state_input_h2, decoder_state_input_c2]) decoder_outputs = decoder_dense(decoder_h2) decoder_model = Model([decoder_inputs, decoder_state_input_h1, decoder_state_input_c1, decoder_state_input_h2, decoder_state_input_c2], [decoder_outputs, state_h1, state_c1, state_h2, state_c2]) return(model,encoder_model,decoder_model)

2.2 模型配置與訓練
自定義了一個acc,便於顯示效果,keras內置的acc無法使用
import keras.backend as K from keras.models import load_model def my_acc(y_true, y_pred): acc = K.cast(K.equal(K.max(y_true,axis=-1),K.cast(K.argmax(y_pred,axis=-1),K.floatx())),K.floatx()) return acc Train = True if __name__ == "__main__": df = read2df("cmn-eng/cmn.txt") eng_dict,id2eng = get_eng_dicts(df["eng"]) chn_dict,id2chn = get_chn_dicts(df["chn"]) print(list(eng_dict.keys())[:20]) print(list(chn_dict.keys())[:20]) enc_in = [[get_val(e,eng_dict) for e in eng.split(" ")] for eng in df["eng"]] dec_in = [[get_val("<GO>",chn_dict)]+[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]] dec_out = [[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]] enc_in_ar = np.array(padding(enc_in,32)) dec_in_ar = np.array(padding(dec_in,30)) dec_out_ar = np.array(padding(dec_out,30)) #dec_out_ar = covt2oh(dec_out_ar) if Train: model,encoder_model,decoder_model = get_model() model.load_weights('e2c1.h5') opt = Adam(lr=LEARNING_RATE, beta_1=0.9, beta_2=0.99, epsilon=1e-08) model.compile(optimizer=opt, loss='sparse_categorical_crossentropy',metrics=[my_acc]) model.summary() print(dec_out_ar.shape) model.fit([enc_in_ar, dec_in_ar], np.expand_dims(dec_out_ar,-1), batch_size=50, epochs=64, initial_epoch=0, validation_split=0.1) model.save('e2c1.h5') encoder_model.save("enc1.h5") decoder_model.save("dec1.h5")

64Epoch訓練結果如下:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, None) 0
__________________________________________________________________________________________________
input_2 (InputLayer) (None, None) 0
__________________________________________________________________________________________________
embedding_1 (Embedding) (None, None, 128) 896000 input_1[0][0]
__________________________________________________________________________________________________
embedding_2 (Embedding) (None, None, 128) 1280000 input_2[0][0]
__________________________________________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM) [(None, None, 256), 395264 embedding_1[0][0]
__________________________________________________________________________________________________
cu_dnnlstm_3 (CuDNNLSTM) [(None, None, 256), 395264 embedding_2[0][0]
cu_dnnlstm_1[0][1]
cu_dnnlstm_1[0][2]
__________________________________________________________________________________________________
cu_dnnlstm_2 (CuDNNLSTM) [(None, 256), (None, 526336 cu_dnnlstm_1[0][0]
__________________________________________________________________________________________________
cu_dnnlstm_4 (CuDNNLSTM) [(None, None, 256), 526336 cu_dnnlstm_3[0][0]
cu_dnnlstm_2[0][1]
cu_dnnlstm_2[0][2]
__________________________________________________________________________________________________
dense_1 (Dense) (None, None, 10000) 2570000 cu_dnnlstm_4[0][0]
==================================================================================================
Non-trainable params: 0
__________________________________________________________________________________________________
...
...
19004/19004 [==============================] - 98s 5ms/step - loss: 0.1371 - my_acc: 0.9832 - val_loss: 2.7299 - val_my_acc: 0.7412
Epoch 58/64
19004/19004 [==============================] - 96s 5ms/step - loss: 0.1234 - my_acc: 0.9851 - val_loss: 2.7378 - val_my_acc: 0.7410
Epoch 59/64
19004/19004 [==============================] - 96s 5ms/step - loss: 0.1132 - my_acc: 0.9867 - val_loss: 2.7477 - val_my_acc: 0.7419
Epoch 60/64
19004/19004 [==============================] - 96s 5ms/step - loss: 0.1050 - my_acc: 0.9879 - val_loss: 2.7660 - val_my_acc: 0.7426
Epoch 61/64
19004/19004 [==============================] - 96s 5ms/step - loss: 0.0983 - my_acc: 0.9893 - val_loss: 2.7569 - val_my_acc: 0.7408
Epoch 62/64
19004/19004 [==============================] - 96s 5ms/step - loss: 0.0933 - my_acc: 0.9903 - val_loss: 2.7775 - val_my_acc: 0.7414
Epoch 63/64
19004/19004 [==============================] - 96s 5ms/step - loss: 0.0885 - my_acc: 0.9911 - val_loss: 2.7885 - val_my_acc: 0.7420
Epoch 64/64
19004/19004 [==============================] - 96s 5ms/step - loss: 0.0845 - my_acc: 0.9920 - val_loss: 2.7914 - val_my_acc: 0.7423

3. 模型應用與預測
從訓練集選取部分數據進行測試
Train = False if __name__ == "__main__": df = read2df("cmn-eng/cmn.txt") eng_dict,id2eng = get_eng_dicts(df["eng"]) chn_dict,id2chn = get_chn_dicts(df["chn"]) print(list(eng_dict.keys())[:20]) print(list(chn_dict.keys())[:20]) enc_in = [[get_val(e,eng_dict) for e in eng.split(" ")] for eng in df["eng"]] dec_in = [[get_val("<GO>",chn_dict)]+[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]] dec_out = [[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]] enc_in_ar = np.array(padding(enc_in,32)) dec_in_ar = np.array(padding(dec_in,30)) dec_out_ar = np.array(padding(dec_out,30)) #dec_out_ar = covt2oh(dec_out_ar) if Train: pass else: encoder_model,decoder_model = load_model("enc1.h5",custom_objects={"my_acc":my_acc}),load_model("dec1.h5",custom_objects={"my_acc":my_acc}) for k in range(16000-20,16000): test_data = enc_in_ar[k:k+1] h1, c1, h2, c2 = encoder_model.predict(test_data) target_seq = np.zeros((1,1)) outputs = [] target_seq[0, len(outputs)] = chn_dict["<GO>"] while True: output_tokens, h1, c1, h2, c2 = decoder_model.predict([target_seq, h1, c1, h2, c2]) sampled_token_index = np.argmax(output_tokens[0, -1, :]) #print(sampled_token_index) outputs.append(sampled_token_index) #target_seq = np.zeros((1, 30)) target_seq[0, 0] = sampled_token_index #print(target_seq) if sampled_token_index == chn_dict["<EOS>"] or len(outputs) > 28: break print("> "+df["eng"][k]) print("< "+' '.join([id2chn[i] for i in outputs[:-1]])) print()

> I can understand you to some extent .
< 在 某種程度 上 我 能 了解 你 。
> I can't recall the last time we met . < 我 想不起來 我們 上次 見面 的 情況 了 。 > I can't remember which is my racket . < 我 不 記得 哪個 是 我 的 球拍 。 > I can't stand that noise any longer . < 我 不能 再 忍受 那 噪音 了 。 > I can't stand this noise any longer . < 我 無法 再 忍受 這個 噪音 了 。 > I caught the man stealing the money . < 我 抓 到 了 這個 男人 正在 偷錢 。 > I could not afford to buy a bicycle . < 我 買不起 自行車 。 > I couldn't answer all the questions . < 我 不能 回答 所有 的 問題 。 > I couldn't think of anything to say . < 我 想不到 要說 什么 話 。 > I cry every time I watch this movie . < 我 每次 看 這部 電影 都 會 哭 。 > I did not participate in the dialog . < 我 沒有 參與 對話 。 > I didn't really feel like going out . < 我 不是 很想 出去 。 > I don't care a bit about the future . < 我 不在乎 將來 。
