基於TensorFlow的DeepQA聊天機器人

本文轉載自查看原文 2019-09-16 14:17 756 機器學習/深度學習/ NLP

　　前段時間看了網上開源的DeepQA項目，對於想了解如何實現聊天機器人是個不錯的入門之選。本項目制作了語料集作為背景數據集，實現中文聊天機器人。
　　環境配置：python3.7，IDE是pycharm的windows環境。話不多少，源碼如下：https://github.com/chenjj9527/chatbot_Chinese.git

一、中文聊天機器人

1.1、模型構造

　　聊天機器人大多都是采用seq2seq結構，更細化的說可以指RNN網絡或者LSTM網絡。模型構造這個函數就是利用TensorFlow框架定義網絡模型的結構，如果你對RNN網絡或者LSTM網絡不是很了解，可以參考https://blog.csdn.net/zzz_cming/article/details/79235475，就可以知道下面RNN網絡是怎樣識別一句話，其中的cell是怎樣的工作原理了。

def get_model(feed_previous=False):
    """
    構造模型
    """ learning_rate = tf.Variable(float(init_learning_rate), trainable=False, dtype=tf.float32) learning_rate_decay_op = learning_rate.assign(learning_rate * 0.9) encoder_inputs = [] decoder_inputs = [] target_weights = [] for i in range(input_seq_len): encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="encoder{0}".format(i))) for i in range(output_seq_len + 1): decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="decoder{0}".format(i))) for i in range(output_seq_len): target_weights.append(tf.placeholder(tf.float32, shape=[None], name="weight{0}".format(i))) # decoder_inputs左移一個時序作為targets targets = [decoder_inputs[i + 1] for i in range(output_seq_len)] cell = tf.contrib.rnn.BasicLSTMCell(size) # 這里輸出的狀態我們不需要 outputs, _ = seq2seq.embedding_attention_seq2seq( encoder_inputs, decoder_inputs[:output_seq_len], cell, num_encoder_symbols=num_encoder_symbols, num_decoder_symbols=num_decoder_symbols, embedding_size=size, output_projection=None, feed_previous=feed_previous, dtype=tf.float32) # 計算加權交叉熵損失 loss = seq2seq.sequence_loss(outputs, targets, target_weights) # 梯度下降優化器 opt = tf.train.GradientDescentOptimizer(learning_rate) # 優化目標：讓loss最小化 update = opt.apply_gradients(opt.compute_gradients(loss)) # 模型持久化 saver = tf.train.Saver(tf.global_variables()) return encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver, learning_rate_decay_op, learning_rate

1.2、訓練數據集加載

　　先看一下我自己做的問答集，question中的每一個問題依次對應answer中的一個答案，兩個文件組成一個問答對構成訓練集，
　　注意：question與answer的行數必須相同，不然會報錯，且不能出現空行；
　　注意：數據集一定要根據需要進行擴充；

　　下面的代碼就是通過path地址，讀取兩個數據集中的數據，做一定的必要處理（必要處理在下——第三個小標題），合並到一個train_set中返回：

def get_train_set():
    """
    得到訓練問答集
    """
    global num_encoder_symbols, num_decoder_symbols train_set = [] with open('./samples/question', 'r', encoding='utf-8') as question_file: with open('./samples/answer', 'r', encoding='utf-8') as answer_file: while True: question = question_file.readline() answer = answer_file.readline() if question and answer: # strip()方法用於移除字符串頭尾的字符 question = question.strip() answer = answer.strip() # 得到分詞id question_id_list = get_id_list_from(question) answer_id_list = get_id_list_from(answer) if len(question_id_list) > 0 and len(answer_id_list) > 0: answer_id_list.append(EOS_ID) train_set.append([question_id_list, answer_id_list]) else: break return train_set

1.3、構造樣本數據

如果我們將所有的數據不加處理直接放入同一個train_set中返回，程序是無法區別哪些是問題哪些是答案、問題的長度讀取到哪答案的長度讀取到哪——我們需要給問題和答案做一些小標記：
　　①、我們事先定義好輸入、輸出的長度，這樣讀取的長度、輸出的長度就固定下來了，程序只需每次通過固定長度就可以取出想要的數據；
　　②、對於輸入長度超標的數據，我們只能選擇截斷原有的輸入——不過我們可以增大輸入序列長度啊，這樣不就不會被截斷了
　　③、對於長度不夠輸出序列長度的輸出，我們采用末尾添0，保證所有的輸入、輸出長度都相同；

GO_ID = 1              # 輸出序列起始標記
EOS_ID = 2             # 結尾標記
PAD_ID = 0             # 空值填充0
batch_num = 1000       # 參與訓練的問答對個數
input_seq_len = 25         # 輸入序列長度
output_seq_len = 50        # 輸出序列長度

　　上面就是定義輸入、輸出序列長度，以及起始標記、結束填充，下面就是構造樣本數據函數代碼

def get_samples(train_set, batch_num):
    """
    構造樣本數據:傳入的train_set是處理好的問答集
    batch_num:讓train_set訓練集里多少問答對參與訓練
    
    # train_set = [[[5, 7, 9], [11, 13, 15, EOS_ID]], [[7, 9, 11], [13, 15, 17, EOS_ID]], [[15, 17, 19], [21, 23, 25, EOS_ID]]]
    """ raw_encoder_input = [] raw_decoder_input = [] if batch_num >= len(train_set): batch_train_set = train_set else: random_start = random.randint(0, len(train_set)-batch_num) batch_train_set = train_set[random_start:random_start+batch_num] # 添加起始標記、結束填充 for sample in batch_train_set: raw_encoder_input.append([PAD_ID] * (input_seq_len - len(sample[0])) + sample[0]) raw_decoder_input.append([GO_ID] + sample[1] + [PAD_ID] * (output_seq_len - len(sample[1]) - 1)) encoder_inputs = [] decoder_inputs = [] target_weights = [] for length_idx in range(input_seq_len): encoder_inputs.append(np.array([encoder_input[length_idx] for encoder_input in raw_encoder_input], dtype=np.int32)) for length_idx in range(output_seq_len): decoder_inputs.append(np.array([decoder_input[length_idx] for decoder_input in raw_decoder_input], dtype=np.int32)) target_weights.append(np.array([ 0.0 if length_idx == output_seq_len - 1 or decoder_input[length_idx] == PAD_ID else 1.0 for decoder_input in raw_decoder_input ], dtype=np.float32)) return encoder_inputs, decoder_inputs, target_weights

1.4、訓練過程

　　訓練過程就是激活TensorFlow框架，往模型中feed數據，並得到訓練的loss，最后是保存參數

def train():
    """
    訓練過程
    """ train_set = get_train_set() with tf.Session() as sess: encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver, learning_rate_decay_op, learning_rate = get_model() sess.run(tf.global_variables_initializer()) # 訓練很多次迭代，每隔100次打印一次loss，可以看情況直接ctrl+c停止 previous_losses = [] for step in range(epochs): sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = get_samples(train_set, batch_num) input_feed = {} for l in range(input_seq_len): input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l] for l in range(output_seq_len): input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l] input_feed[target_weights[l].name] = sample_target_weights[l] input_feed[decoder_inputs[output_seq_len].name] = np.zeros([len(sample_decoder_inputs[0])], dtype=np.int32) [loss_ret, _] = sess.run([loss, update], input_feed) if step % 100 == 0: print('step=', step, 'loss=', loss_ret, 'learning_rate=', learning_rate.eval()) #print('333', previous_losses[-5:]) if len(previous_losses) > 5 and loss_ret > max(previous_losses[-5:]): sess.run(learning_rate_decay_op) previous_losses.append(loss_ret) # 模型參數保存 saver.save(sess, './model/'+ str(epochs)+ '/demo_') #saver.save(sess, './model/' + str(epochs) + '/demo_' + step)

1.5、預測過程

　　預測過程就是讀取model文件夾下的參數文件進行預測

def predict():
    """
    預測過程
    """ with tf.Session() as sess: encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver, learning_rate_decay_op, learning_rate = get_model(feed_previous=True) saver.restore(sess, './model/'+str(epochs)+'/demo_') sys.stdout.write("you ask>> ") sys.stdout.flush() input_seq = sys.stdin.readline() while input_seq: input_seq = input_seq.strip() input_id_list = get_id_list_from(input_seq) if (len(input_id_list)): sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = seq_to_encoder(' '.join([str(v) for v in input_id_list])) input_feed = {} for l in range(input_seq_len): input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l] for l in range(output_seq_len): input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l] input_feed[target_weights[l].name] = sample_target_weights[l] input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32) # 預測輸出 outputs_seq = sess.run(outputs, input_feed)
 # 因為輸出數據每一個是num_decoder_symbols維的，因此找到數值最大的那個就是預測的id，就是這里的argmax函數的功能 outputs_seq = [int(np.argmax(logit[0], axis=0)) for logit in outputs_seq]
 # 如果是結尾符，那么后面的語句就不輸出了 if EOS_ID in outputs_seq: outputs_seq = outputs_seq[:outputs_seq.index(EOS_ID)] outputs_seq = [wordToken.id2word(v) for v in outputs_seq] print("chatbot>>", " ".join(outputs_seq)) else: print("WARN：詞匯不在服務區") sys.stdout.write("you ask>>") sys.stdout.flush() input_seq = sys.stdin.readline()

二、源碼說明

2.1、模型訓練

　　點擊demo_test.py文件，依次點擊：run、Edit Configuration，出現如下窗口：

　　在以上Parameters中填入以下內容train，確定后再運行demo_test.py文件；

train

　　在面板中得到如下訓練信息：

　　訓練結束后，可以在model文件夾下看到生成的模型參數，到這里，訓練就結束了。如下所示：

2.2、模型測試

　　點擊demo_test.py文件，依次點擊：run、Edit Configuration，出現如下窗口：

　　將以上Parameters中填入的內容train換成任意一個字符，點擊OK后再運行demo_test.py文件，進入如下人機交互式：

三、源碼展示

3.1、`demo_test.py`文件

# -*- coding：utf-8 -*-
import sys
import numpy as np import tensorflow as tf from tensorflow.contrib.legacy_seq2seq.python.ops import seq2seq import word_token import jieba import random size = 8 # LSTM神經元size GO_ID = 1 # 輸出序列起始標記 EOS_ID = 2 # 結尾標記 PAD_ID = 0 # 空值填充0 min_freq = 1 # 樣本頻率超過這個值才會存入詞表 epochs = 2000 # 訓練次數 batch_num = 1000 # 參與訓練的問答對個數 input_seq_len = 25 # 輸入序列長度 output_seq_len = 50 # 輸出序列長度 init_learning_rate = 0.5 # 初始學習率  wordToken = word_token.WordToken() # 放在全局的位置，為了動態算出 num_encoder_symbols 和 num_decoder_symbols max_token_id = wordToken.load_file_list(['./samples/question', './samples/answer'], min_freq) num_encoder_symbols = max_token_id + 5 num_decoder_symbols = max_token_id + 5 def get_id_list_from(sentence): """ 得到分詞后的ID """ sentence_id_list = [] seg_list = jieba.cut(sentence) for str in seg_list: id = wordToken.word2id(str) if id: sentence_id_list.append(wordToken.word2id(str)) return sentence_id_list def get_train_set(): """ 得到訓練問答集 """ global num_encoder_symbols, num_decoder_symbols train_set = [] with open('./samples/question', 'r', encoding='utf-8') as question_file: with open('./samples/answer', 'r', encoding='utf-8') as answer_file: while True: question = question_file.readline() answer = answer_file.readline() if question and answer: # strip()方法用於移除字符串頭尾的字符 question = question.strip() answer = answer.strip() # 得到分詞id question_id_list = get_id_list_from(question) answer_id_list = get_id_list_from(answer) if len(question_id_list) > 0 and len(answer_id_list) > 0: answer_id_list.append(EOS_ID) train_set.append([question_id_list, answer_id_list]) else: break return train_set def get_samples(train_set, batch_num): """ 構造樣本數據:傳入的train_set是處理好的問答集 batch_num:讓train_set訓練集里多少問答對參與訓練 """ raw_encoder_input = [] raw_decoder_input = [] if batch_num >= len(train_set): batch_train_set = train_set else: random_start = random.randint(0, len(train_set)-batch_num) batch_train_set = train_set[random_start:random_start+batch_num] # 添加起始標記、結束填充 for sample in batch_train_set: raw_encoder_input.append([PAD_ID] * (input_seq_len - len(sample[0])) + sample[0]) raw_decoder_input.append([GO_ID] + sample[1] + [PAD_ID] * (output_seq_len - len(sample[1]) - 1)) encoder_inputs = [] decoder_inputs = [] target_weights = [] for length_idx in range(input_seq_len): encoder_inputs.append(np.array([encoder_input[length_idx] for encoder_input in raw_encoder_input], dtype=np.int32)) for length_idx in range(output_seq_len): decoder_inputs.append(np.array([decoder_input[length_idx] for decoder_input in raw_decoder_input], dtype=np.int32)) target_weights.append(np.array([ 0.0 if length_idx == output_seq_len - 1 or decoder_input[length_idx] == PAD_ID else 1.0 for decoder_input in raw_decoder_input ], dtype=np.float32)) return encoder_inputs, decoder_inputs, target_weights def seq_to_encoder(input_seq): """ 從輸入空格分隔的數字id串，轉成預測用的encoder、decoder、target_weight等 """ input_seq_array = [int(v) for v in input_seq.split()] encoder_input = [PAD_ID] * (input_seq_len - len(input_seq_array)) + input_seq_array decoder_input = [GO_ID] + [PAD_ID] * (output_seq_len - 1) encoder_inputs = [np.array([v], dtype=np.int32) for v in encoder_input] decoder_inputs = [np.array([v], dtype=np.int32) for v in decoder_input] target_weights = [np.array([1.0], dtype=np.float32)] * output_seq_len return encoder_inputs, decoder_inputs, target_weights def get_model(feed_previous=False): """ 構造模型 """ learning_rate = tf.Variable(float(init_learning_rate), trainable=False, dtype=tf.float32) learning_rate_decay_op = learning_rate.assign(learning_rate * 0.9) encoder_inputs = [] decoder_inputs = [] target_weights = [] for i in range(input_seq_len): encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="encoder{0}".format(i))) for i in range(output_seq_len + 1): decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="decoder{0}".format(i))) for i in range(output_seq_len): target_weights.append(tf.placeholder(tf.float32, shape=[None], name="weight{0}".format(i))) # decoder_inputs左移一個時序作為targets targets = [decoder_inputs[i + 1] for i in range(output_seq_len)] cell = tf.contrib.rnn.BasicLSTMCell(size) # 這里輸出的狀態我們不需要 outputs, _ = seq2seq.embedding_attention_seq2seq( encoder_inputs, decoder_inputs[:output_seq_len], cell, num_encoder_symbols=num_encoder_symbols, num_decoder_symbols=num_decoder_symbols, embedding_size=size, output_projection=None, feed_previous=feed_previous, dtype=tf.float32) # 計算加權交叉熵損失 loss = seq2seq.sequence_loss(outputs, targets, target_weights) # 梯度下降優化器 opt = tf.train.GradientDescentOptimizer(learning_rate) # 優化目標：讓loss最小化 update = opt.apply_gradients(opt.compute_gradients(loss)) # 模型持久化 saver = tf.train.Saver(tf.global_variables()) return encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver, learning_rate_decay_op, learning_rate def train(): """ 訓練過程 """ train_set = get_train_set() with tf.Session() as sess: encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver, learning_rate_decay_op, learning_rate = get_model() sess.run(tf.global_variables_initializer()) # 訓練很多次迭代，每隔100次打印一次loss，可以看情況直接ctrl+c停止 previous_losses = [] for step in range(epochs): sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = get_samples(train_set, batch_num) input_feed = {} for l in range(input_seq_len): input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l] for l in range(output_seq_len): input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l] input_feed[target_weights[l].name] = sample_target_weights[l] input_feed[decoder_inputs[output_seq_len].name] = np.zeros([len(sample_decoder_inputs[0])], dtype=np.int32) [loss_ret, _] = sess.run([loss, update], input_feed) if step % 100 == 0: print('step=', step, 'loss=', loss_ret, 'learning_rate=', learning_rate.eval()) #print('333', previous_losses[-5:]) if len(previous_losses) > 5 and loss_ret > max(previous_losses[-5:]): sess.run(learning_rate_decay_op) previous_losses.append(loss_ret) # 模型參數保存 saver.save(sess, './model/'+ str(epochs)+ '/demo_') #saver.save(sess, './model/' + str(epochs) + '/demo_' + step) def predict(): """ 預測過程 """ with tf.Session() as sess: encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver, learning_rate_decay_op, learning_rate = get_model(feed_previous=True) saver.restore(sess, './model/'+str(epochs)+'/demo_') sys.stdout.write("you ask>> ") sys.stdout.flush() input_seq = sys.stdin.readline() while input_seq: input_seq = input_seq.strip() input_id_list = get_id_list_from(input_seq) if (len(input_id_list)): sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = seq_to_encoder(' '.join([str(v) for v in input_id_list])) input_feed = {} for l in range(input_seq_len): input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l] for l in range(output_seq_len): input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l] input_feed[target_weights[l].name] = sample_target_weights[l] input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32) # 預測輸出 outputs_seq = sess.run(outputs, input_feed) # 因為輸出數據每一個是num_decoder_symbols維的，因此找到數值最大的那個就是預測的id，就是這里的argmax函數的功能 outputs_seq = [int(np.argmax(logit[0], axis=0)) for logit in outputs_seq] # 如果是結尾符，那么后面的語句就不輸出了 if EOS_ID in outputs_seq: outputs_seq = outputs_seq[:outputs_seq.index(EOS_ID)] outputs_seq = [wordToken.id2word(v) for v in outputs_seq] print("chatbot>>", " ".join(outputs_seq)) else: print("WARN：詞匯不在服務區") sys.stdout.write("you ask>>") sys.stdout.flush() input_seq = sys.stdin.readline() if __name__ == "__main__": if sys.argv[1] == 'train': train() else: predict()

3.2、`word_token.py`文件

# -*- coding：utf-8 -*-
import sys
import jieba class WordToken(object): def __init__(self): # 最小起始id號, 保留的用於表示特殊標記 self.START_ID = 4 self.word2id_dict = {} self.id2word_dict = {} def load_file_list(self, file_list, min_freq): """ 加載樣本文件列表，全部切詞后統計詞頻，按詞頻由高到低排序后順次編號 並存到self.word2id_dict和self.id2word_dict中 file_list = [question, answer] min_freq: 最小詞頻，超過最小詞頻的詞才會存入詞表 """ words_count = {} for file in file_list: with open(file, 'r', encoding='utf-8') as file_object: for line in file_object.readlines(): line = line.strip() seg_list = jieba.cut(line) for str in seg_list: if str in words_count: words_count[str] = words_count[str] + 1 else: words_count[str] = 1 sorted_list = [[v[1], v[0]] for v in words_count.items()] sorted_list.sort(reverse=True) for index, item in enumerate(sorted_list): word = item[1] if item[0] < min_freq: break self.word2id_dict[word] = self.START_ID + index self.id2word_dict[self.START_ID + index] = word return index def word2id(self, word): # 判斷word是不是字符串 if not isinstance(word, str): print("Exception: error word not unicode") sys.exit(1) if word in self.word2id_dict: return self.word2id_dict[word] else: return None def id2word(self, id): id = int(id) if id in self.id2word_dict: return self.id2word_dict[id] else: return None

　　下面我們介紹下網上開源的DeepQA項目，這個項目說的很詳細，還有很多功能值得借鑒。

四、DeepQA項目簡單介紹

　　DeepQA源碼GitHub地址：https://github.com/Conchylicultor/DeepQA

　　下載源碼、解壓、並在pycharm下建立工程。
　　本文只講解DeepQA項目的demo，不涉及website版，有興趣的伙伴可以自己研究chatbot_website下的文件。
　　建立工程后可得到如下圖示：

1.data文件夾：**是用來保存語料數據的，在DeepQA源碼GitHub地址中對這個文件夾有詳細說明。簡單介紹如下：打開data文件夾，是右上圖所示：

①、cornell下是康奈爾電影對話語料庫，也是默認的語料數據，.txt格式；

②、如果你自己想使用自己的語料庫，則需要將自己准備的語料存入lightweight文件夾中（下面針對自己的語料庫會有更詳細的操作介紹）；

③、samples文件夾存儲由語料庫.txt格式轉化而來的.pkl文件，.pkl文件才是程序讀取的語料格式；

④、test文件夾下有一個同名不同格式的samples.txt文件，用來存儲測試語料；

2.save文件夾：用來保存由訓練得到的model模型參數，主要是里面的.ckpt文件存儲模型參數；model_predictions.txt保存內測輸出（下有詳細介紹）；
3.main.py是主函數：訓練train、測試test的入口；
4.chatbot.py是主要參數程序：里面包括各種參數調整的接口（下有詳細介紹）；

五、使用默認康奈爾(cornell)電影對話語料庫做chatbot

5.1、模型訓練

　　1、下載解壓DeepQA源碼，新建pycharm工程后，直接運行main.py，即可開始訓練。運行窗口如下所示：

　　2、請注意chatbot.py程序130行–135行的模型參數調整，分別是訓練批次numEpochs、保存參數的步長saveEvery、批量batchsize、學習率lr、dropout參數：(這里根據大家需求，自行調參)

　　3、語料讀取完畢后，就可以在data/samples文件夾下查看由語料庫中的.txt文件生成的.pkl文件：

　　4、最后就是開始漫長的訓練，訓練完成后，可以在save/model文件夾下查看生成的model參數文件：(此時應該是沒有model_predictions.txt文件，內測測試后才會生成model_predictions.txt文件，下有介紹)

5.2、模型測試

5.2.1、內測——生成model_predictions.txt文件

　　先打開data/cornell文件夾，查看到其下有兩個.txt文件，這就是康奈爾訓練語料集，再打開data/test文件夾，查看到其下samples.txt文件就是測試語料集，最后打開main.py文件，依次點擊：run、Edit Configuration，得到如下窗口：

　　在Parameters中填入下面的內容：

--test

　　確定后再點擊運行main.py文件，在對話窗口得到成功信息后，就可以在save/model文件夾下看到生成的model_predictions.txt文件；

　　打開model_predictions.txt文件——這個文件是由訓練語料集得到的model來預測data/test/samples.txt文件得到的預測回答內容，如下所示

5.2.2、外測——進入人機對話模式

　　外測的操作步驟如內測一致，依次點擊：run、Edit Configuration，最后在Parameters中填入的內容改成如下：

--test interactive

　　就可以在對話窗口中進入人機交互模式。

　　到這里一個簡單的chatbot聊天機器人就完成了。訓練次數與語料庫質量直接影響模型效果。

六、使用自己的語料庫做chatbot　　

　　使用自己的語料庫做chatbot其實也很簡單——就是准備一些語料，修改一些參數。

6.1、如何制作自己的語料庫

　　需要在data/lightweight文件夾下制作自己的訓練語料庫，在data/test/samples.txt制作自己的測試語料庫

　　以下方法只是簡單制作方法，制作詳細方法請參考——DeepQA項目如何制作自己的語料庫

6.1.1、訓練語料制作

　　在data/lightweight文件夾下新建<name>.txt文本文件，注意<name>需要使用自己的文件名。在文本文件中輸入自己的語料：不同語境間用===分割，上下即為問答對形式；

6.1.2、測試語料

　　在data/test文件夾下的samples.txt中重新輸入測試語料，測試語料只在內測時候生成model_predictions.txt文件用到；也是上下問答對形式，但不必用===區分語境；

6.2、訓練自己的語料庫

　　每一次重新訓練之前，都要先查看data/samples文件夾下之前生成的兩個.pkl文件是否已經刪除——我並沒有對這一點進行過深究，只是發現程序具有檢查.pkl格式文件的能力？如果不提早刪除，程序會先讀取已存在的.pkl文件，如果這樣就意味着新語料並沒有參與新的訓練。

　　訓練自己的語料庫的步驟我們也已經做過多次輕車熟路了——先打開main.py文件，依次點擊：run、Edit Configuration，在Parameters中填入下面的內容，再點擊運行main.py文件；請注意<name>要與你的文件名一致；

--corpus lightweight --datasetTag <name>

　　成功讀取語料集后，就可以在data/samples查看到新生成的.pkl文件，同樣訓練結束后可以在save/model文件夾下查看新生成的model參數文件；

6.3、測試自己的語料庫結果

　　內測與外測的步驟都與上述內/外測的步驟一模一樣：
　　內測：依次點擊：run、Edit Configuration，在Parameters中填入下面的內容后，點擊運行main.py文件，就可以在得到成功信息后，在save/model文件夾下看到生成的model_predictions.txt文件

--test

　　外測：依次點擊：run、Edit Configuration，在Parameters中填入的內容改成如下，最后點擊運行main.py文件

--test interactive

　　就可以在對話窗口中進入人機交互模式。語料庫質量差、語料庫對話數據少、訓練次數過低都會導致交互預測結果差的狀況產生。

　　在訓練2W次的模型，外測進入交互窗口后，如果輸入的問題是data/lightweight文件夾下自己語料庫中的問題（問題+符號都需要一字不差，問題不區分語種），這時候百分百答出問題正確答案，但如果不是自己語料庫內的問題、或者不是百分百自己語料庫內的中文問題，總會重復出錯，可能由於我的語料集包含的場景過少，問題覆蓋面小，訓練過擬合導致正確答案只能由完整問題才能答出；因為DeepQA項目面對的語種是英語，對英語有模糊回答處理，所以回答英語問題置信度尚可，但如果我們將訓練集改成全中文形式，DeepQA並沒有像jieba分詞一樣類似的操作，所以得到的回答預測總是差強人意。　

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 TensorFlow 聊天機器人開源項目評測第一期：DeepQA Tensorflow打造聊天機器人 Tensorflow搞一個聊天機器人微信聊天機器人離線聊天機器人 wxpy——聊天機器人的實現微信聊天機器人 nodejs實現聊天機器人我用 tensorflow 實現的“一個神經聊天模型”：一個基於深度學習的聊天機器人 QQ 聊天機器人API