目錄
- 大綱概述
- 數據集合
- 數據處理
- 預訓練word2vec模型
一、大綱概述
文本分類這個系列將會有8篇左右文章,從github直接下載代碼,從百度雲下載訓練數據,在pycharm上導入即可使用,包括基於word2vec預訓練的文本分類,與及基於近幾年的預訓練模型(ELMo,BERT等)的文本分類。總共有以下系列:
word2vec預訓練詞向量
textCNN 模型
charCNN 模型
Bi-LSTM 模型
Bi-LSTM + Attention 模型
Transformer 模型
ELMo 預訓練模型
BERT 預訓練模型
模型結構
Bi-LSTM + Attention模型來源於論文Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification。
Bi-LSTM + Attention 就是在Bi-LSTM的模型上加入Attention層,在Bi-LSTM中我們會用最后一個時序 的輸出向量 作為特征向量,然后進行softmax分類。Attention是先計算每個時序的權重,然后將所有時序 的向量進行加權和作為特征向量,然后進行softmax分類。在實驗中,加上Attention確實對結果有所提升。其模型結構如下圖:
二、數據集合
數據集為IMDB 電影影評,總共有三個數據文件,在/data/rawData目錄下,包括unlabeledTrainData.tsv,labeledTrainData.tsv,testData.tsv。在進行文本分類時需要有標簽的數據(labeledTrainData),但是在訓練word2vec詞向量模型(無監督學習)時可以將無標簽的數據一起用上。
訓練數據地址:鏈接:https://pan.baidu.com/s/1-XEwx1ai8kkGsMagIFKX_g 提取碼:rtz8
三、主要代碼
3.1 配置訓練參數:parameter_config.py
# Author:yifan #需要的所有導入包,存放留用,轉換到jupyter后直接使用 # 1 配置訓練參數 class TrainingConfig(object): epoches = 4 evaluateEvery = 100 checkpointEvery = 100 learningRate = 0.001 class ModelConfig(object): embeddingSize = 200 hiddenSizes = [256, 128] # LSTM結構的神經元個數 dropoutKeepProb = 0.5 l2RegLambda = 0.0 class Config(object): sequenceLength = 200 # 取了所有序列長度的均值 batchSize = 128 dataSource = "../data/preProcess/labeledTrain.csv" stopWordSource = "../data/english" numClasses = 1 # 二分類設置為1,多分類設置為類別的數目 rate = 0.8 # 訓練集的比例 training = TrainingConfig() model = ModelConfig() # 實例化配置參數對象 config = Config()
3.2 獲取訓練數據:get_train_data.py
# Author:yifan import json from collections import Counter import gensim import pandas as pd import numpy as np import parameter_config # 2 數據預處理的類,生成訓練集和測試集 class Dataset(object): def __init__(self, config): self.config = config self._dataSource = config.dataSource self._stopWordSource = config.stopWordSource self._sequenceLength = config.sequenceLength # 每條輸入的序列處理為定長 self._embeddingSize = config.model.embeddingSize self._batchSize = config.batchSize self._rate = config.rate self._stopWordDict = {} self.trainReviews = [] self.trainLabels = [] self.evalReviews = [] self.evalLabels = [] self.wordEmbedding = None self.labelList = [] def _readData(self, filePath): """ 從csv文件中讀取數據集,就本次測試的文件做記錄 """ df = pd.read_csv(filePath) #讀取文件,是三列的數據,第一列是review,第二列sentiment,第三列rate if self.config.numClasses == 1: labels = df["sentiment"].tolist() #讀取sentiment列的數據, 顯示輸出01序列數組25000條 elif self.config.numClasses > 1: labels = df["rate"].tolist() #因為numClasses控制,本次取樣沒有取超過二分類 該處沒有輸出 review = df["review"].tolist() reviews = [line.strip().split() for line in review] #按空格語句切分 return reviews, labels def _labelToIndex(self, labels, label2idx): """ 將標簽轉換成索引表示 """ labelIds = [label2idx[label] for label in labels] #print(labels==labelIds) 結果顯示為true,也就是兩個一樣 return labelIds def _wordToIndex(self, reviews, word2idx): """將詞轉換成索引""" reviewIds = [[word2idx.get(item, word2idx["UNK"]) for item in review] for review in reviews] # print(max(max(reviewIds))) # print(reviewIds) return reviewIds #返回25000個無序的數組 def _genTrainEvalData(self, x, y, word2idx, rate): """生成訓練集和驗證集 """ reviews = [] # print(self._sequenceLength) # print(len(x)) for review in x: #self._sequenceLength為200,表示長的切成200,短的補齊,x數據依舊是25000 if len(review) >= self._sequenceLength: reviews.append(review[:self._sequenceLength]) else: reviews.append(review + [word2idx["PAD"]] * (self._sequenceLength - len(review))) # print(len(review + [word2idx["PAD"]] * (self._sequenceLength - len(review)))) #以下是按照rate比例切分訓練和測試數據: trainIndex = int(len(x) * rate) trainReviews = np.asarray(reviews[:trainIndex], dtype="int64") trainLabels = np.array(y[:trainIndex], dtype="float32") evalReviews = np.asarray(reviews[trainIndex:], dtype="int64") evalLabels = np.array(y[trainIndex:], dtype="float32") return trainReviews, trainLabels, evalReviews, evalLabels def _getWordEmbedding(self, words): """按照我們的數據集中的單詞取出預訓練好的word2vec中的詞向量 反饋詞和對應的向量(200維度),另外前面增加PAD對用0的數組,UNK對應隨機數組。 """ wordVec = gensim.models.KeyedVectors.load_word2vec_format("../word2vec/word2Vec.bin", binary=True) vocab = [] wordEmbedding = [] # 添加 "pad" 和 "UNK", vocab.append("PAD") vocab.append("UNK") wordEmbedding.append(np.zeros(self._embeddingSize)) # _embeddingSize 本文定義的是200 wordEmbedding.append(np.random.randn(self._embeddingSize)) # print(wordEmbedding) for word in words: try: vector = wordVec.wv[word] vocab.append(word) wordEmbedding.append(vector) except: print(word + "不存在於詞向量中") # print(vocab[:3],wordEmbedding[:3]) return vocab, np.array(wordEmbedding) def _genVocabulary(self, reviews, labels): """生成詞向量和詞匯-索引映射字典,可以用全數據集""" allWords = [word for review in reviews for word in review] #單詞數量5738236 reviews是25000個觀點句子【】 subWords = [word for word in allWords if word not in self.stopWordDict] # 去掉停用詞 wordCount = Counter(subWords) # 統計詞頻 sortWordCount = sorted(wordCount.items(), key=lambda x: x[1], reverse=True) #返回鍵值對,並按照數量排序 # print(len(sortWordCount)) #161330 # print(sortWordCount[:4],sortWordCount[-4:]) # [('movie', 41104), ('film', 36981), ('one', 24966), ('like', 19490)] [('daeseleires', 1), ('nice310', 1), ('shortsightedness', 1), ('unfairness', 1)] words = [item[0] for item in sortWordCount if item[1] >= 5] # 去除低頻詞,低於5的 vocab, wordEmbedding = self._getWordEmbedding(words) self.wordEmbedding = wordEmbedding word2idx = dict(zip(vocab, list(range(len(vocab))))) #生成類似這種{'I': 0, 'love': 1, 'yanzi': 2} uniqueLabel = list(set(labels)) #標簽去重 最后就 0 1了 label2idx = dict(zip(uniqueLabel, list(range(len(uniqueLabel))))) #本文就 {0: 0, 1: 1} self.labelList = list(range(len(uniqueLabel))) # 將詞匯-索引映射表保存為json數據,之后做inference時直接加載來處理數據 with open("../data/wordJson/word2idx.json", "w", encoding="utf-8") as f: json.dump(word2idx, f) with open("../data/wordJson/label2idx.json", "w", encoding="utf-8") as f: json.dump(label2idx, f) return word2idx, label2idx def _readStopWord(self, stopWordPath): """ 讀取停用詞 """ with open(stopWordPath, "r") as f: stopWords = f.read() stopWordList = stopWords.splitlines() # 將停用詞用列表的形式生成,之后查找停用詞時會比較快 self.stopWordDict = dict(zip(stopWordList, list(range(len(stopWordList))))) def dataGen(self): """ 初始化訓練集和驗證集 """ # 初始化停用詞 self._readStopWord(self._stopWordSource) # 初始化數據集 reviews, labels = self._readData(self._dataSource) # 初始化詞匯-索引映射表和詞向量矩陣 word2idx, label2idx = self._genVocabulary(reviews, labels) # 將標簽和句子數值化 labelIds = self._labelToIndex(labels, label2idx) reviewIds = self._wordToIndex(reviews, word2idx) # 初始化訓練集和測試集 trainReviews, trainLabels, evalReviews, evalLabels = self._genTrainEvalData(reviewIds, labelIds, word2idx, self._rate) self.trainReviews = trainReviews self.trainLabels = trainLabels self.evalReviews = evalReviews self.evalLabels = evalLabels #獲取前些模塊的數據 # config =parameter_config.Config() # data = Dataset(config) # data.dataGen()
3.3 模型構建:mode_structure.py
import tensorflow as tf import parameter_config config = parameter_config.Config() # 構建模型 3 Bi-LSTM + Attention模型 # 構建模型 class BiLSTMAttention(object): def __init__(self, config, wordEmbedding): # 定義模型的輸入 self.inputX = tf.placeholder(tf.int32, [None, config.sequenceLength], name="inputX") self.inputY = tf.placeholder(tf.int32, [None], name="inputY") self.dropoutKeepProb = tf.placeholder(tf.float32, name="dropoutKeepProb") # 定義l2損失 l2Loss = tf.constant(0.0) # 詞嵌入層 with tf.name_scope("embedding"): # 利用預訓練的詞向量初始化詞嵌入矩陣 self.W = tf.Variable(tf.cast(wordEmbedding, dtype=tf.float32, name="word2vec") ,name="W") # 利用詞嵌入矩陣將輸入的數據中的詞轉換成詞向量,維度[batch_size, sequence_length, embedding_size] self.embeddedWords = tf.nn.embedding_lookup(self.W, self.inputX) # 定義兩層雙向LSTM的模型結構 with tf.name_scope("Bi-LSTM"): for idx, hiddenSize in enumerate(config.model.hiddenSizes): with tf.name_scope("Bi-LSTM" + str(idx)): # 定義前向LSTM結構 lstmFwCell = tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell(num_units=hiddenSize, state_is_tuple=True), output_keep_prob=self.dropoutKeepProb) # 定義反向LSTM結構 lstmBwCell = tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell(num_units=hiddenSize, state_is_tuple=True), output_keep_prob=self.dropoutKeepProb) # 采用動態rnn,可以動態的輸入序列的長度,若沒有輸入,則取序列的全長 # outputs是一個元祖(output_fw, output_bw),其中兩個元素的維度都是[batch_size, max_time, hidden_size],fw和bw的hidden_size一樣 # self.current_state 是最終的狀態,二元組(state_fw, state_bw),state_fw=[batch_size, s],s是一個元祖(h, c) outputs_, self.current_state = tf.nn.bidirectional_dynamic_rnn(lstmFwCell, lstmBwCell, self.embeddedWords, dtype=tf.float32, scope="bi-lstm" + str(idx)) # 對outputs中的fw和bw的結果拼接 [batch_size, time_step, hidden_size * 2], 傳入到下一層Bi-LSTM中 self.embeddedWords = tf.concat(outputs_, 2) # 將最后一層Bi-LSTM輸出的結果分割成前向和后向的輸出 outputs = tf.split(self.embeddedWords, 2, -1) # 在Bi-LSTM+Attention的論文中,將前向和后向的輸出相加 with tf.name_scope("Attention"): H = outputs[0] + outputs[1] # 得到Attention的輸出 output = self.attention(H) outputSize = config.model.hiddenSizes[-1] # 全連接層的輸出 with tf.name_scope("output"): outputW = tf.get_variable( "outputW", shape=[outputSize, config.numClasses], initializer=tf.contrib.layers.xavier_initializer()) outputB= tf.Variable(tf.constant(0.1, shape=[config.numClasses]), name="outputB") l2Loss += tf.nn.l2_loss(outputW) l2Loss += tf.nn.l2_loss(outputB) self.logits = tf.nn.xw_plus_b(output, outputW, outputB, name="logits") if config.numClasses == 1: self.predictions = tf.cast(tf.greater_equal(self.logits, 0.0), tf.float32, name="predictions") elif config.numClasses > 1: self.predictions = tf.argmax(self.logits, axis=-1, name="predictions") # 計算二元交叉熵損失 with tf.name_scope("loss"): if config.numClasses == 1: losses = tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=tf.cast(tf.reshape(self.inputY, [-1, 1]), dtype=tf.float32)) elif config.numClasses > 1: losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=self.inputY) self.loss = tf.reduce_mean(losses) + config.model.l2RegLambda * l2Loss def attention(self, H): """ 利用Attention機制得到句子的向量表示 """ # 獲得最后一層LSTM的神經元數量 hiddenSize = config.model.hiddenSizes[-1] # 初始化一個權重向量,是可訓練的參數 W = tf.Variable(tf.random_normal([hiddenSize], stddev=0.1)) # 對Bi-LSTM的輸出用激活函數做非線性轉換 M = tf.tanh(H) # 對W和M做矩陣運算,W=[batch_size, time_step, hidden_size],計算前做維度轉換成[batch_size * time_step, hidden_size] # newM = [batch_size, time_step, 1],每一個時間步的輸出由向量轉換成一個數字 newM = tf.matmul(tf.reshape(M, [-1, hiddenSize]), tf.reshape(W, [-1, 1])) # 對newM做維度轉換成[batch_size, time_step] restoreM = tf.reshape(newM, [-1, config.sequenceLength]) # 用softmax做歸一化處理[batch_size, time_step] self.alpha = tf.nn.softmax(restoreM) # 利用求得的alpha的值對H進行加權求和,用矩陣運算直接操作 r = tf.matmul(tf.transpose(H, [0, 2, 1]), tf.reshape(self.alpha, [-1, config.sequenceLength, 1])) # 將三維壓縮成二維sequeezeR=[batch_size, hidden_size] sequeezeR = tf.reshape(r, [-1, hiddenSize]) sentenceRepren = tf.tanh(sequeezeR) # 對Attention的輸出可以做dropout處理 output = tf.nn.dropout(sentenceRepren, self.dropoutKeepProb) return output
3.4 模型訓練:mode_trainning.py
1 import os 2 import datetime 3 import numpy as np 4 import tensorflow as tf 5 import parameter_config 6 import get_train_data 7 import mode_structure 8 9 #獲取前些模塊的數據 10 config =parameter_config.Config() 11 data = get_train_data.Dataset(config) 12 data.dataGen() 13 14 #4生成batch數據集 15 def nextBatch(x, y, batchSize): 16 # 生成batch數據集,用生成器的方式輸出 17 perm = np.arange(len(x)) #返回[0 1 2 ... len(x)]的數組 18 np.random.shuffle(perm) #亂序 19 x = x[perm] 20 y = y[perm] 21 numBatches = len(x) // batchSize 22 23 for i in range(numBatches): 24 start = i * batchSize 25 end = start + batchSize 26 batchX = np.array(x[start: end], dtype="int64") 27 batchY = np.array(y[start: end], dtype="float32") 28 yield batchX, batchY 29 30 # 5 定義計算metrics的函數 31 """ 32 定義各類性能指標 33 """ 34 def mean(item: list) -> float: 35 """ 36 計算列表中元素的平均值 37 :param item: 列表對象 38 :return: 39 """ 40 res = sum(item) / len(item) if len(item) > 0 else 0 41 return res 42 43 def accuracy(pred_y, true_y): 44 """ 45 計算二類和多類的准確率 46 :param pred_y: 預測結果 47 :param true_y: 真實結果 48 :return: 49 """ 50 if isinstance(pred_y[0], list): 51 pred_y = [item[0] for item in pred_y] 52 corr = 0 53 for i in range(len(pred_y)): 54 if pred_y[i] == true_y[i]: 55 corr += 1 56 acc = corr / len(pred_y) if len(pred_y) > 0 else 0 57 return acc 58 59 def binary_precision(pred_y, true_y, positive=1): 60 """ 61 二類的精確率計算 62 :param pred_y: 預測結果 63 :param true_y: 真實結果 64 :param positive: 正例的索引表示 65 :return: 66 """ 67 corr = 0 68 pred_corr = 0 69 for i in range(len(pred_y)): 70 if pred_y[i] == positive: 71 pred_corr += 1 72 if pred_y[i] == true_y[i]: 73 corr += 1 74 75 prec = corr / pred_corr if pred_corr > 0 else 0 76 return prec 77 78 def binary_recall(pred_y, true_y, positive=1): 79 """ 80 二類的召回率 81 :param pred_y: 預測結果 82 :param true_y: 真實結果 83 :param positive: 正例的索引表示 84 :return: 85 """ 86 corr = 0 87 true_corr = 0 88 for i in range(len(pred_y)): 89 if true_y[i] == positive: 90 true_corr += 1 91 if pred_y[i] == true_y[i]: 92 corr += 1 93 94 rec = corr / true_corr if true_corr > 0 else 0 95 return rec 96 97 def binary_f_beta(pred_y, true_y, beta=1.0, positive=1): 98 """ 99 二類的f beta值 100 :param pred_y: 預測結果 101 :param true_y: 真實結果 102 :param beta: beta值 103 :param positive: 正例的索引表示 104 :return: 105 """ 106 precision = binary_precision(pred_y, true_y, positive) 107 recall = binary_recall(pred_y, true_y, positive) 108 try: 109 f_b = (1 + beta * beta) * precision * recall / (beta * beta * precision + recall) 110 except: 111 f_b = 0 112 return f_b 113 114 def multi_precision(pred_y, true_y, labels): 115 """ 116 多類的精確率 117 :param pred_y: 預測結果 118 :param true_y: 真實結果 119 :param labels: 標簽列表 120 :return: 121 """ 122 if isinstance(pred_y[0], list): 123 pred_y = [item[0] for item in pred_y] 124 125 precisions = [binary_precision(pred_y, true_y, label) for label in labels] 126 prec = mean(precisions) 127 return prec 128 129 def multi_recall(pred_y, true_y, labels): 130 """ 131 多類的召回率 132 :param pred_y: 預測結果 133 :param true_y: 真實結果 134 :param labels: 標簽列表 135 :return: 136 """ 137 if isinstance(pred_y[0], list): 138 pred_y = [item[0] for item in pred_y] 139 140 recalls = [binary_recall(pred_y, true_y, label) for label in labels] 141 rec = mean(recalls) 142 return rec 143 144 def multi_f_beta(pred_y, true_y, labels, beta=1.0): 145 """ 146 多類的f beta值 147 :param pred_y: 預測結果 148 :param true_y: 真實結果 149 :param labels: 標簽列表 150 :param beta: beta值 151 :return: 152 """ 153 if isinstance(pred_y[0], list): 154 pred_y = [item[0] for item in pred_y] 155 156 f_betas = [binary_f_beta(pred_y, true_y, beta, label) for label in labels] 157 f_beta = mean(f_betas) 158 return f_beta 159 160 def get_binary_metrics(pred_y, true_y, f_beta=1.0): 161 """ 162 得到二分類的性能指標 163 :param pred_y: 164 :param true_y: 165 :param f_beta: 166 :return: 167 """ 168 acc = accuracy(pred_y, true_y) 169 recall = binary_recall(pred_y, true_y) 170 precision = binary_precision(pred_y, true_y) 171 f_beta = binary_f_beta(pred_y, true_y, f_beta) 172 return acc, recall, precision, f_beta 173 174 def get_multi_metrics(pred_y, true_y, labels, f_beta=1.0): 175 """ 176 得到多分類的性能指標 177 :param pred_y: 178 :param true_y: 179 :param labels: 180 :param f_beta: 181 :return: 182 """ 183 acc = accuracy(pred_y, true_y) 184 recall = multi_recall(pred_y, true_y, labels) 185 precision = multi_precision(pred_y, true_y, labels) 186 f_beta = multi_f_beta(pred_y, true_y, labels, f_beta) 187 return acc, recall, precision, f_beta 188 189 # 6 訓練模型 190 # 生成訓練集和驗證集 191 trainReviews = data.trainReviews 192 trainLabels = data.trainLabels 193 evalReviews = data.evalReviews 194 evalLabels = data.evalLabels 195 wordEmbedding = data.wordEmbedding 196 labelList = data.labelList 197 198 # 定義計算圖 199 with tf.Graph().as_default(): 200 201 session_conf = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False) 202 session_conf.gpu_options.allow_growth=True 203 session_conf.gpu_options.per_process_gpu_memory_fraction = 0.9 # 配置gpu占用率 204 205 sess = tf.Session(config=session_conf) 206 207 # 定義會話 208 with sess.as_default(): 209 bilstmattention = mode_structure.BiLSTMAttention(config, wordEmbedding) 210 globalStep = tf.Variable(0, name="globalStep", trainable=False) 211 # 定義優化函數,傳入學習速率參數 212 optimizer = tf.train.AdamOptimizer(config.training.learningRate) 213 # 計算梯度,得到梯度和變量 214 gradsAndVars = optimizer.compute_gradients(bilstmattention.loss) 215 # 將梯度應用到變量下,生成訓練器 216 trainOp = optimizer.apply_gradients(gradsAndVars, global_step=globalStep) 217 218 # 用summary繪制tensorBoard 219 gradSummaries = [] 220 for g, v in gradsAndVars: 221 if g is not None: 222 tf.summary.histogram("{}/grad/hist".format(v.name), g) 223 tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g)) 224 225 outDir = os.path.abspath(os.path.join(os.path.curdir, "summarys")) 226 print("Writing to {}\n".format(outDir)) 227 228 lossSummary = tf.summary.scalar("loss", bilstmattention.loss) 229 summaryOp = tf.summary.merge_all() 230 231 trainSummaryDir = os.path.join(outDir, "train") 232 trainSummaryWriter = tf.summary.FileWriter(trainSummaryDir, sess.graph) 233 234 evalSummaryDir = os.path.join(outDir, "eval") 235 evalSummaryWriter = tf.summary.FileWriter(evalSummaryDir, sess.graph) 236 237 238 # 初始化所有變量 239 saver = tf.train.Saver(tf.global_variables(), max_to_keep=5) 240 241 # 保存模型的一種方式,保存為pb文件 242 savedModelPath = "../model/bilstm-atten/savedModel" 243 if os.path.exists(savedModelPath): 244 os.rmdir(savedModelPath) 245 builder = tf.saved_model.builder.SavedModelBuilder(savedModelPath) 246 247 sess.run(tf.global_variables_initializer()) 248 249 def trainStep(batchX, batchY): 250 """ 251 訓練函數 252 """ 253 feed_dict = { 254 bilstmattention.inputX: batchX, 255 bilstmattention.inputY: batchY, 256 bilstmattention.dropoutKeepProb: config.model.dropoutKeepProb 257 } 258 _, summary, step, loss, predictions = sess.run( 259 [trainOp, summaryOp, globalStep, bilstmattention.loss, bilstmattention.predictions], 260 feed_dict) 261 timeStr = datetime.datetime.now().isoformat() 262 263 if config.numClasses == 1: 264 acc, recall, prec, f_beta = get_binary_metrics(pred_y=predictions, true_y=batchY) 265 266 elif config.numClasses > 1: 267 acc, recall, prec, f_beta = get_multi_metrics(pred_y=predictions, true_y=batchY, 268 labels=labelList) 269 270 trainSummaryWriter.add_summary(summary, step) 271 272 return loss, acc, prec, recall, f_beta 273 274 def devStep(batchX, batchY): 275 """ 276 驗證函數 277 """ 278 feed_dict = { 279 bilstmattention.inputX: batchX, 280 bilstmattention.inputY: batchY, 281 bilstmattention.dropoutKeepProb: 1.0 282 } 283 summary, step, loss, predictions = sess.run( 284 [summaryOp, globalStep, bilstmattention.loss, bilstmattention.predictions], 285 feed_dict) 286 287 if config.numClasses == 1: 288 289 acc, precision, recall, f_beta = get_binary_metrics(pred_y=predictions, true_y=batchY) 290 elif config.numClasses > 1: 291 acc, precision, recall, f_beta = get_multi_metrics(pred_y=predictions, true_y=batchY, labels=labelList) 292 293 evalSummaryWriter.add_summary(summary, step) 294 295 return loss, acc, precision, recall, f_beta 296 297 for i in range(config.training.epoches): 298 # 訓練模型 299 print("start training model") 300 for batchTrain in nextBatch(trainReviews, trainLabels, config.batchSize): 301 loss, acc, prec, recall, f_beta = trainStep(batchTrain[0], batchTrain[1]) 302 303 currentStep = tf.train.global_step(sess, globalStep) 304 print("train: step: {}, loss: {}, acc: {}, recall: {}, precision: {}, f_beta: {}".format( 305 currentStep, loss, acc, recall, prec, f_beta)) 306 if currentStep % config.training.evaluateEvery == 0: 307 print("\nEvaluation:") 308 309 losses = [] 310 accs = [] 311 f_betas = [] 312 precisions = [] 313 recalls = [] 314 315 for batchEval in nextBatch(evalReviews, evalLabels, config.batchSize): 316 loss, acc, precision, recall, f_beta = devStep(batchEval[0], batchEval[1]) 317 losses.append(loss) 318 accs.append(acc) 319 f_betas.append(f_beta) 320 precisions.append(precision) 321 recalls.append(recall) 322 323 time_str = datetime.datetime.now().isoformat() 324 print("{}, step: {}, loss: {}, acc: {},precision: {}, recall: {}, f_beta: {}".format(time_str, currentStep, mean(losses), 325 mean(accs), mean(precisions), 326 mean(recalls), mean(f_betas))) 327 328 if currentStep % config.training.checkpointEvery == 0: 329 # 保存模型的另一種方法,保存checkpoint文件 330 path = saver.save(sess, "../model/bilstm-atten/model/my-model", global_step=currentStep) 331 print("Saved model checkpoint to {}\n".format(path)) 332 333 inputs = {"inputX": tf.saved_model.utils.build_tensor_info(bilstmattention.inputX), 334 "keepProb": tf.saved_model.utils.build_tensor_info(bilstmattention.dropoutKeepProb)} 335 336 outputs = {"predictions": tf.saved_model.utils.build_tensor_info(bilstmattention.predictions)} 337 338 prediction_signature = tf.saved_model.signature_def_utils.build_signature_def(inputs=inputs, outputs=outputs, 339 method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME) 340 legacy_init_op = tf.group(tf.tables_initializer(), name="legacy_init_op") 341 builder.add_meta_graph_and_variables(sess, [tf.saved_model.tag_constants.SERVING], 342 signature_def_map={"predict": prediction_signature}, legacy_init_op=legacy_init_op) 343 344 builder.save()
3.5 預測:predict.py
# Author:yifan import os import csv import time import datetime import random import json from collections import Counter from math import sqrt import gensim import pandas as pd import numpy as np import tensorflow as tf from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score import parameter_config config =parameter_config.Config() #7預測代碼 x = "this movie is full of references like mad max ii the wild one and many others the ladybug´s face it´s a clear reference or tribute to peter lorre this movie is a masterpiece we´ll talk much more about in the future" # x = "his movie is the same as the third level movie. There's no place to look good" # x = "This film is not good" #最終反饋為0 # x = "This film is bad" #最終反饋為0 x = "this movie is full of references like mad max ii the wild one and many others the ladybug´s face it´s a clear reference or tribute to peter lorre this movie is a masterpiece we´ll talk much more about in the future" # 注:下面兩個詞典要保證和當前加載的模型對應的詞典是一致的 with open("../data/wordJson/word2idx.json", "r", encoding="utf-8") as f: word2idx = json.load(f) with open("../data/wordJson/label2idx.json", "r", encoding="utf-8") as f: label2idx = json.load(f) idx2label = {value: key for key, value in label2idx.items()} xIds = [word2idx.get(item, word2idx["UNK"]) for item in x.split(" ")] if len(xIds) >= config.sequenceLength: xIds = xIds[:config.sequenceLength] else: xIds = xIds + [word2idx["PAD"]] * (config.sequenceLength - len(xIds)) graph = tf.Graph() with graph.as_default(): gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333) session_conf = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False, gpu_options=gpu_options) sess = tf.Session(config=session_conf) with sess.as_default(): checkpoint_file = tf.train.latest_checkpoint("../model/bilstm-atten/model/") saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file)) saver.restore(sess, checkpoint_file) # 獲得需要喂給模型的參數,輸出的結果依賴的輸入值 inputX = graph.get_operation_by_name("inputX").outputs[0] dropoutKeepProb = graph.get_operation_by_name("dropoutKeepProb").outputs[0] # 獲得輸出的結果 predictions = graph.get_tensor_by_name("output/predictions:0") pred = sess.run(predictions, feed_dict={inputX: [xIds], dropoutKeepProb: 1.0})[0] # print(pred) pred = [idx2label[item] for item in pred] print(pred)
結果
相關代碼可見:https://github.com/yifanhunter/NLP_textClassifier