本文中使用一個基於lstm的RNN來預測電影評論的情感方向是“正面”還是“負面”,具體代碼可參考代碼
整體過程:
由於詞匯量較大,使用one-hot編碼的話效率太低,因此這兒先使用詞嵌入實現輸入數據的降維。詞嵌入可以用word2vec來實現,但在此只創建一個詞嵌入層,並讓網絡自己學習詞嵌入表(embedding table)。
從嵌入層中獲取訓練數據中每一個詞的低維表示,並將其傳入LSTM元胞。這將為網絡添加循環連接,因此在該網絡中可以包含數據的序列信息。 然后,將LSTM的輸出結果輸入到sigmoid層。 使用sigmoid的原因是我們要試圖預測這個文本是“正面”還是“負面”。
結構圖如下:
因為每一個評論所對應的label是"positive” 或者 “negative”,所以我們只需要關注sigmoid層的最后一個輸出,忽略前面的其他輸出, 我們將從最后一步的輸出和訓練label來計算cost。
下面我們來看代碼
導入庫文件
import numpy as np import tensorflow as tf
讀取數據
with open('./data/reviews.txt', 'r') as f: reviews = f.read() with open('./data/labels.txt', 'r') as f: labels = f.read()
數據預處理
from string import punctuation all_text = ''.join([c for c in reviews if c not in punctuation]) #去掉標點符號 reviews = all_text.split('\n') all_text = ' '.join(reviews) words = all_text.split()
編碼review和label
# 創建詞到數字轉換的詞典 from collections import Counter counter = Counter(words) vocab_sorted = sorted(counter,key=counter.get,reverse=True) vocab_to_int = {word: num for num,word in enumerate(vocab_sorted, 1)} # 將評論轉化為數字 reviews_ints = [] for review in reviews: reviews_ints.append([vocab_to_int[word] for word in review.split()]) # 將'positive' 和'negative'的label分別轉換為1和0 labels = labels.split('\n') labels = np.array([1 if each == 'positive' else 0 for each in labels]) # 刪除長度為0的review和對應的label non_zero_index = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0] reviews_ints = [reviews_ints[ii] for ii in non_zero_index] labels = np.array([labels[ii] for ii in non_zero_index])
至此,我們已將reviews和labels全部轉換為更容易處理的整數。
現在,我們要創建一個傳遞給網絡的features數組,為方便處理可以將特征向量的長度定義為200。 對於短於200個字的評論,左邊補全為0。 也就是說,如果review是['best','movie','ever'](對應整數位[117,18,128]),相對應的行將是[0,0,0,...,0,117 ,18,128]; 對於超過200次的評論,使用前200個單詞作為特征向量。
seq_len = 200 features = np.array([review[:seq_len] if len(review) > seq_len else [0] * (seq_len - len(review)) + review for review in reviews_ints])
創建training validation test數據集
split_frac = 0.8 split_idx = int(len(features)*split_frac) train_x, val_x = features[:split_idx], features[split_idx:] train_y, val_y = labels[:split_idx], labels[split_idx:] val_idx = int(len(val_x)*0.5) val_x, test_x = val_x[:val_idx], val_x[val_idx:] val_y, test_y = val_y[:val_idx], val_y[val_idx:]
創建graph
首先,定義超參數
lstm_size = 256 lstm_layers = 1 batch_size = 1000 learning_rate = 0.01
其中:
- lstm_size:LSTM元胞中隱藏層的單元數量,LSTM元胞中實際有四種不同的網絡層,這是每一層中的單元數
- lstm_layers:LSTM層的數量
- batch_size: 單次訓練中傳入網絡的review數量
- learning_rate:學習率
定義變量及嵌入層
n_words = len(vocab_to_int) graph = tf.Graph() with graph.as_default(): inputs_ = tf.placeholder(tf.int32,(batch_size, seq_len),name='inputs') labels_ = tf.placeholder(tf.int32,(batch_size,1),name='labels') keep_prob = tf.placeholder(tf.float32,name='keep_prob') embed_size = 300 with graph.as_default(): embedding = tf.Variable(tf.random_uniform((n_words,embed_size),-1,1)) embed = tf.nn.embedding_lookup(embedding,inputs_)
LSTM層
在TensorFlow中,使用tf.contrib.rnn.BasicLSTMCell可以很方便的創建LSTM元胞,基本用法可以參考官方文檔https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell
使用tf.contrib.rnn.BasicLSTMCell(num_units)就可以創建一個隱藏層單元數量為num_units的元胞
接下來,可以使用tf.contrib.rnn.DropoutWrapper來給lstm元胞添加dropout
例如 :drop = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=keep_prob) 即給元胞cell添加了dropout
通常,多個LSTM層可以是我們的模型獲得更好的表現,如果我們想使用多個LSTM層的話,該怎么做呢?TensorFlow也能很方便的實現這個
例如:cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)就創建了lstm_layers個lstm層,每層的結構和drop類似(drop是一個添加了dropout的基本的lstm層)。
with graph.as_default(): lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size) drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob) cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers) initial_state = cell.zero_state(batch_size, tf.float32)
前向傳播
在數據的前向傳播中,我們需要使用tf.nn.dynamic_rnn來運行LSTM層的代碼。
基本使用方法:outputs, final_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state),其中cell是上面定義的lstm層。
with graph.as_default():
outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
輸出
由於我們只關心最終的輸出,所以我們需要使用outputs[:,-1]來獲取最終輸出,並由此計算cost。
with graph.as_default(): predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid) cost = tf.losses.mean_squared_error(labels_, predictions) optimizer = tf.train.AdamOptimizer(learning_rate, beta1=0.9, beta2=0.999).minimize(cost)
獲取精度
從validation中獲取精度
with graph.as_default(): correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_) accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
獲取訓練batch
這里使用生成器generator來獲取訓練用的batch
def get_batches(x, y, batch_size=100): n_batches = len(x)//batch_size x, y = x[:n_batches*batch_size], y[:n_batches*batch_size] for ii in range(0, len(x), batch_size): yield x[ii:ii+batch_size], y[ii:ii+batch_size]
訓練 Trainging
epochs = 3 with graph.as_default(): saver = tf.train.Saver() with tf.Session(graph=graph) as sess: sess.run(tf.global_variables_initializer()) iteration = 1 for e in range(epochs): state = sess.run(initial_state) for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1): feed = {inputs_: x, labels_: y[:, None], keep_prob: 0.5, initial_state: state} loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed) if iteration%5==0: print("Epoch: {}/{}".format(e, epochs), "Iteration: {}".format(iteration), "Train loss: {:.3f}".format(loss)) if iteration%25==0: val_acc = [] val_state = sess.run(cell.zero_state(batch_size, tf.float32)) for x, y in get_batches(val_x, val_y, batch_size): feed = {inputs_: x, labels_: y[:, None], keep_prob: 1, initial_state: val_state} batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed) val_acc.append(batch_acc) print("Val acc: {:.3f}".format(np.mean(val_acc))) iteration +=1 saver.save(sess, "checkpoints/sentiment.ckpt")
結果:
Epoch: 0/3 Iteration: 5 Train loss: 0.352 Epoch: 0/3 Iteration: 10 Train loss: 0.252 Epoch: 0/3 Iteration: 15 Train loss: 0.235 Epoch: 0/3 Iteration: 20 Train loss: 0.197 Epoch: 0/3 Iteration: 25 Train loss: 0.186 Val acc: 0.720 Epoch: 0/3 Iteration: 30 Train loss: 0.228 Epoch: 0/3 Iteration: 35 Train loss: 0.204 Epoch: 0/3 Iteration: 40 Train loss: 0.199 Epoch: 1/3 Iteration: 45 Train loss: 0.179 Epoch: 1/3 Iteration: 50 Train loss: 0.105 Val acc: 0.846 Epoch: 1/3 Iteration: 55 Train loss: 0.078 Epoch: 1/3 Iteration: 60 Train loss: 0.028 Epoch: 1/3 Iteration: 65 Train loss: 0.015 Epoch: 1/3 Iteration: 70 Train loss: 0.010 Epoch: 1/3 Iteration: 75 Train loss: 0.008 Val acc: 0.506 Epoch: 1/3 Iteration: 80 Train loss: 0.008 Epoch: 2/3 Iteration: 85 Train loss: 0.429 Epoch: 2/3 Iteration: 90 Train loss: 0.223 Epoch: 2/3 Iteration: 95 Train loss: 0.156 Epoch: 2/3 Iteration: 100 Train loss: 0.138 Val acc: 0.534 Epoch: 2/3 Iteration: 105 Train loss: 0.114 Epoch: 2/3 Iteration: 110 Train loss: 0.097 Epoch: 2/3 Iteration: 115 Train loss: 0.035 Epoch: 2/3 Iteration: 120 Train loss: 0.032
運行測試集
test_acc = [] with tf.Session(graph=graph) as sess: saver.restore(sess, tf.train.latest_checkpoint('checkpoints')) test_state = sess.run(cell.zero_state(batch_size, tf.float32)) for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1): feed = {inputs_: x, labels_: y[:, None], keep_prob: 1, initial_state: test_state} batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed) test_acc.append(batch_acc) print("Test accuracy: {:.3f}".format(np.mean(test_acc)))
結果: