Keras 文本TOPIC分類小結
1.任務簡介
對一段輸入文本預測其類別。因時間有限,只在20 news group數據集上進行實驗。
以下是20 news group數據集的簡介
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g.comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian). Here is a list of the 20 newsgroups, partitioned (more or less) according to subject matter:
comp.graphics |
rec.autos |
sci.crypt |
misc.forsale |
talk.politics.misc |
talk.religion.misc |
2. 數據處理
2.1 預處理
通常的處理順序為,去除文本中的標點符號,去除無用詞,去除大小寫差異。這里我用http://web.ist.utl.pt/~acardoso/datasets/這里提供的已處理好的文本:
- 1. all-terms Obtained from the original datasets by applying the following transformations:
- 1. Substitute TAB, NEWLINE and RETURN characters by SPACE.
- 2. Keep only letters (that is, turn punctuation, numbers, etc. into SPACES).
- 3. Turn all letters to lowercase.
- 4. Substitute multiple SPACES by a single SPACE.
- 5. The title/subject of each document is simply added in the beginning of the document's text.
- 2. no-short Obtained from the previous file, by removing words that are less than 3 characters long. For example, removing "he" but keeping "him".
- 3. no-stop Obtained from the previous file, by removing the 524 SMART stopwords. Some of them had already been removed, because they were shorter than 3 characters.
- 4. stemmed Obtained from the previous file, by applying Porter's Stemmer to the remaining words. Information about stemming can be found here.
然后將類別標簽轉換為整形標簽。
2.2 文本轉特征
轉特征過程中使用兩種不同的方法,分別對應不同的模型DNN和LSTM
a) word轉為詞序號
從訓練語料統計獲得單詞列表,並按照詞頻從大到小排序,序號從0開始,然后將句子中單詞全部轉為序號
b) word 轉為詞向量
用google的word2vec工具,根據訓練文本生成單詞對應的詞向量。需要注意的是,此工具生成的詞典中帶有一個 名為 </s> 的單詞, 它是換行,回車符轉換過來的,無視此條目即可。當測試語料中出現集外詞時,使用全0填充vector。
本實驗中,vector的size設為了48,即word轉換為了48維的詞向量。
3.實驗進行
列一下DNN,LSTM變體GRU兩個模型的實驗代碼
a) DNN
from __future__ import absolute_import
from __future__ import print_function
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.preprocessing import sequence
from keras.optimizers import SGD, RMSprop, Adagrad
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Reshape
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU
from keras.preprocessing.text import Tokenizer
import pickle
import LoadOriData
batch_size = 32
maxlen = 100
max_features = 1000
print("Loading data...")
X_train, Y_train = LoadOriData.Process('20ng-train-stemmed.txt', nb_words=max_features)
X_test, Y_test = LoadOriData.Process('20ng-test-stemmed.txt', nb_words=max_features)
print(len(X_train), 'train sequences')
tokenizer = Tokenizer(nb_words=max_features)
X_train = tokenizer.sequences_to_matrix(X_train, mode="binary")
X_test = tokenizer.sequences_to_matrix(X_test, mode='binary')
print('X_train shape:', X_train.shape)
#print('Y_train shape:', Y_train.shape)
print('Build model...')
model = Sequential()
model.add(Dense(512, input_shape=(max_features,), activation='tanh'))
model.add(Dropout(0.5))
model.add(Dense(20, activation='softmax'))
# try using different optimizers and different optimizer configs
model.compile(loss='categorical_crossentropy', optimizer='adam', class_mode="categorical")
json_string = model.to_json()
print(json_string)
f = open('20mlp_model.txt', 'w')
f.write(json_string)
f.close()
print("Train...")
model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=5, show_accuracy=True)
model.save_weights('20mlp_weights.h5', overwrite=True)
score, acc = model.evaluate(X_test, Y_test, batch_size=batch_size, verbose=1, show_accuracy=True)
print('Test score:', score)
print('Test accuracy:', acc)
注:
- 使用tokenizer.sequences_to_matrix 將詞序號組成的序列轉換為0,1值的序列。本實驗使用max_features = 1000,即只記錄了top 1000個詞每個單詞是否出現的信息。於是輸入層size為1000
- 為了方便,我在預處理的時候把輸出,即類別標簽已轉換為了0,1序列,所以輸出不再需要處理。 不過keras自帶工具,keras.utils. np_utils可以完成轉換,例如,若y_test為整型的類別標簽,Y_test = np_utils.to_categorical(y_test, nb_classes), Y_test將得到0,1序列化的結果。
- 本實驗DNN模型結構為 1000*512*20,dropout為50%, 注意最后一層激活函數為softmax, 模型的損失函數設為categorical_crossentropy 類別預測的交叉熵,class_mode設為categorical
b) GRU
from __future__ import absolute_import
from __future__ import print_function
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.preprocessing import sequence
from keras.optimizers import SGD, RMSprop, Adagrad
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Reshape
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU
import pickle
import os
batch_size = 32
weights_file = '20lstm_weights.h5'
print("Loading data...")
f=open('train.pkl', 'r')
X_train, Y_train = pickle.load(f)
f.close()
print('X_train shape:', X_train.shape)
print('Y_train shape:', Y_train.shape)
print('Build model...')
model = Sequential()
model.add(GRU(output_dim=128,input_dim = 48, activation='tanh', inner_activation='hard_sigmoid', input_length=100)) # try using a GRU instead, for fun
model.add(Dropout(0.5))
model.add(Dense(20, activation='softmax'))
# try using different optimizers and different optimizer configs
model.compile(loss='categorical_crossentropy', optimizer='adam', class_mode="categorical")
json_string = model.to_json()
print(json_string)
print("Train...")
if os.path.exists(weights_file):
model.load_weights(weights_file)
model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=4, show_accuracy=True)
model.save_weights(weights_file, overwrite=True)
f=open('test.pkl', 'r')
X_test, Y_test = pickle.load(f)
f.close()
score, acc = model.evaluate(X_test, Y_test, batch_size=batch_size, verbose=1, show_accuracy=True)
print('Test score:', score)
print('Test accuracy:', acc)
注:
- train.pkl, test.pkl 根據word2vec結果,截斷前100個詞 生成X,Y。X為48維,時序長度100, 對於集外詞填充全0; Y為20(總類別數目)維, 所屬類別維度設置為1,其余為0。
- model.evaluate 直接可以計算誤差和准確率(只有分類任務才有意義)
- LSTM的權重包含,U_c,U_f,U_i,U_o W_c,W_f,W_i,W_o b_c,b_f,b_i,b_o共12個參數, 查看其數值用 eval()函數。 gru權重包括U_h,U_r,U_z, W_h,W_r,W_z b_h,b_r,b_z
- LSTM和GRU的計算公式參考 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 。需要注意的是,文章中的W矩陣在由keras中的U和W這兩個矩陣組合成的,文中相當於在計算公式上做了一個合並
4 結論
4.1 DNN
迭代5次,集內准確率92%, 集外70%
In [156]: run 20ng_mlp.py
Loading data...
11293 train sequences
X_train shape: (11293, 1000)
Build model...
{"layers": [{"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "input_shape": [1000], "init": "glorot_uniform", "activation": "tanh", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 512}, {"p": 0.5, "name": "Dropout"}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "glorot_uniform", "activation": "softmax", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 20}], "loss": "categorical_crossentropy", "theano_mode": null, "name": "Sequential", "class_mode": "categorical", "optimizer": {"beta_1": 0.9, "epsilon": 1e-08, "beta_2": 0.999, "lr": 0.001, "name": "Adam"}}
Train...
Epoch 1/5
11293/11293 [==============================] - 9s - loss: 1.4124 - acc: 0.6726
Epoch 2/5
11293/11293 [==============================] - 9s - loss: 0.6080 - acc: 0.8343
Epoch 3/5
11293/11293 [==============================] - 10s - loss: 0.4347 - acc: 0.8773
Epoch 4/5
11293/11293 [==============================] - 9s - loss: 0.3388 - acc: 0.9059
Epoch 5/5
11293/11293 [==============================] - 9s - loss: 0.2772 - acc: 0.9212
7528/7528 [==============================] - 1s
Test score: 1.12070538341
Test accuracy: 0.701248671626
4.2 GRU
迭代了24次(前面先迭代了20次,本次只截了最后4次的log)。
集內准確率86%,集外75%
In [1]: run 20ng_lstm.py
Loading data...
X_train shape: (11293, 100, 48)
Y_train shape: (11293, 20)
Build model...
/usr/local/lib/python2.7/dist-packages/Theano-0.7.0-py2.7.egg/theano/scan_module/ scan_perform_ext.py:133: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
from scan_perform.scan_perform import *
{"layers": [{"truncate_gradient": -1, "name": "GRU", "inner_activation": "hard_sigmoid", "output_dim": 128, "input_shape": [100, 48], "init": "glorot_uniform", "inner_init": "orthogonal", "input_dim": 48, "return_sequences": false, "activation": "tanh", "input_length": 100}, {"p": 0.5, "name": "Dropout"}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "glorot_uniform", "activation": "softmax", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 20}], "loss": "categorical_crossentropy", "theano_mode": null, "name": "Sequential", "class_mode": "categorical", "optimizer": {"beta_1": 0.9, "epsilon": 1e-08, "beta_2": 0.999, "lr": 0.001, "name": "Adam"}}
Train...
Epoch 1/4
11293/11293 [==============================] - 124s - loss: 0.4315 - acc: 0.8681 56/11293 [=========================>....] - ETA: 15s - loss: 0.4299 - acc: 0.8693 Epoch 3/4
11293/11293 [==============================] - 118s - loss: 0.4081 - acc: 0.8756
Epoch 4/4
11293/11293 [==============================] - 130s - loss: 0.3950 - acc: 0.8837
7528/7528 [==============================] - 21s
Test score: 0.863923724031
Test accuracy: 0.758368756642
4.3 總結
1. LSTM/GRU 每次迭代運行時間大概是DNN的11倍, 迭代次數也需要比DNN多,不過其集外准確率要強過DNN。
2. 當增加LSTM/GRU的窗長時(時序長度),每次迭代的准確率會變優,但運行時間變長
3. GRU表現比LSTM好:准確率高,且運行速度快。時間關系,實驗時沒有留下證據,但運行結果是GRU優於LSTM。GRU介紹參考此網址 http://colah.github.io/posts/2015-08-Understanding-LSTMs/
5 其他
- 如何查看中間層的輸出結果:
以DNN的model為例:
model2 = Sequential()
model2.add(Dense(512, input_shape=(max_features,), activation='tanh', weights = model.layers[0].get_weights()))
model2.compile(loss='categorical_crossentropy', optimizer='adam', class_mode = "categorical")
然后TT=model2.predict(X_test, batch_size=..), 獲得的就是第一層之后的輸出
- Dropout 的應用:以系數0.5為例
a) 它在訓練時通過概率來將1/2的神經元disable掉
b) 在預測的時候,是將(W*X+B)*dropout_rate 來作輸出
- 模型結構的保存可以用model.to_json(),然后保存字符串
- 模型權重的保存model.save_weights('20mlp_weights.h5', overwrite=True)來保存h5格式的文件。
或者model.get_weights() 將獲得所有 有權重參數層(dropout層就沒有權重參數)的權重。例如,DNN的結果是 length 為4的array, array[0],array[1]為1000*512那一層的W和b; 而array[2], array[3]為512*20那一層的W和b