Keras 文本TOPIC分類小結

本文轉載自查看原文 2015-12-31 16:01 18197

Keras 文本TOPIC分類小結

1.任務簡介

對一段輸入文本預測其類別。因時間有限，只在20 news group數據集上進行實驗。

以下是20 news group數據集的簡介

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g.comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian). Here is a list of the 20 newsgroups, partitioned (more or less) according to subject matter:

comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x	rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey	sci.crypt sci.electronics sci.med sci.space
misc.forsale	talk.politics.misc talk.politics.guns talk.politics.mideast	talk.religion.misc alt.atheism soc.religion.christian

2. 數據處理

2.1 預處理

通常的處理順序為，去除文本中的標點符號，去除無用詞，去除大小寫差異。這里我用http://web.ist.utl.pt/~acardoso/datasets/這里提供的已處理好的文本：

1. all-terms Obtained from the original datasets by applying the following transformations:
1. 1. Substitute TAB, NEWLINE and RETURN characters by SPACE.
2. 2. Keep only letters (that is, turn punctuation, numbers, etc. into SPACES).
3. 3. Turn all letters to lowercase.
4. 4. Substitute multiple SPACES by a single SPACE.
5. 5. The title/subject of each document is simply added in the beginning of the document's text.
2. no-short Obtained from the previous file, by removing words that are less than 3 characters long. For example, removing "he" but keeping "him".
3. no-stop Obtained from the previous file, by removing the 524 SMART stopwords. Some of them had already been removed, because they were shorter than 3 characters.
4. stemmed Obtained from the previous file, by applying Porter's Stemmer to the remaining words. Information about stemming can be found here.

然后將類別標簽轉換為整形標簽。

2.2 文本轉特征

轉特征過程中使用兩種不同的方法，分別對應不同的模型DNN和LSTM

a) word轉為詞序號

從訓練語料統計獲得單詞列表，並按照詞頻從大到小排序，序號從0開始，然后將句子中單詞全部轉為序號

b) word 轉為詞向量

用google的word2vec工具，根據訓練文本生成單詞對應的詞向量。需要注意的是，此工具生成的詞典中帶有一個名為 </s> 的單詞，它是換行，回車符轉換過來的，無視此條目即可。當測試語料中出現集外詞時，使用全0填充vector。

本實驗中，vector的size設為了48，即word轉換為了48維的詞向量。

3.實驗進行

列一下DNN，LSTM變體GRU兩個模型的實驗代碼

a) DNN

from __future__ import absolute_import

from __future__ import print_function

import numpy as np

np.random.seed(1337) # for reproducibility

from keras.preprocessing import sequence

from keras.optimizers import SGD, RMSprop, Adagrad

from keras.utils import np_utils

from keras.models import Sequential

from keras.layers.core import Dense, Dropout, Activation, Reshape

from keras.layers.embeddings import Embedding

from keras.layers.recurrent import LSTM, GRU

from keras.preprocessing.text import Tokenizer

import pickle

import LoadOriData

batch_size = 32

maxlen = 100

max_features = 1000

print("Loading data...")

X_train, Y_train = LoadOriData.Process('20ng-train-stemmed.txt', nb_words=max_features)

X_test, Y_test = LoadOriData.Process('20ng-test-stemmed.txt', nb_words=max_features)

print(len(X_train), 'train sequences')

tokenizer = Tokenizer(nb_words=max_features)

X_train = tokenizer.sequences_to_matrix(X_train, mode="binary")

X_test = tokenizer.sequences_to_matrix(X_test, mode='binary')

print('X_train shape:', X_train.shape)

#print('Y_train shape:', Y_train.shape)

print('Build model...')

model = Sequential()

model.add(Dense(512, input_shape=(max_features,), activation='tanh'))

model.add(Dropout(0.5))

model.add(Dense(20, activation='softmax'))

# try using different optimizers and different optimizer configs

model.compile(loss='categorical_crossentropy', optimizer='adam', class_mode="categorical")

json_string = model.to_json()

print(json_string)

f = open('20mlp_model.txt', 'w')

f.write(json_string)

f.close()

print("Train...")

model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=5, show_accuracy=True)

model.save_weights('20mlp_weights.h5', overwrite=True)

score, acc = model.evaluate(X_test, Y_test, batch_size=batch_size, verbose=1, show_accuracy=True)

print('Test score:', score)

print('Test accuracy:', acc)

注：

使用tokenizer.sequences_to_matrix 將詞序號組成的序列轉換為0,1值的序列。本實驗使用max_features = 1000，即只記錄了top 1000個詞每個單詞是否出現的信息。於是輸入層size為1000
為了方便，我在預處理的時候把輸出，即類別標簽已轉換為了0，1序列，所以輸出不再需要處理。不過keras自帶工具，keras.utils. np_utils可以完成轉換，例如，若y_test為整型的類別標簽，Y_test = np_utils.to_categorical(y_test, nb_classes)， Y_test將得到0,1序列化的結果。
本實驗DNN模型結構為 1000*512*20，dropout為50%, 注意最后一層激活函數為softmax, 模型的損失函數設為categorical_crossentropy 類別預測的交叉熵，class_mode設為categorical

b) GRU

from __future__ import absolute_import

from __future__ import print_function

import numpy as np

np.random.seed(1337) # for reproducibility

from keras.preprocessing import sequence

from keras.optimizers import SGD, RMSprop, Adagrad

from keras.utils import np_utils

from keras.models import Sequential

from keras.layers.core import Dense, Dropout, Activation, Reshape

from keras.layers.embeddings import Embedding

from keras.layers.recurrent import LSTM, GRU

import pickle

import os

batch_size = 32

weights_file = '20lstm_weights.h5'

print("Loading data...")

f=open('train.pkl', 'r')

X_train, Y_train = pickle.load(f)

f.close()

print('X_train shape:', X_train.shape)

print('Y_train shape:', Y_train.shape)

print('Build model...')

model = Sequential()

model.add(GRU(output_dim=128,input_dim = 48, activation='tanh', inner_activation='hard_sigmoid', input_length=100)) # try using a GRU instead, for fun

model.add(Dropout(0.5))

model.add(Dense(20, activation='softmax'))

# try using different optimizers and different optimizer configs

model.compile(loss='categorical_crossentropy', optimizer='adam', class_mode="categorical")

json_string = model.to_json()

print(json_string)

print("Train...")

if os.path.exists(weights_file):

model.load_weights(weights_file)

model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=4, show_accuracy=True)

model.save_weights(weights_file, overwrite=True)

f=open('test.pkl', 'r')

X_test, Y_test = pickle.load(f)

f.close()

score, acc = model.evaluate(X_test, Y_test, batch_size=batch_size, verbose=1, show_accuracy=True)

print('Test score:', score)

print('Test accuracy:', acc)

注：

train.pkl, test.pkl 根據word2vec結果，截斷前100個詞生成X,Y。X為48維，時序長度100，對於集外詞填充全0； Y為20（總類別數目）維，所屬類別維度設置為1，其余為0。
model.evaluate 直接可以計算誤差和准確率（只有分類任務才有意義）
LSTM的權重包含，U_c,U_f,U_i,U_o W_c,W_f,W_i,W_o b_c,b_f,b_i,b_o共12個參數，查看其數值用 eval()函數。 gru權重包括U_h,U_r,U_z, W_h,W_r,W_z b_h,b_r,b_z
LSTM和GRU的計算公式參考 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 。需要注意的是，文章中的W矩陣在由keras中的U和W這兩個矩陣組合成的，文中相當於在計算公式上做了一個合並

4 結論

4.1 DNN

迭代5次，集內准確率92%，集外70%

In [156]: run 20ng_mlp.py

Loading data...

11293 train sequences

X_train shape: (11293, 1000)

Build model...

{"layers": [{"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "input_shape": [1000], "init": "glorot_uniform", "activation": "tanh", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 512}, {"p": 0.5, "name": "Dropout"}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "glorot_uniform", "activation": "softmax", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 20}], "loss": "categorical_crossentropy", "theano_mode": null, "name": "Sequential", "class_mode": "categorical", "optimizer": {"beta_1": 0.9, "epsilon": 1e-08, "beta_2": 0.999, "lr": 0.001, "name": "Adam"}}

Train...

Epoch 1/5

11293/11293 [==============================] - 9s - loss: 1.4124 - acc: 0.6726

Epoch 2/5

11293/11293 [==============================] - 9s - loss: 0.6080 - acc: 0.8343

Epoch 3/5

11293/11293 [==============================] - 10s - loss: 0.4347 - acc: 0.8773

Epoch 4/5

11293/11293 [==============================] - 9s - loss: 0.3388 - acc: 0.9059

Epoch 5/5

11293/11293 [==============================] - 9s - loss: 0.2772 - acc: 0.9212

7528/7528 [==============================] - 1s

Test score: 1.12070538341

Test accuracy: 0.701248671626

4.2 GRU

迭代了24次（前面先迭代了20次，本次只截了最后4次的log）。

集內准確率86%，集外75%

In [1]: run 20ng_lstm.py

Loading data...

X_train shape: (11293, 100, 48)

Y_train shape: (11293, 20)

Build model...

/usr/local/lib/python2.7/dist-packages/Theano-0.7.0-py2.7.egg/theano/scan_module/ scan_perform_ext.py:133: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility

from scan_perform.scan_perform import *

{"layers": [{"truncate_gradient": -1, "name": "GRU", "inner_activation": "hard_sigmoid", "output_dim": 128, "input_shape": [100, 48], "init": "glorot_uniform", "inner_init": "orthogonal", "input_dim": 48, "return_sequences": false, "activation": "tanh", "input_length": 100}, {"p": 0.5, "name": "Dropout"}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "glorot_uniform", "activation": "softmax", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 20}], "loss": "categorical_crossentropy", "theano_mode": null, "name": "Sequential", "class_mode": "categorical", "optimizer": {"beta_1": 0.9, "epsilon": 1e-08, "beta_2": 0.999, "lr": 0.001, "name": "Adam"}}

Train...

Epoch 1/4

11293/11293 [==============================] - 124s - loss: 0.4315 - acc: 0.8681 56/11293 [=========================>....] - ETA: 15s - loss: 0.4299 - acc: 0.8693 Epoch 3/4

11293/11293 [==============================] - 118s - loss: 0.4081 - acc: 0.8756

Epoch 4/4

11293/11293 [==============================] - 130s - loss: 0.3950 - acc: 0.8837

7528/7528 [==============================] - 21s

Test score: 0.863923724031

Test accuracy: 0.758368756642

4.3 總結

1. LSTM/GRU 每次迭代運行時間大概是DNN的11倍，迭代次數也需要比DNN多，不過其集外准確率要強過DNN。

2. 當增加LSTM/GRU的窗長時（時序長度），每次迭代的准確率會變優，但運行時間變長

3. GRU表現比LSTM好：准確率高，且運行速度快。時間關系，實驗時沒有留下證據，但運行結果是GRU優於LSTM。GRU介紹參考此網址 http://colah.github.io/posts/2015-08-Understanding-LSTMs/

5 其他

如何查看中間層的輸出結果：

以DNN的model為例：

model2 = Sequential()

model2.add(Dense(512, input_shape=(max_features,), activation='tanh', weights = model.layers[0].get_weights()))

model2.compile(loss='categorical_crossentropy', optimizer='adam', class_mode = "categorical")

然后TT=model2.predict(X_test, batch_size=..), 獲得的就是第一層之后的輸出

Dropout 的應用：以系數0.5為例

a) 它在訓練時通過概率來將1/2的神經元disable掉

b) 在預測的時候，是將(W*X+B)*dropout_rate 來作輸出

模型結構的保存可以用model.to_json()，然后保存字符串
模型權重的保存model.save_weights('20mlp_weights.h5', overwrite=True)來保存h5格式的文件。

或者model.get_weights() 將獲得所有有權重參數層（dropout層就沒有權重參數）的權重。例如，DNN的結果是 length 為4的array， array[0],array[1]為1000*512那一層的W和b; 而array[2], array[3]為512*20那一層的W和b

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 基於keras中IMDB的文本分類 demo keras實戰教程二(文本分類BiLSTM) Text-CNN-文本分類-keras Keras實現text classification文本二分類 [Keras實戰教程]·使用Transfromer模型做文本分類（NLP分類最佳模型）文本分類：Keras+RNN vs傳統機器學習 keras 的svm做分類文本分類--多分類 python文本分類文本分類-TextCNN