IMDB影評傾向分類 - N-Gram


catalogue

1. 數據集
2. 模型設計
3. 訓練

 

1. 數據集

0x1: IMDB影評數據

本數據庫含有來自IMDB的25,000條影評,被標記為正面/負面兩種評價

from keras.datasets import imdb

(X_train, y_train), (X_test, y_test) = imdb.load_data(path="imdb_full.pkl",
                                                      nb_words=None,
                                                      skip_top=0,
                                                      maxlen=None,
                                                      test_split=0.1)
                                                      seed=113,
                                                      start_char=1,
                                                      oov_char=2,
                                                      index_from=3)
 

1. path:如果你在本機上已有此數據集(位於'~/.keras/datasets/'+path),則載入。否則數據將下載到該目錄下
2. nb_words:整數或None,要考慮的最常見的單詞數,任何出現頻率更低的單詞將會被編碼到0的位置。
3. skip_top:整數,忽略最常出現的若干單詞,這些單詞將會被編碼為0
4. maxlen:整數,最大序列長度,任何長度大於此值的序列將被截斷
5. seed:整數,用於數據重排的隨機數種子
6. start_char:字符,序列的起始將以該字符標記,默認為1因為0通常用作padding
7. oov_char:字符,因nb_words或skip_top限制而cut掉的單詞將被該字符代替
8. index_from:整數,真實的單詞(而不是類似於start_char的特殊占位符)將從這個下標開始

返回值兩個Tuple,(X_train, y_train), (X_test, y_test),其中

X_train和X_test:序列的列表,每個序列都是詞下標的列表。如果指定了nb_words,則序列中可能的最大下標為nb_words-1。如果指定了maxlen,則序列的最大可能長度為maxlen
y_train和y_test:為序列的標簽,是一個二值list

0x2: 數據預處理

影評已被預處理為詞下標構成的序列,單詞的下標基於它在數據集中出現的頻率標定,例如整數3所編碼的詞為數據集中第3常出現的詞,類似下面這張圖

h的詞頻在數據集中的頻率排名第一,所以它的編碼為1

X_train[0]
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

y_train[0]
1

將語料集轉化為一個詞頻組合的編號集

0x2: N-Gram生成

需要注意的是,kera的dataset默認采用1-Gram的分組方式,如果我們需要使用2-Gram、N-Gram等情況,需要在預處理階段進行拼接

if ngram_range > 1:
    print('Adding {}-gram features'.format(ngram_range))
    # Create set of unique n-gram from the training set.
    ngram_set = set()
    for input_list in X_train:
        for i in range(2, ngram_range + 1):
            set_of_ngram = create_ngram_set(input_list, ngram_value=i)
            ngram_set.update(set_of_ngram)
    print('ngram_set.pop')
    print(ngram_set.pop())

將1-Gram進行"兩兩組合",得到2-Gram,同時組合后的2-Gram需要再次進行index編號,編號的起始就從max_features開始,因為在1-Gram的時候,我們已經把0-max_features的編號都使用過了

但是要注意一個問題,2-Gram對應的組合數量和1-Gram是幾何倍數增加的,對計算和內存的要求也成倍提高了

0x3: 填充序列pad_sequences

得到的數據是一個2D的數據,shape是

X_train shape: (2500, 400)
X_test shape: (2500, 400)

2500是max_features,即我們只從訓練語料庫中選取排名前2500詞頻的詞,400代表maxlen,我們認為對話的長度最長為400,把所有不足400的都padding到400長度

Relevant Link:

https://github.com/fchollet/keras/blob/master/examples/imdb_fasttext.py

 

2. 模型設計

0x1: 嵌入層

model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))

輸入數據的維度就是max_features,即我們選取的詞頻列表中排名前max_features的詞頻,輸入維度為embedding_dims = 50

0x2: 池化層 - 池化是一種低損降維的好方法

在通過卷積獲得了特征(features)之后,下一步我們希望利用這些特征去做分類。理論上講,人們可以把所有解析出來的特征關聯到一個分類器,例如softmax分類器,但計算量非常大。例如:對於一個96X96像素的圖像,假設我們已經通過8X8個輸入學習得到了400個特征。而每一個卷積都會得到一個(96 − 8 + 1) * (96 − 8 + 1) = 7921的結果集,由於已經得到了400個特征,所以對於每個樣例(example)結果集的大小就將達到892 * 400 = 3,168,400 個特征。這樣學習一個擁有超過3百萬特征的輸入的分類器是相當不明智的,並且極易出現過度擬合(over-fitting).
所以就有了pooling這個方法,本質上是把特征維度區域的一部分求個均值或者最大值,用來代表這部分區域。如果是求均值就是mean pooling,求最大值就是max pooling
之所以這樣做,拿圖像識別舉例子,是因為:我們之所以決定使用卷積后的特征是因為圖像具有一種“靜態性”的屬性,這也就意味着在一個圖像區域有用的特征極有可能在另一個區域同樣適用。因此,為了描述大的圖像,一個很自然的想法就是對不同位置的特征進行聚合統計。這個均值或者最大值就是一種聚合統計的方法。
另外,如果人們選擇圖像中的連續范圍作為池化區域,並且只是池化相同(重復)的隱藏單元產生的特征,那么,這些池化單元就具有平移不變性(translation invariant)。這就意味着即使圖像經歷了一個小的平移之后,依然會產生相同的(池化的)特征

model.add(GlobalAveragePooling1D())

Relevant Link:

https://keras-cn.readthedocs.io/en/latest/layers/embedding_layer/
http://www.cnblogs.com/bzjia-blog/p/3415790.html
https://keras-cn.readthedocs.io/en/latest/layers/pooling_layer/

 

3. 訓練

'''This example demonstrates the use of fasttext for text classification
Based on Joulin et al's paper:
Bags of Tricks for Efficient Text Classification
https://arxiv.org/abs/1607.01759
Results on IMDB datasets with uni and bi-gram embeddings:
    Uni-gram: 0.8813 test accuracy after 5 epochs. 8s/epoch on i7 cpu.
    Bi-gram : 0.9056 test accuracy after 5 epochs. 2s/epoch on GTX 980M gpu.
'''

from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import GlobalAveragePooling1D
from keras.datasets import imdb


def create_ngram_set(input_list, ngram_value=2):
    """
    Extract a set of n-grams from a list of integers.
    >>> create_ngram_set([1, 4, 9, 4, 1, 4], ngram_value=2)
    {(4, 9), (4, 1), (1, 4), (9, 4)}
    >>> create_ngram_set([1, 4, 9, 4, 1, 4], ngram_value=3)
    [(1, 4, 9), (4, 9, 4), (9, 4, 1), (4, 1, 4)]
    """
    return set(zip(*[input_list[i:] for i in range(ngram_value)]))


def add_ngram(sequences, token_indice, ngram_range=2):
    """
    Augment the input list of list (sequences) by appending n-grams values.
    Example: adding bi-gram
    >>> sequences = [[1, 3, 4, 5], [1, 3, 7, 9, 2]]
    >>> token_indice = {(1, 3): 1337, (9, 2): 42, (4, 5): 2017}
    >>> add_ngram(sequences, token_indice, ngram_range=2)
    [[1, 3, 4, 5, 1337, 2017], [1, 3, 7, 9, 2, 1337, 42]]
    Example: adding tri-gram
    >>> sequences = [[1, 3, 4, 5], [1, 3, 7, 9, 2]]
    >>> token_indice = {(1, 3): 1337, (9, 2): 42, (4, 5): 2017, (7, 9, 2): 2018}
    >>> add_ngram(sequences, token_indice, ngram_range=3)
    [[1, 3, 4, 5, 1337], [1, 3, 7, 9, 2, 1337, 2018]]
    """
    new_sequences = []
    for input_list in sequences:
        new_list = input_list[:]
        for i in range(len(new_list) - ngram_range + 1):
            for ngram_value in range(2, ngram_range + 1):
                ngram = tuple(new_list[i:i + ngram_value])
                if ngram in token_indice:
                    new_list.append(token_indice[ngram])
        new_sequences.append(new_list)

    return new_sequences

# Set parameters:
# ngram_range = 2 will add bi-grams features
ngram_range = 2
max_features = 5000
maxlen = 400
batch_size = 32
embedding_dims = 50
nb_epoch = 120

print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(
    path="/home/zhenghan/keras/imdb_full.pkl",
    nb_words=max_features
)

# truncate the dataset
truncate_rate = 0.4
X_train = X_train[:int(len(X_train) * truncate_rate)]
y_train = y_train[:int(len(y_train) * truncate_rate)]
X_test = X_test[:int(len(X_test) * truncate_rate)]
y_test = y_test[:int(len(y_test) * truncate_rate)]


print("X_train[0]")
print(X_train[0])
print('y_train[0]')
print(y_train[0])


print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')
print('Average train sequence length: {}'.format(np.mean(list(map(len, X_train)), dtype=int)))
print('Average test sequence length: {}'.format(np.mean(list(map(len, X_test)), dtype=int)))

if ngram_range > 1:
    print('Adding {}-gram features'.format(ngram_range))
    # Create set of unique n-gram from the training set.
    ngram_set = set()
    for input_list in X_train:
        for i in range(2, ngram_range + 1):
            set_of_ngram = create_ngram_set(input_list, ngram_value=i)
            ngram_set.update(set_of_ngram)
    print('ngram_set.pop')
    print(ngram_set.pop())

    # Dictionary mapping n-gram token to a unique integer.
    # Integer values are greater than max_features in order
    # to avoid collision with existing features.
    start_index = max_features + 1
    token_indice = {v: k + start_index for k, v in enumerate(ngram_set)}
    indice_token = {token_indice[k]: k for k in token_indice}

    # max_features is the highest integer that could be found in the dataset.
    max_features = np.max(list(indice_token.keys())) + 1
    print('max_features: ')
    print(max_features)

    # Augmenting X_train and X_test with n-grams features
    X_train = add_ngram(X_train, token_indice, ngram_range)
    X_test = add_ngram(X_test, token_indice, ngram_range)

    print("X_train[0]")
    print(X_train[0])
    print('X_test[0]')
    print(X_test[0])

    print('Average train sequence length: {}'.format(np.mean(list(map(len, X_train)), dtype=int)))
    print('Average test sequence length: {}'.format(np.mean(list(map(len, X_test)), dtype=int)))

print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print('Build model...')
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))

# we add a GlobalAveragePooling1D, which will average the embeddings
# of all words in the document
model.add(GlobalAveragePooling1D())

# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X_train, y_train,
          batch_size=batch_size,
          nb_epoch=nb_epoch,
          validation_data=(X_test, y_test))

Relevant Link:

https://keras-cn.readthedocs.io/en/latest/other/datasets/
https://keras-cn.readthedocs.io/en/latest/preprocessing/sequence/

Copyright (c) 2017 LittleHann All rights reserved


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM