keras-深度學習處理文本數據

本文轉載自查看原文 2019-07-13 17:35 428 keras實戰

深度學習用於自然語言處理是將模式識別應用於單詞、句子和段落，這與計算機視覺是將模式識別應用於像素大致相同。深度學習模型不會接收原始文本作為輸入，它只能處理數值張量，因此我們必須將文本向量化（vectorize）。下圖是主要流程。

one-hot編碼是將每個單詞與一個唯一的整數索引相關聯，然后將這個整數索引 i 轉換為長度為N的二進制向量（N是此表大小），這個向量只有第 i 個元素是1，其余都為0。

詞嵌入是低維的浮點數向量，是從數據中學習得到的。

one-hot：高維度、稀疏

詞嵌入：低維度、密集

這里我們重點介紹詞嵌入！編譯環境keras、jupyter Notebook

利用Embedding層學習詞嵌入

應用場景：IMDB電影評論情感預測任務

1、准備數據（keras內置）

2、將電影評論限制為前10 000個最常見的單詞

3、評論長度限制20個單詞

4、將輸入的整數序列（二維整數張量）轉換為嵌入序列（三維浮點數張量），將這個張量展平為二維，最后在上面訓練一個Dense層用於分類

# 將一個Embedding層實例化
from keras.layers import Embedding

# (最大單詞索引+1， 嵌入的維度)
enmbedding_layer = Embedding(1000, 64)

加載數據、准備用於Embedding層

from keras.datasets import imdb
from keras import preprocessing

max_features = 10000
maxlen = 20

# 將數據加載為整數列表
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = max_features)

x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

在IMDB數據上使用Embedding層和分類器

from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()

model.add(Embedding(10000, 8, input_length=maxlen))

# 將三維的嵌入張量展平成（samples, maxlen * 8）
model.add(Flatten())

# 在上面添加分類器
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train, epochs=10, 
                   batch_size = 32, 
                   validation_split=0.2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 161       
=================================================================
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 10s 517us/step - loss: 0.6759 - acc: 0.6050 - val_loss: 0.6398 - val_acc: 0.6814

......

Epoch 10/10
20000/20000 [==============================] - 3s 127us/step - loss: 0.2839 - acc: 0.8860 - val_loss: 0.5303 - val_acc: 0.7466

得到驗證精度約為75%，我們僅僅將嵌入序列展開並在上面訓練一個Dense層，會導致模型對輸入序列中的每個單詞處理，而沒有考慮單詞之間的關系和句子結構。更好的做法是在嵌入序列上添加循環層或一維卷積層，將整個序列作為整體來學習特征。

如果可用的訓練數據很少，無法用數據學習到合適的詞嵌入，那怎么辦？ ===> 使用預訓練的詞嵌入

使用預訓練的詞嵌入

這次，我們不使用keras內置的已經預先分詞的IMDB數據，而是從頭開始下載。

1. 下載IMDB數據的原始文本

地址：https://mng.bz/0tIo 下載原始IMDB數據集並解壓

import os

imdb_dir = 'F:/keras-dataset/aclImdb'
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname), errors='ignore')
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

2. 對IMDB原始數據的文本進行分詞

預訓練的詞嵌入對訓練數據很少的問題特別有用，因此我們只采取200個樣本進行訓練

# 對數據進行分詞
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

# 在100個單詞后截斷評論
maxlen = 100

# 在200個樣本上進行訓練
training_samples = 200

# 在10000個樣本上進行驗證
validation_samples = 10000

# 只考慮數據集中前10000個最常見的單詞
max_words = 10000

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# 將數據划分為訓練集和驗證集，但首先要打亂數據
# 因為一開始數據中的樣本是排好序的（所有負面評論在前，正面評論在后）
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

Found 88583 unique tokens.
Shape of data tensor: (25000, 100)
Shape of label tensor: (25000,)

3. 下載GloVe詞嵌入

地址：https://nlp.stanford.edu/projects/glove/ 文件名是glove.6B.zip，里面包含400 000個單詞的100維向量。解壓文件

對解壓文件進行解析，構建一個單詞映射為向量表示的索引

# 解析GloVe詞嵌入文件
glove_dir = 'F:/keras-dataset'

embeddings_index = {}

f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'), errors='ignore')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

# 創建一個可以加載到Embedding層中的嵌入矩陣
# 對於單詞索引中索引為i的單詞，這個矩陣的元素i就是這個單詞對應的 embedding_dim 為向量
embedding_dim = 100
embedding_matrics = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            embedding_matrics[i] = embedding_vector

Found 399913 word vectors.

4. 定義模型

# 模型定義
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, 100, 100)          1000000   
_________________________________________________________________
flatten_2 (Flatten)          (None, 10000)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                320032    
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
=================================================================
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0

6. 在模型中加載GloVe嵌入

Embedding層只有一個權重矩陣，是一個二維的浮點數矩陣，其中每個元素i是索引i相關聯的詞向量，將准備好的GloVe矩陣加載到Embedding層中，即模型的第一層

# 將預訓練的詞嵌入加載到Embedding層
model.layers[0].set_weights([embedding_matrics])

# 凍結Embedding層
model.layers[0].trainable = False

7. 訓練和評估模型

# 訓練模型與評估模型
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val))

model.save_weights('pre_trained_glove_model.h5')

Train on 200 samples, validate on 10000 samples
Epoch 1/10
200/200 [==============================] - 1s 4ms/step - loss: 0.9840 - acc: 0.5300 - val_loss: 0.6942 - val_acc: 0.4980

........

Epoch 10/10
200/200 [==============================] - 0s 2ms/step - loss: 0.0598 - acc: 1.0000 - val_loss: 0.8704 - val_acc: 0.5339

8. 繪制結果

# 繪制圖像
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

模型很快就開始過擬合，因為訓練樣本很少，效果不是很好。驗證集的精度56%

9. 對測試集數據進行分詞，並對數據進行評估模型

# 對測試集數據進行分詞
test_dir = os.path.join(imdb_dir, 'test')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)
    for fname in sorted(os.listdir(dir_name)):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname), errors='ignore')
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)
                
sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)

# 在測試集上評估模型
model.load_weights('pre_trained_glove_model.h5')
model.evaluate(x_test, y_test)

25000/25000 [==============================] - 1s 50us/step

[0.8740278043365478, 0.53072]

測試精度達到53%，效果還可以，因為我們只使用了很少的訓練樣本

在不使用預訓練詞嵌入的情況下，訓練相同的模型

# 在不使用預訓練詞嵌入的情況下，訓練相同的模型
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))

驗證集的精度大概52%

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Keras-圖片預處理 pandas 處理文本數據 python處理文本數據 pandas處理大文本數據 Text Data Augmentation for Deep Learning 深度學習的文本數據增強綜述用深度學習做命名實體識別(三)：文本數據標注過程 ML--文本數據處理 Python 文本數據預處理實踐 hadoop mapreduce 如何處理跨行的文本數據 Pandas文本數據處理