Deep learning with Python 學習筆記（5）

本文轉載自查看原文 2018-11-19 21:50 635 讀書筆記/ GloVe/ 深度學習用於文本和序列/ deep learning with python/ 深度學習/ 機器學習&深度學習/ one-hot/ Pthon/ Python

本節講深度學習用於文本和序列

用於處理序列的兩種基本的深度學習算法分別是循環神經網絡（recurrent neural network）和一維卷積神經網絡（1D convnet）
與其他所有神經網絡一樣，深度學習模型不會接收原始文本作為輸入，它只能處理數值張量。文本向量化（vectorize）是指將文本轉換為數值張量的過程。它有多種實現方法

將文本分割為單詞，並將每個單詞轉換為一個向量
將文本分割為字符，並將每個字符轉換為一個向量
提取單詞或字符的 n-gram，並將每個 n-gram 轉換為一個向量。n-gram 是多個連續單詞或字符的集合（n-gram 之間可重疊）

將文本分解而成的單元（單詞、字符或 n-gram）叫作標記（token），將文本分解成標記的過程叫作分詞（tokenization）。所有文本向量化過程都是應用某種分詞方案，然后將數值向量與生成的標記相關聯。這些向量組合成序列張量，被輸入到深度神經網絡中

n-gram 是從一個句子中提取的 N 個（或更少）連續單詞的集合。這一概念中的“單詞”也可以替換為“字符”
The cat sat on the mat 分解為二元語法(2-gram)的集合
{"The", "The cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the", "the mat", "mat"}
分解為三元語法(3-gram)的集合
{"The", "The cat", "cat", "cat sat", "The cat sat",
"sat", "sat on", "on", "cat sat on", "on the", "the",
"sat on the", "the mat", "mat", "on the mat"}
這樣的集合分別叫作二元語法袋（bag-of-2-grams）及三元語法袋（bag-of-3-grams）。這里袋（bag）這一術語指的是，我們處理的是標記組成的集合。這一系列分詞方法叫作詞袋（bag-of-words）。詞袋是一種不保存順序的分詞方法，因此它往往被用於淺層的語言處理模型，而不是深度學習模型

將向量與標記相關聯的方法
對標記做 one-hot 編碼（one-hot encoding）與標記嵌入［token embedding，通常只用於單詞，叫作詞嵌入（word embedding）］

one-hot 編碼是將標記轉換為向量的最常用、最基本的方法

它將每個單詞與一個唯一的整數索引相關聯，然后將這個整數索引 i 轉換為長度為 N 的二進制向量（N 是詞表大小），這個向量只有第 i 個元素是 1，其余元素都為 0 (也可以進行字符級的 one-hot 編碼)

Keras one-hot編碼Demo

from keras.preprocessing.text import Tokenizer


samples = ['The cat sat on the mat.', 'The dog ate my homework.']
# 只考慮前1000個最常見的單詞
tokenizer = Tokenizer(num_words=1000)
# 構建單詞索引
tokenizer.fit_on_texts(samples)
# 找回單詞索引
word_index = tokenizer.word_index
print(word_index)
# 將字符串轉換為整數索引組成的列表
sequences = tokenizer.texts_to_sequences(samples)
print("轉換成的索引序列 ", sequences)
text = tokenizer.sequences_to_texts(sequences)
print("轉會的文本 ", text)
# 得到 one-hot 二進制表示
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
one_num = 0
for items in one_hot_results:
    for item in items:
        if item == 1:
            one_num += 1
print("1的數量為 ", one_num)
print(one_hot_results)

結果

one-hot 編碼的一種變體是所謂的 one-hot 散列技巧（one-hot hashing trick），如果詞表中唯
一標記的數量太大而無法直接處理，就可以使用這種技巧

將單詞散列編碼為固定長度的向量，通常用一個非常簡單的散列函數來實現

這種方法的主要優點在於，它避免了維護一個顯式的單詞索引，從而節省內存並允許數據的在線編碼，缺點就是可能會出現散列沖突

詞嵌入
one-hot 編碼得到的向量是二進制的、稀疏的、維度很高的（維度大小等於詞表中的單詞個數），而詞嵌入是低維的浮點數向量。與 one-hot 編碼得到的詞向量不同，詞嵌入是從數據中學習得到的。常見的詞向量維度是 256、512 或 1024（處理非常大的詞表時）。與此相對，onehot 編碼的詞向量維度通常為 20 000 或更高。因此，詞向量可以將更多的信息塞入更低的維度中

獲取詞嵌入有兩種方法

在完成主任務（比如文檔分類或情感預測）的同時學習詞嵌入。在這種情況下，一開始是隨機的詞向量，然后對這些詞向量進行學習，其學習方式與學習神經網絡的權重相同
在不同於待解決問題的機器學習任務上預計算好詞嵌入，然后將其加載到模型中。這些詞嵌入叫作預訓練詞嵌入（pretrained word embedding）

利用 Embedding 層學習詞嵌入
詞嵌入的作用應該是將人類的語言映射到幾何空間中，我們希望任意兩個詞向量之間的幾何距離）應該和這兩個詞的語義距離有關。可能還希望嵌入空間中的特定方向也是有意義的
Embedding 層的輸入是一個二維整數張量，其形狀為 (samples, sequence_length)，它能夠嵌入長度可變的序列，不過一批數據中的所有序列必須具有相同的長度

簡單Demo

from keras.datasets import imdb
from keras import preprocessing
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding
import matplotlib.pyplot as plt


max_features = 10000
maxlen = 20
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features, path='E:\\study\\dataset\\imdb.npz')
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()
history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)


acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

結果

當可用的訓練數據很少，以至於只用手頭數據無法學習適合特定任務的詞嵌入，你可以從預計算的嵌入空間中加載嵌入向量，而不是在解決問題的同時學習詞嵌入。有許多預計算的詞嵌入數據庫，你都可以下載並在 Keras 的 Embedding 層中使用，word2vec 就是其中之一。另一個常用的是 GloVe（global vectors for word representation，詞表示全局向量）

沒有足夠的數據來自己學習真正強大的特征，但你需要的特征應該是非常通用的，比如常見的視覺特征或語義特征

新聞情感分類Demo，使用GloVe預訓練詞

import os
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
import matplotlib.pyplot as plt


imdb_dir = 'E:\\study\\dataset\\aclImdb'
train_dir = os.path.join(imdb_dir, 'train')
labels = []
texts = []
for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)
# 對 IMDB 原始數據的文本進行分詞
maxlen = 100
training_samples = 200
validation_samples = 10000
max_words = 10000
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index

data = pad_sequences(sequences, maxlen=maxlen)
labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)
# 打亂數據
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

# 　解析 GloVe 詞嵌入文件
glove_dir = 'E:\\study\\models\\glove.6B'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))

# 准備 GloVe 詞嵌入矩陣(max_words, embedding_dim)
embedding_dim = 100
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

# 　模型定義
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
# 將預訓練的詞嵌入加載到 Embedding 層中，並凍結
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False
# 訓練與評估
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

# 對測試集數據進行分詞
test_dir = os.path.join(imdb_dir, 'test')
labels = []
texts = []
for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)
    for fname in sorted(os.listdir(dir_name)):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)
sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)
# 在測試集上評估模型
model.load_weights('pre_trained_glove_model.h5')
model.evaluate(x_test, y_test)

數據下的時間太長放棄了，233

Deep learning with Python 學習筆記（6）
Deep learning with Python 學習筆記（4）

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Deep learning with Python 學習筆記（4） Deep learning with Python 學習筆記（1） Deep learning with Python 學習筆記（3） Deep learning with Python 學習筆記（2） Deep learning with Python 學習筆記（10） Deep Learning（深度學習）學習筆記整理(二） Deep Learning（深度學習）學習筆記整理系列之常用模型（四、五、六、七） Deep Learning（深度學習）學習筆記整理系列之（四） Deep Learning（深度學習）學習筆記整理系列（一）——背景 Deep Learning（深度學習）學習筆記整理系列之（四）——CNN