《python深度學習》筆記---6.1-3、word embedding-使用預訓練的詞嵌入
一、總結
一句話總結:
【將文本轉換為能處理的格式】:將原始文本轉換為神經網絡能夠處理的格式。
【Keras 模型的 Embedding 層】:使用 Keras 模型的 Embedding 層來學習針對特定任務的標記嵌入。
【預訓練詞嵌入 提升 在小型自然語言處理問題】:使用預訓練詞嵌入在小型自然語言處理問題上獲得額外的性能提升
1、使用預訓練的詞嵌入的情況?
【可用的訓練數據很少】:有時可用的訓練數據很少,以至於只用手頭數據無法學習適合特定任務的詞嵌入。
2、在自然語言處理中使用預訓練的詞嵌入,其背后的原理與在圖像分類中使用預訓練的卷積神經網絡是一樣的?
【特征通用】:沒有足夠的數據來自己學習真正強大的特征,但你需要的特征應該是非常通用的,比如常見的視覺特征或語義特征。
3、預訓練的詞嵌入是怎么計算出來的?
【詞頻統計】:預訓練的詞嵌入通常是利用詞頻統計計算得出的(觀察哪些詞共同出現在句子或文檔中),用到 的技術很多,有些涉及神經網絡,有些則不涉及。
4、詞嵌入word2vec 算法 特點?
【其維度抓住了特定的語義屬性】:word2vec 算法 由 Google 的 Tomas Mikolov 於 2013 年開發,其維度抓住了特定的語義屬性,比如性別
二、word embedding-使用預訓練的詞嵌入
博客對應課程的視頻位置:
Using pre-trained word embeddings
Sometimes, you have so little training data available that could never use your data alone to learn an appropriate task-specific embedding of your vocabulary. What to do then?
Instead of learning word embeddings jointly with the problem you want to solve, you could be loading embedding vectors from a pre-computed embedding space known to be highly structured and to exhibit useful properties -- that captures generic aspects of language structure. The rationale behind using pre-trained word embeddings in natural language processing is very much the same as for using pre-trained convnets in image classification: we don't have enough data available to learn truly powerful features on our own, but we expect the features that we need to be fairly generic, i.e. common visual features or semantic features. In this case it makes sense to reuse features learned on a different problem.
Such word embeddings are generally computed using word occurrence statistics (observations about what words co-occur in sentences or documents), using a variety of techniques, some involving neural networks, others not. The idea of a dense, low-dimensional embedding space for words, computed in an unsupervised way, was initially explored by Bengio et al. in the early 2000s, but it only started really taking off in research and industry applications after the release of one of the most famous and successful word embedding scheme: the Word2Vec algorithm, developed by Mikolov at Google in 2013. Word2Vec dimensions capture specific semantic properties, e.g. gender.
There are various pre-computed databases of word embeddings that can download and start using in a Keras Embedding
layer. Word2Vec is one of them. Another popular one is called "GloVe", developed by Stanford researchers in 2014. It stands for "Global Vectors for Word Representation", and it is an embedding technique based on factorizing a matrix of word co-occurrence statistics. Its developers have made available pre-computed embeddings for millions of English tokens, obtained from Wikipedia data or from Common Crawl data.
Let's take a look at how you can get started using GloVe embeddings in a Keras model. The same method will of course be valid for Word2Vec embeddings or any other word embedding database that you can download. We will also use this example to refresh the text tokenization techniques we introduced a few paragraphs ago: we will start from raw text, and work our way up.
Putting it all together: from raw text to word embeddings
We will be using a model similar to the one we just went over -- embedding sentences in sequences of vectors, flattening them and training a Dense
layer on top. But we will do it using pre-trained word embeddings, and instead of using the pre-tokenized IMDB data packaged in Keras, we will start from scratch, by downloading the original text data.
Download the IMDB data as raw text
First, head to http://ai.stanford.edu/~amaas/data/sentiment/
and download the raw IMDB dataset (if the URL isn't working anymore, just Google "IMDB dataset"). Uncompress it.
Now let's collect the individual training reviews into a list of strings, one string per review, and let's also collect the review labels (positive / negative) into a labels
list:
import os imdb_dir = '/home/ubuntu/data/aclImdb' train_dir = os.path.join(imdb_dir, 'train') labels = [] texts = [] for label_type in ['neg', 'pos']: dir_name = os.path.join(train_dir, label_type) for fname in os.listdir(dir_name): if fname[-4:] == '.txt': f = open(os.path.join(dir_name, fname)) texts.append(f.read()) f.close() if label_type == 'neg': labels.append(0) else: labels.append(1)
Tokenize the data
Let's vectorize the texts we collected, and prepare a training and validation split. We will merely be using the concepts we introduced earlier in this section.
Because pre-trained word embeddings are meant to be particularly useful on problems where little training data is available (otherwise, task-specific embeddings are likely to outperform them), we will add the following twist: we restrict the training data to its first 200 samples. So we will be learning to classify movie reviews after looking at just 200 examples...
from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences import numpy as np maxlen = 100 # We will cut reviews after 100 words training_samples = 200 # We will be training on 200 samples validation_samples = 10000 # We will be validating on 10000 samples max_words = 10000 # We will only consider the top 10,000 words in the dataset tokenizer = Tokenizer(num_words=max_words) tokenizer.fit_on_texts(texts) sequences = tokenizer.texts_to_sequences(texts) word_index = tokenizer.word_index print('Found %s unique tokens.' % len(word_index)) data = pad_sequences(sequences, maxlen=maxlen) labels = np.asarray(labels) print('Shape of data tensor:', data.shape) print('Shape of label tensor:', labels.shape) # Split the data into a training set and a validation set # But first, shuffle the data, since we started from data # where sample are ordered (all negative first, then all positive). indices = np.arange(data.shape[0]) np.random.shuffle(indices) data = data[indices] labels = labels[indices] x_train = data[:training_samples] y_train = labels[:training_samples] x_val = data[training_samples: training_samples + validation_samples] y_val = labels[training_samples: training_samples + validation_samples]
Download the GloVe word embeddings
Head to https://nlp.stanford.edu/projects/glove/
(where you can learn more about the GloVe algorithm), and download the pre-computed embeddings from 2014 English Wikipedia. It's a 822MB zip file named glove.6B.zip
, containing 100-dimensional embedding vectors for 400,000 words (or non-word tokens). Un-zip it.
Pre-process the embeddings
Let's parse the un-zipped file (it's a txt
file) to build an index mapping words (as strings) to their vector representation (as number vectors).
glove_dir = '/home/ubuntu/data/' embeddings_index = {} f = open(os.path.join(glove_dir, 'glove.6B.100d.txt')) for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() print('Found %s word vectors.' % len(embeddings_index))
Now let's build an embedding matrix that we will be able to load into an Embedding
layer. It must be a matrix of shape (max_words, embedding_dim)
, where each entry i
contains the embedding_dim
-dimensional vector for the word of index i
in our reference word index (built during tokenization). Note that the index 0
is not supposed to stand for any word or token -- it's a placeholder.
embedding_dim = 100 embedding_matrix = np.zeros((max_words, embedding_dim)) for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if i < max_words: if embedding_vector is not None: # Words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector
Define a model
We will be using the same model architecture as before:
from keras.models import Sequential from keras.layers import Embedding, Flatten, Dense model = Sequential() model.add(Embedding(max_words, embedding_dim, input_length=maxlen)) model.add(Flatten()) model.add(Dense(32, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.summary()
Load the GloVe embeddings in the model
The Embedding
layer has a single weight matrix: a 2D float matrix where each entry i
is the word vector meant to be associated with index i
. Simple enough. Let's just load the GloVe matrix we prepared into our Embedding
layer, the first layer in our model:
model.layers[0].set_weights([embedding_matrix]) model.layers[0].trainable = False
Additionally, we freeze the embedding layer (we set its trainable
attribute to False
), following the same rationale as what you are already familiar with in the context of pre-trained convnet features: when parts of a model are pre-trained (like our Embedding
layer), and parts are randomly initialized (like our classifier), the pre-trained parts should not be updated during training to avoid forgetting what they already know. The large gradient update triggered by the randomly initialized layers would be very disruptive to the already learned features.
Train and evaluate
Let's compile our model and train it:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val)) model.save_weights('pre_trained_glove_model.h5')
Let's plot its performance over time:
import matplotlib.pyplot as plt acc = history.history['acc'] val_acc = history.history['val_acc'] loss = history.history['loss'] val_loss = history.history['val_loss'] epochs = range(1, len(acc) + 1) plt.plot(epochs, acc, 'bo', label='Training acc') plt.plot(epochs, val_acc, 'b', label='Validation acc') plt.title('Training and validation accuracy') plt.legend() plt.figure() plt.plot(epochs, loss, 'bo', label='Training loss') plt.plot(epochs, val_loss, 'b', label='Validation loss') plt.title('Training and validation loss') plt.legend() plt.show()
The model quickly starts overfitting, unsurprisingly given the small number of training samples. Validation accuracy has high variance for the same reason, but seems to reach high 50s.
Note that your mileage may vary: since we have so few training samples, performance is heavily dependent on which exact 200 samples we picked, and we picked them at random. If it worked really poorly for you, try picking a different random set of 200 samples, just for the sake of the exercise (in real life you don't get to pick your training data).
We can also try to train the same model without loading the pre-trained word embeddings and without freezing the embedding layer. In that case, we would be learning a task-specific embedding of our input tokens, which is generally more powerful than pre-trained word embeddings when lots of data is available. However, in our case, we have only 200 training samples. Let's try it:
from keras.models import Sequential from keras.layers import Embedding, Flatten, Dense model = Sequential() model.add(Embedding(max_words, embedding_dim, input_length=maxlen)) model.add(Flatten()) model.add(Dense(32, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.summary() model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val))
acc = history.history['acc'] val_acc = history.history['val_acc'] loss = history.history['loss'] val_loss = history.history['val_loss'] epochs = range(1, len(acc) + 1) plt.plot(epochs, acc, 'bo', label='Training acc') plt.plot(epochs, val_acc, 'b', label='Validation acc') plt.title('Training and validation accuracy') plt.legend() plt.figure() plt.plot(epochs, loss, 'bo', label='Training loss') plt.plot(epochs, val_loss, 'b', label='Validation loss') plt.title('Training and validation loss') plt.legend() plt.show()
Validation accuracy stalls in the low 50s. So in our case, pre-trained word embeddings does outperform jointly learned embeddings. If you increase the number of training samples, this will quickly stop being the case -- try it as an exercise.
Finally, let's evaluate the model on the test data. First, we will need to tokenize the test data:
test_dir = os.path.join(imdb_dir, 'test') labels = [] texts = [] for label_type in ['neg', 'pos']: dir_name = os.path.join(test_dir, label_type) for fname in sorted(os.listdir(dir_name)): if fname[-4:] == '.txt': f = open(os.path.join(dir_name, fname)) texts.append(f.read()) f.close() if label_type == 'neg': labels.append(0) else: labels.append(1) sequences = tokenizer.texts_to_sequences(texts) x_test = pad_sequences(sequences, maxlen=maxlen) y_test = np.asarray(labels)
And let's load and evaluate the first model:
model.load_weights('pre_trained_glove_model.h5') model.evaluate(x_test, y_test)
We get an appalling test accuracy of 54%. Working with just a handful of training samples is hard!