Tensorflow 的Word2vec demo解析

本文轉載自查看原文 2015-11-19 10:24 12505 機器學習

簡單demo的代碼路徑在tensorflow\tensorflow\g3doc\tutorials\word2vec\word2vec_basic.py

Sikp gram方式的model思路

http://tensorflow.org/tutorials/word2vec/index.md

另外可以參考cs224d課程的課件。

窗口設置為左右1個詞

對應skip gram模型就是一個單詞預測其周圍單詞（cbow模型是輸入一系列context詞，預測一個中心詞)

Quick -> the quick -> brown

Skip gram的訓練目標cost function是

對應

但是這樣太耗時了每一步訓練時間代價都是O(VocabularySize)

於是我們采用了 nce(noise-contrastive estimation)的方式,也就是負樣本采樣，采用某種方式隨機生成詞作為負樣本，比如 quick -> sheep ，sheep作為負樣本，假設我們就取一個負樣本

輸入數據這里是分隔好的單詞
讀入單詞存儲到list中
統計詞頻 0號位置給 unknown, 其余按照頻次由高到低排列，unknown的獲取按照預設詞典大小比如50000，則頻次排序靠后於50000的都視為unknown
建立好 key->id id->key的雙向索引map

4. 產生一組training batch

batch_size = 128

embedding_size = 128 # Dimension of the embedding vector.

skip_window = 1 # How many words to consider left and right.

num_skips = 2 # How many times to reuse an input to generate a label.

Batch_size每次sgd訓練時候掃描的數據大小, embedding_size 詞向量的大小，skip_window 窗口大小,

Num_skips = 2 表示input用了產生label的次數限制

demo中默認是2，可以設置為1 對比下

默認2的時候

batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)

for i in range(8):

print(batch[i], '->', labels[i, 0])

print(reverse_dictionary[batch[i]], '->', reverse_dictionary[labels[i, 0]])

Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]

3084 -> 5239

originated -> anarchism

3084 -> 12

originated -> as

12 -> 6

as -> a

12 -> 3084

as -> originated

6 -> 195

a -> term

6 -> 12

a -> as

195 -> 2

term -> of

195 -> 6

term -> a

3084左側出現2次，對應窗口左右各1

設置1的時候

batch, labels = generate_batch(batch_size=8, num_skips=1, skip_window=1)

for i in range(8):

print(batch[i], '->', labels[i, 0])

print(reverse_dictionary[batch[i]], '->', reverse_dictionary[labels[i, 0]])

Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]

3084 -> 12

originated -> as

12 -> 3084

as -> originated

6 -> 12

a -> as

195 -> 2

term -> of

2 -> 3137

of -> abuse

3137 -> 46

abuse -> first

46 -> 59

first -> used

59 -> 156

3084左側只出現1次

# Step 4: Function to generate a training batch for the skip-gram model.

def generate_batch(batch_size, num_skips, skip_window):

global data_index

assert batch_size % num_skips == 0

assert num_skips <= 2 * skip_window

batch = np.ndarray(shape=(batch_size), dtype=np.int32)

labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)

span = 2 * skip_window + 1 # [ skip_window target skip_window ]

buffer = collections.deque(maxlen=span)

for _ in range(span):

buffer.append(data[data_index])

data_index = (data_index + 1) % len(data)

for i in range(batch_size // num_skips):

target = skip_window # target label at the center of the buffer

targets_to_avoid = [ skip_window ]

for j in range(num_skips):

while target in targets_to_avoid:

target = random.randint(0, span - 1)

targets_to_avoid.append(target)

batch[i * num_skips + j] = buffer[skip_window]

labels[i * num_skips + j, 0] = buffer[target]

buffer.append(data[data_index])

data_index = (data_index + 1) % len(data)

return batch, labels

batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)

for i in range(8):

print(batch[i], '->', labels[i, 0])

print(reverse_dictionary[batch[i]], '->', reverse_dictionary[labels[i, 0]])

就是對於一個中心詞在window范圍隨機選取 num_skips個詞，產生一系列的

(input_id, output_id) 作為(batch_instance, label)

這些都是正樣本

訓練准備，

Input embedding W

Output embedding W^

后面code都比較容易理解，tf定義了nce_loss來自動處理，每次會自動添加隨機負樣本

num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default():

# Input data.

train_inputs = tf.placeholder(tf.int32, shape=[batch_size])

train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# Construct the variables.

embeddings = tf.Variable(

tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

nce_weights = tf.Variable(

tf.truncated_normal([vocabulary_size, embedding_size],

stddev=1.0 / math.sqrt(embedding_size)))

nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

# Look up embeddings for inputs.

embed = tf.nn.embedding_lookup(embeddings, train_inputs)

# Compute the average NCE loss for the batch.

# tf.nce_loss automatically draws a new sample of the negative labels each

# time we evaluate the loss.

loss = tf.reduce_mean(

tf.nn.nce_loss(nce_weights, nce_biases, embed, train_labels,

num_sampled, vocabulary_size))

# Construct the SGD optimizer using a learning rate of 1.0.

optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

訓練過程利用embedding矩陣的乘法計算了不同詞向量的歐式距離並計算了高頻幾個詞對應的距離最近的詞展示

最后調用 skitlearn的TSNE模塊進行降維到2元，繪圖展示。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 word2vec 學習Word2vec Word2vec之CBOW Python Tensorflow下的Word2Vec代碼解釋 Forward-backward梯度求導(tensorflow word2vec實例) word2vec原理與代碼 embedding(keras,word2vec) Word2Vec原理及代碼 word2vec安裝以及使用 Word2Vec原理詳解