4 關於word2vec的skip-gram模型使用負例采樣nce_loss損失函數的源碼剖析

本文轉載自查看原文 2018-07-09 16:55 9664 tensorflow實戰

tf.nn.nce_loss是word2vec的skip-gram模型的負例采樣方式的函數，下面分析其源代碼。

1 上下文代碼

loss = tf.reduce_mean(
      tf.nn.nce_loss(weights=nce_weights,
                     biases=nce_biases,
                     labels=train_labels,
                     inputs=embed,
                     num_sampled=num_sampled,
                     num_classes=vocabulary_size))

其中，

train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)

train_inputs中的就是中心詞，train_label中的就是語料庫中該中心詞在滑動窗口內的上下文詞。

所以，train_inputs中會有連續n-1（n為滑動窗口大小）個元素是相同的。即同一中心詞。

embddings是詞嵌入，就是要學習的詞向量的存儲矩陣。共有詞匯表大小的行數，每一行對應一個詞的向量。

# Construct the variables for the NCE loss
nce_weights = tf.Variable(
        tf.truncated_normal([vocabulary_size, embedding_size],
                            stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

nce_weights就是用來存儲如下負例采樣公式中的

、

sigmoid函數有一個對稱特性：

故而上面的公式中，就沒有出現1-XX的形式。用1-XX的形式，可能會更好理解。

具體解釋如下：

l #train_inputs中是中心詞的單詞編號，就是詞匯表中對該單詞的一個編號，一般按詞頻排列，用順序進行編號。
l #train_labels中是中心詞的上下文中的單次編號，這些都算是正樣本，注意和機器學習中的正樣本的意思不一樣，這里是做正確答案的意思。
l #embedding_lookup就是取出某一行。下標從0開始。
l #tf.truncated_normal從截斷的正態分布中輸出隨機值。#生成的值服從具有指定平均值和標准偏差的正態分布，如果生成的值大於平均值2個標准偏差的值則丟棄重新選擇。#標准差就是標准偏差，是方差的算術平均根。而上面的代碼中對標准方差進行了限制的原因就是為了防止神經網絡的參數過大。為什么embeddings中的參數沒有進行限制呢？是因為最初初始化的時候，所有的詞的詞向量之間要保證一定的距離。然后通過學習，才能拉近某些詞的關系，使得某些詞的詞向量更加接近。
l #因為是單層神經網絡，所以要限制參數過大。如果是深層神經網絡，就不需要標准差除一一個embedding_size的平方根了。深層神經網絡雖然也要進行參數的正則化限制，防止過擬合和梯度爆炸問題，但是很少看見，有直接對stddev進行限制的。

2 nce_loss源碼

def nce_loss(weights,
             biases,
             labels,
             inputs,
             num_sampled,
             num_classes,
             num_true=1,
             sampled_values=None,
             remove_accidental_hits=False,
             partition_strategy="mod",
             name="nce_loss"):

logits, labels = _compute_sampled_logits(
      weights=weights,
      biases=biases,
      labels=labels,
      inputs=inputs,
      num_sampled=num_sampled,
      num_classes=num_classes,
      num_true=num_true,
      sampled_values=sampled_values,
      subtract_log_q=True,
      remove_accidental_hits=remove_accidental_hits,
      partition_strategy=partition_strategy,
      name=name)
  sampled_losses = sigmoid_cross_entropy_with_logits(
      labels=labels, logits=logits, name="sampled_losses")
  # sampled_losses is batch_size x {true_loss, sampled_losses...}
  # We sum out true and sampled losses.
  return _sum_rows(sampled_losses)

　　可以看出核心就在於傳入sigmoid_cross_entropy_with_logits的參數。對於任何一個輸出節點只有一個的二分類神經網絡，用sigmoid_cross_entropy_with_logits是最好理解的。logits的維度是batch_size，1。labels的維度就是batch_size，元素取值為0或者1，

　　來看一下sigmoid_cross_entropy_with_logits函數

sigmoid_cross_entropy_with_logits的返回值是：

  Returns:

    A `Tensor` of the same shape as `logits` with the componentwise

    logistic losses.

也就是說：logits的維度是batch_size，1，其返回的維度也是batch_size，1。這個位置的元素就是用這個公式計算的loss：

　　但是在負例采樣中，傳入的logits的維度不是batch_size，1，而是[batch_size, num_true + num_sampled]`。主要觀察一下_compute_sampled_logits函數的輸出。其輸出如下：

  Returns:
    out_logits: `Tensor` object with shape
        `[batch_size, num_true + num_sampled]`, for passing to either
        `nn.sigmoid_cross_entropy_with_logits` (NCE) or
        `nn.softmax_cross_entropy_with_logits` (sampled softmax).
    out_labels: A Tensor object with the same shape as `out_logits`.
  """

其傳入參數的解釋是：

    labels: A `Tensor` of type `int64` and shape `[batch_size,
        num_true]`. The target classes.  Note that this format differs from
        the `labels` argument of `nn.softmax_cross_entropy_with_logits`.
    inputs: A `Tensor` of shape `[batch_size, dim]`.  The forward
        activations of the input network.
    weights: A `Tensor` of shape `[num_classes, dim]`, or a list of `Tensor`
        objects whose concatenation along dimension 0 has shape
        `[num_classes, dim]`.  The (possibly-partitioned) class embeddings.

可以看出_compute_sampled_logits完成的是一個什么過程呢。就是對於每一個樣本，計算出一個維度為[batch_size, num_true + num_sampled]的向量，向量的每個元素都同之前logits的每個元素的意義一樣，是輸出值。同時，返回一個維度為[batch_size, num_true + num_sampled]的向量labels。這個labels中只有一個元素為1。於是再看一下如下公式：

其實，此時的out_logits中對應（label位置為0）的元素就是，對應label位置為1）的元素就是。

然后再傳給sigmoid_cross_entropy_with_logits，同樣是對於每個元素位置的計算使用下面的公式：

所以，nce_loss中調用sigmoid_cross_entropy_with_logits后返回的是：[batch_size, num_true + num_sampled]的向量，其中每個元素都是一個用上述公式計算出loss。

nce_loss的最后一步是_sum_rows：

def _sum_rows(x):
  """Returns a vector summing up each row of the matrix x."""
  # _sum_rows(x) is equivalent to math_ops.reduce_sum(x, 1) when x is
  # a matrix.  The gradient of _sum_rows(x) is more efficient than
  # reduce_sum(x, 1)'s gradient in today's implementation. Therefore,
  # we use _sum_rows(x) in the nce_loss() computation since the loss
  # is mostly used for training.
  cols = array_ops.shape(x)[1]
  ones_shape = array_ops.stack([cols, 1])
  ones = array_ops.ones(ones_shape, x.dtype)
  return array_ops.reshape(math_ops.matmul(x, ones), [-1])

最后，再對nce_loss的返回結果用reduce_mean即可計算一個batch的平均損失。

關於_compute_sampled_logits中如何采樣，如何計算的，這里就不再闡述，同文字理論是一樣的。

我們將_compute_sampled_logits函數中的

# Construct output logits and labels. The true labels/logits start at col 0.
    out_logits = array_ops.concat([true_logits, sampled_logits], 1)

    # true_logits is a float tensor, ones_like(true_logits) is a float
    # tensor of ones. We then divide by num_true to ensure the per-example
    # labels sum to 1.0, i.e. form a proper probability distribution.
    out_labels = array_ops.concat([
        array_ops.ones_like(true_logits) / num_true,
        array_ops.zeros_like(sampled_logits)
    ], 1)

　　改為

    out_logits = array_ops.concat([true_logits, sampled_logits], 1,name="xiaojie_logits")

    # true_logits is a float tensor, ones_like(true_logits) is a float
    # tensor of ones. We then divide by num_true to ensure the per-example
    # labels sum to 1.0, i.e. form a proper probability distribution.
    out_labels = array_ops.concat([
        array_ops.ones_like(true_logits) / num_true,
        array_ops.zeros_like(sampled_logits)
], 1,name="xiaojie_labels")
然后由於這些代碼位於：
  with ops.name_scope(name, "compute_sampled_logits",
                      weights + [biases, inputs, labels]):
ops指定的name下，name為“nce_loss”

我們在word2vec的程序訓練迭代的過程中添加如下代碼：
  for step in range(num_steps):
    batch_inputs, batch_labels = generate_batch(
        batch_size, num_skips, skip_window)
    feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}
    print ("xiaojie Debug:")
    xiaojie_logits= session.graph.get_tensor_by_name("nce_loss/xiaojie_logits:0")
    xiaojie_labels = session.graph.get_tensor_by_name("nce_loss/xiaojie_labels:0")
    xiaojie_logits_value,xiaojie_labels_value=session.run([xiaojie_logits,xiaojie_labels],feed_dict=feed_dict)
print (xiaojie_logits_value,xiaojie_labels_value)

可以看出輸出結果中傳遞給sigmoid_cross_entropy_with_logits函數的就是這么個玩意。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 一文詳解 Word2vec 之 Skip-Gram 模型（結構篇） DL4NLP——詞表示模型（二）基於神經網絡的模型：NPLM；word2vec（CBOW/Skip-gram） DL4NLP——詞表示模型（三）word2vec（CBOW/Skip-gram）的加速：Hierarchical Softmax與Negative Sampling 使用Mxnet基於skip-gram模型實現word2vect tf使用交叉熵損失函數，loss為負 Word2Vec源碼解析 word2vec原理及gensim中word2vec的使用連續詞袋模型CBOW與跳字模型Skip-gram 基於word2vec的文檔向量模型的應用 word2vec 和 glove 模型的區別