CS224d 之學習總結-第一部分

本文轉載自查看原文 2015-08-23 13:18 4852 deep learning/ 筆記/ notes/ NLP/ stanford/ cs224/ CS224D notes

轉載注明出處http://www.cnblogs.com/NeighborhoodGuo/p/4751759.html

聽完斯坦福大學的CS224d公開課真是受益匪淺，課程安排緊湊而且內容翔實由淺入深，使我不僅對各種神經網絡模型有一個認識，還對NLP的原理有了比較深刻的認識。

這門課程分為三個部分：第一部分是NLP的基本原理和DL的基礎知識，DL的基礎知識在ULFDL上也有，只不過ULFDL上講解的大多是基於圖像處理應用方面的，而CS224d主要是基於NLP應用方面的。第二部分就着重講了在NLP方面應用效果比較好的幾個DL模型。第三部分繼續講了一個在NLP方面應用效果好的DL模型，其他時間大多請知名公司的工程師過來做了講座，探討DL在NLP方面的實際應用的問題，非常的Advanced。

第一部分總結

單詞的表示方法

眾所周知，語言基本的單元是句子，句子基本的單元是單詞，於是NLP處理的一個最基礎的問題就是如何表示一個單詞？

One-hot vector

有一種最簡單的表示方法，就是使用One-hot vector表示單詞，即根據單詞的數量|V|生成一個|V| * 1的向量，當某一位為一的時候其他位都為零，然后這個向量就代表一個單詞。缺點也很明顯：1.由於向量長度是根據單詞個數來的，如果有新詞出現，這個向量還得增加，麻煩！(Impossible to keep up to date) 2. 主觀性太強(subjective) 3. 這么多單詞，還得人工打labor並且adapt，想想就恐怖 4.這是最不能忍受的一點，很難計算單詞之間的相似性。

co-occurrence matrix

使用co-occurrence matrix表示單詞，每一行的ID代表一個word，每一列代表一個word的ID。而它們確定的在矩陣中的位置是根據neighbor確定的。neighbor是什么呢？別急，下面會講。CM(co-occurrence matrix)代表的是每一個word在一個特定的neighbor里其他單詞出現的次數。CM當然也有很大的問題，第一個就是隨着單詞的增長，矩陣的大小也在增大；第二個是維度太高（和單詞數目一致），需要的存儲量很大；第三個是太Sparsity，不利於之后的Classification 處理。這可怎么辦呢？沒事，方法總比問題多。SVD閃亮登場！SVD是線性代數里的一種很牛掰的方法，能夠用較低的維度讀出一個矩陣中的大部分有用信息。

neighbor有兩種主要的方法，第一種是word-document，第二種是word-windows。word-document就是將一個個的document作為一個基本單位建立CM，word-windows是將一個個的word周圍某個數目范圍內的單詞作為一個基本單位建立CM。

CM的改進措施：在CM中有些單詞對語意的理解是沒有作用的，我們可以把這些單詞hack掉。這些單詞大多是function word(比如the, a, an, of)。還有一個就是之前我們定義矩陣中某一位置的值只是單純的根據neighbor數數量，但是實際情況和這個肯定有所不同，肯定是距離某一word越近的單詞其相關性（correlation）越強，於是應該定義一個Ramped windows（斜的窗）來建立CM，類似於用一個weight乘count，距離越遠weight越小反之越大（weight是個bell shape函數）。

word2Vec

word2vec和之前的方法不同，之前建立CM是一次澆鑄一次成型（- -），現在建立模型是慢慢的一個單詞一個單詞來。首先設定init parameter，然后根據這個parameter預測輸入進來的word的context，然后如果context和ground truth類似就不變，如果不類似就penalize parameter（怎么有點像訓練小狗，做的對給飯吃，做的不對抽兩下 - -）

word2vec的核心理念就是上面的兩個公式，當然為了優化性能和精確度可能在細微之處有改動，這就是后話了。

看第一個公式，核心理念就是為了讓center word周圍出現某種neighbor的概率最大，最好是1。

第二個公式就是為了求第一個公式最右邊的概率所想出來的辦法，word用兩種方法表示，一種是input word vector，一種是output word vector。

好基友 Continuous Bag of Words Model (CBOW) and Skip-Gram Model

CBOW和Skip-gram非常類似，只不過CBOW的目的是為了從neighbor求出center word；而skip-gram是為了從center word求出neighbor words。

具體公式在課上的lecture note 1里的最后部分有。

Negative Sampling

CBOW和Skip-gram的計算的時候都要使用到softmax我們知道softmax的分母是根據所有可能的數據求結果然后求和，在本例中words的樹目非常大，計算的cost當然也很大。於是使用negative sampling對上面的算法進行優化。

P_n(w)是negative sampling distrubution。目的是讓正確的word的值大，其他隨機的word和center word組合的值小。

Glove

通過前面的講解我們知道word的neighbor有兩種表示方法，第一種是word-document，第二種是word-windows。

word-document對於提取word的topic效果好(semantic效果好)

word-windows對於提取word的語法成份效果好(syntactic效果好)

Glove的目的就是想要綜合這兩種，最后做到對word的表示即sementic的表達效果好，syntactic的表達效果也好。

這有兩篇論文講得很詳細：http://nlp.stanford.edu/pubs/glove.pdf http://www.aclweb.org/anthology/P12-1092

CBOW，skip-gram，negtive-sampling的代碼實現：

  1 # Implement your skip-gram and CBOW models here
  2 
  3 # Interface to the dataset for negative sampling
  4 dataset = type('dummy', (), {})()
  5 def dummySampleTokenIdx():
  6     return random.randint(0, 4)
  7 def getRandomContext(C):
  8     tokens = ["a", "b", "c", "d", "e"]
  9     return tokens[random.randint(0,4)], [tokens[random.randint(0,4)] for i in xrange(2*C)]
 10 dataset.sampleTokenIdx = dummySampleTokenIdx
 11 dataset.getRandomContext = getRandomContext
 12 
 13 def softmaxCostAndGradient(predicted, target, outputVectors):
 14     """ Softmax cost function for word2vec models """
 15     ###################################################################
 16     # Implement the cost and gradients for one predicted word vector  #
 17     # and one target word vector as a building block for word2vec     #
 18     # models, assuming the softmax prediction function and cross      #
 19     # entropy loss.                                                   #
 20     # Inputs:                                                         #
 21     #   - predicted: numpy ndarray, predicted word vector (\hat{r} in #
 22     #           the written component)                                #
 23     #   - target: integer, the index of the target word               #
 24     #   - outputVectors: "output" vectors for all tokens              #
 25     # Outputs:                                                        #
 26     #   - cost: cross entropy cost for the softmax word prediction    #
 27     #   - gradPred: the gradient with respect to the predicted word   #
 28     #           vector                                                #
 29     #   - grad: the gradient with respect to all the other word       # 
 30     #           vectors                                               #
 31     # We will not provide starter code for this function, but feel    #
 32     # free to reference the code you previously wrote for this        #
 33     # assignment!                                                     #
 34     ###################################################################
 35     
 36     ### YOUR CODE HERE
 37     target_exp = np.exp(np.dot(np.reshape(predicted, (1, predicted.shape[0])), 
 38                         np.reshape(outputVectors[target], (outputVectors[target].shape[0], 1))))
 39     all_exp = np.exp(np.dot(outputVectors, np.reshape(predicted, (predicted.shape[0], 1))))
 40     all_sum_exp = np.sum(all_exp)
 41     prob = target_exp / all_sum_exp
 42     cost = -np.log(prob)
 43     gradTarget = -predicted + prob * predicted
 44     
 45     other_exp = np.vstack([all_exp[0:target], all_exp[target + 1:len(all_exp)]]).flatten()
 46     other_sigmoid = other_exp / all_sum_exp
 47     grad = np.dot(np.reshape(other_sigmoid, (other_sigmoid.shape[0], 1)), 
 48                   np.reshape(predicted, (1, predicted.shape[0])))
 49     grad = np.vstack([grad[0:target, :], gradTarget, grad[target:grad.shape[0], :]])
 50     
 51     repmat_exp = np.tile(all_exp, (1, outputVectors.shape[1]))
 52     gradPred = -outputVectors[target] + np.sum(outputVectors * repmat_exp, 0) / all_sum_exp
 53     ### END YOUR CODE
 54     
 55     return cost, gradPred, grad
 56 
 57 def negSamplingCostAndGradient(predicted, target, outputVectors, K=10):
 58     """ Negative sampling cost function for word2vec models """
 59     ###################################################################
 60     # Implement the cost and gradients for one predicted word vector  #
 61     # and one target word vector as a building block for word2vec     #
 62     # models, using the negative sampling technique. K is the sample  #
 63     # size. You might want to use dataset.sampleTokenIdx() to sample  #
 64     # a random word index.                                            #
 65     # Input/Output Specifications: same as softmaxCostAndGradient     #
 66     # We will not provide starter code for this function, but feel    #
 67     # free to reference the code you previously wrote for this        #
 68     # assignment!                                                     #
 69     ###################################################################
 70     
 71     ### YOUR CODE HERE
 72     neg_indexes = [dataset.sampleTokenIdx() for k in range(K)]
 73 
 74     r_W = np.dot(predicted, outputVectors.T)
 75     sigmoid_all = sigmoid(r_W)
 76 
 77     cost = -np.log(sigmoid_all[target]) - np.sum(np.log(1 - sigmoid_all[neg_indexes]))
 78     
 79     gradPred = -outputVectors[target, :] * (1 - sigmoid_all[target])
 80     gradPred += np.dot(sigmoid_all[neg_indexes], outputVectors[neg_indexes, :])
 81 
 82     grad = np.zeros(np.shape(outputVectors))
 83     grad[target, :] = -predicted * (1 - sigmoid_all[target])
 84 
 85     for neg_index in neg_indexes:
 86         grad[neg_index,:] += predicted * sigmoid_all[neg_index]
 87     ### END YOUR CODE
 88     
 89     return cost, gradPred, grad
 90 
 91 def skipgram(currentWord, C, contextWords, tokens, inputVectors, outputVectors, word2vecCostAndGradient = softmaxCostAndGradient):
 92     """ Skip-gram model in word2vec """
 93     ###################################################################
 94     # Implement the skip-gram model in this function.                 #         
 95     # Inputs:                                                         #
 96     #   - currrentWord: a string of the current center word           #
 97     #   - C: integer, context size                                    #
 98     #   - contextWords: list of no more than 2*C strings, the context #
 99     #             words                                               #
100     #   - tokens: a dictionary that maps words to their indices in    #
101     #             the word vector list                                #
102     #   - inputVectors: "input" word vectors for all tokens           #
103     #   - outputVectors: "output" word vectors for all tokens         #
104     #   - word2vecCostAndGradient: the cost and gradient function for #
105     #             a prediction vector given the target word vectors,  #
106     #             could be one of the two cost functions you          #
107     #             implemented above                                   #
108     # Outputs:                                                        #
109     #   - cost: the cost function value for the skip-gram model       #
110     #   - grad: the gradient with respect to the word vectors         #
111     # We will not provide starter code for this function, but feel    #
112     # free to reference the code you previously wrote for this        #
113     # assignment!                                                     #
114     ###################################################################
115     
116     ### YOUR CODE HERE
117     # inputVectors VxD
118     # outputVectors VxD
119 
120     # cost float
121     # gradIn VxD
122     # gradOut VxD
123     cost = 0
124     predicted = inputVectors[tokens[currentWord]]
125     gradIn = np.zeros(inputVectors.shape)
126     gradOut = np.zeros(outputVectors.shape)
127     for contextWord in contextWords:
128         target = tokens[contextWord]
129         contextCost, contextGradPred, contextGrad = word2vecCostAndGradient(predicted, target, outputVectors)
130         cost += contextCost
131         gradIn[tokens[currentWord],:] += contextGradPred
132         gradOut += contextGrad
133     ### END YOUR CODE
134     
135     return cost, gradIn, gradOut
136 
137 def cbow(currentWord, C, contextWords, tokens, inputVectors, outputVectors, word2vecCostAndGradient = softmaxCostAndGradient):
138     """ CBOW model in word2vec """
139     ###################################################################
140     # Implement the continuous bag-of-words model in this function.   #         
141     # Input/Output specifications: same as the skip-gram model        #
142     # We will not provide starter code for this function, but feel    #
143     # free to reference the code you previously wrote for this        #
144     # assignment!                                                     #
145     ###################################################################
146     
147     ### YOUR CODE HERE
148     in_rows = inputVectors.shape[0]
149     in_cols = inputVectors.shape[1]
150     
151     all_context_indx = np.zeros(2 * C)
152     for c in range(2 * C + 1):
153         if c == C:
154             target = tokens[currentWord]
155         elif c < C:
156             all_context_indx[c] = tokens[contextWords[c]]
157         else:
158             all_context_indx[c - 1] = tokens[contextWords[c - 1]]
159         
160     gradIn = np.zeros((in_rows, in_cols))   
161     all_context_indx_list = list(np.array(all_context_indx, int))
162     h = np.mean(inputVectors[all_context_indx_list], 0)
163 
164     cost, gradInTem, gradOut = word2vecCostAndGradient(h, target, outputVectors)
165     for context_indx in all_context_indx:
166         gradIn[context_indx] += gradInTem
167     gradIn = gradIn / 2 / C
168     ### END YOUR CODE
169     
170     return cost, gradIn, gradOut

對於word vector的評估方法

Intrinsic evaluation 是對VSM的一個簡單迅速的評估。這種評估方法不放到整個系統中評估，而僅僅是評估一個subtask進行評估。評估過程很快，可以很好的理解這個系統。但是不知道放到實際的系統中是否也表現的很好。Intrinsic evaluation的第一種評估是Syntactic評估，這種評估方法問題比較少；第二種是semantic評估，存在一詞多義的問題，還有corpus的數據比較舊的問題，這兩個問題都會影響評估結果。Glove word vector是至今Intrinsic evaluation才是結果最好的model，Asymmetric context只評估左邊window的單詞效果不好。More training time and more data對評估結果很有幫助。

Extrinsic evaluation就是把VSM放到實際的任務中進行評估，花費時間較長，如果效果不好的話也不清楚是VSM的問題還是其他模塊的問題或者是interaction的問題。有一個簡單的辦法確認是不是VSM的問題，把這個subsystem用其他的subsystem替換如果精度提高那就換上！

一詞多義的問題

如果一個單詞有很多個意思怎么辦？如果你簡單的就當作一個mean vector來處理那就會相當於把兩個不同意思的向量進行向量相加，這顯然是不准確的。解決方法在Notes講得很詳細，這里摘抄如下：

1. Gather fixed size context windows of all occurrences of the word(for instance, 5 before and 5 after)
2. Each context is represented by a weighted average of the context words’ vectors (using idf-weighting)
3. Apply spherical k-means to cluster these context representations.
4. Finally, each word occurrence is re-labeled to its associated cluster and is used to train the word representation for that cluster.

簡單的說就是使用k-means聚類將如同的context先聚類出來，再給每個certriod賦相應的word，再把相應的context歸給這個word，最后再用我們之前的普通訓練方法訓練。這就解決了一次多義的問題。