CS224n學習筆記(一)

本文轉載自查看原文 2019-04-28 17:04 872

How do we have usable meaning in a computer?

Represents the words as discrete symbols, (離散型變量)

Use the one-hot vector to represent the word in sentence, (Traditional way, we can use Distributional semantics)

Distributional semantics: A word's meaning is giving by words that frequently appear close-by.

when a word w appear in a text, its context words is the set of words that appears nearby. We can use many contexts of w to build up a representation of w.

Word Vector:We build a dense(稠密的) vector for each word, chosen so that it's similar to vector of words that appear in similar contexts, it's a probability vector .

所有的句子都表示 banking的意思。

\[bank =\left( \begin{array}{r}{0.286} \\ {0.792} \\ {-0.177} \\ {-0.107} \\ {0.109} \\ {-0.549} \\ {0.371}\end{array}\right) \]

單詞向量有時候又叫做word embedding 或者 word representation。上面是分布式 distributed representation。不僅僅是一個簡單的位置向量，每一個單詞都有一個 distributed representation，就構成了向量空間。

Word2vec是一個框架用於學習單詞向量。

主要思想：

有一大簇的文本
每一個單詞在一個確定的詞典中可以用向量表示
Go through文本中的每一個位置，這個文本有一個中心單詞 C 以及 context(“outside 單詞”) O。我們要觀察這個單詞周圍的單詞。
使用單詞 C與單詞 O的單詞向量的相似性來計算給定的 O，是C的概率
調整這個單詞向量，直到最大化這個概率

使用中間的單詞可以預測出周圍的單詞

例如對於下面這個例子來說：

我們需要計算的是 $P\left(w_{t+j} | w_{t}\right)$。

我們希望的是通過周圍的單詞預測出中間的單詞 ‘into’ 。所以我們可以改變這些預測，如果我們從貝葉斯的角度看的話，那么就是我們先計算出當遇到 'into' 的時候，周圍每個單詞的概率作為先驗，然后當遇到周圍的單詞的時候，用后驗就可以算出單詞是 ‘into’的概率。

對於 Word2vec的目標函數函數

對於文本中的每一個位置，$t=1, \dots, T$來說預測用一個確定大小為 m的窗口預測 context words，對於中心單詞 $w_{j}$。目標函數就是

\[LikeLihhod =L(\theta)=\prod_{t=1}^{T} \prod_{-m \leq j \leq m \atop j \neq 0} P\left(w_{t+j} | w_{t} ; \theta\right) \]

m 表示窗口的大小，$L(\theta)$ 是關於最優化 $\theta$的函數，那么 $J(\theta)$就是 negative log 似然函數。$\theta$ 恰恰就是單詞向量

\[J(\theta)=-\frac{1}{T} \log L(\theta)=-\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m \atop j \neq 0} \log P\left(w_{t+j} | w_{t} ; \theta\right) \]

上面的 $J(\theta)$ 是損失函數，也就是目標函數。最小化目標函數等價於最大化預測的准確率。

這個 $\theta$ is actually going to be the vector representation. Use that word to predict what the other words occur

那么問題來了，怎么計算 P函數。

對於每個word，我們有兩種表現方法，分別是：

$v_{w}$ when $w$ is a center word
$u_{w}$ when $w$ is a context word

那么對於中心單詞 c 於 context word o來說：

\[P(o | c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} \]

對於前面的例子來說就是：$P\left(u_{\text {problems}} | v_{i n t o}\right)$ short for $\mathrm{P}\left(problems|into; u_{\text {problems}}, v_{i n t o}, \theta\right)$

上面公式的解釋，我們使用的是 $u_{o}^{T} v_{c}$來表示權重的大小。也就是兩個單詞間的關聯程度，$u_{o}^{T} v_{c}$越大越相關，（從向量的角度解釋）。分母是所有的情況。然后是一個 $Softmax$函數：

\[\operatorname{softmax}\left(x_{i}\right)=\frac{\exp \left(x_{i}\right)}{\sum_{j=1}^{n} \exp \left(x_{j}\right)}=p_{i} \]

這個比較容易理解。這個公式也用於后面的CBOW 以及 skip-gram計算損失函數時候。

梯度下降解最小化損失函數

我們的參數只有一個 $\theta$，但是請記住 $\theta$ 包含中心向量與 context word向量，所以是兩截長。這里我們只是用 $\theta$ 來模糊的表示位置參數，那么 $\theta$ 到底是一個什么樣的參數呢？在 CBOW 與 skip-gram中可以理解為兩個矩陣。這個后面再說。所以用梯度下降求解損失函數也放在后面。

\[\theta=\left[ \begin{array}{l}{v_{\text {aardvark}}} \\ {v_{a}} \\ {\vdots} \\ {v_{z e b r a}} \\ {u_{\text {aardvark}}} \\ {u_{a}} \\ {\vdots} \\ {u_{z e b r a}}\end{array}\right] \in \mathbb{R}^{2 d V} \]

基於 SVD的方法獲得單詞向量

在具體使用word2vec之前，我們先講一下使用 SVD(奇異值分解)的方法，這個比較簡單，但是這個方法十分有用，潛在語義索引(LSI) 就可以使用這種方法。we first loop over(遍歷) a massive dataset and accumulate word co-occurrence counts in some form of a matrix X。然后對 X矩陣使用 SVD（奇異值分解），獲得 $US V^{T}$。我們使用 U的行向量作為詞典中所有單詞的單詞向量。

Word-Document Matrixs

構造 X的矩陣的方法如下，遍歷所有的文件(billions of)，word i 每出現在文檔 j 中一次，那么 $X_{i j}$就+ 1. 可以看出 $\left(\mathbb{R}^{|V| \times M}\right)$這個矩陣超級大。

Window based Co-occurrence Matrix

這種方法記錄的不是單詞與文本之間的聯系，而是單詞於單詞之間，共同出現的情況下的聯系：

比如說圖中的例子：

Applying SVD to the cooccurrence matrix

對上面的關系矩陣使用 SVD變換，我們的做法就是提取了前 k個特征向量，用來表示單詞向量。

\[\frac{\sum_{i=1}^{k} \sigma_{i}}{\sum_{i=1}^{|V|} \sigma_{i}} \]

下面的圖可以看到具體的方法，奇異值分解后得到三個矩陣，對中間的 S矩陣取前 k個對角線的數值。

所以前面的 U矩陣的列向量只有 k個，這樣就降低了單詞向量的維度。這個方法還面臨這許多問題。

基於迭代的方法 - Word2vec

兩個算法：CBOW 和 skip-gram

兩種訓練方法：negative sampling and hierarchical softmax

Language Models

首先需要理解，我們是將一個概率賦給一個序列的單詞。從數學上表示，$P\left(w_{1}, w_{2}, \cdots, w_{n}\right)$。如果我們假設每個單詞的出現是完全獨立的，那么：

\[P\left(w_{1}, w_{2}, \cdots, w_{n}\right)=\prod_{i=1}^{n} P\left(w_{i}\right) \]

這個假設顯然不合理，因為大多數情況下，句子中，單詞間有一定的關系，比如說，‘read’ 后面往往是 ‘book’ 或者是 ‘maganize’ 等。所以我們將這個句子的概率表示成，句子中出現的成對的單詞，表示成這個單詞與他下一個單詞為一對，可以寫成這樣：

\[P\left(w_{1}, w_{2}, \cdots, w_{n}\right)=\prod_{i=2}^{n} P\left(w_{i} | w_{i-1}\right) \]

以上就是我們表示一個序列的概率的例子。

Continuous Bag of Words Model (CBOW)

通過上下文預測中間單詞。

對於詞袋模型，我們輸入，(上下文的one-hot表示的單詞向量)用 $x^{(c)}$來表示。輸出用 $y^{(c)}$來表示，我們訓練的時候，輸出只有一個 $y$ 表示中心詞。下面定義未知數：

首先定義兩個矩陣：

\[\mathcal{V} \in \mathbb{R}^{n \times|V|} \text { and } \mathcal{U} \in \mathbb{R}^{|V| \times n} \]

這里面 n 定義了單詞向量空間的大小， $\mathcal{V}$是輸入矩陣，$\mathcal{V}$ 的第 i 列是對於輸入單詞 $w_{i}$ 所表示的 n 維向量。我們表示成 $n \times 1$ 向量 $v_{i}$。同理對於輸出，$\mathcal{U}$ 是輸出矩陣，輸出矩陣 $\mathcal{U}$ 的第 j 行是一個 n 維的單詞向量作為單詞 $w_{i}$ 的輸出向量。我們表示成 $u_{j}$。為什么可以這么說呢？從線性代數的角度來說，我們最開始的輸入是 one-hot編碼的 $x^{(c)}$，我們可以理解為矩陣 $\mathcal{V}$ 對向量 $x^{(c)}$ 的線性變換。 $v_{i}$ 完全由參數 $i$ 與矩陣 $\mathcal{V}$ 決定，所以我們可以直接當成是輸入，這里巧妙的就是，我們最后優化的，也是這個 $v_{i}$ 。

下面是具體的步驟，

對於輸入大小為 m 的上下文，我們產生一個one-hot 編碼為

\[\left(x^{(c-m)}, \ldots, x^{(c-1)}, x^{(c+1)}, \ldots, x^{(c+m)} \in \mathbb{R}^{|V|}\right) \]
這個上下文的單詞向量就是$(v_{c-m}=\mathcal{V} x^{(c-m)}, v_{c-m+1}=\mathcal{V} x^{(c-m+1)}, \ldots, v_{c+m}=\mathcal{V} x^{(c+m)} \in \mathbb{R}^{n} )$
將這些向量求平均值得到：

\[\hat{v}=\frac{v_{c-m}+v_{c-m+1}+\ldots+v_{c+m}}{2 m} \in \mathbb{R}^{n} \]
產生一個用來表示分數的向量 $z=\mathcal{U} \hat{v} \in \mathbb{R}^{|V|}$，而實際上打分的是矩陣 $\mathcal{U}$ 。由於相似矢量的點積更高，那么我們最大化點積的話，相似的單詞就會靠的越近。
將分數通過 softmax函數：$\hat{y}=\operatorname{softmax}(z) \in \mathbb{R}^{|V|}$
我們希望得分最高的是中心詞，訓練的中心詞我們用 $y \in \mathbb{R}|V|$表示，而我們預測的結果用 $\hat{y} \in \mathbb{R}|V|$來表示。

現在我們要得到矩陣 $\mathcal{V}$ 和 $\mathcal{U}$ ，那么我們就要一個目標函數，然后最小化損失函數就可以了。我們使用信息學中的cross-entropy(交叉熵)損失函數，$H(\hat{y}, y)$。從離散的角度，我們可以寫成下面的公式，$y$ 向量的每一個位置都要最為離散的偏差算進來。

\[H(\hat{y}, y)=-\sum_{j=1}^{|V|} y_{j} \log \left(\hat{y}_{j}\right) \]

但是 $y$ 是一個one-hot編碼，所以 $H(\hat{y}, y)=-y_{i} \log \left(\hat{y}_{i}\right)$。最理想的情況下 $\hat{y}_{i}$ = 1，與 $y_{i}$ 是相同的。但是預測的會有偏差。如果 $\hat{y}_{i} < 1 $那么 $H(\hat{y}, y)$ > 0。因此最優化，我們寫成下式：

\[最小化 J = \]

\[\begin{array}{l}{=-\log P\left(w_{c} | w_{c-m}, \ldots, w_{c-1}, w_{c+1}, \ldots, w_{c+m}\right)} \\ {=-\log P\left(u_{c} | \hat{v}\right)} \\ {=-\log \frac{\exp \left(u_{c}^{T} \hat{v}\right)}{\sum_{j=1}^{|V|} \exp \left(u_{j}^{T} \hat{v}\right)}} \\ {=-u_{c}^{T} \hat{v}+\log \sum_{j=1}^{|V|} \exp \left(u_{j}^{T} \hat{v}\right)}\end{array} \]

然后使用隨機梯度下降更新 $u_{c}$ and $v_{j}$。上述式子在理解的過程中，關鍵的地方是要知道還未經過 $softmax$ 函數的時候，我們用 $u_{c}$來表示 $w_{c}$。而輸入使用的是 $\mathcal{V}$ 的列向量表示的，處理之后，用 $\hat{v}$來表示。

Skip-Gram Model

該算法與 CBOW的思路相反。是通過中心詞預測上下文的單詞向量：我們輸入用 $x$ 表示，輸出用 $y^{(j)}$ 來表示，我們定義相同的矩陣 $\mathcal{V}$ 和 $\mathcal{U}$ 。該算法也分成六部完成：

中心單詞的 one-hot編碼，用 $x \in \mathbb{R}^{|V|}$來表示。
我們產生單詞向量 $v_{c}=V x \in\mathbb{R}^{n}$。這一步使用了矩陣 $\mathcal{V}$ 。
然后使用 $z=\mathcal{U} v_{c}$ 得到，$z$ 是一個矩陣，用來表示每個預測的 $y$。
然后使用 softmax函數，$\hat{y}=\operatorname{softmax}(z)$。然后假設 $\hat{y}_{c-m}, \dots, \hat{y}_{c-1}, \hat{y}_{c+1}, \dots, \hat{y}_{c+m}$ 是觀察到每一個文本向量的概率。
我們希望的是我們產生的概率 $\hat{y}_{c-m}, \dots, \hat{y}_{c-1}, \hat{y}_{c+1}, \dots, \hat{y}_{c+m}$ 與實際的概率 $y^{(c-m)}, \ldots, y^{(c-1)}, y^{(c+1)}, \ldots, y^{(c+m)}$相等。實際的概率就是每一個單詞的 one-hot編碼。

接下來就是定義目標函數以及最小化損失函數：$J$ =

\[\begin{aligned} J &=-\sum_{j=0, j \neq m}^{2 m} \log P\left(u_{c-m+j} | v_{c}\right) \\ &=\sum_{j=0, j \neq m}^{2 m} H\left(\hat{y}, y_{c-m+j}\right) \end{aligned} \]

\[\begin{array}{l}{=-\log P\left(w_{c-m}, \ldots, w_{c-1}, w_{c+1}, \ldots, w_{c+m} | w_{c}\right)} \\ {=-\log \prod_{j=0, j \neq m}^{2 m} P\left(w_{c-m+j} | w_{c}\right)} \\ {=-\log \prod_{j=0, j \neq m}^{2 m} P\left(u_{c-m+j} | v_{c}\right)} \\ {=-\log \prod_{j=0, j \neq m}^{2 m} \frac{\exp \left(u_{c-m+j}^{T} v_{c}\right)}{\sum_{k=1}^{|V|} \exp \left(u_{k}^{T} v_{c}\right)}} \\ {=-\log \prod_{j=0, j \neq m}^{2 m} u_{c-m+j}^{T} v_{c}+2 m \log \sum_{k=1}^{|V|} \exp \left(u_{k}^{T} v_{c}\right)}\end{array} \]

上面的損失函數就比較好理解了。

Negative Sampling

前面計算面臨的問題，在 $softmax$ 階段，計算量與單詞向量的長度成 $O(|V|)$關系，而 $softmax$ 計算公式也很復雜

\[\sigma(\mathbf{z})_{j}=\frac{e^{z_{j}}}{\sum_{k=1}^{K} e^{z_{k}}} \quad \text { for } j=1, \ldots, K \]

所以計算太復雜了，為了簡化計算，我們能否不遍歷語料庫呢？

什么是Negative Sampling(負采樣)

對於一個樣本，在之前的樣本中，我們都是將與正確的結果作為訓練結果拿去訓練的。對於負采樣，在語言模型中，假設中心詞為 $w$, 他周圍的上下文共有 $2m$ 個單詞，記為 $context(w)$ ,那個 $w$ 與 $context(w)$ 的對應就是一個正例，而負例（可以理解為錯誤的例子）就是我們取 $n$ 個與 $w$ 不同的單詞 $w_{i}, i=1,2, \ldots n$, 這些 $w_{i}$ 與 $context(w)$ 就構成了負樣本，然后利用分類的思想，這就相當於一個二元分類問題，通過最優化這個分類問題求解模型的參數。

對於一個樣本是正樣本，我們用邏輯回歸的模型計算是正樣本的概率，這里我們用 $v_{c}^{T} v_{w}$ 表示的是單詞向量 $v_{c}$ 與文本向量 $v_{w}$ 的點積，點積越大，越相關，那么 $P = 1$ 的概率就越大：

\[P(D=1 | w, c, \theta)=\sigma\left(v_{c}^{T} v_{w}\right)=\frac{1}{1+e^{\left(-v_{c}^{T} v_{w}\right)}} \]

Negative Sampling下的損失函數

假設我們的參數還用 $\theta$ 來表示的話，分別用 $D$ 與 $D'$ 表示正負樣本集合

\[\theta=\underset{\theta}{\operatorname{argmax}} \prod_{(w, c) \in D} P(D=1 | w, c, \theta) \prod_{(w, c) \in D'} P(D=0 | w, c, \theta) \]

\[=\underset{\theta}{\operatorname{argmax}} \prod_{(w, c) \in D} P(D=1 | w, c, \theta) \prod_{(w, c) \in D'}(1-P(D=1 | w, c, \theta)) \]

\[\begin{array}{l}{=\underset{\theta}{\operatorname{argmax}} \sum_{(w, c) \in D} \log P(D=1 | w, c, \theta)+\sum_{(w, c) \in D'} \log (1-P(D=1 | w, c, \theta))} \\ {=\underset{\theta}{\operatorname{argmax}} \sum_{(w, c) \in D} \log \frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}+\sum_{(w, c) \in \tilde{D}} \log \left(1-\frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}\right)}\end{array} \\ =\underset{\theta}{\operatorname{argmax}} \sum_{(w, c) \in D} \log \frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}+\sum_{(w, c) \in D'} \log \left(\frac{1}{1+\exp \left(u_{w}^{T} v_{c}\right)}\right) \]

在上面的式子中，我們又將參數 $\theta$ 替換成了矩陣 $\mathcal{V}$ 和 $\mathcal{U}$ 列向量與行向量，分別表示輸入與輸出。

最大化似然函數，等價於最小化下面的目標函數：

\[J=-\sum_{(w, c) \in D} \log \frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}-\sum_{(w, c) \in D'} \log \left(\frac{1}{1+\exp \left(u_{w}^{T} v_{c}\right)}\right) \]

那么對於 skip-gram算法來說

對於每一個中心詞 C 來說：(這里沒有遍歷每一個中心詞)

\[J = -\log \sigma\left(u_{c-m+j}^{T} \cdot v_{c}\right)-\sum_{k=1}^{K} \log \sigma\left(-\tilde{u}_{k}^{T} \cdot v_{c}\right) \]

對於CBOW算法，$\hat{v}=\frac{v_{c-m}+v_{c-m+1}+\ldots+v_{c+m}}{2 m}$ 還是用這個公式，所以損失函數就可以寫成：

\[J = -\log \sigma\left(u_{c}^{T} \cdot \hat{v}\right)-\sum_{k=1}^{K} \log \sigma\left(-\tilde{u}_{k}^{T} \cdot \hat{v}\right) \]

Hierarchical Softmax

分層 $Softmax$ 在用於非連續的單詞向量的時候，表現比較好。而負采樣用於連續的單詞向量的時候表現比較好。分層 $Softmax$ 使用的是哈夫曼樹。

我們使用葉子結點代表每一個單詞，單詞的概率定位為從根節點隨機走到該節點的概率。由於是二叉樹，我們結算復雜度為 $O(\log (|V|))$。這樣就降低了模型的復雜度，而我們要學習的參數，就是除了葉節點之外的權重。我們用 $L(w)$ 表示從根結點到葉結點的長度，比如圖中的 $L(w2)$ = 3. 而我們用 $n(w, i)$ 表示路徑上的第 $i$ 個結點，對應於向量中的參數就是 $v_{n(w, i)}$。對於結點 $n$ 我們用 $\operatorname{ch}(n)$ 表示 $n$ 的孩子結點，那么葉結點是 $w$的概率就是：

\[P\left(w | w_{i}\right)=\prod_{j=1}^{L(w)-1} \sigma\left([n(w, j+1)=\operatorname{ch}(n(w, j))] \cdot v_{n(w, j)}^{T} v_{w_{i}}\right) \]

其中

\[[x]=\left\{\begin{array}{l}{1 \text { if } x \text { is true }} \\ {-1 \text { otherwise }}\end{array}\right. \]

上面的式子，對於某個結點 $i$ ,如果 $n(w, i+1)$ 就是 $n(w, i)$ 的孩子的節點的的話，那么 $[x] = 1$，否則 $[x] = -1$ ,這里的 +1 或者 -1 僅僅是用來表示葉結點的方向。對於上圖中的例子來說就是：

\[\begin{aligned} P\left(w_{2} | w_{i}\right) &=p\left(n\left(w_{2}, 1\right), \text { left }\right) \cdot p\left(n\left(w_{2}, 2\right), \text { left }\right) \cdot p\left(n\left(w_{2}, 3\right), \text { right) }\right.\\ &=\sigma\left(v_{n\left(w_{2}, 1\right)}^{T} v_{w_{i}}\right) \cdot \sigma\left(v_{n\left(w_{2}, 2\right)}^{T} v_{w_{i}}\right) \cdot \sigma\left(-v_{n\left(w_{2}, 3\right)}^{T} v_{w_{i}}\right) \end{aligned} \]

而且我們可以保證：

\[\sum_{w=1}^{|V|} P\left(w | w_{i}\right)=1 \]

可以看出，對於每一次更新語料庫，不需要重新計算全部，而是計算，新的 word 對應的 Path 中的參數就可以了。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【NLP CS224N筆記】Lecture 3 GloVe： Global Vectors for Word Representation 斯坦福CS224n課程作業【CS224n-2019學習筆記】Lecture 1: Introduction and Word Vectors [NLP] cs224n-2019 Assignment 1 Exploring Word Vectors CS224d 干貨（視頻，筆記，論文） (Stanford CS224d) Deep Learning and NLP課程筆記（一）：Deep NLP 【圖機器學習】cs224w Lecture 7 - 節點的表示 CS231n 斯坦福深度視覺識別課學習筆記(完結) CS231n 2017 學習筆記01——KNN（K-Nearest Neighbors） (Stanford CS224d) Deep Learning and NLP課程筆記（三）：GloVe與模型的評估