什么是TD-IDF?
計算特征向量(或者說計算詞條的權重)
構造文檔模型
我們這里使用空間向量模型來數據化文檔內容:向量空間模型中將文檔表達為一個矢量。
We use the spatial vector model to digitize the document content: the vector space model represents the document as a vector.
用特征向量(T1,W1;T2,W2;T3, W3;…;Tn,Wn)表示文檔。
The eigenvectors (T1, W1; T2, W2; T3, W3; ... ; Tn, Wn) represents the document.
- Ti是詞條項 ti is term
- Wi是Ti在文檔中的重要程度 (Wi is the importance of term Ti in the document)
即將文檔看作是由一組相互獨立的詞條組構成
Think of a document as a set of independent phrases
把T1,T2 …,Tn看成一個n 維坐標系中的坐標軸
T1, T2... Tn as an n - dimensional coordinate system
對於每一詞條根據其重要程度賦以一定的權值Wi,作為對應坐標軸的坐標值。
Each term is assigned a certain weight, Wi, according to its importance, as the coordinate value of the corresponding coordinate axis.
權重Wi用詞頻表示,詞頻分為絕對詞頻和相對詞頻。
Weighted Wi is represented by word frequency, which is divided into absolute word frequency and relative word frequency.
- 絕對詞頻,即用詞在文本中出現的頻率表示文本。Absolute word frequency, that is, the frequency of words in the text represents the text.
- 相對詞頻,即為歸一化的詞頻,目前使用 最為頻繁的是TF*IDF(Term Frequency * Inverse Document Frequency)TF乘IDF The relative word frequency is the normalized word frequency, and TF*IDF is the most frequently used at present
將文檔量化了之后我們很容易看出D1與Q更相似~因為D1與Q的夾角小,我們可以用余弦cos表示After quantifying the document, it's easy to see that D1 is more similar to Q ~ because the Angle between D1 and Q is small, we can express it in terms of cosine cosine of theta
分析一下這個例子:analyze this example
有三個文檔D1,D2,Q there have three documents D1,D2,Q
這三個文檔一共出現了三個詞條,我們分別用T1,T2,T3表示 this documents appears three terms,we present them by using T1,T2,T3 individualy
在文檔D1中詞條T1的權重為2,T2權重為3,T3權重為5
在文檔D2中詞條T1權重為0,T2權重為7,T3權重為1
在文檔Q中詞條T1權重為0,T2權重為0,T3權重為2
T1 has a weight of 2, T2 has a weight of 3, and T3 has a weight of 5 in document D1
T1 has a weight of 0, T2 has a weight of 7, T3 has a weight of 1 in document D2
T1 has a weight of 0, T2 has a weight of 0, and T3 has a weight of 2 in document D3
D1 | D2 | Q | |
T1 | 2 | 3 | 0 |
T2 | 3 | 7 | 0 |
T3 | 3 | 1 | 2 |
接下來我們看tf*idf的公式:
tf:tf(d,t) 表示詞條t 在文檔d 中的出現次數
Tf (d,t) represents the number of occurrences of term t in document d
idf:idf(t)=log N/df(t)
-
df(t) 表示詞條t 在文本集合中出現過的文本數目(詞條t在哪些文檔出現過) the number of occurences of document in all doucuments ,which term t appear in documents
-
N 表示文本總數 N represent the numbers of all documents
對於詞條t和某一文本d來說,詞條在該文本d的權重計算公式:
For term t and a document d, the formula for calculating the weight of term in that dpcument d is:
- 特征向量(T1,W1;T2,W2;T3, W3;…;Tn,Wn)就可以求出了!
Eigenvectors (T1, W1; T2, W2; T3, W3; ... ; Tn, Wn, that's it!
是不是很簡單呢~
進一步思考:
如果說一個詞條t幾乎在每一個文檔中都出現過那么:
If an term t appears in almost every document, then:
idf(t)=log N/df(t)
趨近於0,此時w(t)也趨近於0,從而使得該詞條在文本中的權重很小,所以詞條對文本的區分度很低。
near 0,w(t) is also tend to zero,then make the weight of this term in the dicument is small ,so the distinction of this term in document is very low
停用詞:在英文中如a,of,is,in.....這樣的詞稱為停用詞,它們都區分文檔的效果幾乎沒有,但是又幾乎在每個文檔中都出現,此時idf的好處就出來了
In English, such as a,of,is,in... Such words, called stop words, have little effect on distinguishing documents but appear in almost every document, and the benefits of idf come out
我們通常根據w(d,t)值的大小,選擇指定數目的詞條作為文本的特征項,生成文本的特征向量(去掉停用詞)We usually select a specified number of entries as text feature items based on the size of the w(d,t) value, and generate text feature vectors (minus stop words).
這種算法一方面突出了文檔中用戶需要的詞,另一方面,又消除了在文本中出現頻率較高但與文本語義無關的詞條的影響
On the one hand, this algorithm highlights the words needed by users in the document, while on the other hand, it eliminates the influence of terms that appear frequently in the text but have nothing to do with the semantic meaning of the document
文本間相似性 :
基於向量空間模型的常用方法,剛才我們提到過可以用余弦值來計算,下面我講一下具體步驟
Now, the usual way of doing it based on vector space models, we mentioned earlier that you can do it with cosines, so let me go through the steps
內積不會求得自行百度。。。。
為了提高計算效率:可以先算x'=x/|x|,y'=y/|y|;大量計算兩兩文檔間相似度時,為降低計算量,先對文檔向量進行單位化。
In order to improve the efficiency of calculation, x'=x/|x|,y'=y/|y|; When calculating the similarity between two documents in large quantities, in order to reduce the amount of computation, So let's first unit the document vector
ok~tf*idf就先到這里
總結:
我們可以通過計算tf*idf的值來作為特征向量的權重
然后通過計算特征向量之間的余弦值來判斷相似性。
We can calculate the value of tf*idf as the weight of the eigenvector
Then the similarity is determined by calculating the cosine between the eigenvectors