python 用gensim進行文本相似度分析

本文轉載自查看原文 2017-05-21 18:47 25198 python

http://blog.csdn.net/chencheng126/article/details/50070021

參考於這個博主的博文。

原理

1、文本相似度計算的需求始於搜索引擎。

搜索引擎需要計算“用戶查詢”和爬下來的眾多”網頁“之間的相似度，從而把最相似的排在最前返回給用戶。

2、主要使用的算法是tf-idf

tf：term frequency 詞頻

idf：inverse document frequency 倒文檔頻率

主要思想是：如果某個詞或短語在一篇文章中出現的頻率高，並且在其他文章中很少出現，則認為此詞或者短語具有很好的類別區分能力，適合用來分類。

第一步：把每個網頁文本分詞，成為 詞包（bag of words）。

第三步：統計網頁（文檔）總數M。

第三步：統計第一個網頁詞數N，計算第一個網頁第一個詞在該網頁中出現的次數n，再找出該詞在所有文檔中出現的次數m。則該詞的tf-idf 為：n/N * 1/(m/M) （還有其它的歸一化公式，這里是最基本最直觀的公式）

第四步：重復第三步，計算出一個網頁所有詞的tf-idf 值。

第五步：重復第四步，計算出所有網頁每個詞的tf-idf 值。

3、處理用戶查詢

第一步：對用戶查詢進行分詞。

第二步：根據網頁庫（文檔）的數據，計算用戶查詢中每個詞的tf-idf 值。

4、相似度的計算

使用 余弦相似度來計算用戶查詢和每個網頁之間的夾角。夾角越小，越相似。

 1 #coding=utf-8
 2 
 3 
 4 # import warnings
 5 # warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
 6 import logging
 7 from gensim import corpora, models, similarities
 8 
 9 datapath = 'D:/hellowxc/python/testres0519.txt'
10 querypath = 'D:/hellowxc/python/queryres0519.txt'
11 storepath = 'D:/hellowxc/python/store0519.txt'
12 def similarity(datapath, querypath, storepath):
13     logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
14 
15     class MyCorpus(object):
16         def __iter__(self):
17             for line in open(datapath):
18                 yield line.split()
19 
20     Corp = MyCorpus()
21     dictionary = corpora.Dictionary(Corp)
22     corpus = [dictionary.doc2bow(text) for text in Corp]
23 
24     tfidf = models.TfidfModel(corpus)
25 
26     corpus_tfidf = tfidf[corpus]
27 
28     q_file = open(querypath, 'r')
29     query = q_file.readline()
30     q_file.close()
31     vec_bow = dictionary.doc2bow(query.split())
32     vec_tfidf = tfidf[vec_bow]
33 
34     index = similarities.MatrixSimilarity(corpus_tfidf)
35     sims = index[vec_tfidf]
36 
37     similarity = list(sims)
38 
39     sim_file = open(storepath, 'w')
40     for i in similarity:
41         sim_file.write(str(i)+'\n')
42     sim_file.close()
43 similarity(datapath, querypath, storepath)