1、收集預料
- 自己寫個爬蟲去收集網頁上的數據。
- 使用別人提供好的數據http://www.sogou.com/labs/dl/ca.html
2、對預料進行去噪和分詞
- 我們需要content其中的值,通過簡單的命令把非content 的標簽干掉
cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>" > corpus.txt
- 分詞可以用jieba分詞:
#!/usr/bin/env python #-*- coding:utf-8 -*- import jieba import jieba.analyse import jieba.posseg as pseg def cut_words(sentence): #print sentence return " ".join(jieba.cut(sentence)).encode('utf-8') f = open("corpus.txt") target = open("resultbig.txt", 'a+') print 'open files' line = f.readlines(100000) num=0 while line: num+=1 curr = [] for oneline in line: #print(oneline) curr.append(oneline) ''' seg_list = jieba.cut_for_search(s) words = pseg.cut(s) for word, flag in words: if flag != 'x': print(word) for x, w in jieba.analyse.extract_tags(s, withWeight=True): print('%s %s' % (x, w)) ''' after_cut = map(cut_words, curr) # print lin, #for words in after_cut: #print words target.writelines(after_cut) print 'saved %s00000 articles'% num line = f.readlines(100000) f.close() target.close()
3、運行word2vec輸出每個詞的向量
-
./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1
輸出為vectors.bin
- 然后我們計算距離的命令即可計算與每個詞最接近的詞了:
./distance vectors.bin
4、現在經過以上的熟悉,我們進入對關鍵詞的聚類:
- 則只需輸入一行命令即可:
./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500
-
然后按類別排序,再輸入另一個命令:
sort classes.txt -k 2 -n > classes.sorted.txt