利用word2vec對關鍵詞進行聚類


1、收集預料

2、對預料進行去噪和分詞

  • 我們需要content其中的值,通過簡單的命令把非content 的標簽干掉
        cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>"  > corpus.txt  

     

  • 分詞可以用jieba分詞:
    #!/usr/bin/env python
    #-*- coding:utf-8 -*-
    import jieba
    import jieba.analyse
    import jieba.posseg as pseg
    def cut_words(sentence):
        #print sentence
        return " ".join(jieba.cut(sentence)).encode('utf-8')
    f = open("corpus.txt")
    target = open("resultbig.txt", 'a+')
    print 'open files'
    line = f.readlines(100000)
    num=0
    while line:
        num+=1
        curr = []
        for oneline in line:
            #print(oneline)
            curr.append(oneline)
        '''
        seg_list = jieba.cut_for_search(s)
        words = pseg.cut(s)
        for word, flag in words:
            if flag != 'x':
                print(word)
        for x, w in jieba.analyse.extract_tags(s, withWeight=True):
            print('%s %s' % (x, w))
        '''
        after_cut = map(cut_words, curr)
        # print lin,
        #for words in after_cut:
            #print words
        target.writelines(after_cut)
        print 'saved %s00000 articles'% num
        line = f.readlines(100000)
    f.close()
    target.close()

     

3、運行word2vec輸出每個詞的向量

  • ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1 

    輸出為vectors.bin

  • 然后我們計算距離的命令即可計算與每個詞最接近的詞了:
    ./distance vectors.bin

     

4、現在經過以上的熟悉,我們進入對關鍵詞的聚類:

  • 則只需輸入一行命令即可:
    ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500  

     

  • 然后按類別排序,再輸入另一個命令:

    sort classes.txt -k 2 -n > classes.sorted.txt 

     

      

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM