NLP（二）：jieba高頻詞提取

本文轉載自查看原文 2020-03-10 20:29 1017 自然語言處理

高頻詞提取（TF，Term Frequency）,高頻詞指在文檔中出現頻率較高並且有用的詞。

所以我們要做的工作有：加載數據，去除停用詞，用字典統計高頻詞，輸出top10的高頻詞。

import glob
import random
import jieba

def getContent(path):
    with open(path, encoding='utf-8', errors='ignore') as f:
        content = ''
        for line in f:
           #去除空行
            line = line.strip()
            content += line
        return content
    
def get_TF(words, topK=10):
    tf_dic = {}
    #遍歷words中的每個詞，如果這個詞在tf_dic中出現過，則令其加一。
    for w in words:
        tf_dic[w] = tf_dic.get(w, 0) + 1
        #將字典tf_dic排序后取出前topK.
    return sorted(tf_dic.items(), key = lambda x: x[1], reverse=True)[:topK]

def stop_words(path):
    with open(path,encoding='utf-8') as f:
        return [l.strip() for l in f]
    
#修改cut函數,path是你的停用詞表所放的位置
def cut(content,path):
    split_words = [x for x in jieba.cut(content) if x not in stop_words(path)]
    return split_words 


def main():
    files=glob.glob('C:/Users/Administrator/Desktop/stop_words/news/*.txt')
    corpus=[getContent(x) for x in files]
    sample_inx=random.randint(0,len(corpus))
    split_words=cut(corpus[sample_inx],'C:/Users/Administrator/Desktop/stop_words/stop_words.txt')
    print('樣本之一：'+corpus[sample_inx])
    print('樣本的分詞效果：'+'/'.join(split_words))
    print('樣本的topk10詞為：'+str(get_TF(split_words)))
    
main()

運行結果如下：

這個代碼需注意的地方有：將新聞復制粘貼到txt文件中注意需用utf8編碼，然后在代碼中體現為open函數中需要加‘encoding='utf-8'’；輸出的結果是一個列表，列表中有許多元組，由詞和詞頻構成。

在默認情況下，jieba采用常規切詞來提取高頻詞匯，但是在特定背景，諸如醫學，娛樂，體育，科學類文本下，需要該領域自己的特定詞典，jieba分詞允許我們加載自定義詞典，代碼如下：

jieba.load_userdict('./data/user_dict.utf8')

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python數據分析高頻詞提取，pyecharts詞雲制作並保存 seo與python大數據結合給文本分詞並提取高頻詞如何從大量數據中找出高頻詞如何從大量數據中找出高頻詞基於統計的無詞典的高頻詞抽取(三)——子串歸並 [LeetCode] Top K Frequent Words 前K個高頻詞統計文檔中前5個高頻詞個數並輸出使用Jieba提取文章的關鍵詞 NLP自然語言處理 jieba中文分詞,關鍵詞提取,詞性標注,並行分詞,起止位置,文本挖掘,NLP WordEmbedding的概念和實現基於統計的無詞典的高頻詞抽取(二)——根據LCP數組計算詞頻