下載一長篇中文文章。
從文件讀取待分析文本。
news = open('gzccnews.txt','r',encoding = 'utf-8')
安裝與使用jieba進行中文分詞。
pip install jieba
import jieba
list(jieba.lcut(news))
生成詞頻統計
排序
排除語法型詞匯,代詞、冠詞、連詞
輸出詞頻最大TOP20
import jieba article = open('test.txt','r').read() dele = {'。','!','?','的','“','”','(',')',' ','》','《',','} jieba.add_word('大數據') words = list(jieba.cut(article)) articleDict = {} articleSet = set(words)-dele for w in articleSet: if len(w)>1: articleDict[w] = words.count(w) articlelist = sorted(articleDict.items(),key = lambda x:x[1], reverse = True) for i in range(10): print(articlelist[i])
運行截圖: