Python 中文文件統計詞頻 + 中文詞雲


1. 詞頻統計:

 1 import jieba
 2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
 3 words  = jieba.lcut(txt)
 4 counts = {}
 5 for word in words:
 6     if len(word) == 1:
 7         continue
 8     else:
 9         counts[word] = counts.get(word,0) + 1
10 items = list(counts.items())
11 items.sort(key=lambda x:x[1], reverse=True)
12 for i in range(15):
13     word, count = items[i]
14     print ("{0:<10}{1:>5}".format(word, count))

結果是:

曹操 946
孔明 737
將軍 622
玄德 585
卻說 534
關公 509
荊州 413
二人 410
丞相 405
玄德曰 390
不可 387
孔明曰 374
張飛 358
如此 320
不能 318

進一步改進, 我想只知道人物出場統計,代碼如下:

 1 import jieba
 2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
 3 names = {'曹操','孔明','劉備','關羽','張飛','呂布','趙雲','孫權','周瑜','袁紹','黃忠','魏延'}
 4 words  = jieba.lcut(txt)
 5 counts = {}
 6 for word in words:
 7     if len(word) == 1:
 8         continue
 9     elif word == "諸葛亮" or word == "孔明曰":
10         rword = "孔明"
11     elif word == "關公" or word == "雲長":
12         rword = "關羽"
13     elif word == "玄德" or word == "玄德曰":
14         rword = "劉備"
15     elif word == "孟德" or word == "丞相":
16         rword = "曹操"
17     else:
18         rword = word
19     counts[rword] = counts.get(rword,0) + 1
20 # for word in excludes:
21 #     del counts[word]
22 items = list(counts.items())
23 items.sort(key=lambda x:x[1], reverse=True)
24 for i in range(40):
25     word, count = items[i]
26     if word in names:
27         print ("{0:<10}{1:>5}".format(word, count))

運行結果為:

曹操 1358
孔明 1265
劉備 1251
關羽 783
張飛 358
呂布 300
趙雲 278
孫權 257
周瑜 217
袁紹 191

進一步的做詞雲圖:

 1 import jieba
 2 import os
 3 import wordcloud
 4  
 5 def getText(file):
 6     with open(file, 'r', encoding= 'UTF-8') as txt:
 7         txt = txt.read()
 8         jieba.lcut(txt)
 9     return txt
10  
11  
12 directoryname =  os.getcwd()
13 filename = input()
14 txt = getText(filename + '.txt')
15 wordclouds = wordcloud.WordCloud(width=1000, height= 800, margin=2).generate(txt)
16 wordclouds.to_file('{}.png'.format(filename))
17  
18 os.system('{}.png'.format(filename))

名稱是可以進一步優化的,參見第二部分代碼。

中文wordcloud庫默認會出現亂碼,解決方法參考 https://blog.csdn.net/Dick633/article/details/80261233

 

參考:https://blog.csdn.net/weixin_44521703/article/details/93058003


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM