1. 詞頻統計:
1 import jieba 2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read() 3 words = jieba.lcut(txt) 4 counts = {} 5 for word in words: 6 if len(word) == 1: 7 continue 8 else: 9 counts[word] = counts.get(word,0) + 1 10 items = list(counts.items()) 11 items.sort(key=lambda x:x[1], reverse=True) 12 for i in range(15): 13 word, count = items[i] 14 print ("{0:<10}{1:>5}".format(word, count))
結果是:
曹操 946
孔明 737
將軍 622
玄德 585
卻說 534
關公 509
荊州 413
二人 410
丞相 405
玄德曰 390
不可 387
孔明曰 374
張飛 358
如此 320
不能 318
進一步改進, 我想只知道人物出場統計,代碼如下:
1 import jieba 2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read() 3 names = {'曹操','孔明','劉備','關羽','張飛','呂布','趙雲','孫權','周瑜','袁紹','黃忠','魏延'} 4 words = jieba.lcut(txt) 5 counts = {} 6 for word in words: 7 if len(word) == 1: 8 continue 9 elif word == "諸葛亮" or word == "孔明曰": 10 rword = "孔明" 11 elif word == "關公" or word == "雲長": 12 rword = "關羽" 13 elif word == "玄德" or word == "玄德曰": 14 rword = "劉備" 15 elif word == "孟德" or word == "丞相": 16 rword = "曹操" 17 else: 18 rword = word 19 counts[rword] = counts.get(rword,0) + 1 20 # for word in excludes: 21 # del counts[word] 22 items = list(counts.items()) 23 items.sort(key=lambda x:x[1], reverse=True) 24 for i in range(40): 25 word, count = items[i] 26 if word in names: 27 print ("{0:<10}{1:>5}".format(word, count))
運行結果為:
曹操 1358
孔明 1265
劉備 1251
關羽 783
張飛 358
呂布 300
趙雲 278
孫權 257
周瑜 217
袁紹 191
進一步的做詞雲圖:
1 import jieba 2 import os 3 import wordcloud 4 5 def getText(file): 6 with open(file, 'r', encoding= 'UTF-8') as txt: 7 txt = txt.read() 8 jieba.lcut(txt) 9 return txt 10 11 12 directoryname = os.getcwd() 13 filename = input() 14 txt = getText(filename + '.txt') 15 wordclouds = wordcloud.WordCloud(width=1000, height= 800, margin=2).generate(txt) 16 wordclouds.to_file('{}.png'.format(filename)) 17 18 os.system('{}.png'.format(filename))
名稱是可以進一步優化的,參見第二部分代碼。
中文wordcloud庫默認會出現亂碼,解決方法參考 https://blog.csdn.net/Dick633/article/details/80261233
參考:https://blog.csdn.net/weixin_44521703/article/details/93058003