python day 17 文本詞頻統計

本文轉載自查看原文 2019-12-29 01:43 1389

文本詞頻統計
一、概述
1．需求：一篇文章，出現了哪些詞？哪些詞出現得最多？
2．首先，要知道英文文本和中文文本的詞頻統計是不同的
二、“HAMLET”
1.噪音處理：提取單詞，去除不必要的其他東西。
2.提取單詞，split按空格切分，形成列表
3.統計單詞和對應的詞頻，使用字典
4.詞頻按關鍵字：出現次數排序，使用列表sort method
5.輸出

Hamlet
def gettext():
    text = open("hamlet.txt",'r').read()
    text = text.lower()
    for ch in '"#$%^&*()_+-,./<>=@{}[]\~\'':
        text = text.replace(ch,'')
    return text
hamlettext = gettext()
words = hamlettext.split()    
counts = {}
for word in words:
    counts[word]=counts.get(word,0)+1
items = list(counts.items())
items.sort(key = lambda x:x[1],reverse = True)
for i in range (20):
    word,count = items[i]
    print("{0:<10}{1:>5}".format(word,count))

三、《三國演義》人名出場次數統計
1．第一版

#三國演義
#first,get words;second,count the times word appear in text;third,print the top 20
import jieba
txt = open('三國演義.txt','r',encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word)==1:
        continue
        else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key = lambda x:x[1],reverse = True)
for i in range (10):
    word,times = items[i]
    print('{0:<10}{1:>5}'.format(word,times))

發現問題：
孔明和孔明曰應該算作一個人
荊州等不是人名
改進：
從列表中刪除非人名詞組
在建立集合統計詞語出場次數的時候，把孔明和孔明曰，算作一個次。

2．第二版

import jieba
txt =open('D:/pythonfiles/三國演義.txt','r',encoding='utf-8').read()
excludes = {'將軍','卻說','荊州','二人','不可','不能','如此','如何','軍士','商議','左右','軍馬','次日','引兵','大喜','天下','東吳','於是','今日','不敢'}
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len (word) == 1 :
        continue
    elif word == '諸葛亮' or word == '孔明曰':
        reword = '孔明'
    elif word == '玄德' or word == '玄德曰':
        reword = '劉備'
    elif word == '關公' or word == '雲長':
        reword = '關羽'
    elif word == '孟德' or word == '丞相':
        reword = '曹操'
    else :
        reword = word
    counts[reword]=counts.get(reword,0)+1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key= lambda x:x[1],reverse = True)
for i in range(20):
    word, count = items[i]
    print('{:<10}{:>5}'.format(word,count))

依舊還是老問題，按照改進的方法，進一步優化即可。

3. 三國演義TOP20

import jieba
txt = open('三國演義.txt','r',encoding='utf-8').read()
txt = jieba.lcut(txt)
counts = {}
def countword(a):
    global counts
    counts[a] = counts.get(a,0)+1
for word in txt:
    if len(word) == 1:
        continue
    elif word == '孔明曰' or word == '諸葛亮'  :
        word = '孔明'
        countword(word)
    elif  word == '玄德' or word =='玄德曰' or word == '主公' or word == '先主':
        word = '劉備'
        countword(word)
    elif word == '丞相' or word == '孟德':
        word ='曹操'
        countword(word)
    elif word == '關公' or word == '雲長':
        word = '關羽'
        countword(word)
    elif word == '都督':
        word = '周瑜'
        countword(word)
    elif word == '后主':
        word = '劉禪'
        countword(word)
    else:
        countword(word)
excluse = ['二人','不可','荊州','卻說','不能','將軍','如此','軍士','如何','商議','左右','次日','引兵','大喜','天下','東吳','軍馬','於是',
       '今日','不敢','陛下','魏兵','人馬','一人','不知','漢中','眾將','只見','蜀兵','大叫','上馬','此人','太守','天子','背后','后人','城中'
       ,'一面','何不','忽報','大軍','先生','何故','然后','先鋒','夫人','不如','趕來','原來','令人','江東','徐州','正是','忽然','下馬','喊聲'
       ,'成都','因此','百姓','未知','大敗','一軍','大事','之后','不見','起兵','接應','軍中','進兵','引軍','大驚','可以']
for i in excluse:
    del counts[i]
items = list(counts.items())    
items.sort(key = lambda x:x[1],reverse = True)
for i in range(20):
    item ,times = items[i]
    print('{0:<10}{1:>5}'.format(item,times))

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【Python】文本詞頻統計 Python小程序—文本詞頻統計 python—文本詞頻統計哈姆雷特 txt 下載文本詞頻統計文本數據分詞，詞頻統計，可視化 - Python 用Python讀取一個文本文件並統計詞頻【Python】詞頻統計詞頻統計（python）用Python來進行詞頻統計 Python詞頻統計