【Python】詞頻統計


  • 需求:一篇文章,出現了哪些詞?哪些詞出現得最多?

英文文本詞頻統計

英文文本:Hamlet 分析詞頻

統計英文詞頻分為兩步:

  • 文本去噪及歸一化
  • 使用字典表達詞頻

代碼:

#CalHamletV1.py
def getText():
    txt = open("hamlet.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.replace(ch, " ")   #將文本中特殊字符替換為空格
    return txt
 
hamletTxt = getText()
words  = hamletTxt.split()
counts = {}
for word in words:           
    counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

運行結果:

the        1138
and         965
to          754
of          669
you         550
i           542
a           542
my          514
hamlet      462
in          436

中文文本詞頻統計

中文文本:《三國演義》分析人物

統計中文詞頻分為兩步:

  • 中文文本分詞
  • 使用字典表達詞頻
#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(15):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

運行結果:

曹操      953
孔明  836
將軍  772
卻說  656
玄德  585
關公  510
丞相  491
二人  469
不可  440
荊州  425
玄德曰     390
孔明曰     390
不能  384
如此  378
張飛  358

能很明顯的看到有一些不相關或重復的信息

優化版本

統計中文詞頻分為三步:

  • 中文文本分詞
  • 使用字典表達詞頻
  • 擴展程序解決問題

我們將不相關或重復的信息放在 excludes 集合里面進行排除。

#CalThreeKingdomsV2.py
import jieba
excludes = {"將軍","卻說","荊州","二人","不可","不能","如此"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "諸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "關公" or word == "雲長":
        rword = "關羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "劉備"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

考研英語詞頻統計

將詞頻統計應用到考研英語中,我們可以統計出出現次數較多的關鍵單詞。
文本鏈接: https://pan.baidu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密碼: fw3r

# CalHamletV1.py
def getText():
    txt = open("86_17_1_2.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.replace(ch, " ")   #將文本中特殊字符替換為空格
    return txt

pyTxt = getText()   #獲得沒有任何標點的txt文件
words  = pyTxt.split()  #獲得單詞
counts = {} #字典,鍵值對
excludes = {"the", "a", "of", "to", "and", "in", "b", "c", "d", "is",\
            "was", "are", "have", "were", "had", "that", "for", "it",\
            "on", "be", "as", "with", "by", "not", "their", "they",\
            "from", "more", "but", "or", "you", "at", "has", "we", "an",\
            "this", "can", "which", "will", "your", "one", "he", "his", "all", "people", "should", "than", "points", "there", "i", "what", "about", "new", "if", "”",\
            "its", "been", "part", "so", "who", "would", "answer", "some", "our", "may", "most", "do", "when", "1", "text", "section", "2", "many", "time", "into", \
            "10", "no", "other", "up", "following", "【答案】", "only", "out", "each", "much", "them", "such", "world", "these", "sheet", "life", "how", "because", "3", "even", \
            "work", "directions", "use", "could", "now", "first", "make", "years", "way", "20", "those", "over", "also", "best", "two", "well", "15", "us", "write", "4", "5", "being", "social", "read", "like", "according", "just", "take", "paragraph", "any", "english", "good", "after", "own", "year", "must", "american", "less", "her", "between", "then", "children", "before", "very", "human", "long", "while", "often", "my", "too", \
            "40", "four", "research", "author", "questions", "still", "last", "business", "education", "need", "information", "public", "says", "passage", "reading", "through", "women", "she", "health", "example", "help", "get", "different", "him", "mark", "might", "off", "job", "30", "writing", "choose", "words", "economic", "become", "science", "society", "without", "made", "high", "students", "few", "better", "since", "6", "rather", "however", "great", "where", "culture", "come", \
            "both", "three", "same", "government", "old", "find", "number", "means", "study", "put", "8", "change", "does", "today", "think", "future", "school", "yet", "man", "things", "far", "line", "7", "13", "50", "used", "states", "down", "12", "14", "16", "end", "11", "making", "9", "another", "young", "system", "important", "letter", "17", "chinese", "every", "see", "s", "test", "word", "century", "language", "little", \
            "give", "said", "25", "state", "problems", "sentence", "food", "translation", "given", "child", "18", "longer", "question", "back", "don’t", "19", "against", "always", "answers", "know", "having", "among", "instead", "comprehension", "large", "35", "want", "likely", "keep", "family", "go", "why", "41", "home", "law", "place", "look", "day", "men", "22", "26", "45", "it’s", "others", "companies", "countries", "once", "money", "24", "though", \
            "27", "29", "31", "say", "national", "ii", "23", "based", "found", "28", "32", "past", "living", "university", "scientific", "–", "36", "38", "working", "around", "data", "right", "21", "jobs", "33", "34", "possible", "feel", "process", "effect", "growth", "probably", "seems", "fact", "below", "37", "39", "history", "technology", "never", "sentences", "47", "true", "scientists", "power", "thought", "during", "48", "early", "parents", \
            "something", "market", "times", "46", "certain", "whether", "000", "did", "enough", "problem", "least", "federal", "age", "idea", "learn", "common", "political", "pay", "view", "going", "attention", "happiness", "moral", "show", "live", "until", "52", "49", "ago", "percent", "stress", "43", "44", "42", "meaning", "51", "e", "iii", "u", "60", "anything", "53", "55", "cultural", "nothing", "short", "100", "water", "car", "56", "58", "【解析】", "54", "59", "57", "v", "。","63", "64", "65", "61", "62", "66", "70", "75", "f", "【考點分析】", "67", "here", "68",  "71", "72", "69", "73", "74", "選項a", "ourselves", "teachers", "helps", "參考范文", "gdp", "yourself", "gone", "150"}
for word in words:
    if word not in excludes:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

x = len(counts)
print(x)

r = 0

next = eval(input("1繼續"))

while next == 1:
    r += 100
    for i in range(r, r+100):
        word, count = items[i]
        print ("\"{}\"".format(word), end = ", ")
    next = eval(input("1繼續"))


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM