英文文本詞頻統計

統計英文詞頻分為兩步：

文本去噪及歸一化
使用字典表達詞頻

代碼：

#CalHamletV1.py
def getText():
    txt = open("hamlet.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.replace(ch, " ")   #將文本中特殊字符替換為空格
    return txt
 
hamletTxt = getText()
words  = hamletTxt.split()
counts = {}
for word in words:           
    counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

運行結果：

the        1138
and         965
to          754
of          669
you         550
i           542
a           542
my          514
hamlet      462
in          436

中文文本詞頻統計

中文文本：《三國演義》分析人物

統計中文詞頻分為兩步：

中文文本分詞
使用字典表達詞頻

#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(15):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

運行結果：

曹操      953
孔明  836
將軍  772
卻說  656
玄德  585
關公  510
丞相  491
二人  469
不可  440
荊州  425
玄德曰     390
孔明曰     390
不能  384
如此  378
張飛  358

能很明顯的看到有一些不相關或重復的信息

優化版本

統計中文詞頻分為三步：

中文文本分詞
使用字典表達詞頻
擴展程序解決問題

我們將不相關或重復的信息放在 excludes 集合里面進行排除。

#CalThreeKingdomsV2.py
import jieba
excludes = {"將軍","卻說","荊州","二人","不可","不能","如此"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "諸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "關公" or word == "雲長":
        rword = "關羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "劉備"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

考研英語詞頻統計

將詞頻統計應用到考研英語中，我們可以統計出出現次數較多的關鍵單詞。
文本鏈接: https://pan.baidu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密碼: fw3r

# CalHamletV1.py
def getText():
    txt = open("86_17_1_2.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.replace(ch, " ")   #將文本中特殊字符替換為空格
    return txt

pyTxt = getText()   #獲得沒有任何標點的txt文件
words  = pyTxt.split()  #獲得單詞
counts = {} #字典，鍵值對
excludes = {"the", "a", "of", "to", "and", "in", "b", "c", "d", "is",\
            "was", "are", "have", "were", "had", "that", "for", "it",\
            "on", "be", "as", "with", "by", "not", "their", "they",\
            "from", "more", "but", "or", "you", "at", "has", "we", "an",\
            "this", "can", "which", "will", "your", "one", "he", "his", "all", "people", "should", "than", "points", "there", "i", "what", "about", "new", "if", "”",\
            "its", "been", "part", "so", "who", "would", "answer", "some", "our", "may", "most", "do", "when", "1", "text", "section", "2", "many", "time", "into", \
            "10", "no", "other", "up", "following", "【答案】", "only", "out", "each", "much", "them", "such", "world", "these", "sheet", "life", "how", "because", "3", "even", \
            "work", "directions", "use", "could", "now", "first", "make", "years", "way", "20", "those", "over", "also", "best", "two", "well", "15", "us", "write", "4", "5", "being", "social", "read", "like", "according", "just", "take", "paragraph", "any", "english", "good", "after", "own", "year", "must", "american", "less", "her", "between", "then", "children", "before", "very", "human", "long", "while", "often", "my", "too", \
            "40", "four", "research", "author", "questions", "still", "last", "business", "education", "need", "information", "public", "says", "passage", "reading", "through", "women", "she", "health", "example", "help", "get", "different", "him", "mark", "might", "off", "job", "30", "writing", "choose", "words", "economic", "become", "science", "society", "without", "made", "high", "students", "few", "better", "since", "6", "rather", "however", "great", "where", "culture", "come", \
            "both", "three", "same", "government", "old", "find", "number", "means", "study", "put", "8", "change", "does", "today", "think", "future", "school", "yet", "man", "things", "far", "line", "7", "13", "50", "used", "states", "down", "12", "14", "16", "end", "11", "making", "9", "another", "young", "system", "important", "letter", "17", "chinese", "every", "see", "s", "test", "word", "century", "language", "little", \
            "give", "said", "25", "state", "problems", "sentence", "food", "translation", "given", "child", "18", "longer", "question", "back", "don’t", "19", "against", "always", "answers", "know", "having", "among", "instead", "comprehension", "large", "35", "want", "likely", "keep", "family", "go", "why", "41", "home", "law", "place", "look", "day", "men", "22", "26", "45", "it’s", "others", "companies", "countries", "once", "money", "24", "though", \
            "27", "29", "31", "say", "national", "ii", "23", "based", "found", "28", "32", "past", "living", "university", "scientific", "–", "36", "38", "working", "around", "data", "right", "21", "jobs", "33", "34", "possible", "feel", "process", "effect", "growth", "probably", "seems", "fact", "below", "37", "39", "history", "technology", "never", "sentences", "47", "true", "scientists", "power", "thought", "during", "48", "early", "parents", \
            "something", "market", "times", "46", "certain", "whether", "000", "did", "enough", "problem", "least", "federal", "age", "idea", "learn", "common", "political", "pay", "view", "going", "attention", "happiness", "moral", "show", "live", "until", "52", "49", "ago", "percent", "stress", "43", "44", "42", "meaning", "51", "e", "iii", "u", "60", "anything", "53", "55", "cultural", "nothing", "short", "100", "water", "car", "56", "58", "【解析】", "54", "59", "57", "v", "。","63", "64", "65", "61", "62", "66", "70", "75", "f", "【考點分析】", "67", "here", "68",  "71", "72", "69", "73", "74", "選項a", "ourselves", "teachers", "helps", "參考范文", "gdp", "yourself", "gone", "150"}
for word in words:
    if word not in excludes:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

x = len(counts)
print(x)

r = 0

next = eval(input("1繼續"))

while next == 1:
    r += 100
    for i in range(r, r+100):
        word, count = items[i]
        print ("\"{}\"".format(word), end = ", ")
    next = eval(input("1繼續"))

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 詞頻統計（python）用Python來進行詞頻統計 Python詞頻統計【Python】文本詞頻統計 Python 詞頻統計 python統計詞頻 Python 分詞並統計詞頻 Python 英文詞頻統計 python進行分詞及統計詞頻 Python統計詞頻的幾種方式