練習題:統計一段英語文章的單詞頻率,取出頻率最高的5個單詞和個數(用python實現)
先全部轉為小寫再判定 lower()
怎么判定單詞?
1 不是字母的特殊字符作為分隔符分割字符串 (避免特殊字符的處理不便,全部替換成'-')
2 正則分割
3 遍歷字符串,取每個word
4 正則匹配
怎么統計個數?
將wordlist的word和word的個數放入dict,排序
''' dinghanhua 2018-11-11 練習:一段英文文章,統計每個單詞的頻率,返回出現頻率最高的5個單詞和次數 ''' import re art = ' If we want to" run Locust \ / distributed on multiple machines we would also have to specify the master host when starting the slaves (this is not needed when running Locust distributed on a single machine, since the master host defaults to 127.0.0.1):' ''' 怎么判定單詞? 1 不是字母的特殊字符作為分隔符分割字符串 2 遍歷字符串,取每個word 3 正則匹配 怎么統計個數? 將wordlist的word和word的個數放入dict,排序 '''
word_dict = {} #用於統計 word:個數 word_list = [] #用於存放所有單詞
# 找出所有不是字母的字符替換成統一的字符,split()分割之后便是單詞
pattern = r'[^a-z]+'
art_new = re.sub(pattern,'-',art.lower()) #所有的非字母替換成-
word_list = art_new.split('-') #轉成小寫分隔單詞
wordlist = list(filter(lambda x : x != '',word_list)) #去掉空串 print('所有的單詞列表:',wordlist)
#正則表達式分隔
pattern = r'[^a-z]+' #非字母
word_list = re.split(pattern,art.lower()) #還要去除空串
print(word_list)
# 遍歷字符串,獲取每個word追加到wordlist (不好) word ='' word_list2 = [] for letter in art.lower(): if letter.isalpha(): #如果是字母,追加到word word += letter else: if word != '': word_list2.append(word) #不是字母,word不為空的話追加wordlist word = '' # word置空 print(word_list2)
# 正則表達式匹配單詞 pattern = r'[a-z]+' word_list3 = re.findall(pattern,art.lower()) print(word_list3)
最后的統計的代碼:
#統計 for word in set(word_list): word_dict[word] = word_list.count(word) #key=單詞,value=單詞在list里的count #取最多的前五個 print(sorted(word_dict.items(),key = lambda x:x[1],reverse=True)[0:5]) #dict根據value倒序,取前5個
word_dict = {}.fromkeys(word_list) #先用list生成dict的keys for word in word_dict.keys(): word_dict[word] = word_list.count(word)
the end!