詞頻統計兩種實現方法

本文轉載自查看原文 2020-08-25 17:17 569 Pytorch/ tensorflow2/ tensorflow1

第一種：vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE-1))

舉例：

from collections import Counter

colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

c = Counter(colors)

print (dict(c))

most_common：取top-k的數據

第二種:

def generate_vocab_file(input_seg_file, output_vocab_file):
　　with open(input_seg_file, 'r',encoding='UTF-8') as f:
　　lines = f.readlines()
　　word_dict = {}
　　for line in lines:
　　label, content = line.strip('\r\n').split('\t')
　　for word in content.split():
　　word_dict.setdefault(word, 0)
　　word_dict[word] += 1
　　# [(word, frequency), ..., ()]
　　sorted_word_dict = sorted(
　　word_dict.items(), key = lambda d:d[1], reverse=True)
　　with open(output_vocab_file, 'w',encoding='UTF-8') as f:
　　f.write('<UNK>\t10000000\n')
　　for item in sorted_word_dict:
　　f.write('%s\t%d\n' % (item[0], item[1]))

類似實現：

colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

result = {}

for color in colors:

　　if result.get(color)==None:

　　　　 result[color]=1

　　else:

　　　　result[color]+=1

print (result) #{'red': 2, 'blue': 3, 'green': 1}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 英文詞頻統計的java實現方法統計逆序對的兩種解法 Java動態代理的兩種實現方法洗牌程序的兩種實現方法比較 $Android啟動界面（Splash）的兩種實現方法 python with語句上下文管理的兩種實現方法 C# 定義常量兩種實現方法關於JS獲取select值的兩種實現方法兩種頻率調制(FM)方法的MATLAB實現統計查詢-根據條件進行count的兩種實現方式- oracle

詞頻統計 兩種實現方法

免責聲明！

詞頻統計兩種實現方法