詞頻統計 兩種實現方法


第一種:vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE-1))

舉例:

from collections import Counter 

colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

c = Counter(colors)

print (dict(c))

most_common:取top-k的數據

第二種:

def generate_vocab_file(input_seg_file, output_vocab_file):
  with open(input_seg_file, 'r',encoding='UTF-8') as f:
  lines = f.readlines()
  word_dict = {}
  for line in lines:
  label, content = line.strip('\r\n').split('\t')
  for word in content.split():
  word_dict.setdefault(word, 0)
  word_dict[word] += 1
  # [(word, frequency), ..., ()]
  sorted_word_dict = sorted(
  word_dict.items(), key = lambda d:d[1], reverse=True)
  with open(output_vocab_file, 'w',encoding='UTF-8') as f:
  f.write('<UNK>\t10000000\n')
  for item in sorted_word_dict:
  f.write('%s\t%d\n' % (item[0], item[1]))

類似實現:

colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

result = {}

for color in colors:

  if result.get(color)==None:

     result[color]=1

  else:

    result[color]+=1

print (result) #{'red': 2, 'blue': 3, 'green': 1}


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM