词频统计 两种实现方法


第一种:vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE-1))

举例:

from collections import Counter 

colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

c = Counter(colors)

print (dict(c))

most_common:取top-k的数据

第二种:

def generate_vocab_file(input_seg_file, output_vocab_file):
  with open(input_seg_file, 'r',encoding='UTF-8') as f:
  lines = f.readlines()
  word_dict = {}
  for line in lines:
  label, content = line.strip('\r\n').split('\t')
  for word in content.split():
  word_dict.setdefault(word, 0)
  word_dict[word] += 1
  # [(word, frequency), ..., ()]
  sorted_word_dict = sorted(
  word_dict.items(), key = lambda d:d[1], reverse=True)
  with open(output_vocab_file, 'w',encoding='UTF-8') as f:
  f.write('<UNK>\t10000000\n')
  for item in sorted_word_dict:
  f.write('%s\t%d\n' % (item[0], item[1]))

类似实现:

colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

result = {}

for color in colors:

  if result.get(color)==None:

     result[color]=1

  else:

    result[color]+=1

print (result) #{'red': 2, 'blue': 3, 'green': 1}


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM