Hadoop上的中文分詞與詞頻統計實踐

本文轉載自查看原文 2012-12-16 19:47 10933

首先來推薦相關材料：http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-count-on-hadoop/。小蝦的這個統計武俠小說人名熱度的段子很有意思，照虎畫貓來實踐一下。

與其不同的地方有：

　　0）其使用Hadoop Streaming，這里使用MapReduce框架。

　　1）不同的中文分詞方法，這里使用IKAnalyzer，主頁在http://code.google.com/p/ik-analyzer/。

　　2）這里的材料為《射雕英雄傳》。哈哈，總要來一些改變。

0）使用WordCount源代碼，修改其Map，在Map中使用IKAnalyzer的分詞功能。

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.ByteArrayInputStream;

import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class ChineseWordCount {
    
      public static class TokenizerMapper 
           extends Mapper<Object, Text, Text, IntWritable>{
        
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
          
        public void map(Object key, Text value, Context context
                        ) throws IOException, InterruptedException {
            
            byte[] bt = value.getBytes();
            InputStream ip = new ByteArrayInputStream(bt);
            Reader read = new InputStreamReader(ip);
            IKSegmenter iks = new IKSegmenter(read,true);
            Lexeme t;
            while ((t = iks.next()) != null)
            {
                word.set(t.getLexemeText());
                context.write(word, one);
            }
        }
      }
  
  public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(ChineseWordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

1）So，完成了，本地插件模擬環境OK。打包（帶上分詞包）扔到集群上。

hadoop fs -put chinese_in.txt chinese_in.txt
hadoop jar WordCount.jar chinese_in.txt out0

...mapping reducing...

hadoop fs -ls ./out0
hadoop fs -get part-r-00000 words.txt

2）數據后處理：

2.1）數據排序

head words.txt
tail words.txt


sort -k2 words.txt >0.txt
head 0.txt
tail 0.txt
sort -k2r words.txt>0.txt
head 0.txt
tail 0.txt
sort -k2rn words.txt>0.txt
head -n 50 0.txt

2.2）目標提取

awk '{if(length($1)>=2) print $0}' 0.txt >1.txt

2.3）結果呈現

head 1.txt -n 50 | sed = | sed 'N;s/\n//'

1郭靖   6427
2黃蓉   4621
3歐陽   1660
4甚么   1430
5說道   1287
6洪七公 1225
7笑道   1214
8自己   1193
9一個   1160
10師父  1080
11黃葯師        1059
12心中  1046
13兩人  1016
14武功  950
15咱們  925
16一聲  912
17只見  827
18他們  782
19心想  780
20周伯通        771
21功夫  758
22不知  755
23歐陽克        752
24聽得  741
25丘處機        732
26當下  668
27爹爹  664
28只是  657
29知道  654
30這時  639
31之中  621
32梅超風        586
33身子  552
34都是  540
35不是  534
36如此  531
37柯鎮惡        528
38到了  523
39不敢  522
40裘千仞        521
41楊康  520
42你們  509
43這一  495
44卻是  478
45眾人  476
46二人  475
47鐵木真        469
48怎么  464
49左手  452
50地下  448

在非人名詞中有很多很有意思，如：5說道7笑道12心中17只見22不知30這時49左手。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 初學Hadoop之中文詞頻統計中文詞頻統計 python jieba分詞小說與詞頻統計利用jieba分詞進行詞頻統計基於統計的中文分詞中文詞頻統計與詞雲生成中文詞頻統計與詞雲生成 Pig + Ansj 統計中文文本詞頻文本數據分詞，詞頻統計，可視化 - Python Python 中文文件統計詞頻 + 中文詞雲