首先來推薦相關材料:http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-count-on-hadoop/。小蝦的這個統計武俠小說人名熱度的段子很有意思,照虎畫貓來實踐一下。
與其不同的地方有:
0)其使用Hadoop Streaming,這里使用MapReduce框架。
1)不同的中文分詞方法,這里使用IKAnalyzer,主頁在http://code.google.com/p/ik-analyzer/。
2)這里的材料為《射雕英雄傳》。哈哈,總要來一些改變。
0)使用WordCount源代碼,修改其Map,在Map中使用IKAnalyzer的分詞功能。
import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.Reader; import java.io.ByteArrayInputStream; import org.wltea.analyzer.core.IKSegmenter; import org.wltea.analyzer.core.Lexeme; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class ChineseWordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { byte[] bt = value.getBytes(); InputStream ip = new ByteArrayInputStream(bt); Reader read = new InputStreamReader(ip); IKSegmenter iks = new IKSegmenter(read,true); Lexeme t; while ((t = iks.next()) != null) { word.set(t.getLexemeText()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(ChineseWordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
1)So,完成了,本地插件模擬環境OK。打包(帶上分詞包)扔到集群上。
hadoop fs -put chinese_in.txt chinese_in.txt hadoop jar WordCount.jar chinese_in.txt out0 ...mapping reducing... hadoop fs -ls ./out0 hadoop fs -get part-r-00000 words.txt
2)數據后處理:
2.1)數據排序
head words.txt tail words.txt sort -k2 words.txt >0.txt head 0.txt tail 0.txt sort -k2r words.txt>0.txt head 0.txt tail 0.txt sort -k2rn words.txt>0.txt head -n 50 0.txt
2.2)目標提取
awk '{if(length($1)>=2) print $0}' 0.txt >1.txt
2.3)結果呈現
head 1.txt -n 50 | sed = | sed 'N;s/\n//'
1郭靖 6427 2黃蓉 4621 3歐陽 1660 4甚么 1430 5說道 1287 6洪七公 1225 7笑道 1214 8自己 1193 9一個 1160 10師父 1080 11黃葯師 1059 12心中 1046 13兩人 1016 14武功 950 15咱們 925 16一聲 912 17只見 827 18他們 782 19心想 780 20周伯通 771 21功夫 758 22不知 755 23歐陽克 752 24聽得 741 25丘處機 732 26當下 668 27爹爹 664 28只是 657 29知道 654 30這時 639 31之中 621 32梅超風 586 33身子 552 34都是 540 35不是 534 36如此 531 37柯鎮惡 528 38到了 523 39不敢 522 40裘千仞 521 41楊康 520 42你們 509 43這一 495 44卻是 478 45眾人 476 46二人 475 47鐵木真 469 48怎么 464 49左手 452 50地下 448
在非人名詞中有很多很有意思,如:5說道7笑道12心中17只見22不知30這時49左手。