Hadoop: MapReduce2的幾個基本示例

本文轉載自查看原文 2015-05-30 13:19 3895 wordcount/ 15.Java/Scala/ mapreduce/ max/ 17.Hadoop/ count/ distinct/ sum/ hadoop/ 09.Open Source/ avarage

1) WordCount

這個就不多說了,滿大街都是,網上有幾篇對WordCount的詳細分析

http://www.sxt.cn/u/235/blog/5809

http://www.cnblogs.com/zhanghuijunjava/archive/2013/04/27/3036549.html

這二篇都寫得不錯, 特別幾張圖畫得很清晰

2) 去重處理(Distinct)

類似於db中的select distinct(x) from table , 去重處理甚至比WordCount還要簡單,假如我們要對以下文件的內容做去重處理(注:該文件也是后面幾個示例的輸入參數)

基本上啥也不用做,在map階段,把每一行的值當成key分發下去,然后在reduce階段回收上來就可以了.

注:里面用到了一個自己寫的類HDFSUtil,可以在 hadoop: hdfs API示例一文中找到.

原理:map階段完成后,在reduce開始之前,會有一個combine的過程,相同的key值會自動合並,所以自然而然的就去掉了重復.

 1 package yjmyzz.mr;
 2 
 3 import org.apache.hadoop.conf.Configuration;
 4 import org.apache.hadoop.fs.Path;
 5 import org.apache.hadoop.io.NullWritable;
 6 import org.apache.hadoop.io.Text;
 7 import org.apache.hadoop.mapreduce.Job;
 8 import org.apache.hadoop.mapreduce.Mapper;
 9 import org.apache.hadoop.mapreduce.Reducer;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
12 import org.apache.hadoop.util.GenericOptionsParser;
13 
14 import yjmyzz.util.HDFSUtil;
15 
16 import java.io.IOException;
17 
18 
19 public class RemoveDup {
20 
21     public static class RemoveDupMapper
22             extends Mapper<Object, Text, Text, NullWritable> {
23 
24         public void map(Object key, Text value, Context context)
25                 throws IOException, InterruptedException {
26             context.write(value, NullWritable.get());
27             //System.out.println("map: key=" + key + ",value=" + value);
28         }
29 
30     }
31 
32     public static class RemoveDupReducer extends Reducer<Text, NullWritable, Text, NullWritable> {
33         public void reduce(Text key, Iterable<NullWritable> values, Context context)
34                 throws IOException, InterruptedException {
35             context.write(key, NullWritable.get());
36             //System.out.println("reduce: key=" + key);
37         }
38     }
39 
40     public static void main(String[] args) throws Exception {
41         Configuration conf = new Configuration();
42         String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
43         if (otherArgs.length < 2) {
44             System.err.println("Usage: RemoveDup <in> [<in>...] <out>");
45             System.exit(2);
46         }
47 
48         //刪除輸出目錄(可選,省得多次運行時,總是報OUTPUT目錄已存在)
49         HDFSUtil.deleteFile(conf, otherArgs[otherArgs.length - 1]);
50 
51         Job job = Job.getInstance(conf, "RemoveDup");
52         job.setJarByClass(RemoveDup.class);
53         job.setMapperClass(RemoveDupMapper.class);
54         job.setCombinerClass(RemoveDupReducer.class);
55         job.setReducerClass(RemoveDupReducer.class);
56         job.setOutputKeyClass(Text.class);
57         job.setOutputValueClass(NullWritable.class);
58 
59 
60         for (int i = 0; i < otherArgs.length - 1; ++i) {
61             FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
62         }
63         FileOutputFormat.setOutputPath(job,
64                 new Path(otherArgs[otherArgs.length - 1]));
65         System.exit(job.waitForCompletion(true) ? 0 : 1);
66     }
67 
68 
69 }

View Code

輸出:

3) 記錄計數(Count)

這個跟WordCount略有不同,類似於Select Count(*) from tables的效果,代碼也超級簡單,直接拿WordCount改一改就行了

 1 package yjmyzz.mr;
 2 
 3 import org.apache.hadoop.conf.Configuration;
 4 import org.apache.hadoop.fs.Path;
 5 import org.apache.hadoop.io.IntWritable;
 6 import org.apache.hadoop.io.Text;
 7 import org.apache.hadoop.mapreduce.Job;
 8 import org.apache.hadoop.mapreduce.Mapper;
 9 import org.apache.hadoop.mapreduce.Reducer;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
12 import org.apache.hadoop.util.GenericOptionsParser;
13 import yjmyzz.util.HDFSUtil;
14 
15 import java.io.IOException;
16 import java.util.StringTokenizer;
17 
18 
19 public class RowCount {
20 
21     public static class RowCountMapper
22             extends Mapper<Object, Text, Text, IntWritable> {
23 
24         private final static IntWritable one = new IntWritable(1);
25         private final  static Text countKey = new Text("count");
26 
27         public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
28                 context.write(countKey, one);
29         }
30     }
31 
32     public static class RowCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
33         private IntWritable result = new IntWritable();
34 
35         public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
36             int sum = 0;
37             for (IntWritable val : values) {
38                 sum += val.get();
39             }
40             result.set(sum);
41             context.write(key, result);
42         }
43     }
44 
45     public static void main(String[] args) throws Exception {
46         Configuration conf = new Configuration();
47         String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
48         if (otherArgs.length < 2) {
49             System.err.println("Usage: RowCount <in> [<in>...] <out>");
50             System.exit(2);
51         }
52         //刪除輸出目錄(可選)
53         HDFSUtil.deleteFile(conf, otherArgs[otherArgs.length - 1]);
54 
55         Job job = Job.getInstance(conf, "word count");
56         job.setJarByClass(RowCount.class);
57         job.setMapperClass(RowCountMapper.class);
58         job.setCombinerClass(RowCountReducer.class);
59         job.setReducerClass(RowCountReducer.class);
60         job.setOutputKeyClass(Text.class);
61         job.setOutputValueClass(IntWritable.class);
62         for (int i = 0; i < otherArgs.length - 1; ++i) {
63             FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
64         }
65         FileOutputFormat.setOutputPath(job,
66                 new Path(otherArgs[otherArgs.length - 1]));
67         System.exit(job.waitForCompletion(true) ? 0 : 1);
68     }
69 
70 
71 }

View Code

輸出: count 11

注:如果只想輸出一個數字,不需要"count"這個key,可以改進一下:

 1 package yjmyzz.mr;
 2 
 3 import org.apache.hadoop.conf.Configuration;
 4 import org.apache.hadoop.fs.Path;
 5 import org.apache.hadoop.io.LongWritable;
 6 import org.apache.hadoop.io.NullWritable;
 7 import org.apache.hadoop.io.Text;
 8 import org.apache.hadoop.mapreduce.Job;
 9 import org.apache.hadoop.mapreduce.Mapper;
10 import org.apache.hadoop.mapreduce.Reducer;
11 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.util.GenericOptionsParser;
14 import yjmyzz.util.HDFSUtil;
15 
16 import java.io.IOException;
17 
18 
19 public class RowCount2 {
20 
21     public static class RowCount2Mapper
22             extends Mapper<LongWritable, Text, LongWritable, NullWritable> {
23 
24         public long count = 0;
25 
26         public void map(LongWritable key, Text value, Context context)
27                 throws IOException, InterruptedException {
28             count += 1;
29         }
30 
31         protected void cleanup(Context context) throws IOException, InterruptedException {
32             context.write(new LongWritable(count), NullWritable.get());
33         }
34 
35     }
36 
37     public static class RowCount2Reducer extends Reducer<LongWritable, NullWritable, LongWritable, NullWritable> {
38 
39         public long count = 0;
40 
41         public void reduce(LongWritable key, Iterable<NullWritable> values, Context context)
42                 throws IOException, InterruptedException {
43             count += key.get();
44         }
45 
46 
47         protected void cleanup(Context context) throws IOException, InterruptedException {
48             context.write(new LongWritable(count), NullWritable.get());
49         }
50 
51     }
52 
53     public static void main(String[] args) throws Exception {
54         Configuration conf = new Configuration();
55         String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
56         if (otherArgs.length < 2) {
57             System.err.println("Usage: FindMax <in> [<in>...] <out>");
58             System.exit(2);
59         }
60 
61         //刪除輸出目錄(可選,省得多次運行時,總是報OUTPUT目錄已存在)
62         HDFSUtil.deleteFile(conf, otherArgs[otherArgs.length - 1]);
63 
64         Job job = Job.getInstance(conf, "RowCount2");
65         job.setJarByClass(RowCount2.class);
66         job.setMapperClass(RowCount2Mapper.class);
67         job.setCombinerClass(RowCount2Reducer.class);
68         job.setReducerClass(RowCount2Reducer.class);
69         job.setOutputKeyClass(LongWritable.class);
70         job.setOutputValueClass(NullWritable.class);
71 
72         for (int i = 0; i < otherArgs.length - 1; ++i) {
73             FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
74         }
75         FileOutputFormat.setOutputPath(job,
76                 new Path(otherArgs[otherArgs.length - 1]));
77         System.exit(job.waitForCompletion(true) ? 0 : 1);
78     }
79 
80 
81 }

View Code

這樣輸出結果就只有一個數字11了.

注意: 這里context.write(xxx)只能寫在cleanup方法中, 該方法在Mapper和Reducer接口中都有, 在map方法及reduce方法執行完后,會觸發cleanup方法. 大家可以嘗試下,把context.write(xxx)寫在map和reduce方法中試試看,結果會出現多行記錄,而不是預期的僅1個數字.

4)求最大值(Max)

 1 package yjmyzz.mr;
 2 
 3 import org.apache.hadoop.conf.Configuration;
 4 import org.apache.hadoop.fs.Path;
 5 import org.apache.hadoop.io.LongWritable;
 6 import org.apache.hadoop.io.NullWritable;
 7 import org.apache.hadoop.io.Text;
 8 import org.apache.hadoop.mapreduce.Job;
 9 import org.apache.hadoop.mapreduce.Mapper;
10 import org.apache.hadoop.mapreduce.Reducer;
11 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.util.GenericOptionsParser;
14 import yjmyzz.util.HDFSUtil;
15 
16 import java.io.IOException;
17 
18 
19 public class Max {
20 
21     public static class MaxMapper
22             extends Mapper<LongWritable, Text, LongWritable, NullWritable> {
23 
24         public long max = Long.MIN_VALUE;
25 
26         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
27             max = Math.max(Long.parseLong(value.toString()), max);
28         }
29 
30         protected void cleanup(Mapper.Context context) throws IOException, InterruptedException {
31             context.write(new LongWritable(max), NullWritable.get());
32         }
33 
34     }
35 
36     public static class MaxReducer extends Reducer<LongWritable, NullWritable, LongWritable, NullWritable> {
37 
38         public long max = Long.MIN_VALUE;
39 
40         public void reduce(LongWritable key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
41 
42             max = Math.max(max, key.get());
43 
44         }
45 
46 
47         protected void cleanup(Reducer.Context context) throws IOException, InterruptedException {
48             context.write(new LongWritable(max), NullWritable.get());
49         }
50 
51     }
52 
53     public static void main(String[] args) throws Exception {
54         Configuration conf = new Configuration();
55         String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
56         if (otherArgs.length < 2) {
57             System.err.println("Usage: Max <in> [<in>...] <out>");
58             System.exit(2);
59         }
60 
61         //刪除輸出目錄(可選,省得多次運行時,總是報OUTPUT目錄已存在)
62         HDFSUtil.deleteFile(conf, otherArgs[otherArgs.length - 1]);
63 
64         Job job = Job.getInstance(conf, "Max");
65         job.setJarByClass(Max.class);
66         job.setMapperClass(MaxMapper.class);
67         job.setCombinerClass(MaxReducer.class);
68         job.setReducerClass(MaxReducer.class);
69         job.setOutputKeyClass(LongWritable.class);
70         job.setOutputValueClass(NullWritable.class);
71 
72         for (int i = 0; i < otherArgs.length - 1; ++i) {
73             FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
74         }
75         FileOutputFormat.setOutputPath(job,
76                 new Path(otherArgs[otherArgs.length - 1]));
77         System.exit(job.waitForCompletion(true) ? 0 : 1);
78     }
79 
80 
81 }

View Code

輸出結果:8

如果看懂了剛才的Count2版本的代碼,這個自然不用多解釋.

5)求和(Sum)

 1 package yjmyzz.mr;
 2 
 3 import org.apache.hadoop.conf.Configuration;
 4 import org.apache.hadoop.fs.Path;
 5 import org.apache.hadoop.io.LongWritable;
 6 import org.apache.hadoop.io.NullWritable;
 7 import org.apache.hadoop.io.Text;
 8 import org.apache.hadoop.mapreduce.Job;
 9 import org.apache.hadoop.mapreduce.Mapper;
10 import org.apache.hadoop.mapreduce.Reducer;
11 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.util.GenericOptionsParser;
14 import yjmyzz.util.HDFSUtil;
15 
16 import java.io.IOException;
17 
18 
19 public class Sum {
20 
21     public static class SumMapper
22             extends Mapper<LongWritable, Text, LongWritable, NullWritable> {
23 
24         public long sum = 0;
25 
26         public void map(LongWritable key, Text value, Context context)
27                 throws IOException, InterruptedException {
28             sum += Long.parseLong(value.toString());
29         }
30 
31         protected void cleanup(Context context) throws IOException, InterruptedException {
32             context.write(new LongWritable(sum), NullWritable.get());
33         }
34 
35     }
36 
37     public static class SumReducer extends Reducer<LongWritable, NullWritable, LongWritable, NullWritable> {
38 
39         public long sum = 0;
40 
41         public void reduce(LongWritable key, Iterable<NullWritable> values, Context context)
42                 throws IOException, InterruptedException {
43             sum += key.get();
44         }
45 
46 
47         protected void cleanup(Context context) throws IOException, InterruptedException {
48             context.write(new LongWritable(sum), NullWritable.get());
49         }
50 
51     }
52 
53     public static void main(String[] args) throws Exception {
54         Configuration conf = new Configuration();
55         String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
56         if (otherArgs.length < 2) {
57             System.err.println("Usage: Sum <in> [<in>...] <out>");
58             System.exit(2);
59         }
60 
61         //刪除輸出目錄(可選,省得多次運行時,總是報OUTPUT目錄已存在)
62         HDFSUtil.deleteFile(conf, otherArgs[otherArgs.length - 1]);
63 
64         Job job = Job.getInstance(conf, "Sum");
65         job.setJarByClass(Sum.class);
66         job.setMapperClass(SumMapper.class);
67         job.setCombinerClass(SumReducer.class);
68         job.setReducerClass(SumReducer.class);
69         job.setOutputKeyClass(LongWritable.class);
70         job.setOutputValueClass(NullWritable.class);
71 
72         for (int i = 0; i < otherArgs.length - 1; ++i) {
73             FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
74         }
75         FileOutputFormat.setOutputPath(job,
76                 new Path(otherArgs[otherArgs.length - 1]));
77         System.exit(job.waitForCompletion(true) ? 0 : 1);
78     }
79 
80 
81 }

View Code

輸出結果:43

Sum與剛才的Max原理如出一轍,不多解釋了,依舊利用了cleanup方法

6)求平均值(Avg)

  1 package yjmyzz.mr;
  2 
  3 import org.apache.hadoop.conf.Configuration;
  4 import org.apache.hadoop.fs.Path;
  5 import org.apache.hadoop.io.*;
  6 import org.apache.hadoop.mapreduce.Job;
  7 import org.apache.hadoop.mapreduce.Mapper;
  8 import org.apache.hadoop.mapreduce.Reducer;
  9 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 10 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
 11 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 12 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 13 import org.apache.hadoop.util.GenericOptionsParser;
 14 import yjmyzz.util.HDFSUtil;
 15 
 16 import java.io.IOException;
 17 
 18 
 19 public class Average {
 20 
 21     public static class AvgMapper
 22             extends Mapper<LongWritable, Text, LongWritable, LongWritable> {
 23 
 24         public long sum = 0;
 25         public long count = 0;
 26 
 27         public void map(LongWritable key, Text value, Context context)
 28                 throws IOException, InterruptedException {
 29             sum += Long.parseLong(value.toString());
 30             count += 1;
 31         }
 32 
 33         protected void cleanup(Context context) throws IOException, InterruptedException {
 34             context.write(new LongWritable(sum), new LongWritable(count));
 35         }
 36 
 37     }
 38 
 39     public static class AvgCombiner extends Reducer<LongWritable, LongWritable, LongWritable, LongWritable> {
 40 
 41         public long sum = 0;
 42         public long count = 0;
 43 
 44         public void reduce(LongWritable key, Iterable<LongWritable> values, Context context)
 45                 throws IOException, InterruptedException {
 46             sum += key.get();
 47             for (LongWritable v : values) {
 48                 count += v.get();
 49             }
 50         }
 51 
 52         protected void cleanup(Context context) throws IOException, InterruptedException {
 53             context.write(new LongWritable(sum), new LongWritable(count));
 54         }
 55 
 56     }
 57 
 58     public static class AvgReducer extends Reducer<LongWritable, LongWritable, DoubleWritable, NullWritable> {
 59 
 60         public long sum = 0;
 61         public long count = 0;
 62 
 63         public void reduce(LongWritable key, Iterable<LongWritable> values, Context context)
 64                 throws IOException, InterruptedException {
 65             sum += key.get();
 66             for (LongWritable v : values) {
 67                 count += v.get();
 68             }
 69         }
 70 
 71 
 72         protected void cleanup(Context context) throws IOException, InterruptedException {
 73             context.write(new DoubleWritable(new Double(sum)/count), NullWritable.get());
 74         }
 75 
 76     }
 77 
 78     public static void main(String[] args) throws Exception {
 79         Configuration conf = new Configuration();
 80         String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
 81         if (otherArgs.length < 2) {
 82             System.err.println("Usage: Avg <in> [<in>...] <out>");
 83             System.exit(2);
 84         }
 85 
 86         //刪除輸出目錄(可選,省得多次運行時,總是報OUTPUT目錄已存在)
 87         HDFSUtil.deleteFile(conf, otherArgs[otherArgs.length - 1]);
 88 
 89         Job job = Job.getInstance(conf, "Avg");
 90         job.setJarByClass(Average.class);
 91         job.setMapperClass(AvgMapper.class);
 92         job.setCombinerClass(AvgCombiner.class);
 93         job.setReducerClass(AvgReducer.class);
 94 
 95         //注意這里:由於Mapper與Reducer的輸出Key,Value類型不同,所以要單獨為Mapper設置類型
 96         job.setMapOutputKeyClass(LongWritable.class);
 97         job.setMapOutputValueClass(LongWritable.class);
 98 
 99         
100         job.setOutputKeyClass(DoubleWritable.class);
101         job.setOutputValueClass(NullWritable.class);
102 
103         for (int i = 0; i < otherArgs.length - 1; ++i) {
104             FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
105         }
106         FileOutputFormat.setOutputPath(job,
107                 new Path(otherArgs[otherArgs.length - 1]));
108         System.exit(job.waitForCompletion(true) ? 0 : 1);
109     }
110 
111 
112 }

View Code

輸出:3.909090909090909

這個稍微要復雜一點,平均值大家都知道=Sum/Count,所以這其實前面Count與Max的綜合運用而已,思路是在輸出的key-value中,用max做key,用count做value,最終形成{sum,count}的輸出,然后在最后的cleanup中,sum/count即得avg,但是有一個特點要注意的地方,由於Mapper與Reducer的output {key,value}類型並不一致,所以96-101行這里,分別設置了Map及Reduce的key,value輸出類型,如果沒有96-97這二行,100-101這二行會默認把Mapper,Combiner,Reducer這三者的輸出類型設置成相同的類型.

7) 改進型的WordCount(按詞頻倒排)

官網示例WordCount只統計出單詞出現的次數,並未按詞頻做倒排,下面的代碼示例實現了該功能

 1 package yjmyzz.mr;
 2 
 3 import org.apache.hadoop.conf.Configuration;
 4 import org.apache.hadoop.fs.Path;
 5 import org.apache.hadoop.io.IntWritable;
 6 import org.apache.hadoop.io.LongWritable;
 7 import org.apache.hadoop.io.NullWritable;
 8 import org.apache.hadoop.io.Text;
 9 import org.apache.hadoop.mapreduce.Job;
10 import org.apache.hadoop.mapreduce.Mapper;
11 import org.apache.hadoop.mapreduce.Reducer;
12 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
14 import org.apache.hadoop.util.GenericOptionsParser;
15 import yjmyzz.util.HDFSUtil;
16 
17 import java.io.IOException;
18 import java.util.Comparator;
19 import java.util.StringTokenizer;
20 import java.util.TreeMap;
21 
22 
23 public class WordCount2 {
24 
25     public static class TokenizerMapper
26             extends Mapper<Object, Text, Text, IntWritable> {
27 
28         private final static IntWritable one = new IntWritable(1);
29         private Text word = new Text();
30 
31         public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
32             StringTokenizer itr = new StringTokenizer(value.toString());
33             while (itr.hasMoreTokens()) {
34                 word.set(itr.nextToken());
35                 context.write(word, one);
36             }
37         }
38     }
39 
40     public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
41 
42         //定義treeMap來保持統計結果,由於treeMap是按key升序排列的,這里要人為指定Comparator以實現倒排
43         private TreeMap<Integer, String> treeMap = new TreeMap<Integer, String>(new Comparator<Integer>() {
44             @Override
45             public int compare(Integer x, Integer y) {
46                 return y.compareTo(x);
47             }
48         });
49 
50         public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
51             //reduce后的結果放入treeMap,而不是向context中記入結果
52             int sum = 0;
53             for (IntWritable val : values) {
54                 sum += val.get();
55             }
56             if (treeMap.containsKey(sum)){
57                 String value = treeMap.get(sum) + "," + key.toString();
58                 treeMap.put(sum,value);
59             }
60             else {
61                 treeMap.put(sum, key.toString());
62             }
63         }
64 
65         protected void cleanup(Context context) throws IOException, InterruptedException {
66             //將treeMap中的結果,按value-key順序寫入contex中
67             for (Integer key : treeMap.keySet()) {
68                 context.write(new Text(treeMap.get(key)), new IntWritable(key));
69             }
70         }
71     }
72 
73     public static void main(String[] args) throws Exception {
74         Configuration conf = new Configuration();
75         String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
76         if (otherArgs.length < 2) {
77             System.err.println("Usage: wordcount2 <in> [<in>...] <out>");
78             System.exit(2);
79         }
80         //刪除輸出目錄
81         HDFSUtil.deleteFile(conf, otherArgs[otherArgs.length - 1]);
82         Job job = Job.getInstance(conf, "word count2");
83         job.setJarByClass(WordCount2.class);
84         job.setMapperClass(TokenizerMapper.class);
85         job.setCombinerClass(IntSumReducer.class);
86         job.setReducerClass(IntSumReducer.class);
87         job.setOutputKeyClass(Text.class);
88         job.setOutputValueClass(IntWritable.class);
89         for (int i = 0; i < otherArgs.length - 1; ++i) {
90             FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
91         }
92         FileOutputFormat.setOutputPath(job,
93                 new Path(otherArgs[otherArgs.length - 1]));
94         System.exit(job.waitForCompletion(true) ? 0 : 1);
95     }
96 
97 
98 }

View Code

原理: 依然用到了cleanup,此外為了實現排序,采用了TreeMap這種內置了key排序的數據結構.

這里為了展示更直觀,選用了電影<超能陸戰隊>主題曲的第一段歌詞做為輸入:

They say we are what we are
But we do not have to be
I am  bad behavior but I do it in the best way
I will be the watcher
Of the eternal flame
I will be the guard dog
of all your fever dreams

原版的WordCount處理完后,結果如下:

But	1
I	4
Of	1
They	1
all	1
am	1
are	2
bad	1
be	3
behavior	1
best	1
but	1
do	2
dog	1
dreams	1
eternal	1
fever	1
flame	1
guard	1
have	1
in	1
it	1
not	1
of	1
say	1
the	4
to	1
watcher	1
way	1
we	3
what	1
will	2
your	1

改進后的WordCount2處理結果如下:

I,the	4
be,we	3
are,do,will	2
But,Of,They,all,am,bad,behavior,best,but,dog,dreams,eternal,fever,flame,guard,have,in,it,not,of,say,to,watcher,way,what,your	1

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hadoop: MapReduce2多個job串行處理從分治算法到 Hadoop MapReduce Hadoop編程實例之MapReduce 三.hadoop mapreduce之WordCount例子 Hadoop之MapReduce學習筆記（一） Hadoop學習之旅三：MapReduce Hadoop,MapReduce操作Mysql Hadoop MapReduce執行過程(一) Hadoop，MapReduce，參數 Hadoop　MapReduce原理