hadoop mapreduce 解決 top K問題

本文轉載自查看原文 2012-12-04 18:17 8311 hadoop

網上搜索到的那個top K問題的解法，我覺得有些地方都沒有講明白。因為我們要找出top K, 那么就應該顯式的指明the num of reduce tasks is one.

不然我還真不好理解為什么可以得到top K的結果。這里順便提及一下，一個map task就是一個進程。有幾個map task就有幾個中間文件，有幾個reduce task就有幾個最終輸出文件。好了，這就好理解了，我們要找的top K 是指的全局的前K條數據，那么不管中間有幾個map, reduce最終只能有一個reduce來匯總數據，輸出top K。

下面寫出思路和代碼：

1. Mappers

使用默認的mapper數據，一個input split（輸入分片）由一個mapper來處理。

在每一個map task中，我們找到這個input split的前k個記錄。這里我們用TreeMap這個數據結構來保存top K的數據，這樣便於更新。下一步，我們來加入新記錄到TreeMap中去（這里的TreeMap我感覺就是個大頂堆）。在map中，我們對每一條記錄都嘗試去更新TreeMap，最后我們得到的就是這個分片中的local top k的k個值。在這里要提醒一下，以往的mapper中，我們都是處理一條數據之后就context.write或者output.collector一次。而在這里不是，這里是把所有這個input split的數據處理完之后再進行寫入。所以，我們可以把這個context.write放在cleanup里執行。cleanup就是整個mapper task執行完之后會執行的一個函數。

2.reducers

由於我前面講了很清楚了，這里只有一個reducer，就是對mapper輸出的數據進行再一次匯總，選出其中的top k，即可達到我們的目的。Note that we are using NullWritable here. The reason for this is we want all of the outputs from all of the mappers to be grouped into a single key in the reducer.

 1 package seven.ili.patent;
 2 
 3 /**
 4  * Created with IntelliJ IDEA.
 5  * User: Isaac Li
 6  * Date: 12/4/12
 7  * Time: 5:48 PM
 8  * To change this template use File | Settings | File Templates.
 9  */
10 
11 import org.apache.hadoop.conf.Configuration;
12 import org.apache.hadoop.conf.Configured;
13 import org.apache.hadoop.fs.Path;
14 import org.apache.hadoop.io.IntWritable;
15 import org.apache.hadoop.io.LongWritable;
16 import org.apache.hadoop.io.NullWritable;
17 import org.apache.hadoop.io.Text;
18 import org.apache.hadoop.mapreduce.Job;
19 import org.apache.hadoop.mapreduce.Mapper;
20 import org.apache.hadoop.mapreduce.Reducer;
21 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
22 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
23 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
24 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
25 import org.apache.hadoop.util.Tool;
26 import org.apache.hadoop.util.ToolRunner;
27 
28 import java.io.IOException;
29 import java.util.TreeMap;
30 
31 //利用MapReduce求最大值海量數據中的K個數
32 public class Top_k_new extends Configured implements Tool {
33 
34     public static class MapClass extends Mapper<LongWritable, Text, NullWritable, Text> {
35         public static final int K = 100;
36         private TreeMap<Integer, Text> fatcats = new TreeMap<Integer, Text>();
37         public void map(LongWritable key, Text value, Context context)
38                 throws IOException, InterruptedException {
39 
40             String[] str = value.toString().split(",", -2);
41             int temp = Integer.parseInt(str[8]);
42             fatcats.put(temp, value);
43             if (fatcats.size() > K)
44                 fatcats.remove(fatcats.firstKey())
45         }
46         @Override
47         protected void cleanup(Context context) throws IOException,  InterruptedException {
48             for(Text text: fatcats.values()){
49                 context.write(NullWritable.get(), text);
50             }
51         }
52     }
53 
54     public static class Reduce extends Reducer<NullWritable, Text, NullWritable, Text> {
55         public static final int K = 100;
56         private TreeMap<Integer, Text> fatcats = new TreeMap<Integer, Text>();
57         public void reduce(NullWritable key, Iterable<Text> values, Context context)
58                 throws IOException, InterruptedException {
59             for (Text val : values) {
60                 String v[] = val.toString().split("\t");
61                 Integer weight = Integer.parseInt(v[1]);
62                 fatcats.put(weight, val);
63                 if (fatcats.size() > K)
64                     fatcats.remove(fatcats.firstKey());
65             }
66             for (Text text: fatcats.values())
67                 context.write(NullWritable.get(), text);
68         }
69     }
70 
71     public int run(String[] args) throws Exception {
72         Configuration conf = getConf();
73         Job job = new Job(conf, "TopKNum");
74         job.setJarByClass(Top_k_new.class);
75         FileInputFormat.setInputPaths(job, new Path(args[0]));
76         FileOutputFormat.setOutputPath(job, new Path(args[1]));
77         job.setMapperClass(MapClass.class);
78        // job.setCombinerClass(Reduce.class);
79         job.setReducerClass(Reduce.class);
80         job.setInputFormatClass(TextInputFormat.class);
81         job.setOutputFormatClass(TextOutputFormat.class);
82         job.setOutputKeyClass(NullWritable.class);
83         job.setOutputValueClass(Text.class);
84         System.exit(job.waitForCompletion(true) ? 0 : 1);
85         return 0;
86     }
87     public static void main(String[] args) throws Exception {
88         int res = ToolRunner.run(new Configuration(), new Top_k_new(), args);
89         System.exit(res);
90     }
91 
92 }

參考：http://www.greenplum.com/blog/topics/hadoop/how-hadoop-mapreduce-can-transform-how-you-build-top-ten-lists

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 如何解決TOP-K問題 top k問題解決面試問題中的top k問題 Leetcode Top K問題的兩種解決思路應用C++ STL以最小堆方法解決Top K 問題經典面試題TOP k問題優先隊列PriorityQueue實現大小根堆解決top k 問題優先隊列實現大小根堆解決top k 問題優先隊列實現大小根堆解決top k 問題堆排序以及Top K問題-Java實現