Mapreduce實例——二次排序

本文轉載自查看原文 2018-10-09 17:11 869 大數據

原理

在Map階段，使用job.setInputFormatClass定義的InputFormat將輸入的數據集分割成小數據塊splites，同時InputFormat提供一個RecordReder的實現。本實驗中使用的是TextInputFormat，他提供的RecordReder會將文本的字節偏移量作為key，這一行的文本作為value。這就是自定義Map的輸入是<LongWritable, Text>的原因。然后調用自定義Map的map方法，將一個個<LongWritable, Text>鍵值對輸入給Map的map方法。注意輸出應該符合自定義Map中定義的輸出<IntPair, IntWritable>。最終是生成一個List<IntPair, IntWritable>。在map階段的最后，會先調用job.setPartitionerClass對這個List進行分區，每個分區映射到一個reducer。每個分區內又調用job.setSortComparatorClass設置的key比較函數類排序。可以看到，這本身就是一個二次排序。如果沒有通過job.setSortComparatorClass設置key比較函數類，則可以使用key實現的compareTo方法進行排序。在本實驗中，就使用了IntPair實現的compareTo方法。

在Reduce階段，reducer接收到所有映射到這個reducer的map輸出后，也是會調用job.setSortComparatorClass設置的key比較函數類對所有數據對排序。然后開始構造一個key對應的value迭代器。這時就要用到分組，使用job.setGroupingComparatorClass設置的分組函數類。只要這個比較器比較的兩個key相同，他們就屬於同一個組，它們的value放在一個value迭代器，而這個迭代器的key使用屬於同一個組的所有key的第一個key。最后就是進入Reducer的reduce方法，reduce方法的輸入是所有的（key和它的value迭代器）。同樣注意輸入與輸出的類型必須與自定義的Reducer中聲明的一致。

環境

Linux Ubuntu 14.04

jdk-7u75-linux-x64

hadoop-2.6.0-cdh5.4.5

hadoop-2.6.0-eclipse-cdh5.4.5.jar

eclipse-java-juno-SR2-linux-gtk-x86_64

內容

在電商網站中，用戶進入頁面瀏覽商品時會產生訪問日志，記錄用戶對商品的訪問情況，現有goods_visit2表，包含（goods_id,click_num）兩個字段，數據內容如下：

goods_id click_num
1010037 100
1010102 100
1010152 97
1010178 96
1010280 104
1010320 103
1010510 104
1010603 96
1010637 97

編寫MapReduce代碼，功能為根據商品的點擊次數(click_num)進行降序排序，再根據goods_id升序排序，並輸出所有商品。

輸出結果如下：

點擊次數商品id
------------------------------------------------
104 1010280
104 1010510
------------------------------------------------
103 1010320
------------------------------------------------
100 1010037
100 1010102
------------------------------------------------
97 1010152
97 1010637
------------------------------------------------
96 1010178
96 1010603

實驗步驟

1.切換到/apps/hadoop/sbin目錄下，開啟Hadoop。

cd /apps/hadoop/sbin
./start-all.sh

2.在Linux本地新建/data/mapreduce8目錄。

mkdir -p /data/mapreduce8

3.在Linux中切換到/data/mapreduce8目錄下，用wget命令從http://192.168.1.100:60000/allfiles/mapreduce8/goods_visit2網址上下載文本文件goods_visit2。

cd /data/mapreduce8
wget http://192.168.1.100:60000/allfiles/mapreduce8/goods_visit2

然后在當前目錄下用wget命令從http://192.168.1.100:60000/allfiles/mapreduce8/hadoop2lib.tar.gz網址上下載項目用到的依賴包。

wget http://192.168.1.100:60000/allfiles/mapreduce8/hadoop2lib.tar.gz

將hadoop2lib.tar.gz解壓到當前目錄下。

tar zxvf hadoop2lib.tar.gz

4.首先在HDFS上新建/mymapreduce8/in目錄，然后將Linux本地/data/mapreduce8目錄下的goods_visit2文件導入到HDFS的/mymapreduce8/in目錄中。

hadoop fs -mkdir -p /mymapreduce8/in
hadoop fs -put /data/mapreduce8/goods_visit2 /mymapreduce8/in

5.新建Java Project項目，項目名為mapreduce8。

在mapreduce8項目下新建一個package包，包名為mapreduce。

在mapreduce的package包下新建一個SecondarySort類。

6.添加項目所需依賴的jar包，右鍵單擊mapreduce8，新建一個文件夾hadoop2lib，用於存放項目所需的jar包。

將/data/mapreduce8目錄下，hadoop2lib目錄中的jar包，拷貝到eclipse中mapreduce8項目的hadopo2lib目錄下。

選中hadoop2lib目錄下所有jar包，並添加到Build Path中。

7.編寫Java代碼，並描述其設計思路

二次排序：在mapreduce中，所有的key是需要被比較和排序的，並且是二次，先根據partitioner，再根據大小。而本例中也是要比較兩次。先按照第一字段排序，然后在第一字段相同時按照第二字段排序。根據這一點，我們可以構造一個復合類IntPair，他有兩個字段，先利用分區對第一字段排序，再利用分區內的比較對第二字段排序。Java代碼主要分為四部分：自定義key，自定義分區函數類，map部分，reduce部分。

自定義key的代碼：

public static class IntPair implements WritableComparable<IntPair>
{
int first; //第一個成員變量
int second; //第二個成員變量
public void set(int left, int right)
{
first = left;
second = right;
}
public int getFirst()
{
return first;
}
public int getSecond()
{
return second;
}
@Override
//反序列化，從流中的二進制轉換成IntPair
public void readFields(DataInput in) throws IOException
{
// TODO Auto-generated method stub
first = in.readInt();
second = in.readInt();
}
@Override
//序列化，將IntPair轉化成使用流傳送的二進制
public void write(DataOutput out) throws IOException
{
// TODO Auto-generated method stub
out.writeInt(first);
out.writeInt(second);
}
@Override
//key的比較
public int compareTo(IntPair o)
{
// TODO Auto-generated method stub
if (first != o.first)
{
return first < o.first ? 1 : -1;
}
else if (second != o.second)
{
return second < o.second ? -1 : 1;
}
else
{
return 0;
}
}
@Override
public int hashCode()
{
return first * 157 + second;
}
@Override
public boolean equals(Object right)
{
if (right == null)
return false;
if (this == right)
return true;
if (right instanceof IntPair)
{
IntPair r = (IntPair) right;
return r.first == first && r.second == second;
}
else
{
return false;
}
}
}

所有自定義的key應該實現接口WritableComparable，因為是可序列的並且可比較的，並重載方法。該類中包含以下幾種方法：1.反序列化，從流中的二進制轉換成IntPair 方法為public void readFields(DataInput in) throws IOException 2.序列化，將IntPair轉化成使用流傳送的二進制方法為public void write(DataOutput out)3. key的比較 public int compareTo(IntPair o) 另外新定義的類應該重寫的兩個方法 public int hashCode() 和public boolean equals(Object right) 。

分區函數類代碼

public static class FirstPartitioner extends Partitioner<IntPair, IntWritable>
{
@Override
public int getPartition(IntPair key, IntWritable value,int numPartitions)
{
return Math.abs(key.getFirst() * 127) % numPartitions;
}
}

對key進行分區，根據自定義key中first乘以127取絕對值在對numPartions取余來進行分區。這主要是為實現了第一次排序。按分區分。

分組函數類代碼

public static class GroupingComparator extends WritableComparator
{
protected GroupingComparator()
{
super(IntPair.class, true);
}
@Override
//Compare two WritableComparables.
public int compare(WritableComparable w1, WritableComparable w2)
{
IntPair ip1 = (IntPair) w1;
IntPair ip2 = (IntPair) w2;
int l = ip1.getFirst();
int r = ip2.getFirst();
return l == r ? 0 : (l < r ? -1 : 1);
}
}

分組函數類。在reduce階段，構造一個key對應的value迭代器的時候，只要first相同就屬於同一個組，放在一個value迭代器。這是一個比較器，需要繼承WritableComparator。

map代碼：

public static class Map extends Mapper<LongWritable, Text, IntPair, IntWritable>
{
//自定義map
private final IntPair intkey = new IntPair();
private final IntWritable intvalue = new IntWritable();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
int left = 0;
int right = 0;
if (tokenizer.hasMoreTokens())
{
left = Integer.parseInt(tokenizer.nextToken());
if (tokenizer.hasMoreTokens())
right = Integer.parseInt(tokenizer.nextToken());
intkey.set(right, left);
intvalue.set(left);
context.write(intkey, intvalue);
}
}
}

在map階段，使用job.setInputFormatClass定義的InputFormat將輸入的數據集分割成小數據塊splites，同時InputFormat提供一個RecordReder的實現。本例子中使用的是TextInputFormat，他提供的RecordReder會將文本的一行的行號作為key，這一行的文本作為value。這就是自定義Map的輸入是<LongWritable, Text>的原因。然后調用自定義Map的map方法，將一個個<LongWritable, Text>鍵值對輸入給Map的map方法。注意輸出應該符合自定義Map中定義的輸出<IntPair, IntWritable>。最終是生成一個List<IntPair, IntWritable>。在map階段的最后，會先調用job.setPartitionerClass對這個List進行分區，每個分區映射到一個reducer。每個分區內又調用job.setSortComparatorClass設置的key比較函數類排序。可以看到，這本身就是一個二次排序。如果沒有通過job.setSortComparatorClass設置key比較函數類，則使用key的實現的compareTo方法。在本例子中，使用了IntPair實現的compareTo方法。

Reduce代碼：

public static class Reduce extends Reducer<IntPair, IntWritable, Text, IntWritable>
{
private final Text left = new Text();
private static final Text SEPARATOR = new Text("------------------------------------------------");
public void reduce(IntPair key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException
{
context.write(SEPARATOR, null);
left.set(Integer.toString(key.getFirst()));
System.out.println(left);
for (IntWritable val : values)
{
context.write(left, val);
//System.out.println(val);
}
}
}

在reduce階段，reducer接收到所有映射到這個reducer的map輸出后，也是會調用job.setSortComparatorClass設置的key比較函數類對所有數據對排序。然后開始構造一個key對應的value迭代器。這時就要用到分組，使用job.setGroupingComparatorClass設置的分組函數類。只要這個比較器比較的兩個key相同，他們就屬於同一個組，它們的value放在一個value迭代器，而這個迭代器的key使用屬於同一個組的所有key的第一個key。最后就是進入Reducer的reduce方法，reduce方法的輸入是所有的（key和它的value迭代器）。同樣注意輸入與輸出的類型必須與自定義的Reducer中聲明的一致。

完整代碼：

package mapreduce;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class SecondarySort
{
public static class IntPair implements WritableComparable<IntPair>
{
int first;
int second;
public void set(int left, int right)
{
first = left;
second = right;
}
public int getFirst()
{
return first;
}
public int getSecond()
{
return second;
}
@Override
public void readFields(DataInput in) throws IOException
{
// TODO Auto-generated method stub
first = in.readInt();
second = in.readInt();
}
@Override
public void write(DataOutput out) throws IOException
{
// TODO Auto-generated method stub
out.writeInt(first);
out.writeInt(second);
}
@Override
public int compareTo(IntPair o)
{
// TODO Auto-generated method stub
if (first != o.first)
{
return first < o.first ? 1 : -1;
}
else if (second != o.second)
{
return second < o.second ? -1 : 1;
}
else
{
return 0;
}
}
@Override
public int hashCode()
{
return first * 157 + second;
}
@Override
public boolean equals(Object right)
{
if (right == null)
return false;
if (this == right)
return true;
if (right instanceof IntPair)
{
IntPair r = (IntPair) right;
return r.first == first && r.second == second;
}
else
{
return false;
}
}
}
public static class FirstPartitioner extends Partitioner<IntPair, IntWritable>
{
@Override
public int getPartition(IntPair key, IntWritable value,int numPartitions)
{
return Math.abs(key.getFirst() * 127) % numPartitions;
}
}
public static class GroupingComparator extends WritableComparator
{
protected GroupingComparator()
{
super(IntPair.class, true);
}
@Override
//Compare two WritableComparables.
public int compare(WritableComparable w1, WritableComparable w2)
{
IntPair ip1 = (IntPair) w1;
IntPair ip2 = (IntPair) w2;
int l = ip1.getFirst();
int r = ip2.getFirst();
return l == r ? 0 : (l < r ? -1 : 1);
}
}
public static class Map extends Mapper<LongWritable, Text, IntPair, IntWritable>
{
private final IntPair intkey = new IntPair();
private final IntWritable intvalue = new IntWritable();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
int left = 0;
int right = 0;
if (tokenizer.hasMoreTokens())
{
left = Integer.parseInt(tokenizer.nextToken());
if (tokenizer.hasMoreTokens())
right = Integer.parseInt(tokenizer.nextToken());
intkey.set(right, left);
intvalue.set(left);
context.write(intkey, intvalue);
}
}
}
public static class Reduce extends Reducer<IntPair, IntWritable, Text, IntWritable>
{
private final Text left = new Text();
private static final Text SEPARATOR = new Text("------------------------------------------------");
public void reduce(IntPair key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException
{
context.write(SEPARATOR, null);
left.set(Integer.toString(key.getFirst()));
System.out.println(left);
for (IntWritable val : values)
{
context.write(left, val);
//System.out.println(val);
}
}
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException
{
Configuration conf = new Configuration();
Job job = new Job(conf, "secondarysort");
job.setJarByClass(SecondarySort.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setPartitionerClass(FirstPartitioner.class);
job.setGroupingComparatorClass(GroupingComparator.class);
job.setMapOutputKeyClass(IntPair.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
String[] otherArgs=new String[2];
otherArgs[0]="hdfs://localhost:9000/mymapreduce8/in/goods_visit2";
otherArgs[1]="hdfs://localhost:9000/mymapreduce8/out";
FileInputFormat.setInputPaths(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

8.在SecondarySort類文件中，右鍵並點擊=>Run As=>Run on Hadoop選項。

9.待執行完畢后，進入命令模式，在hdfs上從Java代碼指定的輸出路徑中查看實驗結果。

hadoop fs -ls /mymapreduce8/out
hadoop fs -cat /mymapreduce8/out/part-r-00000

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 MapReduce二次排序 MapReduce的二次排序 MapReduce 二次排序二次排序 hadoop二次排序【spark】示例：二次排序 WebGis二次開發包實例【MapReduce】一、MapReduce簡介與實例 MapReduce實例你知道希爾排序為什么可以打破二次時間界嗎？