hadoop 學習自定義分區

本文轉載自查看原文 2016-04-06 11:52 1843 hadoop學習筆記

如圖所示：有三個ReducerTask，因此處理完成之后的數據存儲在三個文件中；

默認情況下，numReduceTasks的數量為1，前面做的實驗中，輸出數據都是在一個文件中。通過自定義myPatitioner類，可以把 ruduce 處理后的數據分類匯總，這里 MyPartitioner是Partitioner的基類，如果需要定制partitioner也需要繼承該類。 HashPartitioner是mapreduce的默認partitioner。計算方法是 which reducer=(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks，得到當前的目的reducer。

實驗內容，在上一個自定義排序的基礎上，把正方形和長方形分別進行排序，即設置兩個ReducerTask任務，通過自定義MyPartitioner實現。

package com.nwpulisz;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Reducer;
public class SelfDefineSort {
	/**
	 * @param args
	 * @author nwpulisz
	 * @date 2016.4.1
	 */
	static final String INPUT_PATH="hdfs://192.168.255.132:9000/input";
	static final String OUTPUT_PATH="hdfs://192.168.255.132:9000/output";
	
	public static void main(String[] args) throws Exception {
		// TODO Auto-generated method stub
		Configuration conf = new Configuration();
		Path outPut_path= new Path(OUTPUT_PATH);
		Job job = new Job(conf, "SelfDefineSort");
		
		//如果輸出路徑是存在的，則提前刪除輸出路徑
		FileSystem fileSystem = FileSystem.get(new URI(OUTPUT_PATH), conf);
		if(fileSystem.exists(outPut_path))
		{
			fileSystem.delete(outPut_path,true);
		}
		
		job.setJarByClass(RectangleWritable.class); //注意不能少setJarByClass，要不出現報錯,源碼中的解釋。
													//Set the Jar by finding where a given class came from.
		
		FileInputFormat.setInputPaths(job, INPUT_PATH);
		FileOutputFormat.setOutputPath(job, outPut_path);
		
		job.setMapperClass(MyMapper.class);
		job.setReducerClass(MyReducer.class);
		
		job.setMapOutputKeyClass(RectangleWritable.class);
		job.setMapOutputValueClass(NullWritable.class);
		
		job.setPartitionerClass(MyPatitioner.class); //自定義myPatitioner類，把ruduce處理后的數據分類匯總；
		job.setNumReduceTasks(2); //設置ReduceTask數量為2；
		
		job.setOutputKeyClass(IntWritable.class);
		job.setOutputValueClass(IntWritable.class);
		job.waitForCompletion(true);
	}
	
	static class MyMapper extends Mapper<LongWritable, Text, RectangleWritable, NullWritable>{
		protected void map(LongWritable k1, Text v1, 
                Context context) throws IOException, InterruptedException {
			String[] splits = v1.toString().split("\t");
			RectangleWritable k2 = new RectangleWritable(Integer.parseInt(splits[0]),
					Integer.parseInt(splits[1]));
			
			context.write(k2,NullWritable.get());
		}
	}
	
	static class MyReducer extends Reducer<RectangleWritable, NullWritable,
					IntWritable, IntWritable>{
		protected void reduce(RectangleWritable k2,
				Iterable<NullWritable> v2s,
				Context context)
				throws IOException, InterruptedException {
			// TODO Auto-generated method stub
			context.write(new IntWritable(k2.getLength()), new IntWritable(k2.getWidth()));
		}
		
	}
	
}
class MyPatitioner extends Partitioner<RectangleWritable, NullWritable>{
	@Override
	public int getPartition(RectangleWritable k2, NullWritable v2, int numPartitions) {
		// TODO Auto-generated method stub
		if (k2.getLength() == k2.getWidth()) { //根據長方形和正方形進行分類
			return 0;  
		}else {
			return 1;
		}
	}
	 
}

其中的 RectangleWritable類與上一節中定義的相同。

此處，在eclipse中直接運行該代碼，會顯示錯誤，如下圖：

可能是因為hadoop版本的原因，因此需要將源碼文件打成jar包，在hadoop服務器上運行，jar中包括內容為：

在hadoop上運行 hadoop jar SelfDefinePartitioner.jar(jar包名,自定義)

運行結果如下圖所示：

開始運行：

輸出結果：

來自為知筆記(Wiz)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 spark自定義分區器實現 MapReduce的自定義排序、分區和分組 Spark(九)【RDD的分區和自定義Partitioner】 Hadoop mapreduce自定義分組RawComparator Hadoop學習筆記—7.計數器與自定義計數器 Hadoop學習筆記—5.自定義類型處理手機上網日志 Kafka 生產者自定義分區策略 Java-API+Kafka實現自定義分區 Hadoop案例（五）過濾日志及自定義日志輸出路徑（自定義OutputFormat) hive自定義函數學習