轉載請注明出處:http://www.cnblogs.com/zhengrunjian/p/4527269.html
1作為輸入
當壓縮文件做為mapreduce的輸入時,mapreduce將自動通過擴展名找到相應的codec對其解壓。
如果我們壓縮的文件有相應壓縮格式的擴展名(比如lzo,gz,bzip2等),hadoop就會根據擴展名去選擇解碼器解壓。
hadoop對每個壓縮格式的支持,詳細見下表:
如果壓縮的文件沒有擴展名,則需 要在執行mapreduce任務的時候指定輸入格式.
- hadoop jar /usr/home/hadoop/hadoop-0.20.2/contrib/streaming/hadoop-streaming-0.20.2-CDH3B4.jar
- -file /usr/home/hadoop/hello/mapper.py -mapper /usr/home/hadoop/hello/mapper.py
- -file /usr/home/hadoop/hello/reducer.py -reducer /usr/home/hadoop/hello/reducer.py
- -input lzotest -output result4
- -jobconf mapred.reduce.tasks=1
- -inputformat org.apache.hadoop.mapred.LzoTextInputFormat
2作為輸出
當mapreduce的輸出文件需要壓縮時,可以更改mapred.output.compress為true,mapped.output.compression.codec為想要使用的codec的類名就
可以了,當然你可以在代碼中指定,通過調用FileOutputFormat的靜態方法去設置這兩個屬性,我們來看代碼:
- package com.sweetop.styhadoop;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.io.compress.GzipCodec;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import java.io.IOException;
- /**
- * Created with IntelliJ IDEA.
- * User: lastsweetop
- * Date: 13-6-27
- * Time: 下午7:48
- * To change this template use File | Settings | File Templates.
- */
- public class MaxTemperatureWithCompression {
- public static void main(String[] args) throws Exception {
- if (args.length!=2){
- System.out.println("Usage: MaxTemperature <input path> <out path>");
- System.exit(-1);
- }
- Job job=new Job();
- job.setJarByClass(MaxTemperature.class);
- job.setJobName("Max Temperature");
- FileInputFormat.addInputPath(job, new Path(args[0]));
- FileOutputFormat.setOutputPath(job, new Path(args[1]));
- job.setMapperClass(MaxTemperatrueMapper.class);
- job.setCombinerClass(MaxTemperatureReducer.class);
- job.setReducerClass(MaxTemperatureReducer.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
- FileOutputFormat.setCompressOutput(job, true);
- FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
- System.exit(job.waitForCompletion(true)?0:1);
- }
- }
- ~/hadoop/bin/hadoop com.sweetop.styhadoop.MaxTemperatureWithCompression input/data.gz output/
輸出的每一個part都會被壓縮,我們這里只有一個part,看下壓縮了的輸出
- [hadoop@namenode test]$hadoop fs -get output/part-r-00000.gz .
- [hadoop@namenode test]$ls
- 1901 1902 ch2 ch3 ch4 data.gz news.gz news.txt part-r-00000.gz
- [hadoop@namenode test]$gunzip -c part-r-00000.gz
- 1901<span style="white-space:pre"> </span>317
- 1902<span style="white-space:pre"> </span>244
當然代碼里也可以設置,你只需調用SequenceFileOutputFormat的setOutputCompressionType方法進行設置。
- SequenceFileOutputFormat.setOutputCompressionType(job, SequenceFile.CompressionType.BLOCK);
3壓縮map輸出
即使你的mapreduce的輸入輸出都是未壓縮的文件,你仍可以對map任務的中間輸出作壓縮,因為它要寫在硬盤並且通過網絡傳輸到reduce節點,對其壓
縮可以提高很多性能,這些工作也是只要設置兩個屬性即可,我們看下代碼里怎么設置:
- Configuration conf = new Configuration();
- conf.setBoolean("mapred.compress.map.output", true);
- conf.setClass("mapred.map.output.compression.codec",GzipCodec.class, CompressionCodec.class);
- Job job=new Job(conf);
- 轉至:http://blog.csdn.net/lastsweetop/article/details/9187721