文件的壓縮有兩大好處:1、可以減少存儲文件所需要的磁盤空間;2、可以加速數據在網絡和磁盤上的傳輸。尤其是在處理大數據時,這兩大好處是相當重要的。
下面是一個使用gzip工具壓縮文件的例子。將文件/user/hadoop/aa.txt進行壓縮,壓縮后為/user/hadoop/text.gz
1 package com.hdfs; 2 3 import java.io.IOException; 4 import java.io.InputStream; 5 import java.io.OutputStream; 6 import java.net.URI; 7 8 import org.apache.hadoop.conf.Configuration; 9 import org.apache.hadoop.fs.FSDataInputStream; 10 import org.apache.hadoop.fs.FSDataOutputStream; 11 import org.apache.hadoop.fs.FileSystem; 12 import org.apache.hadoop.fs.Path; 13 import org.apache.hadoop.io.IOUtils; 14 import org.apache.hadoop.io.compress.CompressionCodec; 15 import org.apache.hadoop.io.compress.CompressionCodecFactory; 16 import org.apache.hadoop.io.compress.CompressionInputStream; 17 import org.apache.hadoop.io.compress.CompressionOutputStream; 18 import org.apache.hadoop.util.ReflectionUtils; 19 20 public class CodecTest { 21 //壓縮文件 22 public static void compress(String codecClassName) throws Exception{ 23 Class<?> codecClass = Class.forName(codecClassName); 24 Configuration conf = new Configuration(); 25 FileSystem fs = FileSystem.get(conf); 26 CompressionCodec codec = (CompressionCodec)ReflectionUtils.newInstance(codecClass, conf); 27 //指定壓縮文件路徑 28 FSDataOutputStream outputStream = fs.create(new Path("/user/hadoop/text.gz")); 29 //指定要被壓縮的文件路徑 30 FSDataInputStream in = fs.open(new Path("/user/hadoop/aa.txt")); 31 //創建壓縮輸出流 32 CompressionOutputStream out = codec.createOutputStream(outputStream); 33 IOUtils.copyBytes(in, out, conf); 34 IOUtils.closeStream(in); 35 IOUtils.closeStream(out); 36 } 37 38 //解壓縮 39 public static void uncompress(String fileName) throws Exception{ 40 Class<?> codecClass = Class.forName("org.apache.hadoop.io.compress.GzipCodec"); 41 Configuration conf = new Configuration(); 42 FileSystem fs = FileSystem.get(conf); 43 CompressionCodec codec = (CompressionCodec)ReflectionUtils.newInstance(codecClass, conf); 44 FSDataInputStream inputStream = fs.open(new Path("/user/hadoop/text.gz")); 45 //把text文件里到數據解壓,然后輸出到控制台 46 InputStream in = codec.createInputStream(inputStream); 47 IOUtils.copyBytes(in, System.out, conf); 48 IOUtils.closeStream(in); 49 } 50 51 //使用文件擴展名來推斷二來的codec來對文件進行解壓縮 52 public static void uncompress1(String uri) throws IOException{ 53 Configuration conf = new Configuration(); 54 FileSystem fs = FileSystem.get(URI.create(uri), conf); 55 56 Path inputPath = new Path(uri); 57 CompressionCodecFactory factory = new CompressionCodecFactory(conf); 58 CompressionCodec codec = factory.getCodec(inputPath); 59 if(codec == null){ 60 System.out.println("no codec found for " + uri); 61 System.exit(1); 62 } 63 String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension()); 64 InputStream in = null; 65 OutputStream out = null; 66 try { 67 in = codec.createInputStream(fs.open(inputPath)); 68 out = fs.create(new Path(outputUri)); 69 IOUtils.copyBytes(in, out, conf); 70 } finally{ 71 IOUtils.closeStream(out); 72 IOUtils.closeStream(in); 73 } 74 } 75 76 public static void main(String[] args) throws Exception { 77 //compress("org.apache.hadoop.io.compress.GzipCodec"); 78 //uncompress("text"); 79 uncompress1("hdfs://master:9000/user/hadoop/text.gz"); 80 } 81 82 }
首先執行77行進行壓縮,壓縮后執行第78行進行解壓縮,這里解壓到標准輸出,所以執行78行會再控制台看到文件/user/hadoop/aa.txt的內容。如果執行79行的話會將文件解壓到/user/hadoop/text,他是根據/user/hadoop/text.gz的擴展名判斷使用哪個解壓工具進行解壓的。解壓后的路徑就是去掉擴展名。
進行文件壓縮后,在執行命令./hadoop fs -ls /user/hadoop/查看文件信息,如下:
1 [hadoop@master bin]$ ./hadoop fs -ls /user/hadoop/ 2 Found 7 items 3 -rw-r--r-- 3 hadoop supergroup 76805248 2013-06-17 23:55 /user/hadoop/aa.mp4 4 -rw-r--r-- 3 hadoop supergroup 520 2013-06-17 22:29 /user/hadoop/aa.txt 5 drwxr-xr-x - hadoop supergroup 0 2013-06-16 17:19 /user/hadoop/input 6 drwxr-xr-x - hadoop supergroup 0 2013-06-16 19:32 /user/hadoop/output 7 drwxr-xr-x - hadoop supergroup 0 2013-06-18 17:08 /user/hadoop/test 8 drwxr-xr-x - hadoop supergroup 0 2013-06-18 19:45 /user/hadoop/test1 9 -rw-r--r-- 3 hadoop supergroup 46 2013-06-19 20:09 /user/hadoop/text.gz
第4行為壓縮之前的文件,大小為520個字節。第9行為壓縮后的文件,大小為46個字節。由此可以看出上面講的壓縮的兩大好處了。