Hadoop SequenceFile數據結構介紹及讀寫

本文轉載自查看原文 2016-06-04 22:21 11086 Hadoop

　　在一些應用中，我們需要一種特殊的數據結構來存儲數據，並進行讀取，這里就分析下為什么用SequenceFile格式文件。

Hadoop SequenceFile

　　Hadoop提供的SequenceFile文件格式提供一對key,value形式的不可變的數據結構。同時，HDFS和MapReduce job使用SequenceFile文件可以使文件的讀取更加效率。

SequenceFile的格式

　　SequenceFile的格式是由一個header 跟隨一個或多個記錄組成。前三個字節是一個Bytes SEQ代表着版本號，同時header也包括key的名稱，value class , 壓縮細節，metadata，以及Sync markers。Sync markers的作用在於可以讀取任意位置的數據。

　　在recourds中，又分為是否壓縮格式。當沒有被壓縮時，key與value使用Serialization序列化寫入SequenceFile。當選擇壓縮格式時，record的壓縮格式與沒有壓縮其實不盡相同，除了value的bytes被壓縮，key是不被壓縮的。

　　在Block中，它使所有的信息進行壓縮，壓縮的最小大小由配置文件中，io.seqfile.compress.blocksize配置項決定。

SequenceFile的MapFile

　　一個MapFile可以通過SequenceFile的地址，進行分類查找的格式。使用這個格式的優點在於，首先會將SequenceFile中的地址都加載入內存，並且進行了key值排序，從而提供更快的數據查找。

寫SequenceFile文件:

　　將key按100-1以IntWritable object進行倒敘寫入sequence file,value為Text objects格式。在將key和value寫入Sequence File前，首先將每行所在的位置寫入(writer.getLength())

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

import java.io.IOException;
import java.net.URI;



public class SequenceFileWriteDemo {
  
  private static final String[] DATA = {
    "One, two, buckle my shoe",
    "Three, four, shut the door",
    "Five, six, pick up sticks",
    "Seven, eight, lay them straight",
    "Nine, ten, a big fat hen"
  };
  
  public static void main(String[] args) throws IOException {
    String uri = args[0];
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri), conf);
    Path path = new Path(uri);

    IntWritable key = new IntWritable();
    Text value = new Text();
    SequenceFile.Writer writer = null;
    try {
      writer = SequenceFile.createWriter(fs, conf, path,
          key.getClass(), value.getClass());
      
      for (int i = 0; i < 100; i++) {
        key.set(100 - i);
        value.set(DATA[i % DATA.length]);
        System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
        writer.append(key, value);
      }
    } finally {
      IOUtils.closeStream(writer);
    }
  }
}

讀取SequenceFile文件:

　　首先需要創建SequenceFile.Reader實例，隨后通過調用next()函數進行每行結果集的迭代(需要依賴序列化).

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;

import java.io.IOException;
import java.net.URI;



public class SequenceFileReadDemo {
  
  public static void main(String[] args) throws IOException {
    String uri = args[0];
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri), conf);
    Path path = new Path(uri);

    SequenceFile.Reader reader = null;
    try {
      reader = new SequenceFile.Reader(fs, path, conf);
      Writable key = (Writable)
        ReflectionUtils.newInstance(reader.getKeyClass(), conf);
      Writable value = (Writable)
        ReflectionUtils.newInstance(reader.getValueClass(), conf);
      long position = reader.getPosition();
      while (reader.next(key, value)) {
　　　　 //同步記錄的邊界
        String syncSeen = reader.syncSeen() ? "*" : "";
        System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);
        position = reader.getPosition(); // beginning of next record
      }
    } finally {
      IOUtils.closeStream(reader);
    }
  }
}

參考文獻: 《Hadoop:The Definitive Guide, 4th Edition》

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 數據結構介紹通用數據存儲格式: Hadoop SequenceFile、HFile 數據結構與算法介紹 Set數據結構基本介紹 Hadoop基於文件的數據結構及實例 Redis 底層數據結構介紹數據結構 Roaring Bitmaps 介紹 Redis 數據結構簡介、RedisTemplate介紹 Java數據結構介紹（線性結構和非線性結構）數據結構:用實例分析ArrayList與LinkedList的讀寫性能