各種類型的Writable（Text、ByteWritable、NullWritable、ObjectWritable、GenericWritable、ArrayWritable、MapWritable、SortedMapWritable）轉

本文轉載自查看原文 2014-04-15 09:22 4332 MapReduce詳解

java原生類型

除char類型以外，所有的原生類型都有對應的Writable類，並且通過get和set方法可以他們的值。

IntWritable和LongWritable還有對應的變長VIntWritable和VLongWritable類。

固定長度還是變長的選用類似與數據庫中的char或者vchar。

Text類型

Text類型使用變長int型存儲長度，所以Text類型的最大存儲為2G.

Text類型采用標准的utf-8編碼，所以與其他文本工具可以非常好的交互，但要注意的是，這樣的話就和java的String類型差別就很多了。

檢索的不同

Text的chatAt返回的是一個整型，及utf-8編碼后的數字，而不是象String那樣的unicode編碼的char類型。

[java] view plain copy

@Test
public void testTextIndex(){
Text text=new Text("hadoop");
Assert.assertEquals(text.getLength(), 6);
Assert.assertEquals(text.getBytes().length, 6);
Assert.assertEquals(text.charAt(2),(int)'d');
Assert.assertEquals("Out of bounds",text.charAt(100),-1);
}

Text還有個find方法，類似String里indexOf方法

[java] view plain copy

@Test
public void testTextFind() {
Text text = new Text("hadoop");
Assert.assertEquals("find a substring",text.find("do"),2);
Assert.assertEquals("Find first 'o'",text.find("o"),3);
Assert.assertEquals("Find 'o' from position 4 or later",text.find("o",4),4);
Assert.assertEquals("No match",text.find("pig"),-1);
}

Unicode的不同

當uft-8編碼后的字節大於兩個時，Text和String的區別就會更清晰，因為String是按照unicode的char計算，而Text是按照字節計算。

我們來看下1到4個字節的不同的unicode字符

4個unicode分別占用1到4個字節，u+10400在java的unicode字符重占用兩個char，前三個字符分別占用1個char

我們通過代碼來看下String和Text的不同

[java] view plain copy

@Test
public void string() throws UnsupportedEncodingException {
String str = "\u0041\u00DF\u6771\uD801\uDC00";
Assert.assertEquals(str.length(), 5);
Assert.assertEquals(str.getBytes("UTF-8").length, 10);
Assert.assertEquals(str.indexOf("\u0041"), 0);
Assert.assertEquals(str.indexOf("\u00DF"), 1);
Assert.assertEquals(str.indexOf("\u6771"), 2);
Assert.assertEquals(str.indexOf("\uD801\uDC00"), 3);
Assert.assertEquals(str.charAt(0), '\u0041');
Assert.assertEquals(str.charAt(1), '\u00DF');
Assert.assertEquals(str.charAt(2), '\u6771');
Assert.assertEquals(str.charAt(3), '\uD801');
Assert.assertEquals(str.charAt(4), '\uDC00');
Assert.assertEquals(str.codePointAt(0), 0x0041);
Assert.assertEquals(str.codePointAt(1), 0x00DF);
Assert.assertEquals(str.codePointAt(2), 0x6771);
Assert.assertEquals(str.codePointAt(3), 0x10400);
}
@Test
public void text() {
Text text = new Text("\u0041\u00DF\u6771\uD801\uDC00");
Assert.assertEquals(text.getLength(), 10);
Assert.assertEquals(text.find("\u0041"), 0);
Assert.assertEquals(text.find("\u00DF"), 1);
Assert.assertEquals(text.find("\u6771"), 3);
Assert.assertEquals(text.find("\uD801\uDC00"), 6);
Assert.assertEquals(text.charAt(0), 0x0041);
Assert.assertEquals(text.charAt(1), 0x00DF);
Assert.assertEquals(text.charAt(3), 0x6771);
Assert.assertEquals(text.charAt(6), 0x10400);
}

這樣一比較就很明顯了。

1.String的length()方法返回的是char的數量，Text的getLength()方法返回的是字節的數量。

2.String的indexOf()方法返回的是以char為單元的偏移量，Text的find()方法返回的是以字節為單位的偏移量。

3.String的charAt()方法不是返回的整個unicode字符，而是返回的是java中的char字符

4.String的codePointAt()和Text的charAt方法比較類似，不過要注意，前者是按char的偏移量，后者是字節的偏移量

Text的迭代

在Text中對unicode字符的迭代是相當復雜的，因為與unicode所占的字節數有關，不能簡單的使用index的增長來確定。首先要把Text對象使用ByteBuffer進行封裝，然后再調用Text的靜態方法bytesToCodePoint對ByteBuffer進行輪詢返回unicode字符的code point。看一下示例代碼：

[java] view plain copy

package com.sweetop.styhadoop;
import org.apache.hadoop.io.Text;
import java.nio.ByteBuffer;
/**
* Created with IntelliJ IDEA.
* User: lastsweetop
* Date: 13-7-9
* Time: 下午5:00
* To change this template use File | Settings | File Templates.
*/
public class TextIterator {
public static void main(String[] args) {
Text text = new Text("\u0041\u00DF\u6771\uD801\udc00");
ByteBuffer buffer = ByteBuffer.wrap(text.getBytes(), 0, text.getLength());
int cp;
while (buffer.hasRemaining() && (cp = Text.bytesToCodePoint(buffer)) != -1) {
System.out.println(Integer.toHexString(cp));
}
}
}

Text的修改

除了NullWritable是不可更改外，其他類型的Writable都是可以修改的。你可以通過 Text的set方法去修改去修改重用這個實例。

[java] view plain copy

@Test
public void testTextMutability() {
Text text = new Text("hadoop");
text.set("pig");
Assert.assertEquals(text.getLength(), 3);
Assert.assertEquals(text.getBytes().length, 3);
}

但要注意的就是，在某些情況下Text的getBytes方法返回的字節數組的長度和Text的getLength方法返回的長度不一致。因此，在調用getBytes()方法的同時最好也調用一下getLength方法，這樣你就知道在字節數組里有多少有效的字符。

[java] view plain copy

@Test
public void testTextMutability2() {
Text text = new Text("hadoop");
text.set(new Text("pig"));
Assert.assertEquals(text.getLength(),3);
Assert.assertEquals(text.getBytes().length,6);
}

BytesWritable類型

ByteWritable類型是一個二進制數組的封裝類型，序列化格式是以一個4字節的整數(這點與Text不同，Text是以變長int開頭)開始表明字節數組的長度，然后接下來就是數組本身。看下示例：

[java] view plain copy

@Test
public void testByteWritableSerilizedFromat() throws IOException {
BytesWritable bytesWritable=new BytesWritable(new byte[]{3,5});
byte[] bytes=SerializeUtils.serialize(bytesWritable);
Assert.assertEquals(StringUtils.byteToHexString(bytes),"000000020305");
}

和Text一樣，ByteWritable也可以通過set方法修改，getLength返回的大小是真實大小，而getBytes返回的大小確不是。

[java] view plain copy

<span style="white-space:pre"> </span>bytesWritable.setCapacity(11);
bytesWritable.setSize(4);
Assert.assertEquals(4,bytesWritable.getLength());
Assert.assertEquals(11,bytesWritable.getBytes().length);

NullWritable類型

NullWritable是一個非常特殊的Writable類型，序列化不包含任何字符，僅僅相當於個占位符。你在使用mapreduce時，key或者value在無需使用時，可以定義為NullWritable。

[java] view plain copy

package com.sweetop.styhadoop;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.util.StringUtils;
import java.io.IOException;
/**
* Created with IntelliJ IDEA.
* User: lastsweetop
* Date: 13-7-16
* Time: 下午9:23
* To change this template use File | Settings | File Templates.
*/
public class TestNullWritable {
public static void main(String[] args) throws IOException {
NullWritable nullWritable=NullWritable.get();
System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(nullWritable)));
}
}

ObjectWritable類型

ObjectWritable是其他類型的封裝類，包括java原生類型，String,enum,Writable,null等，或者這些類型構成的數組。當你的一個field有多種類型時，ObjectWritable類型的用處就發揮出來了，不過有個不好的地方就是占用的空間太大，即使你存一個字母，因為它需要保存封裝前的類型，我們來看瞎示例：

[java] view plain copy

package com.sweetop.styhadoop;
import org.apache.hadoop.io.ObjectWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.util.StringUtils;
import java.io.IOException;
/**
* Created with IntelliJ IDEA.
* User: lastsweetop
* Date: 13-7-17
* Time: 上午9:14
* To change this template use File | Settings | File Templates.
*/
public class TestObjectWritable {
public static void main(String[] args) throws IOException {
Text text=new Text("\u0041");
ObjectWritable objectWritable=new ObjectWritable(text);
System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(objectWritable)));
}
}

僅僅是保存一個字母，那么看下它序列化后的結果是什么：

[java] view plain copy

00196f72672e6170616368652e6861646f6f702e696f2e5465787400196f72672e6170616368652e6861646f6f702e696f2e546578740141

太浪費空間了，而且類型一般是已知的，也就那么幾個，那么它的代替方法出現，看下一小節

GenericWritable類型

使用GenericWritable時，只需繼承於他，並通過重寫getTypes方法指定哪些類型需要支持即可，我們看下用法：

[java] view plain copy

package com.sweetop.styhadoop;
import org.apache.hadoop.io.GenericWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
class MyWritable extends GenericWritable {
MyWritable(Writable writable) {
set(writable);
}
public static Class<? extends Writable>[] CLASSES=null;
static {
CLASSES= (Class<? extends Writable>[])new Class[]{
Text.class
};
}
@Override
protected Class<? extends Writable>[] getTypes() {
return CLASSES; //To change body of implemented methods use File | Settings | File Templates.
}
}

然后輸出序列化后的結果

[java] view plain copy

package com.sweetop.styhadoop;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.VIntWritable;
import org.apache.hadoop.util.StringUtils;
import java.io.IOException;
/**
* Created with IntelliJ IDEA.
* User: lastsweetop
* Date: 13-7-17
* Time: 上午9:51
* To change this template use File | Settings | File Templates.
*/
public class TestGenericWritable {
public static void main(String[] args) throws IOException {
Text text=new Text("\u0041\u0071");
MyWritable myWritable=new MyWritable(text);
System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(text)));
System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(myWritable)));
}
}

結果是：

[java] view plain copy

024171
00024171

GenericWritable的序列化只是把類型在type數組里的索引放在了前面，這樣就比ObjectWritable節省了很多空間，所以推薦大家使用GenericWritable

集合類型的Writable

ArrayWritable和TwoDArrayWritable

ArrayWritable和TwoDArrayWritable分別表示數組和二維數組的Writable類型，指定數組的類型有兩種方法,構造方法里設置，或者繼承於ArrayWritable,TwoDArrayWritable也是一樣。

[java] view plain copy

package com.sweetop.styhadoop;
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.StringUtils;
import java.io.IOException;
/**
* Created with IntelliJ IDEA.
* User: lastsweetop
* Date: 13-7-17
* Time: 上午11:14
* To change this template use File | Settings | File Templates.
*/
public class TestArrayWritable {
public static void main(String[] args) throws IOException {
ArrayWritable arrayWritable=new ArrayWritable(Text.class);
arrayWritable.set(new Writable[]{new Text("\u0071"),new Text("\u0041")});
System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(arrayWritable)));
}
}

看下輸出：

[java] view plain copy

0000000201710141

可知，ArrayWritable以一個整型開始表示數組長度，然后數組里的元素一一排開。

ArrayPrimitiveWritable和上面類似，只是不需要用子類去繼承ArrayWritable而已。

MapWritable和SortedMapWritable

MapWritable對應Map,SortedMapWritable對應SortedMap,以4個字節開頭，存儲集合大小，然后每個元素以一個字節開頭存儲類型的索引（類似GenericWritable,所以總共的類型總數只能倒127），接着是元素本身，先key后value，這樣一對對排開。

這兩個Writable以后會用很多，貫穿整個hadoop，這里就不寫示例了。

我們注意到沒看到set集合和list集合，這個可以代替實現。用MapWritable代替set，SortedMapWritable代替sortedmap，只需將他們的values設置成NullWritable即可，NullWritable不占空間。相同類型構成的list，可以用ArrayWritable代替，不同類型的list可以用GenericWritable實現類型，然后再使用ArrayWritable封裝。當然MapWritable一樣可以實現list，把key設置為索引，values做list里的元素。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 各種類型文件頭標准編碼（轉） jquery 獲取及設置input各種類型的值 (轉) java String類型和各種類型的轉換 qsort函數排序各種類型的數據。各種類型stm8替代方案 python - 發送帶各種類型附件的郵件正確給各種類型的屬性賦值 ios 讀取各種類型文件 redis中鍵值對中值的各種類型 Intent MIME 打開各種類型的文件