一:protobuf編碼基本數據類型
public enum FieldType { DOUBLE (JavaType.DOUBLE , WIRETYPE_FIXED64 ), FLOAT (JavaType.FLOAT , WIRETYPE_FIXED32 ), INT64 (JavaType.LONG , WIRETYPE_VARINT ), UINT64 (JavaType.LONG , WIRETYPE_VARINT ), INT32 (JavaType.INT , WIRETYPE_VARINT ), FIXED64 (JavaType.LONG , WIRETYPE_FIXED64 ), FIXED32 (JavaType.INT , WIRETYPE_FIXED32 ), BOOL (JavaType.BOOLEAN , WIRETYPE_VARINT ), STRING (JavaType.STRING , WIRETYPE_LENGTH_DELIMITED) { public boolean isPackable() { return false; } }, GROUP (JavaType.MESSAGE , WIRETYPE_START_GROUP ) { public boolean isPackable() { return false; } }, MESSAGE (JavaType.MESSAGE , WIRETYPE_LENGTH_DELIMITED) { public boolean isPackable() { return false; } }, BYTES (JavaType.BYTE_STRING, WIRETYPE_LENGTH_DELIMITED) { public boolean isPackable() { return false; } }, UINT32 (JavaType.INT , WIRETYPE_VARINT ), ENUM (JavaType.ENUM , WIRETYPE_VARINT ), SFIXED32(JavaType.INT , WIRETYPE_FIXED32 ), SFIXED64(JavaType.LONG , WIRETYPE_FIXED64 ), SINT32 (JavaType.INT , WIRETYPE_VARINT ), SINT64 (JavaType.LONG , WIRETYPE_VARINT );
附圖:
static Object readPrimitiveField( CodedInputStream input, FieldType type, Utf8Validation utf8Validation) throws IOException { switch (type) { case DOUBLE : return input.readDouble (); case FLOAT : return input.readFloat (); case INT64 : return input.readInt64 (); case UINT64 : return input.readUInt64 (); case INT32 : return input.readInt32 (); case FIXED64 : return input.readFixed64 (); case FIXED32 : return input.readFixed32 (); case BOOL : return input.readBool (); case BYTES : return input.readBytes (); case UINT32 : return input.readUInt32 (); case SFIXED32: return input.readSFixed32(); case SFIXED64: return input.readSFixed64(); case SINT32 : return input.readSInt32 (); case SINT64 : return input.readSInt64 ();
MessageLite對應的java類型默認值:
public enum JavaType { INT(0), LONG(0L), FLOAT(0F), DOUBLE(0D), BOOLEAN(false), STRING(""), BYTE_STRING(ByteString.EMPTY), ENUM(null), MESSAGE(null);
在Java種對不同類型的選擇,其他的類型區別很明顯,主要在與int32、uint32、sint32、fixed32中以及對應的64位版本的選擇,因為在Java中這些類型都用int(long)來表達,但是protobuf內部使用ZigZag編碼方式來處理多余的符號問題,但是在編譯生成的代碼中並沒有驗證邏輯,比如uint的字段不能傳入負數之類的。而從編碼效率上,對fixed32類型,如果字段值大於2^28,它的編碼效率比int32更加有效;而在負數編碼上sint32的效率比int32要高;uint32則用於字段值永遠是正整數的情況。
在實現上,protobuf使用CodedOutputStream實現序列化邏輯、CodedInputStream實現反序列化邏輯,他們都包含write/read基本類型和Message類型的方法,write方法中同時包含fieldNumber和value參數,在寫入時先寫入由fieldNumber和WireType組成的tag值(添加這個WireType類型信息是為了在對無法識別的字段編碼時可以通過這個類型信息判斷使用那種方式解析這個未知字段,所以這幾種類型值即可),這個tag值是一個可變長int類型,所謂的可變長類型就是一個字節的最高位(msb,most significant bit)用1表示后一個字節屬於當前字段,而最高位0表示當前字段編碼結束。
varint32如下:
/** * Compute the number of bytes that would be needed to encode a varint. * {@code value} is treated as unsigned, so it won't be sign-extended if * negative. */ public static int computeRawVarint32Size(final int value) { if ((value & (0xffffffff << 7)) == 0) return 1; if ((value & (0xffffffff << 14)) == 0) return 2; if ((value & (0xffffffff << 21)) == 0) return 3; if ((value & (0xffffffff << 28)) == 0) return 4; return 5; }
wireType類型如下:
public static final int WIRETYPE_VARINT = 0; public static final int WIRETYPE_FIXED64 = 1; public static final int WIRETYPE_LENGTH_DELIMITED = 2; public static final int WIRETYPE_START_GROUP = 3; public static final int WIRETYPE_END_GROUP = 4; public static final int WIRETYPE_FIXED32 = 5;
3bits表示;
static final int TAG_TYPE_BITS = 3; static final int TAG_TYPE_MASK = (1 << TAG_TYPE_BITS) - 1; /** Given a tag value, determines the wire type (the lower 3 bits). */ static int getTagWireType(final int tag) { return tag & TAG_TYPE_MASK; } /** Given a tag value, determines the field number (the upper 29 bits). */ public static int getTagFieldNumber(final int tag) { return tag >>> TAG_TYPE_BITS; } /** Makes a tag value given a field number and wire type. */ static int makeTag(final int fieldNumber, final int wireType) { return (fieldNumber << TAG_TYPE_BITS) | wireType; }
/** Write a {@code double} field, including tag, to the stream. */ public void writeDouble(final int fieldNumber, final double value) throws IOException { writeTag(fieldNumber, WireFormat.WIRETYPE_FIXED64); writeDoubleNoTag(value); } /** Write a {@code float} field, including tag, to the stream. */ public void writeFloat(final int fieldNumber, final float value) throws IOException { writeTag(fieldNumber, WireFormat.WIRETYPE_FIXED32); writeFloatNoTag(value); } /** Write a {@code uint64} field, including tag, to the stream. */ public void writeUInt64(final int fieldNumber, final long value) throws IOException { writeTag(fieldNumber, WireFormat.WIRETYPE_VARINT); writeUInt64NoTag(value); } /** Write an {@code int64} field, including tag, to the stream. */ public void writeInt64(final int fieldNumber, final long value) throws IOException { writeTag(fieldNumber, WireFormat.WIRETYPE_VARINT); writeInt64NoTag(value); } /** Write an {@code int32} field, including tag, to the stream. */ public void writeInt32(final int fieldNumber, final int value) throws IOException { writeTag(fieldNumber, WireFormat.WIRETYPE_VARINT); writeInt32NoTag(value); } /** Write a {@code fixed64} field, including tag, to the stream. */ public void writeFixed64(final int fieldNumber, final long value) throws IOException { writeTag(fieldNumber, WireFormat.WIRETYPE_FIXED64); writeFixed64NoTag(value); } /** Write a {@code fixed32} field, including tag, to the stream. */ public void writeFixed32(final int fieldNumber, final int value) throws IOException { writeTag(fieldNumber, WireFormat.WIRETYPE_FIXED32); writeFixed32NoTag(value); } /** Write a {@code bool} field, including tag, to the stream. */ public void writeBool(final int fieldNumber, final boolean value) throws IOException { writeTag(fieldNumber, WireFormat.WIRETYPE_VARINT); writeBoolNoTag(value); } /** Write a {@code string} field, including tag, to the stream. */ public void writeString(final int fieldNumber, final String value) throws IOException { writeTag(fieldNumber, WireFormat.WIRETYPE_LENGTH_DELIMITED); writeStringNoTag(value); }
在寫入tag值后,再寫入字段值value,對不同的字段類型采用不同的編碼方式:
1. 對int32/int64類型,如果值大於等於0,直接采用可變長編碼,否則,采用64位的可變長編碼,因而其編碼結果永遠是10個字節,所有說它int32/int64類型在編碼負數效率很低(varint32,最高5bytes, 去除5個bits標志位也夠啊!為什么sign-extend到64,並且為10bytes?,本人除了在codeInputStream中看出解析時方便,別的看不到任何原因,求解釋??).。
/** Write an {@code int32} field to the stream. */ public void writeInt32NoTag(final int value) throws IOException { if (value >= 0) { writeRawVarint32(value); } else { // Must sign-extend. writeRawVarint64(value); } }
sign-extend 64:
public void writeRawVarint64(long value) throws IOException { while (true) { if ((value & ~0x7FL) == 0) { writeRawByte((int)value); return; } else { writeRawByte(((int)value & 0x7F) | 0x80); value >>>= 7; } } }
10bytes:
public static int computeInt32SizeNoTag(final int value) { if (value >= 0) { return computeRawVarint32Size(value); } else { // Must sign-extend. return 10; } }
2. 對uint32/uint64類型,也采用變長編碼,不對負數做驗證。
public void writeUInt32NoTag(final int value) throws IOException { writeRawVarint32(value); }
只是簡單的調用varint32變長write(不對value值有任何判斷非負);
3. 對sint32/sint64類型,首先對該值做ZigZag編碼,以保留,然后將編碼后的值采用變長編碼。所謂ZigZag編碼即將負數轉換成正數,而所有正數都乘2,如0編碼成0,-1編碼成1,1編碼成2,-2編碼成3,以此類推,因而它對負數的編碼依然保持比較高的效率。
public void writeSInt32NoTag(final int value) throws IOException { writeRawVarint32(encodeZigZag32(value)); }
順帶32,64,zigzag:
/** * Encode a ZigZag-encoded 32-bit value. ZigZag encodes signed integers * into values that can be efficiently encoded with varint. (Otherwise, * negative values must be sign-extended to 64 bits to be varint encoded, * thus always taking 10 bytes on the wire.) * * @param n A signed 32-bit integer. * @return An unsigned 32-bit integer, stored in a signed int because * Java has no explicit unsigned support. */ public static int encodeZigZag32(final int n) { // Note: the right-shift must be arithmetic return (n << 1) ^ (n >> 31); } /** * Encode a ZigZag-encoded 64-bit value. ZigZag encodes signed integers * into values that can be efficiently encoded with varint. (Otherwise, * negative values must be sign-extended to 64 bits to be varint encoded, * thus always taking 10 bytes on the wire.) * * @param n A signed 64-bit integer. * @return An unsigned 64-bit integer, stored in a signed int because * Java has no explicit unsigned support. */ public static long encodeZigZag64(final long n) { // Note: the right-shift must be arithmetic return (n << 1) ^ (n >> 63); } }
4. 對fixed32/sfixed32/fixed64/sfixed64類型,直接將該值以小端模式的固定長度編碼。
以fixed32為例:
/** Write a {@code fixed32} field to the stream. */ public void writeFixed32NoTag(final int value) throws IOException { writeRawLittleEndian32(value); }
public void writeRawLittleEndian32(final int value) throws IOException { writeRawByte((value ) & 0xFF); writeRawByte((value >> 8) & 0xFF); writeRawByte((value >> 16) & 0xFF); writeRawByte((value >> 24) & 0xFF); }
其他類似。
5. 對double類型,先將double轉換成long類型,然后以8個字節固定長度小端模式寫入。
6. 對float類型,先將float類型轉換成int類型,然后以4個字節固定長度小端模式寫入。
public void writeDouble(final int fieldNumber, final double value) throws IOException { writeTag(fieldNumber, WireFormat.WIRETYPE_FIXED64); writeDoubleNoTag(value); } /** Write a {@code float} field, including tag, to the stream. */ public void writeFloat(final int fieldNumber, final float value) throws IOException { writeTag(fieldNumber, WireFormat.WIRETYPE_FIXED32); writeFloatNoTag(value); }
/** Write a {@code double} field to the stream. */ public void writeDoubleNoTag(final double value) throws IOException { writeRawLittleEndian64(Double.doubleToRawLongBits(value)); } /** Write a {@code float} field to the stream. */ public void writeFloatNoTag(final float value) throws IOException { writeRawLittleEndian32(Float.floatToRawIntBits(value)); }
7. 對bool類型,寫0或1的一個字節。
public void writeBool(final int fieldNumber, final boolean value) throws IOException { writeTag(fieldNumber, WireFormat.WIRETYPE_VARINT); writeBoolNoTag(value); }
public void writeBoolNoTag(final boolean value) throws IOException { writeRawByte(value ? 1 : 0); }
8. 對string類型,使用UTF-8編碼獲取字節數組,然后先用變長編碼寫入字節數組長度,然后寫入所有的字節數組。
Tag |
msgByteSize |
msgByte |
public void writeStringNoTag(final String value) throws IOException { // Unfortunately there does not appear to be any way to tell Java to encode // UTF-8 directly into our buffer, so we have to let it create its own byte // array and then copy. final byte[] bytes = value.getBytes(Internal.UTF_8); writeRawVarint32(bytes.length); writeRawBytes(bytes); }
9. 對bytes類型(ByteString),先用變長編碼寫入長度,然后寫入整個字節數組。
Tag |
msgByteSize |
msgByte |
public void writeBytesNoTag(final ByteString value) throws IOException { writeRawVarint32(value.size()); writeRawBytes(value); }
10. 對枚舉類型(類型值WIRETYPE_VARINT),用int32編碼方式寫入定義枚舉項時給定的值(因而在給枚舉類型項賦值時不推薦使用負數,因為int32編碼方式對負數編碼效率太低)。
/** * Write an enum field, including tag, to the stream. Caller is responsible * for converting the enum value to its numeric value. */ public void writeEnum(final int fieldNumber, final int value) throws IOException { writeTag(fieldNumber, WireFormat.WIRETYPE_VARINT); writeEnumNoTag(value); }
public void writeEnumNoTag(final int value) throws IOException { writeInt32NoTag(value); }
11. 對內嵌Message類型(類型值WIRETYPE_LENGTH_DELIMITED),先寫入整個Message序列化后字節長度,然后寫入整個Message。
Tag |
msgByteSize |
msgByte
|
public void writeMessageNoTag(final MessageLite value) throws IOException { writeRawVarint32(value.getSerializedSize()); value.writeTo(this); }
repeated字段編碼
對於repeated字段,一般有兩種編碼方式:
1. 每個項都先寫入tag,然后寫入具體數據。如對基本類型:
Tag |
Data |
Tag |
Data |
… |
而對message類型:
Tag |
Length |
Data |
Tag |
Length |
Data |
… |
2. 先寫入tag,后count,再寫入count個項,每個項包含length|data數據。即:
Tag |
Count |
Length |
Data |
Length |
Data |
… |
從編碼效率的角度來看,個人感覺第二中情況更加有效,然而不知道處於什么原因考慮,protobuf采用了第一種方式來編碼,個人能想到的一個理由是第一種情況下,每個消息項都是相對獨立的,因而在傳輸過程中接收端每接收到一個消息項就可以進行解析,而不需要等待整個repeated字段的消息包。對於基本類型,protobuf也采用了第一種編碼方式,后來發現這種編碼方式效率太低,因而可以添加[packed = true]的描述將其轉換成第三種編碼方式(第二種方式的變種,對基本數據類型,比第二種方式更加有效):
3. 先寫入tag,后寫入字段的總字節數,再寫入每個項數據。即:
Tag |
dataByteSize |
Data |
Data |
… |
目前protobuf只支持基本類型的packed修飾,因而如果將packed添加到非repeated字段或非基本類型的repeated字段,編譯器在編譯.proto文件時會報錯。
未識別字段編碼
在protobuf中,將所有未識別字段保存在UnknownFieldSet中,並且在每個由protobuf編譯生成的Message類以及GeneratedMessage.Builder中保存了UnknownFieldSet字段unknownFields;該字段可以從CodedInputStream中初始化(調用UnknownFieldSet.Builder的mergeFieldFrom()方法)或從用戶自己通過Builder設置;在序列化時,調用UnknownFieldSet的writeTo()方法將自身內容序列化到CodedOutputStream中。
UnknownFieldSet顧名思義是未知字段的集合,其內部數據結構是一個FieldNumber到Field的Map,而一個Field用於表達一個未知字段,它可以是任何值,因而它包含了所有5中類型的List字段,這里並沒有對一個Field驗證,因而允許多個相同FieldNumber的未知字段,並且他們可以是任意類型值。UnknownFieldSet采用MessageLite編程模式,因而它實現了MessageLite接口,並且定義了一個Builder類實現MessageLite.Builder接口用於手動或從CodedInputStream中構建UnknownFieldSet。雖然Field本身沒有實現MessageLite接口,它依然實現了該接口的部分方法,如writeTo()、getSerializedSize()用於實現向CodedOutputStream中序列化自身,並且定義了Field.Builder類用於構建Field實例。
在一個Message序列化時(writeTo()方法實現),在寫完所有可識別的字段以及擴展字段,這個定義在Message中的UnknownFieldSet也會被寫入CodedOutputStream中;而在從CodedInputStream中解析時,對任何未知字段也都會被寫入這個UnknownFieldSet中。
擴展字段編碼
在寫框架代碼時,經常由擴展性的需求,在Java中,只需要簡單的定義一個父類或接口即可解決,如果框架本身還負責構建實例本身,可以使用反射或暴露Factory類也可以順利實現,然而對序列化來說,就很難提供這種動態plugin機制了。然而protobuf還是提出來一個相對可以接受的機制(語法有點怪異,但是至少可以用):在一個message中定義它支持的可擴展字段值的范圍,然后用戶可以使用extend關鍵字擴展該message定義(具體參考相關章節)。在實現中,所有這些支持字段擴展的message類型繼承自ExtendableMessage類(它本身繼承自GeneratedMessage類)並實現ExtendableMessageOrBuilder接口,而它們的Builder類則繼承自ExtendableBuilder類並且同時也實現了ExtendableMessageOrBuilder接口。
ExtendableMessage和ExtendableBuilder類都包含FieldSet<FieldDescriptor>類型的字段用於保存該message所有的擴展字段值。FieldSet中保存了FieldDescriptor到其Object值的Map,然而在ExtendableMessage和ExtendableBuilder中則使用GeneratedExtension來表識一個擴展字段,這是因為GeneratedExtension除了包含對一個擴展字段的描述信息FieldDescriptor外,還存儲了該擴展字段的類型、默認值等信息,在protobuf消息定義編譯器中會為每個擴展字段生成相應的GeneratedExtension實例以供用戶使用