Encoding用法
Encoding用法比較簡單,如果只是字節和字符的互相轉換,GetBytes()和GetChars()這兩個方法及它們的重載基本上會滿足你所有要求。
GetByteCount()及其重載是得到一個字符串轉換成字節時實際的字節個數。
GetCharCount()及其重載是得到一個字節數組轉換成字符串的大小。
要注意這兩個方法:int GetMaxByteCount(int charCount); int GetMaxCharCount(int byteCount);
它並不是你期望的那樣,如果是單字節就返回charCount,如果是雙字節就返回chartCount*2,而是chartCount+1,(chartCount+1)*2。
Console.WriteLine("The max byte count is {0}.", Encoding.Unicode.GetMaxByteCount(10)); Console.WriteLine("The max byte count is {0}.", Encoding.ASCII.GetMaxByteCount(10));
上面的結果分別是22和11,而不是20,10。我在一篇英文博客里找到了原因,我英語不好,沒有弄明白什么是high surrogate和low surrogate:http://blogs.msdn.com/b/shawnste/archive/2005/03/02/383903.aspx
For example, Encoding.GetEncoding(1252).GetMaxByteCount(1) returns 2. 1252 is a single byte code page (encoding), so generally one would expect that GetMaxByteCount(n) would return n, but it doesn't, it usually returns n+1.
One reason for this oddity is that an Encoder could store a high surrogate on one call to GetBytes(), hoping that the next call is a low surrogate. This allows the fallback mechanism to provide a fallback for a complete surrogate pair, even if that pair is split between calls to GetBytes(). If the fallback returns a ? for each surrogate half, or if the next call doesn't have a surrogate, then 2 characters could be output for that surrogate pair. So in this case, calling Encoder.GetBytes() with a high surrogate would return 0 bytes and then following that with another call with only the low surrogate would return 2 bytes.
下面代碼是Encoding的簡單應用,大家可以打印一下結果,然后結合上篇講的,會有所收獲的。
static void Output(Encoding encoding,string t) { Console.WriteLine(encoding.ToString()); byte[] buffer = encoding.GetBytes(t); foreach (byte b in buffer) { Console.Write(b + "-"); } string s = encoding.GetString(buffer); Console.WriteLine(s); }
string strTest = "test我鎔a有κ"; Console.WriteLine(strTest); Output(Encoding.GetEncoding("gb18030"), strTest); Output(Encoding.Default, strTest); Output(Encoding.UTF32, strTest); Output(Encoding.UTF8, strTest); Output(Encoding.Unicode, strTest); Output(Encoding.ASCII, strTest); Output(Encoding.UTF7, strTest);
關於BOM
BOM全稱是Byte Order Mark,即字節順序標記,是一段二進制,用於標識一個文本是用什么編碼的,比如當用Notepad打開一個文本時,如果文本里包括這一段BOM,那么它就能判斷是采用哪一種編碼方式,並用相應的解碼方式,就會正確打開文本不會有亂碼。如果沒有這一段BOM,Notepad會默認以ANSI打開,這種會有亂碼的可能性。可以通過Encoding的方法GetPreamble()來判斷這編碼有沒有BOM,目前CLR中只有下面5個Encoding有BOM。
UTF-8: EF BB BF
UTF-16 big endian: FE FF
UTF-16 little endian: FF FE
UTF-32 big endian: 00 00 FE FF
UTF-32 little endian: FF FE 00 00
用Encoding的靜態屬性Unicode,UTF8,UTF32構造的Encoding都是默認帶有BOM的,如果你想在寫一個文本時(比如XML文件,如果有BOM,會有亂碼的),不想帶BOM,那么就必須用它們的實例,
Encoding encodingUTF16=new UnicodeEncoding(false, false);//第二個參數必須要為false Encoding encodingUTF8=new UTF8Encoding(false); Encoding encodingUTF32=new UTF32Encoding(false,false);//第二個參數必須要為false
讀寫文本和BOM的關系可以參考園子里這篇博客,講的很詳細我就不重復了,.NET(C#):字符編碼(Encoding)和字節順序標記(BOM)
判斷一個文本的編碼方式
如果給定一個文本,我們不知道它的編碼格式,解碼時我們如何選擇Encoding呢?答案是根據BOM來判斷到底是哪種Unicode,如果沒有BOM,這個就很難說了,這個得根據文本文件的來源了,一般是用Encoding.Default,這個是根據你計算機里當前的設置而返回不同的值。如果你的文件是來自一位國際友人的話,你最好用UTF-8來解碼了。下面的代碼在指定文件沒有BOM時,不能保證其正確性,如果你要用到你項目中,千萬要注意這一點。
/// <summary> ///Return the Encoding of a text file. Return Encoding.Default if no Unicode // BOM (byte order mark) is found. /// </summary> /// <param name="FileName"></param> /// <returns></returns> public static Encoding GetFileEncoding(String FileName) { Encoding Result = null; FileInfo FI = new FileInfo(FileName); FileStream FS = null; try { FS = FI.OpenRead(); Encoding[] UnicodeEncodings = { Encoding.BigEndianUnicode, Encoding.Unicode, Encoding.UTF8, Encoding.UTF32, new UTF32Encoding(true,true) }; for (int i = 0; Result == null && i < UnicodeEncodings.Length; i++) { FS.Position = 0; byte[] Preamble = UnicodeEncodings[i].GetPreamble(); bool PreamblesAreEqual = true; for (int j = 0; PreamblesAreEqual && j < Preamble.Length; j++) { PreamblesAreEqual = Preamble[j] == FS.ReadByte(); } // or use Array.Equals to compare two arrays. // fs.Read(buf, 0, Preamble.Length); // PreamblesAreEqual = Array.Equals(Preamble, buf) if (PreamblesAreEqual) { Result = UnicodeEncodings[i]; } } } catch (System.IO.IOException ex) { throw ex; } finally { if (FS != null) { FS.Close(); } } if (Result == null) { Result = Encoding.Default; } return Result; }
待續。。。。
下一節主要講Encoder和Decoder
順便問一下,編輯博客時,看着還挺漂亮的文章,怎么預覽時好多格式都不見了?好難看啊