Encoding用法

Encoding用法比較簡單，如果只是字節和字符的互相轉換，GetBytes()和GetChars()這兩個方法及它們的重載基本上會滿足你所有要求。

GetByteCount()及其重載是得到一個字符串轉換成字節時實際的字節個數。

GetCharCount()及其重載是得到一個字節數組轉換成字符串的大小。

要注意這兩個方法：int GetMaxByteCount(int charCount); int GetMaxCharCount(int byteCount);

它並不是你期望的那樣，如果是單字節就返回charCount，如果是雙字節就返回chartCount*2，而是chartCount+1，(chartCount+1)*2。

            Console.WriteLine("The max byte count is {0}.", Encoding.Unicode.GetMaxByteCount(10));
            Console.WriteLine("The max byte count is {0}.", Encoding.ASCII.GetMaxByteCount(10));

上面的結果分別是22和11，而不是20，10。我在一篇英文博客里找到了原因，我英語不好，沒有弄明白什么是high surrogate和low surrogate：http://blogs.msdn.com/b/shawnste/archive/2005/03/02/383903.aspx

For example, Encoding.GetEncoding(1252).GetMaxByteCount(1) returns 2. 1252 is a single byte code page (encoding), so generally one would expect that GetMaxByteCount(n) would return n, but it doesn't, it usually returns n+1.

One reason for this oddity is that an Encoder could store a high surrogate on one call to GetBytes(), hoping that the next call is a low surrogate. This allows the fallback mechanism to provide a fallback for a complete surrogate pair, even if that pair is split between calls to GetBytes(). If the fallback returns a ? for each surrogate half, or if the next call doesn't have a surrogate, then 2 characters could be output for that surrogate pair. So in this case, calling Encoder.GetBytes() with a high surrogate would return 0 bytes and then following that with another call with only the low surrogate would return 2 bytes.

下面代碼是Encoding的簡單應用，大家可以打印一下結果，然后結合上篇講的，會有所收獲的。

        static void Output(Encoding encoding,string t)
        {
            Console.WriteLine(encoding.ToString());
            byte[] buffer = encoding.GetBytes(t);
            foreach (byte b in buffer)
            {
                Console.Write(b + "-");
            }
            string s = encoding.GetString(buffer);
            Console.WriteLine(s);
        }

            string strTest = "test我鎔a有κ";
            Console.WriteLine(strTest);
            Output(Encoding.GetEncoding("gb18030"), strTest);
            Output(Encoding.Default, strTest);
            Output(Encoding.UTF32, strTest);
            Output(Encoding.UTF8, strTest);
            Output(Encoding.Unicode, strTest);
            Output(Encoding.ASCII, strTest);
            Output(Encoding.UTF7, strTest);

關於BOM

BOM全稱是Byte Order Mark，即字節順序標記，是一段二進制，用於標識一個文本是用什么編碼的，比如當用Notepad打開一個文本時，如果文本里包括這一段BOM，那么它就能判斷是采用哪一種編碼方式，並用相應的解碼方式，就會正確打開文本不會有亂碼。如果沒有這一段BOM，Notepad會默認以ANSI打開，這種會有亂碼的可能性。可以通過Encoding的方法GetPreamble()來判斷這編碼有沒有BOM，目前CLR中只有下面5個Encoding有BOM。

UTF-8: EF BB BF

UTF-16 big endian: FE FF

UTF-16 little endian: FF FE

UTF-32 big endian: 00 00 FE FF

UTF-32 little endian: FF FE 00 00

用Encoding的靜態屬性Unicode，UTF8，UTF32構造的Encoding都是默認帶有BOM的，如果你想在寫一個文本時（比如XML文件，如果有BOM，會有亂碼的），不想帶BOM，那么就必須用它們的實例，

Encoding encodingUTF16=new UnicodeEncoding(false, false);//第二個參數必須要為false

Encoding encodingUTF8=new UTF8Encoding(false);

Encoding encodingUTF32=new UTF32Encoding(false,false);//第二個參數必須要為false

讀寫文本和BOM的關系可以參考園子里這篇博客，講的很詳細我就不重復了，.NET(C#)：字符編碼(Encoding)和字節順序標記(BOM)

判斷一個文本的編碼方式

如果給定一個文本，我們不知道它的編碼格式，解碼時我們如何選擇Encoding呢？答案是根據BOM來判斷到底是哪種Unicode，如果沒有BOM，這個就很難說了，這個得根據文本文件的來源了，一般是用Encoding.Default，這個是根據你計算機里當前的設置而返回不同的值。如果你的文件是來自一位國際友人的話，你最好用UTF-8來解碼了。下面的代碼在指定文件沒有BOM時，不能保證其正確性，如果你要用到你項目中，千萬要注意這一點。

/// <summary>
        ///Return the Encoding of a text file.  Return Encoding.Default if no Unicode
        // BOM (byte order mark) is found.
        /// </summary>
        /// <param name="FileName"></param>
        /// <returns></returns>
        public static Encoding GetFileEncoding(String FileName)
        {
            Encoding Result = null;
            FileInfo FI = new FileInfo(FileName);
            FileStream FS = null;
            try
            {
                FS = FI.OpenRead();
                Encoding[] UnicodeEncodings =
                { 
                    Encoding.BigEndianUnicode, 
                    Encoding.Unicode,
                    Encoding.UTF8,
                    Encoding.UTF32,
                    new UTF32Encoding(true,true)
                };
                for (int i = 0; Result == null && i < UnicodeEncodings.Length; i++)
                {
                    FS.Position = 0;
                    byte[] Preamble = UnicodeEncodings[i].GetPreamble();
                    bool PreamblesAreEqual = true;
                    for (int j = 0; PreamblesAreEqual && j < Preamble.Length; j++)
                    {
                        PreamblesAreEqual = Preamble[j] == FS.ReadByte();
                    }
                    // or use Array.Equals to compare two arrays.
                    // fs.Read(buf, 0, Preamble.Length);
                    // PreamblesAreEqual = Array.Equals(Preamble, buf)
                    if (PreamblesAreEqual)
                    {
                        Result = UnicodeEncodings[i];
                    }
                }
            }
            catch (System.IO.IOException ex)
            {
                throw ex;
            }
            finally
            {
                if (FS != null)
                {
                    FS.Close();
                }
            }
            if (Result == null)
            {
                Result = Encoding.Default;
            }
            return Result;
        }

待續。。。。

下一節主要講Encoder和Decoder

順便問一下，編輯博客時，看着還挺漂亮的文章，怎么預覽時好多格式都不見了？好難看啊

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 C# 小敘 Encoding (一) C# 小敘 Encoding (三) (C#) Encoding. C# Encoding C#中Encoding.Unicode與Encoding.UTF8的區別 C# 字符編碼類Encoding Unity C# 的編碼方式 Encoding C# 中容易忽視的 Encoding.GetByteCount 內存問題 C# Encoding UTF-16 ,C#中的UTF16 C#Encoding