RapidJSON 代碼剖析（三）：Unicode 的編碼與解碼

本文轉載自查看原文 2015-06-03 17:34 10625 雜談

8.1 Character Encoding

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).
翻譯：JSON文本應該以UTF-8、UTF-16、UTF-32編碼。缺省編碼為UTF-8，而且有大量的實現能讀取以UTF-8編碼的JSON文本，說明UTF-8具互操作性；有許多實現不能讀取其他編碼（如 UTF-16及UTF-32）

RapidJSON 希望盡量支持各種常用 UTF 編碼，用四百多行代碼實現了 5 種 Unicode 編碼器／解碼器，另外加上 ASCII 編碼。本文會簡單介紹它的實現方式。

（配圖為老彼得·布呂赫爾筆下的巴別塔）

回顧 Unicode、UTF 與 C++

Unicode 是一個標准，用於處理世界上大部分的文字。在 Unicode 出現之前，每種語言文字會使用不同的編碼，例如英文主要用 ASCII、中文主要用 GB 2312 和大五碼、日文主要用 JIS 等等。這樣會造成很多不便，例如一個文本信息很難混合各種語言的文字。

Unicode 定義了統一字符集（Universal Coded Character Set, UCS），每個字符映射至一個整數碼點（code point），碼點的范圍是 0 至 0x10FFFF。儲存這些碼點有不同方式，這些方式稱為 Unicode 轉換格式（Uniform Transformation Format, UTF）。現時流行的 UTF 為 UTF-8、UTF-16 和 UTF-32。每種 UTF 會把一個碼點儲存為一至多個編碼單元（code unit）。例如 UTF-8 的編碼單元是 8 位的字節、UTF-16 為 16 位、UTF-32 為 32 位。除 UTF-32 外，UTF-8 和 UTF-16 都是可變長度編碼。

UTF-8 成為現時互聯網上最流行的格式，有幾個原因：

它采用字節為編碼單元，不會有字節序（endianness）的問題。
每個 ASCII 字符只需一個字節去儲存。
如果程序原來是以字節方式儲存字符，理論上不需要特別改動就能處理 UTF-8 的數據。

那么，在處理 JSON 時，若使用 UTF-8，我們為何還需要特別處理？這是因為 JSON 的字符串可以包含 \uXXXX 這種轉義字符串。例如["\u20AC"]這個JSON是一個數組，里面有一個字符串，轉義之后是歐元符號"€"。在 JSON 中，這個轉義符使用 UTF-16 編碼。JSON 也支持 UTF-16 代理對（surrogate pair），例如高音譜號(U+1D11E)可寫成"\uD834\uDD1E"。所以，即使是 UTF-8 的 JSON，我們都需要在解析JSON字符串時做解碼／編碼工作。

雖然 Unicode 始於上世紀90年代，C++11 才加入較好的支持。RapidJSON 為了支持 C++ 03，需要自行實現一組編碼／解碼器。

Encoding

RapidJSON 的編碼（encoding）的概念是這樣的（非C++代碼）：

concept Encoding {
    typename Ch;    //! Type of character. A "character" is actually a code unit in unicode's definition.

    enum { supportUnicode = 1 }; // or 0 if not supporting unicode

    //! \brief Encode a Unicode codepoint to an output stream.
    //! \param os Output stream.
    //! \param codepoint An unicode codepoint, ranging from 0x0 to 0x10FFFF inclusively.
    template<typename OutputStream>
    static void Encode(OutputStream& os, unsigned codepoint);

    //! \brief Decode a Unicode codepoint from an input stream.
    //! \param is Input stream.
    //! \param codepoint Output of the unicode codepoint.
    //! \return true if a valid codepoint can be decoded from the stream.
    template <typename InputStream>
    static bool Decode(InputStream& is, unsigned* codepoint);

    //! \brief Validate one Unicode codepoint from an encoded stream.
    //! \param is Input stream to obtain codepoint.
    //! \param os Output for copying one codepoint.
    //! \return true if it is valid.
    //! \note This function just validating and copying the codepoint without actually decode it.
    template <typename InputStream, typename OutputStream>
    static bool Validate(InputStream& is, OutputStream& os);

    // The following functions are deal with byte streams.

    //! Take a character from input byte stream, skip BOM if exist.
    template <typename InputByteStream>
    static CharType TakeBOM(InputByteStream& is);

    //! Take a character from input byte stream.
    template <typename InputByteStream>
    static Ch Take(InputByteStream& is);

    //! Put BOM to output byte stream.
    template <typename OutputByteStream>
    static void PutBOM(OutputByteStream& os);

    //! Put a character to output byte stream.
    template <typename OutputByteStream>
    static void Put(OutputByteStream& os, Ch c);
};

由於 C++ 可使用不同類型作為字符類型，如 char、wchar_t、char16_t (C++11)、char32_t (C++11)等，實現這個 Encoding 概念的類需要設定一個 Ch 類型。

這當中最種要的函數是 Encode() 和 Decode()，它們分別把碼點編碼至輸出流，以及從輸入流解碼成碼點。Validate()則是只驗證編碼是否正確，並復制至目標流，不做解碼工作。例如 UTF-16 的編碼／解碼實現是：

template<typename CharType = wchar_t>
struct UTF16 {
    typedef CharType Ch;
    RAPIDJSON_STATIC_ASSERT(sizeof(Ch) >= 2);

    enum { supportUnicode = 1 };

    template<typename OutputStream>
    static void Encode(OutputStream& os, unsigned codepoint) {
        RAPIDJSON_STATIC_ASSERT(sizeof(typename OutputStream::Ch) >= 2);
        if (codepoint <= 0xFFFF) {
            RAPIDJSON_ASSERT(codepoint < 0xD800 || codepoint > 0xDFFF); // Code point itself cannot be surrogate pair 
            os.Put(static_cast<typename OutputStream::Ch>(codepoint));
        }
        else {
            RAPIDJSON_ASSERT(codepoint <= 0x10FFFF);
            unsigned v = codepoint - 0x10000;
            os.Put(static_cast<typename OutputStream::Ch>((v >> 10) | 0xD800));
            os.Put((v & 0x3FF) | 0xDC00);
        }
    }

    template <typename InputStream>
    static bool Decode(InputStream& is, unsigned* codepoint) {
        RAPIDJSON_STATIC_ASSERT(sizeof(typename InputStream::Ch) >= 2);
        Ch c = is.Take();
        if (c < 0xD800 || c > 0xDFFF) {
            *codepoint = c;
            return true;
        }
        else if (c <= 0xDBFF) {
            *codepoint = (c & 0x3FF) << 10;
            c = is.Take();
            *codepoint |= (c & 0x3FF);
            *codepoint += 0x10000;
            return c >= 0xDC00 && c <= 0xDFFF;
        }
        return false;
    }

    // ...
};

轉碼

RapidJSON 的解析器可以讀入某種編碼的JSON，並轉碼為另一種編碼。例如我們可以解析一個 UTF-8 JSON文件至 UTF-16 的 DOM。我們可以實現一個類做這樣的轉碼工作：

template<typename SourceEncoding, typename TargetEncoding>
struct Transcoder {
    //! Take one Unicode codepoint from source encoding, convert it to target encoding and put it to the output stream.
    template<typename InputStream, typename OutputStream>
    RAPIDJSON_FORCEINLINE static bool Transcode(InputStream& is, OutputStream& os) {
        unsigned codepoint;
        if (!SourceEncoding::Decode(is, &codepoint))
            return false;
        TargetEncoding::Encode(os, codepoint);
        return true;
    }

    // ...
};

這段代碼非常簡單，就是從輸入流解碼出一個碼點，解碼成功就編碼並寫入輸出流。但如果來源的編碼和目標的編碼都一樣，我們不是做了無用功么？但 C++ 的[模板偏特化（partial template specialization）可以這么做：

//! Specialization of Transcoder with same source and target encoding.
template<typename Encoding>
struct Transcoder<Encoding, Encoding> {
    template<typename InputStream, typename OutputStream>
    RAPIDJSON_FORCEINLINE static bool Transcode(InputStream& is, OutputStream& os) {
        os.Put(is.Take());  // Just copy one code unit. This semantic is different from primary template class.
        return true;
    }

    // ...
};

那么，不用轉碼的時候，就只需復制編碼一個單元。零開銷！所以，在解析及生成 JSON 時都使用到 Transcoder 去做編碼轉換。

UTF-8 解碼與 DFA

在 UTF-8 中，一個碼點可能會編碼為1至4個編碼單元（字節）。它的解碼比較復雜。RapidJSON 參考了 Hoehrmann 的實現，使用確定有限狀態自動機（deterministic finite automation, DFA）的方式去解碼。UTF-8的解碼過程可以表示為以下的DFA:

當中，每個轉移（transition）代表在輸入流中遇到的編碼單元（字節）范圍。這幅圖忽略了不合法的范圍，它們都會轉移至一個錯誤的狀態。

原來我希望在本文中詳細解析 RapidJSON 實現中的「優化」。但幾年前在 Windows 上的測試結果和近日在 Mac 上的測試結果大相逕庭。還是等待之后再分析后再講。

AutoUTF

有時候，我們不能在編譯期決定 JSON 采用了哪種編碼。而上述的實現都是在編譯期以模板類型做挷定的。所以，后來 RapidJSON 加入了一個運行時做動態挷定的編碼類型，稱為 AutoUTF。它之所以稱為自動，是因為它還有檢測字節順序標記（byte-order mark, BOM）的功能。如果輸入流有 BOM，就能自動選擇適當的解碼器。不過，因為在運行時挷定，就需要多一層間接。RapidJSON采用了函數指針的數組來做這間接層。

ASCII

有一個用家提出希望寫入 JSON 時，能把所有非 ASCII 的字符都寫成 \uXXXX 轉義形式。解決方法就是加入了 ASCII 這個模板類：

template<typename CharType = char>
struct ASCII {
    typedef CharType Ch;

    enum { supportUnicode = 0 };

    // ...

    template <typename InputStream>
    static bool Decode(InputStream& is, unsigned* codepoint) {
        unsigned char c = static_cast<unsigned char>(is.Take());
        *codepoint = c;
        return c <= 0X7F;
    }

    // ...
};

通過檢測 supportUnicode，寫入 JSON 時就可以決定是否做轉義。另外，Decode()時也會檢查是否超出 ASCII 范圍。

總結

RapidJSON 提供內置的 Unicode 支持，包括各種 UTF 格式及轉碼。這是其他 JSON 庫較少做的部分。另外，RapidJSON 是在輸入輸出流的層面去處理，避免了把整個JSON讀入、轉碼，然后才開始解析。RapidJSON 這么實現節省內存，而且性能應該更優。

最近為了開發 RapidJSON 下一個版本新增的 JSON Schema 功能，實現了一個正則表達式引擎。該引擎也利用了 Encoding 這套框架，輕松地實現了 Unicode 支持，例如可以直接匹配 UTF-8 的輸入流。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 RapidJSON 代碼剖析（四）：優化 Grisu python Unicode 編碼解碼 PHP解碼unicode編碼中文字符代碼示例 RapidJSON 代碼剖析（一）：混合任意類型的堆棧 Sql Server UniCode編碼解碼 C# Unicode編碼解碼 PHP 的 UNICODE 編碼和解碼 Unicode轉義(\uXXXX)的編碼和解碼 Asp.Net \uxxx Unicode編碼解碼 Unicode編碼解碼在線轉換工具