UTF-8編碼格式的Byte Order Mark問題

本文轉載自查看原文 2011-12-24 19:13 8719 Basic Knowledge/ Encoding

前兩天同事編寫的SQL Server數據庫腳本文件交給我運行時，出現了syntax error的錯誤，但將文件內容拷貝到SQL Server Management Studio里面運行時卻一切正常。。。真是很詭異，經檢查許久，才發現原來是UTF-8編碼的BOM(Byte Order Mark)問題。

以下摘自wikipedia：

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.^[1]

Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters ï»¿ for this.

The Unicode Standard does permit the BOM in UTF-8,^[2] but does not require or recommend its use.^[3] Byte order has no meaning in UTF-8^[4] so in UTF-8 the BOM serves only to identify a text stream or file as UTF-8.

Many Windows programs (including Windows Notepad) add BOMs to UTF-8 files by default^{[citation needed]}.

因為Unicode可以采用16位或者32位編碼，所以計算機在處理時需要知道其字節順序，BOM就是用來標識字節流的字節順序的，但字節順序這個概念對UTF-8來說是沒有意義的，所以BOM對UTF-8同樣沒有意義。但Unicode標准卻BOM在UTF-8編碼格式中存在。其存在位置在文件開頭，以三個字節0xEF, 0xBB, 0xBF表示。

UTF-8編碼不推薦使用無意義的BOM，但許多Windows程序卻在保存UTF-8編碼的文件時將其存為帶BOM的格式（即在文件開頭加上0xEFBBBF三個字節），這么干的就包括Windows記事本。

因此，在編輯UTF-8的文件時，建議不要使用記事本等進行編輯，雖然保存后的文件仍然是UTF-8，但卻已經不再是保存前的UTF-8了，這在使用這些文件的時候可能就會因為編碼而出現問題，就像我文章開頭所描述的那樣。

去掉UTF-8編碼文件BOM的方法：用Notepad++的Encoding菜單中的Encoding in UTF-8 without BOM即可。或者用任何16進制編輯器將文件前三個字節去掉。再或者更簡單的：用VIM設置UTF-8編碼的BOM標記。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Java socket 以byte[]簡單分片傳送數據("UTF-8"編碼) UTF-8編碼的空格（194 160）問題 C# 頁面設置成UTF-8編碼格式，中文亂碼問題 UTF-8文件編碼格式中有無簽名問題匯總(BOM) IDEA設置編碼格式為utf-8，圖形界面還是出現中文亂碼問題 Python 讀取csv報錯編碼問題： UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 10: invalid start byte [字符編碼]Invalid byte 1 of 1-byte UTF-8 sequence終極解決方案 ANSI和UTF-8編碼 powershell之utf-8編碼 oracle編碼格式從utf-8轉換為GBK