Java讀帶有BOM的UTF-8文件亂碼原因及解決方法

本文轉載自查看原文 2016-12-29 16:50 4201 java SE/ 原創

原因：

關於utf-8編碼的txt文件，windows以記事本方式保存時會在第一行最開始處自動加入bom格式的相關信息，大概三個字節！　

所以java在讀取此類文件時第一行時會多出三個不相關的字節，這樣對正常的程序產生了不良影響！

解決方法：

　網上有如下解決方法確實可行

１.使用UltraEdit將上邊的txt文件另存為UTF-8無BOM格式；

２.使用Notepad++打開上邊的txt文件執行如下操作“格式-->以UTF-8無BOM格式編碼”，修改后將txt文本進行保存

不足之處：

但是這樣也有不足，這樣對文件生產者提出了很高的要求，萬一這樣的文件是很多人生產的，那就勢必會產生各種各樣的問題，這歸根到底是jdk的一個bug.

有沒有什么辦法能夠一勞永逸呢，答案是有的，咱們程序里控制，來跟着我一起做！

終極解決方案：

（１）在工程中增加JDK提供的一個工具類：

public class UnicodeInputStream extends InputStream {
       PushbackInputStream internalIn;
       boolean             isInited = false;
       String              defaultEnc;
       String              encoding;

       private static final int BOM_SIZE = 4;

       public UnicodeInputStream(InputStream in, String defaultEnc) {
           internalIn = new PushbackInputStream(in, BOM_SIZE);
           this.defaultEnc = defaultEnc;
       }

       public String getDefaultEncoding() {
          return defaultEnc;
       }

       public String getEncoding() {
          if (!isInited) {
             try {
                init();
             } catch (IOException ex) {
                IllegalStateException ise = new IllegalStateException("Init method failed.");
                ise.initCause(ise);
                throw ise;
             }
          }
          return encoding;
       }

       /**
        * Read-ahead four bytes and check for BOM marks. Extra bytes are
        * unread back to the stream, only BOM bytes are skipped.
        */
       protected void init() throws IOException {
          if (isInited) return;

          byte bom[] = new byte[BOM_SIZE];
          int n, unread;
          n = internalIn.read(bom, 0, bom.length);

          if ( (bom[0] == (byte)0x00) && (bom[1] == (byte)0x00) &&
                      (bom[2] == (byte)0xFE) && (bom[3] == (byte)0xFF) ) {
             encoding = "UTF-32BE";
             unread = n - 4;
          } else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) &&
                      (bom[2] == (byte)0x00) && (bom[3] == (byte)0x00) ) {
             encoding = "UTF-32LE";
             unread = n - 4;
          } else if ( (bom[0] == (byte)0xEF) && (bom[1] == (byte)0xBB) &&
                (bom[2] == (byte)0xBF) ) {
             encoding = "UTF-8";
             unread = n - 3;
          } else if ( (bom[0] == (byte)0xFE) && (bom[1] == (byte)0xFF) ) {
             encoding = "UTF-16BE";
             unread = n - 2;
          } else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) ) {
             encoding = "UTF-16LE";
             unread = n - 2;
          } else {
             // Unicode BOM mark not found, unread all bytes
             encoding = defaultEnc;
             unread = n;
          }
          //System.out.println("read=" + n + ", unread=" + unread);

          if (unread > 0) internalIn.unread(bom, (n - unread), unread);

          isInited = true;
       }

       public void close() throws IOException {
          //init();
          isInited = true;
          internalIn.close();
       }

       public int read() throws IOException {
          //init();
          isInited = true;
          return internalIn.read();
       }
   }

（２）讀取時使用如下代碼：　　//因為我這邊是服務器上的遠程文件，如果是本地文件使用File類

　　　URL url = new URL("http://****/***/test.txt");

　　　// File f = new File("test.txt");

       String enc = null; // or NULL to use systemdefault
        UnicodeInputStream uin = new UnicodeInputStream(url.openStream(),enc); //如果是本地將url.openStream -> new FileInputStream(f)
        enc = uin.getEncoding(); // check and skip possible BOM bytes
        InputStreamReader in;
        if (enc == null){
           in = new InputStreamReader(uin);
        }else {
           in = new InputStreamReader(uin, enc);
        }
       BufferedReader reader = new BufferedReader(in);
        //BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("D:/tags.txt"),"utf-8"));
       String tmp =reader.readLine();

這樣讀取的結果就是正常的了，有什么問題還可以留言！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Java讀取UTF-8格式txt文件第一行出現亂碼——問號“?”及解決;Java讀帶有BOM的UTF-8文件亂碼原因及解決方法 UTF-8文件的BOM頭的來由及去除方法 java utf-8文件處理bom頭 java 讀寫UTF-8文件的方法 C# UTF-8文件帶BOM和不帶BOM文件的轉換 Eclipse打開UTF-8文件亂碼 [C#.net]處理UTF-8文件亂碼解決IIS7下UTF-8文件提示出錯信息亂碼問題 [轉]解決IIS下UTF-8文件報錯亂碼的問題 Java 解決BufferedReader讀取UTF-8文件中文亂碼