java自動探測文件的字符編碼


Mozilla有一個C++版的自動字符集探測算法代碼,然后sourceforge上有人將其改成java版的~~

主頁:http://jchardet.sourceforge.net/

jchardet is a java port of the source from mozilla's automatic charset detection algorithm.
The original author is Frank Tang. What is available here is the java port of that code.
The original source in C++ can be found from http://lxr.mozilla.org/mozilla/source/intl/chardet/
More information can be found at http://www.mozilla.org/projects/intl/chardet.html

下面是見證奇跡的時刻:

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.mozilla.intl.chardet.nsDetector;
import org.mozilla.intl.chardet.nsICharsetDetectionObserver;

public class FileCharsetDetector {
    private boolean found = false;
    private String encoding = null;

    public static void main(String[] argv) throws Exception {
        File file1 = new File("C:\\test1.txt");
        
        System.out.println("文件編碼:" + new FileCharsetDetector().guessFileEncoding(file1));
    }

    /**
     * 傳入一個文件(File)對象,檢查文件編碼
     * 
     * @param file
     *            File對象實例
     * @return 文件編碼,若無,則返回null
     * @throws FileNotFoundException
     * @throws IOException
     */
    public String guessFileEncoding(File file) throws FileNotFoundException, IOException {
        return guessFileEncoding(file, new nsDetector());
    }

    /**
     * <pre>
     * 獲取文件的編碼
     * @param file
     *            File對象實例
     * @param languageHint
     *            語言提示區域代碼 @see #nsPSMDetector ,取值如下:
     *             1 : Japanese
     *             2 : Chinese
     *             3 : Simplified Chinese
     *             4 : Traditional Chinese
     *             5 : Korean
     *             6 : Dont know(default)
     * </pre>
     * 
     * @return 文件編碼,eg:UTF-8,GBK,GB2312形式(不確定的時候,返回可能的字符編碼序列);若無,則返回null
     * @throws FileNotFoundException
     * @throws IOException
     */
    public String guessFileEncoding(File file, int languageHint) throws FileNotFoundException, IOException {
        return guessFileEncoding(file, new nsDetector(languageHint));
    }

    /**
     * 獲取文件的編碼
     * 
     * @param file
     * @param det
     * @return
     * @throws FileNotFoundException
     * @throws IOException
     */
    private String guessFileEncoding(File file, nsDetector det) throws FileNotFoundException, IOException {
        // Set an observer...
        // The Notify() will be called when a matching charset is found.
        det.Init(new nsICharsetDetectionObserver() {
            public void Notify(String charset) {
                encoding = charset;
                found = true;
            }
        });

        BufferedInputStream imp = new BufferedInputStream(new FileInputStream(file));
        byte[] buf = new byte[1024];
        int len;
        boolean done = false;
        boolean isAscii = false;

        while ((len = imp.read(buf, 0, buf.length)) != -1) {
            // Check if the stream is only ascii.
            isAscii = det.isAscii(buf, len);
            if (isAscii) {
                break;
            }
            // DoIt if non-ascii and not done yet.
            done = det.DoIt(buf, len, false);
            if (done) {
                break;
            }
        }
        imp.close();
        det.DataEnd();

        if (isAscii) {
            encoding = "ASCII";
            found = true;
        }

        if (!found) {
            String[] prob = det.getProbableCharsets();
            //這里將可能的字符集組合起來返回
            for (int i = 0; i < prob.length; i++) {
                if (i == 0) {
                    encoding = prob[i];
                } else {
                    encoding += "," + prob[i];
                }
            }

            if (prob.length > 0) {
                // 在沒有發現情況下,也可以只取第一個可能的編碼,這里返回的是一個可能的序列
                return encoding;
            } else {
                return null;
            }
        }
        return encoding;
    }
}

上面是判斷文件編碼的demo,本人測試了一下,得到的結果還是比較靠譜的~

上面提到的主頁上還有一個HtmlCharsetDetector的demo,感興趣的話可以去看一下。

 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM