用juniversalchardet解決爬蟲亂碼問題


 

 

        爬蟲往往會遇到亂碼問題。最簡單的方法是根據http的響應信息來獲取編碼信息。但如果對方網站的響應信息不包含編碼信息或編碼信息錯誤,那么爬蟲取下來的信息就很可能是亂碼。

       好的解決辦法是直接根據頁面內容來自動判斷頁面的編碼。如Mozilla公司的firefox使用的universalchardet編碼自動檢測工具。

       juniversalchardet是universalchardet的Java版本。源碼開源於 https://github.com/thkoch2001/juniversalchardet

       自動編碼主要是根據統計學的方法來判斷。具體原理,可以看http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

       現在以Java爬蟲常用的httpclient來講解如何使用。看以下關鍵代碼:

 
UniversalDetector encDetector = new UniversalDetector(null);  
    while ((l = myStream.read(tmp)) != -1) {  
        buffer.append(tmp, 0, l);  
        if (!encDetector.isDone()) {  
            encDetector.handleData(tmp, 0, l);  
        }  
    }  
encDetector.dataEnd();  
String encoding = encDetector.getDetectedCharset();  
if (encoding != null) {  
    return new String(buffer.toByteArray(), encoding);  
}  
encDetector.reset();  

  

  1. myStream.read(tmp)) 讀取httpclient得到的流。我們要做的就是在讀流的同時,運用juniversalchardet來檢測編碼,如果有符合特征的編碼的出現,則最后可通過detector.getDetectedCharset()  
  2. 可以得到編碼,否則返回null。至此,檢測工作結束,通過String的構造方法來進行按一定編碼構建字符串。  



http://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet/1.0.3

<!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet -->
<dependency>
    <groupId>com.googlecode.juniversalchardet</groupId>
    <artifactId>juniversalchardet</artifactId>
    <version>1.0.3</version>
</dependency>

  

 

https://code.google.com/archive/p/juniversalchardet/

 

Java port of universalchardet

1. What is it?

juniversalchardet is a Java port of 'universalchardet', that is the encoding detector library of Mozilla.

The original code of universalchardet is available athttp://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

Techniques used by universalchardet are described athttp://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

2. Encodings that can be detected

  • Chinese

    • ISO-2022-CN
    • BIG5
    • EUC-TW
    • GB18030
    • HZ-GB-23121
  • Cyrillic

    • ISO-8859-5
    • KOI8-R
    • WINDOWS-1251
    • MACCYRILLIC
    • IBM866
    • IBM855
  • Greek

    • ISO-8859-7
    • WINDOWS-1253
  • Hebrew

    • ISO-8859-8
    • WINDOWS-1255
  • Japanese

    • ISO-2022-JP
    • SHIFT_JIS
    • EUC-JP
  • Korean

    • ISO-2022-KR
    • EUC-KR
  • Unicode

    • UTF-8
    • UTF-16BE / UTF-16LE
    • UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
  • Others

    • WINDOWS-1252

1 Currently not supported by Java

3. How to use it

  1. Construct an instance of org.mozilla.universalchardet.UniversalDetector.
  2. Feed some data (typically several thousands bytes) to the detector by calling UniversalDetector.handleData().
  3. Notify the detector of the end of data by calling UniversalDetector.dataEnd().
  4. Get the detected encoding name by calling UniversalDetector.getDetectedCharset().
  5. Don't forget to call UniversalDetector.reset() before you reuse the detector instance.

Sample Code

Download ``` import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector { public static void main(String[] args) throws java.io.IOException { byte[] buf = new byte[4096]; String fileName = args[0]; java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

// (1)
UniversalDetector detector = new UniversalDetector(null);

// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
  detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();

// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
  System.out.println("Detected encoding = " + encoding);
} else {
  System.out.println("No encoding detected.");
}

// (5)
detector.reset();

} } ```

4. Related Works

jchardet

  • http://jchardet.sourceforge.net/ jchardet is another Java port of the Mozilla's encoding dectection library. The main difference between jchardet and juniversalchardet is modules they are based on. jchardet is based on the 'chardet' module that has long existed. juniversalchardet is based on the 'universalchardet' module that is new and generally provides better accuracy on detection results.

5. License

The library is subject to the Mozilla Public License Version 1.1. Alternatively, the library may be used under the terms of either the GNU General Public License Version 2 or later, or the GNU Lesser General Public License 2.1 or later.

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM