因為采用Apache Tika解析網頁文件時產生亂碼問題,所以后來仔細看了一下Apache Tika源碼
先瀏覽一下tika編碼識別的相關接口和類的UML模型
下面是編碼識別接口,EncodingDetector.java
public interface EncodingDetector { /** * Detects the character encoding of the given text document, or * <code>null</code> if the encoding of the document can not be detected. * <p> * If the document input stream is not available, then the first * argument may be <code>null</code>. Otherwise the detector may * read bytes from the start of the stream to help in encoding detection. * The given stream is guaranteed to support the * {@link InputStream#markSupported() mark feature} and the detector * is expected to {@link InputStream#mark(int) mark} the stream before * reading any bytes from it, and to {@link InputStream#reset() reset} * the stream before returning. The stream must not be closed by the * detector. * <p> * The given input metadata is only read, not modified, by the detector. * * @param input text document input stream, or <code>null</code> * @param metadata input metadata for the document * @return detected character encoding, or <code>null</code> * @throws IOException if the document input stream could not be read */ Charset detect(InputStream input, Metadata metadata) throws IOException; }
編碼識別接口EncodingDetector的實現類有三,分別是HtmlEncodingDetector,UniversalEncodingDetector,和Icu4jEncodingDetector
從三者的名稱基本可以看出他們的功能或所用的組件,Tika默認的網頁編碼識別是存在問題的,當解析網頁文件時,在網頁的html元素里面編碼有誤的時候就會產生亂碼。
HtmlEncodingDetector類的源碼如下:
public class HtmlEncodingDetector implements EncodingDetector { // TIKA-357 - use bigger buffer for meta tag sniffing (was 4K) private static final int META_TAG_BUFFER_SIZE = 8192; private static final Pattern HTTP_EQUIV_PATTERN = Pattern.compile( "(?is)<meta\\s+http-equiv\\s*=\\s*['\\\"]\\s*" + "Content-Type['\\\"]\\s+content\\s*=\\s*['\\\"]" + "([^'\\\"]+)['\\\"]"); private static final Pattern META_CHARSET_PATTERN = Pattern.compile( "(?is)<meta\\s+charset\\s*=\\s*['\\\"]([^'\\\"]+)['\\\"]"); private static final Charset ASCII = Charset.forName("US-ASCII"); public Charset detect(InputStream input, Metadata metadata) throws IOException { if (input == null) { return null; } // Read enough of the text stream to capture possible meta tags input.mark(META_TAG_BUFFER_SIZE); byte[] buffer = new byte[META_TAG_BUFFER_SIZE]; int n = 0; int m = input.read(buffer); while (m != -1 && n < buffer.length) { n += m; m = input.read(buffer, n, buffer.length - n); } input.reset(); // Interpret the head as ASCII and try to spot a meta tag with // a possible character encoding hint String charset = null; String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString(); Matcher equiv = HTTP_EQUIV_PATTERN.matcher(head); if (equiv.find()) { MediaType type = MediaType.parse(equiv.group(1)); if (type != null) { charset = type.getParameters().get("charset"); } } if (charset == null) { // TIKA-892: HTML5 meta charset tag Matcher meta = META_CHARSET_PATTERN.matcher(head); if (meta.find()) { charset = meta.group(1); } } if (charset != null) { try { return CharsetUtils.forName(charset); } catch (Exception e) { // ignore } } return null; } }
如果需要正確的編碼,需要改寫
public Charset detect(InputStream input, Metadata metadata)方法
接下來分析另外一個重要的實現類UniversalEncodingDetector,從它的名稱基本可以猜測到時采用的juniversalchardet組件,其代碼如下:
public class UniversalEncodingDetector implements EncodingDetector { private static final int BUFSIZE = 1024; private static final int LOOKAHEAD = 16 * BUFSIZE; public Charset detect(InputStream input, Metadata metadata) throws IOException { if (input == null) { return null; } input.mark(LOOKAHEAD); try { UniversalEncodingListener listener = new UniversalEncodingListener(metadata); byte[] b = new byte[BUFSIZE]; int n = 0; int m = input.read(b); while (m != -1 && n < LOOKAHEAD && !listener.isDone()) { n += m; listener.handleData(b, 0, m); m = input.read(b, 0, Math.min(b.length, LOOKAHEAD - n)); } return listener.dataEnd(); } catch (IOException e) { throw e; } catch (Exception e) { // if juniversalchardet is not available return null; } finally { input.reset(); } } }
這里編碼識別 是通過UniversalEncodingListener類,該類實現了CharsetListener接口,該接口是juniversalchardet組件的接口,代碼如下:
package org.mozilla.universalchardet; public interface CharsetListener { public void report(String charset); }
該接口待實現的只有void report(String charset)方法
UniversalEncodingListener類持有私有成員
private final UniversalDetector detector = new UniversalDetector(this);
這里的this即為其本身,我們可以猜測到detector對象是通過回調CharsetListener接口的void report(String charset)方法傳回編碼類型字符串的
可以看到UniversalDetector類里面自帶的測試代碼:
public static void main(String[] args) throws Exception { if (args.length != 1) { System.out.println("USAGE: java UniversalDetector filename"); return; } UniversalDetector detector = new UniversalDetector( new CharsetListener() { public void report(String name) { System.out.println("charset = " + name); } } ); byte[] buf = new byte[4096]; java.io.FileInputStream fis = new java.io.FileInputStream(args[0]); int nread; while ((nread = fis.read(buf)) > 0 && !detector.isDone()) { detector.handleData(buf, 0, nread); } detector.dataEnd(); }
如果我們的程序采用UniversalEncodingDetector類來識別文件編碼,代碼怎么實現呢?下面是調用方法:
public static void main(String[] args) throws IOException, TikaException { // TODO Auto-generated method stub File file=new File("[文件路徑]"); InputStream stream=null; try { stream=new FileInputStream(file); EncodingDetector detector=new UniversalEncodingDetector(); Charset charset = detector.detect(new BufferedInputStream(stream), new Metadata()); System.out.println("編碼:"+charset.name()); } finally { if (stream != null) stream.close(); } }
第三個類是Icu4jEncodingDetector,從名稱可以看出是采用的IBM的Icu4j組件,代碼如下:
public class Icu4jEncodingDetector implements EncodingDetector { public Charset detect(InputStream input, Metadata metadata) throws IOException { if (input == null) { return null; } CharsetDetector detector = new CharsetDetector(); String incomingCharset = metadata.get(Metadata.CONTENT_ENCODING); String incomingType = metadata.get(Metadata.CONTENT_TYPE); if (incomingCharset == null && incomingType != null) { // TIKA-341: Use charset in content-type MediaType mt = MediaType.parse(incomingType); if (mt != null) { incomingCharset = mt.getParameters().get("charset"); } } if (incomingCharset != null) { detector.setDeclaredEncoding(CharsetUtils.clean(incomingCharset)); } // TIKA-341 without enabling input filtering (stripping of tags) // short HTML tests don't work well detector.enableInputFilter(true); detector.setText(input); for (CharsetMatch match : detector.detectAll()) { try { return CharsetUtils.forName(match.getName()); } catch (Exception e) { // ignore } } return null; } }
最關鍵的類是CharsetDetector,這里暫不進一步分析這個類了
下面轉帖網上的一篇博文
《使用ICU4J探測文檔編碼》
http://blog.csdn.net/cnhome/article/details/6973343
曾經使用過這個東東,還是不錯的,中國人的一篇論文,最早的時候好像是在哪個開源瀏覽器里。
網頁源碼的編碼探測一般有兩種方式,一種是通過分析網頁源碼中Meta信息,比如contentType,來取得編碼,但是某些網頁不的contentType中不含任何編碼信息,這時需要通過第二種方式進行探測,第二種是使用統計學和啟發式方法對網頁源碼進行編碼探測。ICU4J就是基於第二種方式的類庫。由IBM提供。
下面的例子演示了一個簡單的探測過程。
package com.huilan.dig.contoller; import java.io.IOException; import java.io.InputStream; import com.ibm.icu.text.CharsetDetector; import com.ibm.icu.text.CharsetMatch; /** * 本類使用ICU4J包進行文檔編碼獲取 * */ public class EncodeDetector { /** * 獲取編碼 * @throws IOException * @throws Exception */ public static String getEncode(byte[] data,String url){ CharsetDetector detector = new CharsetDetector(); detector.setText(data); CharsetMatch match = detector.detect(); String encoding = match.getName(); System.out.println("The Content in " + match.getName()); CharsetMatch[] matches = detector.detectAll(); System.out.println("All possibilities"); for (CharsetMatch m : matches) { System.out.println("CharsetName:" + m.getName() + " Confidence:" + m.getConfidence()); } return encoding; } public static String getEncode(InputStream data,String url) throws IOException{ CharsetDetector detector = new CharsetDetector(); detector.setText(data); CharsetMatch match = detector.detect(); String encoding = match.getName(); System.out.println("The Content in " + match.getName()); CharsetMatch[] matches = detector.detectAll(); System.out.println("All possibilities"); for (CharsetMatch m : matches) { System.out.println("CharsetName:" + m.getName() + " Confidence:" + m.getConfidence()); } return encoding; } }