Apache Tika源碼研究(一)


因為采用Apache Tika解析網頁文件時產生亂碼問題,所以后來仔細看了一下Apache Tika源碼

先瀏覽一下tika編碼識別的相關接口和類的UML模型

下面是編碼識別接口,EncodingDetector.java

public interface EncodingDetector {

    /**
     * Detects the character encoding of the given text document, or
     * <code>null</code> if the encoding of the document can not be detected.
     * <p>
     * If the document input stream is not available, then the first
     * argument may be <code>null</code>. Otherwise the detector may
     * read bytes from the start of the stream to help in encoding detection.
     * The given stream is guaranteed to support the
     * {@link InputStream#markSupported() mark feature} and the detector
     * is expected to {@link InputStream#mark(int) mark} the stream before
     * reading any bytes from it, and to {@link InputStream#reset() reset}
     * the stream before returning. The stream must not be closed by the
     * detector.
     * <p>
     * The given input metadata is only read, not modified, by the detector.
     *
     * @param input text document input stream, or <code>null</code>
     * @param metadata input metadata for the document
     * @return detected character encoding, or <code>null</code>
     * @throws IOException if the document input stream could not be read
     */
    Charset detect(InputStream input, Metadata metadata) throws IOException;

}

編碼識別接口EncodingDetector的實現類有三,分別是HtmlEncodingDetector,UniversalEncodingDetector,和Icu4jEncodingDetector

從三者的名稱基本可以看出他們的功能或所用的組件,Tika默認的網頁編碼識別是存在問題的,當解析網頁文件時,在網頁的html元素里面編碼有誤的時候就會產生亂碼。

HtmlEncodingDetector類的源碼如下:

public class HtmlEncodingDetector implements EncodingDetector {

    // TIKA-357 - use bigger buffer for meta tag sniffing (was 4K)
    private static final int META_TAG_BUFFER_SIZE = 8192;

    private static final Pattern HTTP_EQUIV_PATTERN = Pattern.compile(
            "(?is)<meta\\s+http-equiv\\s*=\\s*['\\\"]\\s*"
            + "Content-Type['\\\"]\\s+content\\s*=\\s*['\\\"]"
            + "([^'\\\"]+)['\\\"]");

    private static final Pattern META_CHARSET_PATTERN = Pattern.compile(
            "(?is)<meta\\s+charset\\s*=\\s*['\\\"]([^'\\\"]+)['\\\"]");

    private static final Charset ASCII = Charset.forName("US-ASCII");

    public Charset detect(InputStream input, Metadata metadata)
            throws IOException {
        if (input == null) {
            return null;
        }

        // Read enough of the text stream to capture possible meta tags
        input.mark(META_TAG_BUFFER_SIZE);
        byte[] buffer = new byte[META_TAG_BUFFER_SIZE];
        int n = 0;
        int m = input.read(buffer);
        while (m != -1 && n < buffer.length) {
            n += m;
            m = input.read(buffer, n, buffer.length - n);
        }
        input.reset();

        // Interpret the head as ASCII and try to spot a meta tag with
        // a possible character encoding hint
        String charset = null;
        String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString();

        Matcher equiv = HTTP_EQUIV_PATTERN.matcher(head);
        if (equiv.find()) {
            MediaType type = MediaType.parse(equiv.group(1));
            if (type != null) {
                charset = type.getParameters().get("charset");
            }
        }
        if (charset == null) {
            // TIKA-892: HTML5 meta charset tag
            Matcher meta = META_CHARSET_PATTERN.matcher(head);
            if (meta.find()) {
                charset = meta.group(1);
            }
        }

        if (charset != null) {
            try {
                return CharsetUtils.forName(charset);
            } catch (Exception e) {
                // ignore
            }
        }

        return null;
    }

}

如果需要正確的編碼,需要改寫

public Charset detect(InputStream input, Metadata metadata)方法

接下來分析另外一個重要的實現類UniversalEncodingDetector,從它的名稱基本可以猜測到時采用的juniversalchardet組件,其代碼如下:

public class UniversalEncodingDetector implements EncodingDetector {

    private static final int BUFSIZE = 1024;

    private static final int LOOKAHEAD = 16 * BUFSIZE;

    public Charset detect(InputStream input, Metadata metadata)
            throws IOException {
        if (input == null) {
            return null;
        }

        input.mark(LOOKAHEAD);
        try {
            UniversalEncodingListener listener =
                    new UniversalEncodingListener(metadata);

            byte[] b = new byte[BUFSIZE];
            int n = 0;
            int m = input.read(b);
            while (m != -1 && n < LOOKAHEAD && !listener.isDone()) {
                n += m;
                listener.handleData(b, 0, m);
                m = input.read(b, 0, Math.min(b.length, LOOKAHEAD - n));
            }

            return listener.dataEnd();
        } catch (IOException e) {
            throw e;
        } catch (Exception e) { // if juniversalchardet is not available
            return null;
        } finally {
            input.reset();
        }
    }

}

這里編碼識別 是通過UniversalEncodingListener類,該類實現了CharsetListener接口,該接口是juniversalchardet組件的接口,代碼如下:

package org.mozilla.universalchardet;

public interface CharsetListener
{
    public void report(String charset);
}

該接口待實現的只有void report(String charset)方法

UniversalEncodingListener類持有私有成員

 private final UniversalDetector detector = new UniversalDetector(this);

這里的this即為其本身,我們可以猜測到detector對象是通過回調CharsetListener接口的void report(String charset)方法傳回編碼類型字符串的

可以看到UniversalDetector類里面自帶的測試代碼:

public static void main(String[] args) throws Exception
    {
        if (args.length != 1) {
            System.out.println("USAGE: java UniversalDetector filename");
            return;
        }

        UniversalDetector detector = new UniversalDetector(
                new CharsetListener() {
                    public void report(String name)
                    {
                        System.out.println("charset = " + name);
                    }
                }
                );
        
        byte[] buf = new byte[4096];
        java.io.FileInputStream fis = new java.io.FileInputStream(args[0]);
        
        int nread;
        while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }
        detector.dataEnd();
    }
如果我們的程序采用UniversalEncodingDetector類來識別文件編碼,代碼怎么實現呢?下面是調用方法:
public static void main(String[] args) throws IOException, TikaException {
        // TODO Auto-generated method stub
        File file=new File("[文件路徑]");
        InputStream stream=null;        
        try
        {
            stream=new FileInputStream(file);
            EncodingDetector  detector=new UniversalEncodingDetector();
            Charset charset = detector.detect(new BufferedInputStream(stream), new Metadata());
            System.out.println("編碼:"+charset.name());    
        }
        finally
        {
            if (stream != null)    stream.close();
        }
    }
 第三個類是Icu4jEncodingDetector,從名稱可以看出是采用的IBM的Icu4j組件,代碼如下:
public class Icu4jEncodingDetector implements EncodingDetector {

    public Charset detect(InputStream input, Metadata metadata)
            throws IOException {
        if (input == null) {
            return null;
        }

        CharsetDetector detector = new CharsetDetector();

        String incomingCharset = metadata.get(Metadata.CONTENT_ENCODING);
        String incomingType = metadata.get(Metadata.CONTENT_TYPE);
        if (incomingCharset == null && incomingType != null) {
            // TIKA-341: Use charset in content-type
            MediaType mt = MediaType.parse(incomingType);
            if (mt != null) {
                incomingCharset = mt.getParameters().get("charset");
            }
        }

        if (incomingCharset != null) {
            detector.setDeclaredEncoding(CharsetUtils.clean(incomingCharset));
        }

        // TIKA-341 without enabling input filtering (stripping of tags)
        // short HTML tests don't work well
        detector.enableInputFilter(true);

        detector.setText(input);

        for (CharsetMatch match : detector.detectAll()) {
            try {
                return CharsetUtils.forName(match.getName());
            } catch (Exception e) {
                // ignore
            }
        }

        return null;
    }

}

 

最關鍵的類是CharsetDetector,這里暫不進一步分析這個類了

下面轉帖網上的一篇博文  

《使用ICU4J探測文檔編碼》

http://blog.csdn.net/cnhome/article/details/6973343

曾經使用過這個東東,還是不錯的,中國人的一篇論文,最早的時候好像是在哪個開源瀏覽器里。 

 

網頁源碼的編碼探測一般有兩種方式,一種是通過分析網頁源碼中Meta信息,比如contentType,來取得編碼,但是某些網頁不的contentType中不含任何編碼信息,這時需要通過第二種方式進行探測,第二種是使用統計學和啟發式方法對網頁源碼進行編碼探測。ICU4J就是基於第二種方式的類庫。由IBM提供。

下面的例子演示了一個簡單的探測過程。

package com.huilan.dig.contoller;

import java.io.IOException;
import java.io.InputStream;

import com.ibm.icu.text.CharsetDetector;
import com.ibm.icu.text.CharsetMatch;

/**
* 本類使用ICU4J包進行文檔編碼獲取
*
*/
public class EncodeDetector {
    /**
    * 獲取編碼
    * @throws IOException
    * @throws Exception
    */
    public static String getEncode(byte[] data,String url){
       CharsetDetector detector = new CharsetDetector();
       detector.setText(data);
       CharsetMatch match = detector.detect();
       String encoding = match.getName();
       System.out.println("The Content in " + match.getName());
       CharsetMatch[] matches = detector.detectAll();
       System.out.println("All possibilities");
       for (CharsetMatch m : matches) {
        System.out.println("CharsetName:" + m.getName() + " Confidence:"
          + m.getConfidence());
       }
       return encoding;
    }
    public static String getEncode(InputStream data,String url) throws IOException{
       CharsetDetector detector = new CharsetDetector();
       detector.setText(data);
       CharsetMatch match = detector.detect();
       String encoding = match.getName();
       System.out.println("The Content in " + match.getName());
       CharsetMatch[] matches = detector.detectAll();
       System.out.println("All possibilities");
       for (CharsetMatch m : matches) {
        System.out.println("CharsetName:" + m.getName() + " Confidence:"
          + m.getConfidence());
       }
       return encoding;
    }

}

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM