HTMLParser1.6 源代碼閱讀


看到博客園的大牛們都喜歡發系列的文章,我也發一篇。不過我不打算寫什么spring hibernate配置什么的,我只想寫寫自己閱讀別人代碼的一些筆記。歡迎大家拍磚。

 

從開始進行閱讀,第一個包是:org.htmlparser.里面的類包括

Attribute.java

Node.java

NodeFactory.java

NodeFilter.java

Parser.java

PrototypicalNodeFactory.java

Remark.java

Tag.java

Text.java

可以看出都是針對基本數據結構的類。

一個一個進行分析,Attribute.java是記錄網頁元素的屬性的,

                                <a href="EditPosts.aspx" id="TabPosts">隨筆</a>

這個是博客園的,那Attribute應當可以記錄 href=“MySubscibe.aspx"這樣的元素。看他的構造方法:

public Attribute (String name, String assignment, String value, char quote)
    {
        setName (name);
        setAssignment (assignment);
        if (0 == quote)
            setRawValue (value);
        else
        {
            setValue (value);
            setQuote (quote);
        }
    }

看見分為是否含有引號的情況。setName,和setValue沒有什么可以看的,如果不含引號的情況,value怎么設置

/**
     * Set the value of the attribute and the quote character.
     * If the value is pure whitespace, assign it 'as is' and reset the
     * quote character. If not, check for leading and trailing double or
     * single quotes, and if found use this as the quote character and
     * the inner contents of <code>value</code> as the real value.
     * Otherwise, examine the string to determine if quotes are needed
     * and an appropriate quote character if so. This may involve changing
     * double quotes within the string to character references.
     * @param value The new value.
     * @see #getRawValue
     * @see #getRawValue(StringBuffer)
     */
    public void setRawValue (String value)
    {
        char ch;
        boolean needed;
        boolean singleq;
        boolean doubleq;
        String ref;
        StringBuffer buffer;
        char quote;

        quote = 0;
        if ((null != value) && (0 != value.trim ().length ()))
        {
            if (value.startsWith ("'") && value.endsWith ("'")
                && (2 <= value.length ()))
            {
                quote = '\'';
                value = value.substring (1, value.length () - 1);
            }
            else if (value.startsWith ("\"") && value.endsWith ("\"")
                && (2 <= value.length ()))
            {
                quote = '"';
                value = value.substring (1, value.length () - 1);
            }
            else
            {
                // first determine if there's whitespace in the value
                // and while we're at it find a suitable quote character
                needed = false;
                singleq = true;
                doubleq = true;
                for (int i = 0; i < value.length (); i++)
                {
                    ch = value.charAt (i);
                    if ('\'' == ch)
                    {
                        singleq  = false;
                        needed = true;
                    }
                    else if ('"' == ch)
                    {
                        doubleq = false;
                        needed = true;
                    }
                    else if (!('-' == ch) && !('.' == ch) && !('_' == ch)
                       && !(':' == ch) && !Character.isLetterOrDigit (ch))
                    {
                        needed = true;
                    }
                }

                // now apply quoting
                if (needed)
                {
                    if (doubleq)
                        quote = '"';
                    else if (singleq)
                        quote = '\'';
                    else
                    {
                        // uh-oh, we need to convert some quotes into character
                        // references, so convert all double quotes into &#34;
                        quote = '"';
                        ref = "&quot;"; // Translate.encode (quote);
                        // JDK 1.4: value = value.replaceAll ("\"", ref);
                        buffer = new StringBuffer (
                                value.length() * (ref.length () - 1));
                        for (int i = 0; i < value.length (); i++)
                        {
                            ch = value.charAt (i);
                            if (quote == ch)
                                buffer.append (ref);
                            else
                                buffer.append (ch);
                        }
                        value = buffer.toString ();
                    }
                }
            }
        }
        setValue (value);
        setQuote (quote);
    }

如果沒有設置分割字符的話,需要進行判斷,首先判斷value的字符是哪種?單引號,雙引號,還是其他。如果是單引號開頭,單引號結尾,那么分割是單引號。如果是雙引號,就是雙引號。如果不是這樣,可能需要修復,如果里面有單引號,那么我們用雙引號進行包裝,里面含有雙引號,我們用單引號進行包裝。或者其中含有一些特別的字符(不是數字,字符,-,_,:),我們需要用引號引用起來。這樣屬性就可以保存下來。

屬性暫時分析到這里,有興趣的可以自己閱讀剩下的部分。

Node.java

其實htmlParser就是一個詞法語法分析器,學過自動機的同學應當對此很熟悉,(ps,本人自動機掛掉了。。)而HTML的元素有三種類型,text,Tag,Remark(remark是不是這種<!--ddd-->,這個不確定)。 我們進行語法解析的時候肯定要返回相應的node,那這個node應當設計成抽象類或者接口,的確,也是這樣設計的。看代碼

package org.htmlparser;

import org.htmlparser.lexer.Page;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.visitors.NodeVisitor;

public interface Node
    extends
        Cloneable
{
    
    String toPlainTextString ();  
    String toHtml ();
    String toHtml (boolean verbatim);
    String toString ();    
    void collectInto (NodeList list, NodeFilter filter);   
    int getStartPosition ();    
    void setStartPosition (int position);   
    int getEndPosition ();   
    void setEndPosition (int position);   
    Page getPage ();
    void setPage (Page page); 
    void accept (NodeVisitor visitor);   
    Node getParent ();
    void setParent (Node node);
    NodeList getChildren ();  
    void setChildren (NodeList children);
    Node getFirstChild ();
    Node getLastChild (); 
    Node getPreviousSibling ();      
    Node getNextSibling ();     
    String getText ();  
    void setText (String text); 
    void doSemanticAction ()
        throws
            ParserException;  
    Object clone ()
        throws
            CloneNotSupportedException;
}

這里光看這個也沒辦法領略精髓,大致就是一個開始克隆,可以轉換成html元素,轉換成string的類型,並且可以迭代的一系列方法。不過下一步我們應當從lexer中尋找相關答案,保持我們的閱讀順序,我們繼續進行下一個java分析

NodeFactory

先看代碼

package org.htmlparser;

import java.util.Vector;

import org.htmlparser.lexer.Page;
import org.htmlparser.util.ParserException;

public interface NodeFactory
{
   
    Text createStringNode (Page page, int start, int end)
        throws
            ParserException;
 
    Remark createRemarkNode (Page page, int start, int end)
        throws
            ParserException;
   
    Tag createTagNode (Page page, int start, int end, Vector attributes)
        throws
            ParserException;
}

就是創建上面三種HTML頁面的基本元素。

NodeFilter

package org.htmlparser;

import java.io.Serializable;

/**
 * Implement this interface to select particular nodes.
 */
public interface NodeFilter
    extends
        Serializable,
        Cloneable
{

    boolean accept (Node node);
}

進行一個Node的合理性驗證

Parser

終於看到重點了

package org.htmlparser;

import java.io.Serializable;
import java.net.HttpURLConnection;
import java.net.URLConnection;

import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.http.ConnectionManager;
import org.htmlparser.http.ConnectionMonitor;
import org.htmlparser.http.HttpHeader;
import org.htmlparser.lexer.Lexer;
import org.htmlparser.lexer.Page;
import org.htmlparser.util.DefaultParserFeedback;
import org.htmlparser.util.IteratorImpl;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.util.ParserFeedback;
import org.htmlparser.visitors.NodeVisitor;


public class Parser
    implements
        Serializable,
        ConnectionMonitor
{
   
    public static final double
    VERSION_NUMBER = 1.6
    ;

   
    public static final String
    VERSION_TYPE = "Release Build"
    ;

   
    public static final String
    VERSION_DATE = "Jun 10, 2006"
    ;

   
    public static final String VERSION_STRING =
            "" + VERSION_NUMBER
            + " (" + VERSION_TYPE + " " + VERSION_DATE + ")";

  
    protected ParserFeedback mFeedback;

    protected Lexer mLexer;

    public static final ParserFeedback DEVNULL =
        new DefaultParserFeedback (DefaultParserFeedback.QUIET);

    public static final ParserFeedback STDOUT = new DefaultParserFeedback ();

    static
    {
        getConnectionManager ().getDefaultRequestProperties ().put (
            "User-Agent", "HTMLParser/" + getVersionNumber ());
    
    }


    public static String getVersion ()
    {
        return (VERSION_STRING);
    }

    public static double getVersionNumber ()
    {
        return (VERSION_NUMBER);
    }

    public static ConnectionManager getConnectionManager ()
    {
        return (Page.getConnectionManager ());
    }

    public static void setConnectionManager (ConnectionManager manager)
    {
        Page.setConnectionManager (manager);
    }

    public static Parser createParser (String html, String charset)
    {
        Parser ret;

        if (null == html)
            throw new IllegalArgumentException ("html cannot be null");
        ret = new Parser (new Lexer (new Page (html, charset)));

        return (ret);
    }


    public Parser ()
    {
        this (new Lexer (new Page ("")), DEVNULL);
    }

    public Parser (Lexer lexer, ParserFeedback fb)
    {
        setFeedback (fb);
        setLexer (lexer);
        setNodeFactory (new PrototypicalNodeFactory ());
    }

    public Parser (URLConnection connection, ParserFeedback fb)
        throws
            ParserException
    {
        this (new Lexer (connection), fb);
    }

    public Parser (String resource, ParserFeedback feedback)
        throws
            ParserException
    {
        setFeedback (feedback);
        setResource (resource);
        setNodeFactory (new PrototypicalNodeFactory ());
    }

    public Parser (String resource) throws ParserException
    {
        this (resource, STDOUT);
    }

    public Parser (Lexer lexer)
    {
        this (lexer, STDOUT);
    }

    public Parser (URLConnection connection) throws ParserException
    {
        this (connection, STDOUT);
    }

    public void setResource (String resource)
        throws
            ParserException
    {
        int length;
        boolean html;
        char ch;

        if (null == resource)
            throw new IllegalArgumentException ("resource cannot be null");
        length = resource.length ();
        html = false;
        for (int i = 0; i < length; i++)
        {
            ch = resource.charAt (i);
            if (!Character.isWhitespace (ch))
            {
                if ('<' == ch)
                    html = true;
                break;
            }
        }
        if (html)
            setLexer (new Lexer (new Page (resource)));
        else
            setLexer (new Lexer (getConnectionManager ().openConnection (resource)));
    }

    public void setConnection (URLConnection connection)
        throws
            ParserException
    {
        if (null == connection)
            throw new IllegalArgumentException ("connection cannot be null");
        setLexer (new Lexer (connection));
    }

    public URLConnection getConnection ()
    {
        return (getLexer ().getPage ().getConnection ());
    }


    public void setURL (String url)
        throws
            ParserException
    {
        if ((null != url) && !"".equals (url))
            setConnection (getConnectionManager ().openConnection (url));
    }

    public String getURL ()
    {
        return (getLexer ().getPage ().getUrl ());
    }

    public void setEncoding (String encoding)
        throws
            ParserException
    {
        getLexer ().getPage ().setEncoding (encoding);
    }

    public String getEncoding ()
    {
        return (getLexer ().getPage ().getEncoding ());
    }


    public void setLexer (Lexer lexer)
    {
        NodeFactory factory;
        String type;

        if (null == lexer)
            throw new IllegalArgumentException ("lexer cannot be null");
        // move a node factory that's been set to the new lexer
        factory = null;
        if (null != getLexer ())
            factory = getLexer ().getNodeFactory ();
        if (null != factory)
            lexer.setNodeFactory (factory);
        mLexer = lexer;
        // warn about content that's not likely text
        type = mLexer.getPage ().getContentType ();
        if (type != null && !type.startsWith ("text"))
            getFeedback ().warning (
                "URL "
                + mLexer.getPage ().getUrl ()
                + " does not contain text");
    }

    public Lexer getLexer ()
    {
        return (mLexer);
    }

    public NodeFactory getNodeFactory ()
    {
        return (getLexer ().getNodeFactory ());
    }

    public void setNodeFactory (NodeFactory factory)
    {
        if (null == factory)
            throw new IllegalArgumentException ("node factory cannot be null");
        getLexer ().setNodeFactory (factory);
    }

 
    public void setFeedback (ParserFeedback fb)
    {
        if (null == fb)
            mFeedback = DEVNULL;
        else
            mFeedback = fb;
    }

    public ParserFeedback getFeedback()
    {
        return (mFeedback);
    }

    public void reset ()
    {
        getLexer ().reset ();
    }

 
    public NodeIterator elements () throws ParserException
    {
        return (new IteratorImpl (getLexer (), getFeedback ()));
    }
   
    public NodeList parse (NodeFilter filter) throws ParserException
    {
        NodeIterator e;
        Node node;
        NodeList ret;

        ret = new NodeList ();
        for (e = elements (); e.hasMoreNodes (); )
        {
            node = e.nextNode ();
            if (null != filter)
                node.collectInto (ret, filter);
            else
                ret.add (node);
        }

        return (ret);
    }

    public void visitAllNodesWith (NodeVisitor visitor) throws ParserException
    {
        Node node;
        visitor.beginParsing();
        for (NodeIterator e = elements(); e.hasMoreNodes(); )
        {
            node = e.nextNode();
            node.accept(visitor);
        }
        visitor.finishedParsing();
    }

    public void setInputHTML (String inputHTML)
        throws
            ParserException
    {
        if (null == inputHTML)
            throw new IllegalArgumentException ("html cannot be null");
        if (!"".equals (inputHTML))
            setLexer (new Lexer (new Page (inputHTML)));
    }

    public NodeList extractAllNodesThatMatch (NodeFilter filter)
        throws
            ParserException
    {
        NodeIterator e;
        NodeList ret;

        ret = new NodeList ();
        for (e = elements (); e.hasMoreNodes (); )
            e.nextNode ().collectInto (ret, filter);

        return (ret);
    }


    public void preConnect (HttpURLConnection connection)
        throws
            ParserException
    {
        getFeedback ().info (HttpHeader.getRequestHeader (connection));
    }


    public void postConnect (HttpURLConnection connection)
        throws
            ParserException
    {
        getFeedback ().info (HttpHeader.getResponseHeader (connection));
    }

    public static void main (String [] args)
    {
        Parser parser;
        NodeFilter filter;

        if (args.length < 1 || args[0].equals ("-help"))
        {
            System.out.println ("HTML Parser v" + getVersion () + "\n");
            System.out.println ();
            System.out.println ("Syntax : java -jar htmlparser.jar"
                    + " <file/page> [type]");
            System.out.println ("   <file/page> the URL or file to be parsed");
            System.out.println ("   type the node type, for example:");
            System.out.println ("     A - Show only the link tags");
            System.out.println ("     IMG - Show only the image tags");
            System.out.println ("     TITLE - Show only the title tag");
            System.out.println ();
            System.out.println ("Example : java -jar htmlparser.jar"
                    + " http://www.yahoo.com");
            System.out.println ();
        }
        else
            try
            {
                parser = new Parser ();
                if (1 < args.length)
                    filter = new TagNameFilter (args[1]);
                else
                {
                    filter = null;
                    // for a simple dump, use more verbose settings
                    parser.setFeedback (Parser.STDOUT);
                    getConnectionManager ().setMonitor (parser);
                }
                getConnectionManager ().setRedirectionProcessingEnabled (true);
                getConnectionManager ().setCookieProcessingEnabled (true);
                parser.setResource (args[0]);
                System.out.println (parser.parse (filter));
            }
            catch (ParserException e)
            {
                e.printStackTrace ();
            }
    }
}

首先我們要看看構造方法:通常,我們使用HTMLParser的時候是這樣new的 Parser a = Parser.createParser(content,"UTF-8");而這個方法就是通過new Parser(New Lexer(new Page(html,charset)));方法創建一個Parser,也就是核心是Page對象,Lexer對象和Parser對象。

看其他的構造方法,空構造方法,我們略去。

public Parser(Lexer lexer,ParserFeedback fb),里面無非就是設置詞法解析器,設置異常信息接收器,和設置一個nodeFactory。

public Parser (URLConnection connection, ParserFeedback fb)網頁
public Parser (String resource, ParserFeedback feedback) 對內容進行解析

public Parser (String resource)

public Parser (Lexer lexer)

public Parser (URLConnection connection)

無非就是用默認的fd,之后會對FeedBack類進行相關說明。

到這里,parser完成的無非就是對給定的url,或者content,或者urlconnection對象,創建相應的詞法分析器,feedback,和根據條件創建相應的Page對象。同時,我們看到Nodefactory實際上的設置是在Lexer中進行的。這里先不管。

這里有幾個我們經常用的方法

public NodeList parse (NodeFilter filter) throws ParserException 根據過濾條件返回相應的nodelist

public void visitAllNodesWith (NodeVisitor visitor) throws ParserException  運用迭代器的方式進行遍歷,這里涉及了NodeVisitor,IteratorImpl 這里還沒有看。暫時不知道為什么這樣設計。

public void setInputHTML (String inputHTML) 可以對html元素進行解析

public NodeList extractAllNodesThatMatch (NodeFilter filter) 根據過濾條件返回滿足條件的NodeList

最后是測試方法。parser類讀完了,其實parser類就是一個大的入口,將與頁面相關的信息傳遞給parser類,parser調用其他類對其進行解析,返回nodelist。nodelist里面有若干node,而實例化的node里面存有我們相用的各種信息,到這里沒有看到詞法分析器的影子。。

 

PrototypicalNodeFactory

對Text,Remark,Tag進行標准化的約束。這個類暫時不做過多介紹,等用到的時候進行相關解釋。

Remark

Text

Tag 這三個接口類,實現node,為具體的Node提供接口。

 

至此,org.htmlParser  閱讀完畢,看見還是比較簡單的。為什么我要閱讀這個代碼,一時項目中用到了這個HTMLParser,因為不放心,想知道內部是如何實現的。二是,我自動機以前掛掉過,我想自己實現一個Parser。三是,我的時間和充裕。

 

下一章節,我們將介紹另一個核心包:org.htmlparser.lexer 。同時,我們要對html頁面標准進行學習,不然如何自己實現一個Parser呢?

 

歡迎批評指正。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM