Java - XPath解析爬取內容

本文轉載自查看原文 2014-10-24 20:43 5111 java/ Java Tips

就爬取和解析內容而言，我們有太多選擇。
比如，很多人都覺得Jsoup就可以解決所有問題。
無論是Http請求、DOM操作、CSS query selector篩選都非常方便。
　
關鍵是這個selector，僅通過一個表達式篩選出的只能是一個node。
如過我想獲得一個text或者一個node的屬性值，我需要從返回的element對象中再獲取一次。
而我恰好接到了一個有意思的需求，僅通過一個表達式表示想篩選的內容，獲取一個新聞網頁的每一條新聞的標題、鏈接等信息。

　
XPath再合適不過了，比如下面這個例子：

static void crawlByXPath(String url,String xpathExp) throws IOException, ParserConfigurationException, SAXException, XPathExpressionException {

    String html = Jsoup.connect(url).post().html();

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document document = builder.parse(html);

    XPathFactory xPathFactory = XPathFactory.newInstance();
    XPath xPath = xPathFactory.newXPath();

    XPathExpression expression = xPath.compile(xpathExp);
    expression.evaluate(html);

}

　　
遺憾的是，幾乎沒有網站可以通過documentBuilder.parse這段代碼。
而XPath卻對DOM非常嚴格。
對HTML進行一次clean，於是我加入了這個東西:

    <dependency>
        <groupId>net.sourceforge.htmlcleaner</groupId>
        <artifactId>htmlcleaner</artifactId>
        <version>2.9</version>
    </dependency>

　
HtmlCleaner可以幫我解決這個問題，而且他本身就支持XPath。
僅僅一行HtmlCleaner.clean就解決了:

public static void main(String[] args) throws IOException, XPatherException {
    String url = "http://zhidao.baidu.com/daily";
    String contents = Jsoup.connect(url).post().html();

    HtmlCleaner hc = new HtmlCleaner();
    TagNode tn = hc.clean(contents);
    String xpath = "//h2/a/@href";
    Object[] objects = tn.evaluateXPath(xpath);
    System.out.println(objects.length);

}

　
但是HtmlCleaner又引發了新的問題，當我把表達式寫成"//h2/a[contains(@href,'daily')]/@href"時，他提示我不支持contains函數。
而javax.xml.xpath則支持函數使用，這下問題來了。
如何結合二者? HtmlCleaner提供了DomSerializer，可以將TagNode對象轉為org.w3c.dom.Document對象，比如:

Document dom = new DomSerializer(new CleanerProperties()).createDOM(tn);

　
如此一來就可以發揮各自長處了。

public static void main(String[] args) throws IOException, XPatherException, ParserConfigurationException, XPathExpressionException {
    String url = "http://zhidao.baidu.com/daily";
    String exp = "//h2/a[contains(@href,'daily')]/@href";

    String html = null;
    try {
        Connection connect = Jsoup.connect(url);
        html = connect.get().body().html();
    } catch (IOException e) {
        e.printStackTrace();
    }
    HtmlCleaner hc = new HtmlCleaner();
    TagNode tn = hc.clean(html);
    Document dom = new DomSerializer(new CleanerProperties()).createDOM(tn);
    XPath xPath = XPathFactory.newInstance().newXPath();
    Object result;
    result = xPath.evaluate(exp, dom, XPathConstants.NODESET);
    if (result instanceof NodeList) {
        NodeList nodeList = (NodeList) result;
        System.out.println(nodeList.getLength());
        for (int i = 0; i < nodeList.getLength(); i++) {
            Node node = nodeList.item(i);
            System.out.println(node.getNodeValue() == null ? node.getTextContent() : node.getNodeValue());
        }
    }
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 php使用xpath爬取內容 xpath之圖片數據解析與爬取 lxml xpath 爬取並正常顯示中文內容 C#使用xpath簡單爬取網站的內容爬取千千小說 -- xpath python xpath圖片爬取爬蟲 selenium+Xpath 爬取動態js頁面元素內容爬取伯樂在線文章（二）通過xpath提取源文件中需要的內容【Python】 requests 爬取博客園內容AttributeError: 'NoneType' object has no attribute 'xpath' 7-13爬蟲入門之BeautifulSoup對網頁爬取內容的解析