如何使用Jsoup爬取網頁內容

本文轉載自查看原文 2020-04-25 16:46 1811 接口測試

前言：

這是一篇遲到很久的文章了，人真的是越來越懶，前一陣用jsoup實現了一個功能，個人覺得和selenium的webdriver原理類似，所以今天正好有時間，就又來更新分享了。

實現場景：

爬取博客園https://www.cnblogs.com/longronglang，文章列表中標題、鏈接、發布時間及閱讀量

思路：

1、引入jar包

2、通過httpclient，設置參數，代理，建立連接，獲取HTML文檔（響應信息）

3、將獲取的響應信息，轉換成HTML文檔為Document對象

4、使用jQuery定位方式，這塊就和web自動化一樣了定位獲取文本及相關屬性

相關詳細使用參考官網：https://jsoup.org/

實現：

1、引入依賴

<dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>
        <dependency>
            <groupId>commons-httpclient</groupId>
            <artifactId>commons-httpclient</artifactId>
            <version>3.1</version>
</dependency>

2、通過httpclient，設置參數，代理，建立連接，獲取HTML文檔（響應信息）

        String requestUrl = "https://www.cnblogs.com/longronglang/";
        HttpClient client = new HttpClient();
        HttpClientParams clientParams = client.getParams();
        clientParams.setContentCharset("UTF-8");
        GetMethod method = new GetMethod(requestUrl);
        String response =method.getResponseBodyAsString();

3、將獲取的響應信息，轉換成HTML文檔為Document對象

  Document document = Jsoup.parse(response);

4、使用jQuery定位方式，這塊就和web自動化一樣了定位獲取文本及相關屬性

這里可以仔細看下，也可以說是核心思路了，如下圖：

從圖中可以看到，文章標題在在a標簽中，也就是通過class屬性為postTitle2進行綁定，那么我們的dom對象就定位到這里即可，那么我想獲取文章標題這個dom對象，可以寫成如下代碼：

 Elements postItems = document.getElementsByClass("postTitle2");

同理，獲取發布時間及閱讀量，也可以寫成如下代碼：

 Elements readcontexts = document.getElementsByClass("postDesc");

最后我們來段整合的代碼如下：

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.params.HttpClientParams;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.junit.Test;

import java.io.IOException;

public class JsoupTest {

    @Test
    public void test() {
        String requestUrl = "https://www.cnblogs.com/longronglang/";
        HttpClient client = new HttpClient();
        HttpClientParams clientParams = client.getParams();
        clientParams.setContentCharset("UTF-8");
        GetMethod method = new GetMethod(requestUrl);
        String response = null;
        int code = 0;
        try {
            code = client.executeMethod(method);
            response = method.getResponseBodyAsString();
            if (code == HttpStatus.SC_OK) {
                Document document = Jsoup.parse(response);
                Elements postItems = document.getElementsByClass("postTitle2");
                Elements readcontexts = document.getElementsByClass("postDesc");
                for (int i = 0; i < postItems.size(); i++) {
                    System.out.println("文章標題:" + postItems.get(i).text());
                    System.out.println("文章地址:" + postItems.get(i).attr("href"));
                    System.out.println("發布信息:" + readcontexts.get(i).text());
                }
            } else {
                System.out.println("返回狀態不是200,可能需要登錄或者授權，亦或者重定向了！");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

運行結果如下：

到此，一個爬蟲搞完，這里只事拋磚引用，有興趣的同學，請自行擴展。

如果感情一開始就是不對等的，那么索性就早點結束掉它，利人利己。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用Java Jsoup爬取網頁內容（存入本地並從本地讀取） java爬取網頁內容簡單例子（2）——附jsoup的select用法詳解學習使用Java的webmagic框架爬取網頁內容使用HTTPURLConnection模擬登陸，爬取網頁內容 python爬取網頁內容demo jsoup抓取網頁內容網頁內容爬取：如何提取正文內容網頁內容爬取：如何提取正文內容 BEAUTIFULSOUP的輸出使用Jsoup獲取網頁內容超時設置 java爬蟲爬取網頁內容前，對網頁內容的編碼格式進行判斷的方式