Jsoup獲取網頁內容（並且解決中文亂碼問題）

本文轉載自查看原文 2020-11-23 18:16 574

1. 根據連接地址獲取網頁內容，解決中文亂碼頁面內容，請求失敗后嘗試3次

private static Document getPageContent(String urlStr) {
        for (int i = 1; i <= 3; i++) {
            try {
                URL url = new URL(urlStr);
                HttpURLConnection connection = (HttpURLConnection) url.openConnection();
                // 默認就是Get，可以采用post，大小寫都行，因為源碼里都toUpperCase了。
                connection.setRequestMethod("GET");
                // 是否允許緩存，默認true。
                connection.setUseCaches(Boolean.FALSE);
                // 是否開啟輸出輸入，如果是post使用true。默認是false
                // connection.setDoOutput(Boolean.TRUE);
                // connection.setDoInput(Boolean.TRUE);
                // 設置請求頭信息
                connection.addRequestProperty("Connection", "close");
                // 設置連接主機超時（單位：毫秒）
                connection.setConnectTimeout(8000);
                // 設置從主機讀取數據超時（單位：毫秒）
                connection.setReadTimeout(8000);
                // 設置Cookie
                // connection.addRequestProperty("Cookie", "你的Cookies");
                // 開始請求
                int index = urlStr.indexOf("://") + 3;
                String baseUri = urlStr.substring(0, index) + url.getHost();
                Document doc = Jsoup.parse(connection.getInputStream(), "GBK", baseUri);
                if (doc != null) {
                    return doc;
                }
                Thread.sleep(3 * 1000);
                continue;
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return null;
    }

2. 解析網頁數據，通過多種方式獲取頁面元素

    public static void main(String[] args) {
        String urlStr = "http://test.cn/a.html";// 靜態頁面鏈接地址
        Document doc = getPageContent(urlStr);
        if (doc != null) {
            // 1. 根據id查詢元素
            Element e1 = doc.getElementById("id");
            // 2. 根據標簽獲取元素
            Elements e2 = doc.getElementsByTag("p");
            // 3. 根據class獲取元素
            Element e3 = doc.getElementsByClass("class_p").first();
            // 4. 根據屬性獲取元素
            Element e4 = doc.getElementsByAttributeValue("href", "http://test.cn").first();
            // 5. 根據查詢器獲取元素(class 為writing的div下的p)
            Elements e5 = doc.select("div.writing>p");
            Elements es = doc.select("div .writing p");
            if (es != null && es.size() > 0) {
                for (Element p : es) {
                    String pStr = p.text().trim();
                    System.out.println(pStr);
                }
            }
        }
    }

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用Jsoup獲取網頁內容超時設置 java 如何獲取網頁的動態內容，並解析網頁內容解決Http響應內容中文亂碼問題 Httpclient 和jsoup結和提取網頁內容(某客學院視頻鏈接）基於apache —HttpClient的小爬蟲獲取網頁內容 C#獲取網頁內容的三種方式 php 網頁內容抓取解決中文亂碼問題 JS獲取url多個參數及解決中文亂碼問題 JS獲取url多個參數及解決中文亂碼問題