HttpClients+Jsoup抓取筆趣閣小說，並保存到本地TXT文件

本文轉載自查看原文 2018-10-10 17:13 1363 spider

前言

　　首先先介紹一下Jsoup：（摘自官網）

　　jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

　　Jsoup俗稱“大殺器”，具體的使用大家可以看 jsoup中文文檔

代碼編寫

　　首先maven引包：

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.4</version>
</dependency>

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpcore</artifactId>
    <version>4.4.9</version>
</dependency>

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.11.3</version>
</dependency>

　　封裝幾個方法（思路大多都在注解里面，相信大家都看得懂）：

　　/**
     * 創建.txt文件
     *
     * @param fileName 文件名（小說名）
     * @return File對象
     */
    public static File createFile(String fileName) {
        //獲取桌面路徑
        String comPath = FileSystemView.getFileSystemView().getHomeDirectory().getPath();
        //創建空白文件夾：networkNovel
        File file = new File(comPath + "\\networkNovel\\" + fileName + ".txt");
        try {
            //獲取父目錄
            File fileParent = file.getParentFile();
            if (!fileParent.exists()) {
                fileParent.mkdirs();
            }
            //創建文件
            if (!file.exists()) {
                file.createNewFile();
            }
        } catch (Exception e) {
            file = null;
            System.err.println("新建文件操作出錯");
            e.printStackTrace();
        }
        return file;
    }

    /**
     * 字符流寫入文件
     *
     * @param file  file對象
     * @param value 要寫入的數據
     */
    public static void fileWriter(File file, String value) {
        //字符流
        try {
            FileWriter resultFile = new FileWriter(file, true);//true,則追加寫入
            PrintWriter myFile = new PrintWriter(resultFile);
            //寫入
            myFile.println(value);
            myFile.println("\n");

            myFile.close();
            resultFile.close();
        } catch (Exception e) {
            System.err.println("寫入操作出錯");
            e.printStackTrace();
        }
    }

    /**
     * 采集當前url完整response實體.toString()
     *
     * @param url url
     * @return response實體.toString()
     */
    public static String gather(String url,String refererUrl) {
        String result = null;
        try {
            //創建httpclient對象 (這里設置成全局變量，相對於同一個請求session、cookie會跟着攜帶過去)
            CloseableHttpClient httpClient = HttpClients.createDefault();
            //創建get方式請求對象
            HttpGet httpGet = new HttpGet(url);
            httpGet.addHeader("Content-type", "application/json");
            //包裝一下
            httpGet.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");
            httpGet.addHeader("Referer", refererUrl);
            httpGet.addHeader("Connection", "keep-alive");

            //通過請求對象獲取響應對象
            CloseableHttpResponse response = httpClient.execute(httpGet);
            //獲取結果實體
            if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
                result = EntityUtils.toString(response.getEntity(), "GBK");
            }

            //釋放鏈接
            response.close();
        }
        //這里還可以捕獲超時異常，重新連接抓取
        catch (Exception e) {
            result = null;
            System.err.println("采集操作出錯");
            e.printStackTrace();
        }
        return result;
    }

    /**
     * 使用jsoup處理html字符串，根據規則，得到當前章節名以及完整內容跟下一章的鏈接地址
     * 每個站點的代碼風格都不一樣，所以規則要根據不同的站點去修改
　　　* 比如這里的文章內容直接用一個div包起來，而有些站點是每個段落用p標簽包起來
     * @param html html字符串
     * @return Map<String,String>
     */
    public static Map<String, String> processor(String html) {
        HashMap<String, String> map = new HashMap<>();
        String chapterName;//章節名
        String chapter = null;//完整章節（包括章節名）
        String next = null;//下一章鏈接地址
        try {
            //解析html格式的字符串成一個Document
            Document doc = Jsoup.parse(html);

            //章節名稱
            Elements bookname = doc.select("div.bookname > h1");
            chapterName = bookname.text().trim();
            chapter = chapterName +"\n";

            //文章內容
            Elements content = doc.select("div#content");
            String replaceText = content.text().replace(" ", "\n");
            chapter = chapter + replaceText;

            //下一章
            Elements nextText = doc.select("a:matches((?i)下一章)");
            if (nextText.size() > 0) {
                next = nextText.attr("href");
            }

            map.put("chapterName", chapterName);//章節名稱
            map.put("chapter", chapter);//完整章節內容
            map.put("next", next);//下一章鏈接地址
        } catch (Exception e) {
            map = null;
            System.err.println("處理數據操作出錯");
            e.printStackTrace();
        }
        return map;
    }

    /**
     * 遞歸寫入完整的一本書
     * @param file file
     * @param baseUrl 基礎url
     * @param url 當前url
     * @param refererUrl refererUrl
     */
    public static void mergeBook(File file, String baseUrl, String url, String refererUrl) {
        String html = gather(baseUrl + url,baseUrl +refererUrl);
        Map<String, String> map = processor(html);
        //追加寫入
        fileWriter(file, map.get("chapter"));
        System.out.println(map.get("chapterName") + " --100%");
        if (!StringUtils.isEmpty(map.get("next"))) {
　　　　　　　//遞歸
            mergeBook(file, baseUrl, map.get("next"),url);
        }
    }

　　main測試：

　　public static void main(String[] args) {
        //需要提供的條件：站點；小說名；第一章的鏈接；refererUrl
        String baseUrl = "http://www.biquge.com.tw";
        File file = createFile("斗破蒼穹");
        mergeBook(file, baseUrl, "/1_1999/1179371.html","/1_1999/");
    }

效果

　　給大家看一下我之前爬取的數據，多開幾個進程，掛機爬，差不多七個G，七百八十多部小說

　　代碼開源

　　代碼已經開源、托管到我的GitHub、碼雲：

　　GitHub：https://github.com/huanzi-qch/spider

　　碼雲：https://gitee.com/huanzi-qch/spider

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲練習:抓取筆趣閣小說(一) Jsoup-基於Java實現網絡爬蟲-爬取筆趣閣小說 python爬取筆趣閣小說 Python 爬取筆趣閣小說 Java 實現 HttpClients+jsoup，Jsoup，htmlunit，Headless Chrome 爬蟲抓取數據 python爬去筆趣閣完整一本小說爬蟲大作業之爬取筆趣閣小說 c#爬取筆趣閣小說（附源碼）【爬蟲】對新筆趣閣小說進行爬取，保存和下載筆趣閣爬蟲