爬蟲入門實例：利用requests庫爬取筆趣小說網

本文轉載自查看原文 2018-10-29 15:53 917 python爬蟲

w3cschool上的來練練手，爬取筆趣看小說http://www.biqukan.com/，

爬取《凡人修仙傳仙界篇》的所有章節

1.利用requests訪問目標網址，使用了get方法

2.使用BeautifulSoup解析返回的網頁信息，使用了BeautifulSoup方法

3.從中獲取我們需要的小說內容，使用了find，find_all等方法

4.進行格式化處理，主要是python里字典和列表的運算

5.保存到txt文件，涉及一些簡單的文件操作，open，write等

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import os


class NovelSpider:
    def __init__(self):
        self.start_url = 'https://www.biqukan.com/1_1680/'

    def get_novel(self):
        # 訪問起始URL
        response = requests.get(self.start_url)
        # 這里用lxml解析器會出問題，找了好久才發現。。。
        soup = BeautifulSoup(response.text, 'html.parser')
        # print(response.text)
        div_chapter = soup.find(class_="listmain")
        # print(div_chapter)
　　　　  # 選取所有的a標簽，a標簽包含所有章節名稱和URL
        chapter_list = div_chapter.find_all('a')
        # 這里去除前12個重復的章節（具體看html代碼）
        chapter_list = chapter_list[12:]
        #print(chapter_list)
        chapter = []
        # 記錄總章節數，下載顯示完成率
        chapter_num = len(chapter_list)
        # 設置計數器
        count = 0
        # 循環對每個章節進行訪問和下載
        print('《凡人修仙傳仙界篇》開始下載:')
        for cl in chapter_list:
            chapter_dict = {}
            chapter_name = cl.get_text()
            # 抓取章節名稱
            chapter_dict['name'] = chapter_name
            chapter_url = cl.get('href')
            # 抓取章節URL地址
            chapter_dict['value'] = 'https://www.biqukan.com' + chapter_url
            if chapter_dict not in chapter:
                chapter.append(chapter_dict)
            print(f"已下載:{count}/{chapter_num}")
            # 調用download_novel（），按章節下載小說
            self.download_novel(chapter_dict)
            # 同時計數器加一
            count += 1

    def parse_novel(self, url):
        # 小說章節的具體內容是動態加載的，用Phantom訪問
        browser = webdriver.PhantomJS(executable_path=r'F:\Spider\novelSpider\phantomjs.exe')
        browser.get(url)
        soup = BeautifulSoup(browser.page_source, 'html.parser')
        find_txt = soup.find(class_='showtxt')
        # print(type(find_txt.get_text()))
        return find_txt.get_text()

    def download_novel(self, data):  # data是{name：章節名，value：章節url地址}的字典
        filename = data['name']
        url = data['value']
        # 通過url訪問小說章節的具體內容，返回小說內容，str
        txt = self.parse_novel(url)

        # 設置下載存儲路徑
        path = r"F:\Spider\novelSpider"
        # 檢查路徑是否存在，否則創建新的文件夾
        isExists = os.path.exists(path)
        if not isExists:
            # print('創建了新的文件夾')
            os.mkdir(path)
        else:
            # print('文件夾已存在')
            pass

        # 保存txt文件
        with open(path + f'\凡人修仙傳仙界篇.txt', 'a', encoding='utf-8') as f:
            # print(f'正在下載--{filename}')
            f.write(f'{filename}\n\n')
            f.write(txt)
            # 章節分割線
            f.write('\n======\n\n')
            f.close()


if __name__ == '__main__':
    ns = NovelSpider()
    ns.get_novel()

下載真的是超級慢，，，好像是PhantomJS訪問花時間，，有待學習和改進！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python3中BeautifulSoup爬取筆趣閣小說網利用python的requests和BeautifulSoup庫爬取小說網站內容 Python爬蟲入門教程02：筆趣閣小說爬取爬蟲大作業之爬取筆趣閣小說 Python 爬取筆趣閣小說 python爬取筆趣閣小說 Python爬蟲練習（一）爬取筆趣閣小說（搜索+爬取） python 爬取全本免費小說網的小說初次嘗試python爬蟲，爬取小說網站的小說。 Python的scrapy之爬取頂點小說網的所有小說