python爬取筆趣閣小說

本文轉載自查看原文 2019-02-09 23:02 1202 python

下午打開手機，無意間看到了被我擱在角落的起點小說，。。想起來好久都沒看小說了，之前在看凈無痕的新作品《伏天氏》，之前充起點幣看了大概兩百章左右，現在已經更到800+章了，直接充起點幣有點舍不得。。。

想起之前自學爬蟲在筆趣閣測試爬小說，所以。。。

那就再來爬一波《伏天氏》。。。

結構分析：

1.目錄頁面：

https://www.qu.la/book/2125/

可以看到目錄全都放在一個id為list的盒子了，直接用Xpath來選擇這一部分就好了，然后把章節名和url保存，方便后面的使用：

關鍵語句如下:

    # Xpath篩選
    results = driver.find_elements_by_xpath("//div[contains(@id,'list')]//dl//dd//a")
    for result in results:
        res_url = result.get_attribute('href')  # url
        res_tit = result.text   # 章節標題

2.章節頁面

隨機分析一個：

https://www.qu.la/book/2125/10580853.html

ok 文章依舊放在一個id是content的盒子里，繼續Xpath

關鍵語句：

driver.get(url)
content = driver.find_element_by_xpath("//div[contains(@id,'content')]").text

3.用chrome的瀏覽器驅動，無頭訪問，然后爬取目錄對應的頁面文章，寫入文本，ok搞定。

實現代碼如下(格式可能稍微有點丑。。。習慣沒養成好，慢慢改。。。大家別學我):

(上次爬貓眼的啥來着，，沒設置時間間隔，，請求太頻繁了，然后ip被限制了，代理哪些又比較麻煩，而且不太會，所以就猥瑣一點了，每隔一秒請求下一張，，，反正是為了不花錢，不是為了速度。。。)

# 爬取筆趣閣的小說 《伏天氏》
# -*- utf-8 -*-
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

# 目錄主頁面 地址
root_url = 'https://www.qu.la/book/2125/'


# 獲取章節目錄 以及章節頁面鏈接
def get_catalogue():
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    driver = webdriver.Chrome(chrome_options=chrome_options)
    driver.get(root_url)

    # 目錄列表
    catalogue = {}

    # Xpath篩選
    results = driver.find_elements_by_xpath("//div[contains(@id,'list')]//dl//dd//a")
    for result in results:
        res_url = result.get_attribute('href')  # url
        res_tit = result.text   # 章節標題

        # 檢測'月票' '通知' 關鍵字
        flag = False
        if not re.search('月票', res_tit) and not re.search('通知', res_tit):
            flag = True

        # 存入字典
        if res_url not in catalogue and flag:
            catalogue[res_tit] = res_url

    driver.close()
    # 輸出章節總數
    # print(catalogue.__len__())
    return catalogue


# 下載並存儲章節數據
def download(catalogue):
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    driver = webdriver.Chrome(chrome_options=chrome_options)
    count = 0   # 計數器

    for key, value in catalogue.items():
        count = count + 1
        title = key    # 章節標題
        url = value      # 章節頁面
        print('title =' + title + 'url = ' + url)
        try:
            driver.get(url)
            content = driver.find_element_by_xpath("//div[contains(@id,'content')]").text
            with open('G:\python 資源\python project\小說爬取(伏天氏)\\'+str(title)+'.txt','wt', encoding='utf-8') as file:
                file.write(content)
            print('章節'+str(count)+'寫入成功')
        except IOError:
            print('章節'+str(count)+'寫入出錯')

        # 休眠 1s
        time.sleep(1)
        driver.back()
    driver.close()


if __name__ == '__main__':
    catalogues = get_catalogue()
    download(catalogues)

成果：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python 爬取筆趣閣小說 Python爬蟲練習（一）爬取筆趣閣小說（搜索+爬取）爬蟲大作業之爬取筆趣閣小說 c#爬取筆趣閣小說（附源碼） Python爬蟲入門教程02：筆趣閣小說爬取 Python3中BeautifulSoup爬取筆趣閣小說網 python爬去筆趣閣完整一本小說 Jsoup-基於Java實現網絡爬蟲-爬取筆趣閣小說爬蟲學習：request+xpath爬取筆趣閣小說 java多線程爬取筆趣閣所有小說（請准備夠大的硬盤）

python爬取筆趣閣小說

下午打開手機，無意間看到了被我擱在角落的起點小說，。。想起來好久都沒看小說了，之前在看凈無痕的新作品《伏天氏》，之前充起點幣看了大概兩百章左右，現在已經更到800+章了，直接充起點幣有點舍不得。。。

想起之前自學爬蟲在筆趣閣測試爬小說，所以。。。

那就再來爬一波《伏天氏》。。。

結構分析：

1.目錄頁面：

https://www.qu.la/book/2125/

可以看到目錄全都放在一個id為list的盒子了，直接用Xpath來選擇這一部分就好了，然后把章節名和url保存，方便后面的使用：

關鍵語句如下:

2.章節頁面

隨機分析一個：

https://www.qu.la/book/2125/10580853.html

ok 文章依舊放在一個id是content的盒子里，繼續Xpath

關鍵語句：

3.用chrome的瀏覽器驅動，無頭訪問，然后爬取目錄對應的頁面文章，寫入文本，ok搞定 。

實現代碼如下(格式可能稍微有點丑。。。習慣沒養成好，慢慢改。。。大家別學我):

(上次爬貓眼的啥來着，，沒設置時間間隔，，請求太頻繁了，然后ip被限制了，代理哪些又比較麻煩，而且不太會，所以就猥瑣一點了，每隔一秒請求下一張，，，反正是為了不花錢，不是為了速度。。。)

成果：

免責聲明！

3.用chrome的瀏覽器驅動，無頭訪問，然后爬取目錄對應的頁面文章，寫入文本，ok搞定。