爬蟲實戰：爬取免費小說

本文轉載自查看原文 2020-08-24 12:39 766 爬蟲

1.爬蟲實戰項目，爬取小說，只能爬取免費小說（VIP小說需要充錢登陸：方法有所差異，后續會進行講解）

　　本教程出於學習目的，如有犯規，請留言聯系

　　爬取網站：起點中文網，盜墓筆記免費篇

　　https://book.qidian.com/info/68223#Catalog

2.網頁結構分析

結構分析發現：每一大標題在div元素里面，是否免費，包含在div元素的孫子元素span的類屬性里面（class='free' 還是 class='vip'）

因此：如果我們想要提取免費章節小說，需要先根據span元素進行判斷。

3.完整代碼

#!/usr/bin/env python
#-*- coding:utf-8 -*-

'''爬取盜墓筆記小說免費版
'''


import requests
from bs4 import BeautifulSoup


headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}

class Story(object):
    
    def __init__(self,url):
        self.url = url
    
    
    def get_html(self,url):
        try:
            response = requests.get(url,headers=headers)
            if response.status_code == 200:
                return response.text
            else:
                return None
        except Exception as e:
            print('wrong', e)
    
    
    def get_soup(self,html):
        try:
            soup = BeautifulSoup(html,'html.parser')
        except:
            soup = BeautifulSoup(html, 'xml')
        return soup


    def start(self):
        html = self.get_html(self.url)
        soup = self.get_soup(html)
        
        try:
            free_result = soup.select('div.volume span.free') 
            if free_result:
                for free in free_result:
                    chapters = free.parent.parent.select('li a')  # 理解為什么要找到parent元素
                    for chapter in chapters:
                        title = chapter.text.strip().replace(' ', '_')
                        href = 'https:' + chapter['href']
                        
                        html = self.get_html(href)
                        soup = self.get_soup(html)
                        content =  soup.select('div.read-content')[0].text.strip().replace('\u3000', ' ')
                        print('\033[1;34m開始爬取:  {title}\033[0m'.format(**locals()))
                        with open(title+'.txt', 'w') as fw:
                            fw.write(content)
        except:
            None
    
    
    
if __name__ == '__main__':

    url = 'https://book.qidian.com/info/68223#Catalog'

    gg = Story(url)
    gg.start()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 爬取全本免費小說網的小說 python爬蟲之小說爬取 python爬蟲之爬取小說（一） Python實戰項目網絡爬蟲之爬取小說吧小說正文 Python爬取全書網小說，免費看小說初次嘗試python爬蟲，爬取小說網站的小說。爬蟲小案例——爬取網站小說 Golang 簡單爬蟲實現，爬取小說 Java爬蟲：用java爬取小說 Python爬蟲入門實戰項目--爬取新筆趣閣小說