爬蟲實戰——起點中文網小說的爬取


  首先打開起點中文網,網址為:https://www.qidian.com/

  本次實戰目標是爬取一本名叫《大千界域》的小說,本次實戰僅供交流學習,支持作者,請上起點中文網訂閱觀看。

 

  我們首先找到該小說的章節信息頁面,網址為:https://book.qidian.com/info/3144877#Catalog

  

  點擊檢查,獲取頁面的html信息,我發現每一章都對應一個url鏈接,故我們只要得到本頁面html信息,然后通過Beautifulsoup,re等工具,就可將所有章節的url全部得到存成一個url列表然后挨個訪問便可獲取到所有章節內容,本次爬蟲也就大功告成了!

 

 

 

  按照我的想法,我用如下代碼獲取了頁面html,並在后端輸出顯示,結果發現返回的html信息不全,包含章節鏈接的body標簽沒有被爬取到,就算補全了headers信息,還是無法獲取到body標簽里的內容,看來起點對反爬做的措施不錯嘛,這條道走不通,咱們換一條。

import requests

def get():
    url = 'https://book.qidian.com/info/3144877#Catalog'
    req = requests.get(url)
    print(req.text)

if __name__ == '__main__':
    get()

       既然這個頁面是動態加載的,故可能應用ajax與后端數據庫進行了數據交互,然后渲染到了頁面上,我們只需攔截這次交互請求,獲取到交互的數據即可。

打開網頁https://book.qidian.com/info/3144877#Catalog,再次右鍵點擊檢查即審查元素,因為是要找到數據交互,故點擊network里的XHR請求,精確捕獲XHR對象,我們發現一個url為https://book.qidian.com/ajax/book/category?_csrfToken=1iiVodIPe2qL9Z53jFDIcXlmVghqnB6jSwPP5XKF&bookId=3144877的請求返回的response是一個包含所有卷id和章節id的json對象,這就是我們要尋找的交互數據。

        通過如下代碼,便可獲取到該json對象

import requests
import random

def random_user_agent():
    list = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/44.0.2403.155 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36']
    seed = random.randint(0, len(list)-1)
    return list[seed]

def getJson():
    url = 'https://book.qidian.com/ajax/book/category?_csrfToken=BXnzDKmnJamNAgLu4O3GknYVL2YuNX5EE86tTBAm&bookId=3144877'
    headers = {'User-Agent': random_user_agent(),
               'Referer': 'https://book.qidian.com/info/3144877',
               'Cookie': '_csrfToken=BXnzDKmnJamNAgLu4O3GknYVL2YuNX5EE86tTBAm; newstatisticUUID=1564467217_1193332262; qdrs=0%7C3%7C0%7C0%7C1; showSectionCommentGuide=1; qdgd=1; lrbc=1013637116%7C436231358%7C0%2C1003541158%7C309402995%7C0; rcr=1013637116%2C1003541158; bc=1003541158%2C1013637116; e1=%7B%22pid%22%3A%22qd_P_limitfree%22%2C%22eid%22%3A%22qd_E01%22%2C%22l1%22%3A4%7D; e2=%7B%22pid%22%3A%22qd_P_free%22%2C%22eid%22%3A%22qd_A18%22%2C%22l1%22%3A3%7D'
    }
    res = requests.get(url=url, params=headers)
    json_str = res.text
    print(json_str)

if __name__ == '__main__':
    getJson()

  

  在小說的章節信息頁面里我發現有分卷閱讀,點擊進入后發現該頁面包含該卷的所有章節內容,且每一個分卷閱讀的前半段url都是https://read.qidian.com/hankread/3144877/,變得只是該卷的id號,例如第一卷初來乍到的id為8478272,故閱讀整個第一卷內容的鏈接為https://read.qidian.com/hankread/3144877/8478272。故我們只需要在上述json對象里截取所有卷id,便可以爬取整本了!

  爬取效果如下:

 

  完整代碼如下:

import requests
import re
from bs4 import BeautifulSoup
from requests.exceptions import *
import random
import json
import time
import os
import sys

def random_user_agent():
    list = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/44.0.2403.155 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36']
    seed = random.randint(0, len(list)-1)
    return list[seed]

def getJson():
    url = 'https://book.qidian.com/ajax/book/category?_csrfToken=BXnzDKmnJamNAgLu4O3GknYVL2YuNX5EE86tTBAm&bookId=3144877'
    headers = {'User-Agent': random_user_agent(),
               'Referer': 'https://book.qidian.com/info/3144877',
               'Cookie': '_csrfToken=BXnzDKmnJamNAgLu4O3GknYVL2YuNX5EE86tTBAm; newstatisticUUID=1564467217_1193332262; qdrs=0%7C3%7C0%7C0%7C1; showSectionCommentGuide=1; qdgd=1; lrbc=1013637116%7C436231358%7C0%2C1003541158%7C309402995%7C0; rcr=1013637116%2C1003541158; bc=1003541158%2C1013637116; e1=%7B%22pid%22%3A%22qd_P_limitfree%22%2C%22eid%22%3A%22qd_E01%22%2C%22l1%22%3A4%7D; e2=%7B%22pid%22%3A%22qd_P_free%22%2C%22eid%22%3A%22qd_A18%22%2C%22l1%22%3A3%7D'
    }
    try:
        res = requests.get(url=url, params=headers)
        if res.status_code == 200:
            json_str = res.text
            list = json.loads(json_str)['data']['vs']
            response = {
                'VolumeId_List': [],
                'VolumeNum_List': []
            }
            for i in range(len(list)):
                json_str = json.dumps(list[i]).replace(" ", "")
                volume_id = re.search('.*?"vId":(.*?),', json_str, re.S).group(1)
                volume_num = re.search('.*?"cCnt":(.*?),', json_str, re.S).group(1)
                response['VolumeId_List'].append(volume_id)
                response['VolumeNum_List'].append(volume_num)
            return response
        else:
            print('No response')
            return None
    except ReadTimeout:
        print("ReadTimeout!")
        return None
    except RequestException:
        print("請求頁面出錯!")
        return None

def getPage(VolId_List, VolNum_List):
    '''
    通過卷章Id找到要爬取的頁面,並返回頁面html信息
    :param VolId_List: 卷章Id列表
    :param VolNum_List: 每一卷含有的章節數量列表
    :return:
    '''
    size = len(VolId_List)
    for i in range(size):
        path = 'C://Users//49881//Projects//PycharmProjects//Spider2起點中文網//大千界域//卷' + str(i + 1)
        mkdir(path)
        url = 'https://read.qidian.com/hankread/3144877/'+VolId_List[i]
        print('\n當前訪問路徑:'+url)
        headers = {
            'User-Agent': random_user_agent(),
            'Referer': 'https://book.qidian.com/info/3144877',
            'Cookie': 'e1=%7B%22pid%22%3A%22qd_P_hankRead%22%2C%22eid%22%3A%22%22%2C%22l1%22%3A3%7D; e2=%7B%22pid%22%3A%22qd_P_hankRead%22%2C%22eid%22%3A%22%22%2C%22l1%22%3A2%7D; _csrfToken=BXnzDKmnJamNAgLu4O3GknYVL2YuNX5EE86tTBAm; newstatisticUUID=1564467217_1193332262; qdrs=0%7C3%7C0%7C0%7C1; showSectionCommentGuide=1; qdgd=1; e1=%7B%22pid%22%3A%22qd_P_limitfree%22%2C%22eid%22%3A%22qd_E01%22%2C%22l1%22%3A4%7D; e2=%7B%22pid%22%3A%22qd_P_free%22%2C%22eid%22%3A%22qd_A18%22%2C%22l1%22%3A3%7D; rcr=3144877%2C1013637116%2C1003541158; lrbc=3144877%7C52472447%7C0%2C1013637116%7C436231358%7C0%2C1003541158%7C309402995%7C0; bc=3144877'
        }
        try:
            res = requests.get(url=url, params=headers)
            if res.status_code == 200:
                print('第'+str(i+1)+'卷已開始爬取:')
                parsePage(res.text, url, path, int(VolNum_List[i]))
            else:
                print('No response')
                return None
        except ReadTimeout:
            print("ReadTimeout!")
            return None
        except RequestException:
            print("請求頁面出錯!")
            return None
        time.sleep(3)

def parsePage(html, url, path, chapNum):
    '''
    解析小說內容頁面,將每章內容寫入txt文件,並存儲到相應的卷目錄下
    :param html: 小說內容頁面
    :param url: 訪問路徑
    :param path: 卷目錄路徑
    :return: None
    '''
    if html == None:
        print('訪問路徑為'+url+'的頁面為空')
        return
    soup = BeautifulSoup(html, 'lxml')
    ChapInfoList = soup.find_all('div', attrs={'class': 'main-text-wrap'})
    alreadySpiderNum = 0.0
    for i in range(len(ChapInfoList)):
        sys.stdout.write('\r已爬取{0}'.format('%.2f%%' % float(alreadySpiderNum/chapNum*100)))
        sys.stdout.flush()
        time.sleep(0.5)
        soup1 = BeautifulSoup(str(ChapInfoList[i]), 'lxml')
        ChapName = soup1.find('h3', attrs={'class': 'j_chapterName'}).span.string
        ChapName = re.sub('[\/:*?"<>|]', '', ChapName)
        if ChapName == '無題':
            ChapName = '第'+str(i+1)+'章 無題'
        filename = path+'//'+ChapName+'.txt'
        readContent = soup1.find('div', attrs={'class': 'read-content j_readContent'}).find_all('p')
        for item in readContent:
            paragraph = re.search('.*?<p>(.*?)</p>', str(item), re.S).group(1)
            save2file(filename, paragraph)
        alreadySpiderNum += 1.0
    sys.stdout.write('\r已爬取{0}'.format('%.2f%%' % float(alreadySpiderNum / chapNum * 100)))


def save2file(filename, content):
    with open(r''+filename, 'a', encoding='utf-8') as f:
        f.write(content+'\n')
        f.close()

def mkdir(path):
    '''
    創建卷目錄文件夾
    :param path: 創建路徑
    :return: None
    '''
    folder = os.path.exists(path)
    if not folder:
        os.makedirs(path)
    else:
        print('路徑'+path+'已存在')

def main():
    response = getJson()
    if response != None:
        VolId_List = response['VolumeId_List']
        VolNum_List = response['VolumeNum_List']
        getPage(VolId_List, VolNum_List)
    else:
        print('無法爬取該小說!')
    print("小說爬取完畢!")

if __name__ == '__main__':
    main()

  


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM