python采集小說網站完整教程（附完整代碼）

本文轉載自查看原文 2019-08-12 17:56 1127 采集小說/ python/ scrapy/ 章羲

python 采集網站數據，本教程用的是scrapy蜘蛛

1、安裝Scrapy框架

命令行執行：

pip install scrapy

安裝的scrapy依賴包和原先你安裝的其他python包有沖突話，推薦使用Virtualenv安裝

安裝完成后，隨便找個文件夾創建爬蟲

scrapy startproject 你的蜘蛛名稱

文件夾目錄

爬蟲規則寫在spiders目錄下

items.py ——需要爬取的數據

pipelines.py ——執行數據保存

settings —— 配置

middlewares.py——下載器

下面是采集一個小說網站的源碼

先在items.py定義采集的數據

# 2019年8月12日17:41:08
# author zhangxi<1638844034@qq.com> import scrapy class BookspiderItem(scrapy.Item): # define the fields for your item here like: i = scrapy.Field() book_name = scrapy.Field() book_img = scrapy.Field() book_author = scrapy.Field() book_last_chapter = scrapy.Field() book_last_time = scrapy.Field() book_list_name = scrapy.Field() book_content = scrapy.Field() pass

編寫采集規則

# 2019年8月12日17:41:08
# author zhangxi<1638844034@qq.com> import scrapy from ..items import BookspiderItem class Book(scrapy.Spider): name = "BookSpider" start_urls = [ 'http://www.xbiquge.la/xiaoshuodaquan/' ] def parse(self, response): bookAllList = response.css('.novellist:first-child>ul>li') for all in bookAllList: booklist = all.css('a::attr(href)').extract_first() yield scrapy.Request(booklist,callback=self.list) def list(self,response): book_name = response.css('#info>h1::text').extract_first() book_img = response.css('#fmimg>img::attr(src)').extract_first() book_author = response.css('#info p:nth-child(2)::text').extract_first() book_last_chapter = response.css('#info p:last-child::text').extract_first() book_last_time = response.css('#info p:nth-last-child(2)::text').extract_first() bookInfo = { 'book_name':book_name, 'book_img':book_img, 'book_author':book_author, 'book_last_chapter':book_last_chapter, 'book_last_time':book_last_time } list = response.css('#list>dl>dd>a::attr(href)').extract() i = 0 for var in list: i += 1 bookInfo['i'] = i # 獲取抓取時的順序，保存數據時按順序保存 yield scrapy.Request('http://www.xbiquge.la'+var,meta=bookInfo,callback=self.info) def info(self,response): self.log(response.meta['book_name']) content = response.css('#content::text').extract() item = BookspiderItem() item['i'] = response.meta['i'] item['book_name'] = response.meta['book_name'] item['book_img'] = response.meta['book_img'] item['book_author'] = response.meta['book_author'] item['book_last_chapter'] = response.meta['book_last_chapter'] item['book_last_time'] = response.meta['book_last_time'] item['book_list_name'] = response.css('.bookname h1::text').extract_first() item['book_content'] = ''.join(content) yield item

保存數據

import os
class BookspiderPipeline(object): def process_item(self, item, spider): curPath = 'E:/小說/' tempPath = str(item['book_name']) targetPath = curPath + tempPath if not os.path.exists(targetPath): os.makedirs(targetPath) book_list_name = str(str(item['i'])+item['book_list_name']) filename_path = targetPath+'/'+book_list_name+'.txt' print('------------') print(filename_path) with open(filename_path,'a',encoding='utf-8') as f: f.write(item['book_content']) return item

執行

scrapy crawl  BookSpider

即可完成一個小說程序的采集

這里推薦使用

scrapy shell 爬取的網頁url

然后 response.css('') 測試規則是否正確

本教程程序源碼：github:https://github.com/zhangxi-key/py-book.git

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 我用Python實現了一個小說網站雛形 python爬蟲之小說網站--下載小說(正則表達式) 《C# 爬蟲破境之道》：第二境爬蟲應用 — 第四節：小說網站采集 Python的scrapy之爬取頂點小說網的所有小說使用django+mysql+scrapy制作的一個小說網站從“頂點小說”下載完整小說——python爬蟲隨機漫步python（附完整代碼） Python爬蟲中文小說網點查找小說並且保存到txt(含中文亂碼處理方法) 教你用python爬取網站美女圖（附代碼及教程）根據名稱搜索小說並下載到本地【全書小說網】