scrapy多url爬取


編輯本隨筆

 一、單頁面爬取

  1. 創建項目
    scrapy startproject qiubaiByPages
  2. 創建spider文件
    scrapy genspider qiubai www.qiushibaike.com/text
  3. 編寫數據存儲膜拜items
    class QiubaibypagesItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        author = scrapy.Field()
        content=scrapy.Field()
    View Code
  4. 數據解析代碼編寫
    # -*- coding: utf-8 -*-
    import scrapy
    from qiubaiByPages.items import QiubaibypagesItem
    
    class QiubaiSpider(scrapy.Spider):
        name = 'qiubai'
        allowed_domains = ['www.qiushibaike.com/text']
        start_urls = ['http://www.qiushibaike.com/text/']
    
        def parse(self, response):
            div_list=response.xpath('//div[@id="content-left"]/div')
            for div in div_list:
                author=div.xpath("./div/a[2]/h2/text()").extract_first()
                content=div.xpath("./a/div/span/text()").extract_first()
    
                #創建item對象,將解析到的數據存儲到items對象中
                item=QiubaibypagesItem()
                item["author"]=author
                item["content"]=content
                yield item
    View Code
  5. 編寫數據持久化操作pipelines
    class QiubaibypagesPipeline(object):
        fp=None
        def open_spider(self,spider):
            print("開始爬蟲")
            self.fp=open("./qiubaipages.txt","w",encoding="utf-8")
        
        def process_item(self, item, spider):
            self.fp.write(item["author"]+":"+item["content"])
            
            return item
        def close_spider(self,spider):
            self.fp.close()
            print("爬蟲結束")
    View Code
  6. 修改setting文件,關閉rebotstxt協議,添加user-agent頭部信息,配置pipelines
  7. 啟動爬蟲
    scrapy crawl qiubai --nolog

  8. 稍等

 二、多頁面爬取

 請求的手動發送

# -*- coding: utf-8 -*-
import scrapy
from qiubaiByPages.items import QiubaibypagesItem

class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    #allowed_domains = ['www.qiushibaike.com/text']
    start_urls = ['https://www.qiushibaike.com/text/']
    #設計通用的url模板
    url = "https://www.qiushibaike.com/text/page/%d/"
    pageNum=1

    def parse(self, response):
        div_list=response.xpath('//div[@id="content-left"]/div')
        for div in div_list:
            author=div.xpath("./div/a[2]/h2/text()").extract_first()
            content=div.xpath("./a/div/span/text()").extract_first()

            #創建item對象,將解析到的數據存儲到items對象中
            item=QiubaibypagesItem()
            item["author"]=author
            item["content"]=content
            yield item
        #請求的手動發送
        if self.pageNum<=13:
            print("爬取到第%d頁數據" % self.pageNum)
            #13標示最后一頁的頁碼
            self.pageNum+=1
            new_url=format(self.url % self.pageNum)
            #callback:將請求獲取到的頁面數據進行數據解析
            yield scrapy.Request(url=new_url ,callback=self.parse)
View Code

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM