Scrapy筆記：CrawSpider中rules中的使用

本文轉載自查看原文 2017-05-03 16:34 8905 Scrapy/ python/ scrapy

scrapy.spiders.crawl.CrawlSpider類的使用

　　這個類比較適用於對網站爬取批量網頁，相比於Spider類，CrawlSpider主要使用規則(rules)來提取鏈接

　　rules = (

　　　　Rule(LinkExtractor(allow=(r'https://movie.douban.com/subject/\d+/')), callback="parse_item1"),

　　　　Rule(LinkExtractor(allow=(r'https://movie.douban.com/subject/.+')), callback="parse_item2"),

　　　　)

　　如果接觸過django，那么可以發現這個規則與django的路由系統十分相似，CrawlSpider的rules屬性是直接從response對象的文本中提取url，然后自動創建新的請求。與Spider不同的是，CrawlSpider已經重寫了parse函數，因此我們可以看到，scrapy的官網文檔的例子中並沒有重寫parse。

這一切是scrapy自動實現的，具體過程是：

　　scrapy crawl spidername開始運行，程序自動使用start_urls構造Request並發送請求，然后調用parse函數對其進行解析，在這個解析過程中使用rules中的規則從html（或xml）文本中提取匹配的鏈接，通過這個鏈接再次生成Request，如此不斷循環，直到返回的文本中再也沒有匹配的鏈接，或調度器中的Request對象用盡，程序才停止。

　　rules中的規則如果callback沒有指定，則使用默認的parse函數進行解析，如果指定了，那么使用自定義的解析函數。

　　如果起始的url解析方式有所不同，那么可以重寫CrawlSpider中的另一個函數parse_start_url(self, response)用來解析第一個url返回的Response，但這不是必須的。

　　Rule對象的follow參數的作用是：指定了根據該規則從response提取的鏈接是否需要跟進。

參考： http://scrapy-chs.readthedocs.io/zh_CN/stable/topics/spiders.html#crawling-rules

 1 #!/usr/bin/python
 2 # -*- coding: utf-8 -*-
 3 
 4 import scrapy
 5 from tutorial01.items import MovieItem
 6 from scrapy.spiders.crawl import Rule, CrawlSpider
 7 from scrapy.linkextractors import LinkExtractor
 8 
 9 
10 class DoubanmoviesSpider(CrawlSpider):
11     name = "doubanmovies"
12     allowed_domains = ["douban.com"]
13     start_urls = ['https://movie.douban.com/tag/']
14 #     http_user='username' #http協議的基本認證功能 ；http_user和http_pass
15 #     http_pass='password'
16     rules = ( #自動從response中根據正則表達式提取url，再根據這個url再次發起請求，並用callback解析返回的結果
17         Rule(LinkExtractor(allow=(r'https://movie.douban.com/subject/\d+/')), callback="parse_item"),
18         #Rule(LinkExtractor(allow=(r'https://movie.douban.com/tag/\[wW]+'))), # 從網頁中提取http鏈接
19         
20     )
21 
22 
23     def parse_item(self, response):
24         movie = MovieItem()
25         movie['name'] = response.xpath('//*[@id="content"]/h1/span[1]/text()').extract()[0]
26         movie['director'] = '/'.join(response.xpath('//a[@rel="v:directedBy"]/text()').extract())
27         movie['writer'] = '/'.join(response.xpath('//*[@id="info"]/span[2]/span[2]/a/text()').extract())
28         movie['url'] = response.url
29         movie['score'] = response.xpath('//*[@class="ll rating_num"]/text()').extract()[0]
30         movie['collections'] = response.xpath('//span[@property="v:votes"]/text()').extract()[0] #評價人數
31         movie['pub_date'] = response.xpath('//span[@property="v:initialReleaseDate"]/text()').extract()[0]
32         movie['actor'] = '/'.join(response.css('span.actor span.attrs').xpath('.//a[@href]/text()').extract())
33         movie['classification'] = '/'.join(response.xpath('//span[@property="v:genre"]/text()').extract())
34         print('movie:%s  |url:%s'%(movie['name'],movie['url']))
35         return movie
36 
37     def parse_start_url(self, response):
38         urls = response.xpath('//div[@class="article"]//a/@href').extract()
39         for url in urls:
40             if 'https' not in url: # 去除多余的鏈接
41                 url = response.urljoin(url) # 補全
42                 print(url)
43                 print('*'*30)
44                 yield scrapy.Request(url)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲scrapy之rules的基本使用 Scrapy中yield的使用 Scrapy中對xpath使用re Vue中rules效驗規則的使用和常見效驗規則 scrapy中的xpath中的re使用 scrapy全站爬取拉勾網及CrawSpider介紹 Yii CModel中rules驗證規則關於scrapy中scrapy.Request中的屬性 Yii2中rules驗證規則 scrapy中的response