說明:
本文參照了官網的 dmoz 爬蟲例子。
不過這個例子有些年頭了,而 dmoz.org 的網頁結構已經不同以前。所以我對xpath
也相應地進行了修改。
概要:
本文提出了scrapy 的三個入門應用場景
- 爬取單頁
- 根據目錄頁面,爬取所有指向的頁面
- 爬取第一頁,然后根據第一頁的連接,再爬取下一頁...。依此,直到結束
對於場景二、場景三可以認為都屬於:鏈接跟隨(Following links)
鏈接跟隨的特點就是:在 parse 函數結束時,必須 yield 一個帶回調函數 callback 的 Request 類的實例
本文基於:windows 7 (64) + python 3.5 (64) + scrapy 1.2
場景一
描述:
爬取單頁內容
示例代碼:
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for div in response.xpath('//div[@class="title-and-desc"]'):
item = DmozItem()
item['title'] = div.xpath('a/div/text()').extract_first().strip()
item['link'] = div.xpath('a/@href').extract_first()
item['desc'] = div.xpath('div[@class="site-descr "]/text()').extract_first().strip()
yield item
場景二
描述:
- ①進入目錄,提取連接。
- ②然后爬取連接指向的頁面的內容
其中①的yield scrapy.Request的callback指向②
官網描述:
...extract the links for the pages you are interested, follow them and then extract the data you want for all of them.
示例代碼:
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
'http://www.dmoz.org/Computers/Programming/Languages/Python/' # 這是目錄頁面
]
def parse(self, response):
for a in response.xpath('//section[@id="subcategories-section"]//div[@class="cat-item"]/a'):
url = response.urljoin(a.xpath('@href').extract_first().split('/')[-2])
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for div in response.xpath('//div[@class="title-and-desc"]'):
item = DmozItem()
item['title'] = div.xpath('a/div/text()').extract_first().strip()
item['link'] = div.xpath('a/@href').extract_first()
item['desc'] = div.xpath('div[@class="site-descr "]/text()').extract_first().strip()
yield item
場景三
描述:
- ①進入頁面,爬取內容,並提取下一頁的連接。
- ②然后爬取下一頁連接指向的頁面的內容
其中①的yield scrapy.Request的callback指向①自己
官網描述:
A common pattern is a callback method that extracts some items, looks for a link to follow to the next page and then yields a Request with the same callback for it
示例代碼:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
def parse(self, response):
for h3 in response.xpath('//h3').extract():
yield MyItem(title=h3)
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
說明:
第三個場景未測試!