使用背景:
我們通常在爬去某個網站的時候都是爬去每個標簽下的某些內容,往往一個網站的主頁后面會包含很多物品或者信息的詳細的內容,我們只提取某個大標簽下的某些內容的話,會顯的效率較低,大部分網站的都是按照固定套路(也就是固定模板,把各種信息展示給用戶),LinkExtrator就非常適合整站抓取,為什么呢?因為你通過xpath、css等一些列參數設置,拿到整個網站的你想要的鏈接,而不是固定的某個標簽下的一些鏈接內容,非常適合整站爬取。
1 import scrapy 2 from scrapy.linkextractor import LinkExtractor 3 4 class WeidsSpider(scrapy.Spider): 5 name = "weids" 6 allowed_domains = ["wds.modian.com"] 7 start_urls = ['http://www.gaosiedu.com/gsschool/'] 8 9 def parse(self, response): 10 link = LinkExtractor(restrict_xpaths='//ul[@class="cont_xiaoqu"]/li') 11 links = link.extract_links(response) 12 print(links)
links是一個list
我們來迭代一下這個list
1 for link in links: 2 print(link)
links里面包含了我們要提取的url,那我們怎么才能拿到這個url呢?
直接在for循環里面link.url就能拿到我們要的url和text信息
1 for link in links: 2 print(link.url,link.text)
別着急,LinkExtrator里面不止一個xpath提取方法,還有很多參數。
>allow:接收一個正則表達式或一個正則表達式列表,提取絕對url於正則表達式匹配的鏈接,如果該參數為空,默認全部提取。
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractor import LinkExtractor 4 5 class WeidsSpider(scrapy.Spider): 6 name = "weids" 7 allowed_domains = ["wds.modian.com"] 8 start_urls = ['http://www.gaosiedu.com/gsschool/'] 9 10 def parse(self, response): 11 pattern = '/gsschool/.+\.shtml' 12 link = LinkExtractor(allow=pattern) 13 links = link.extract_links(response) 14 print(type(links)) 15 for link in links: 16 print(link)
>deny:接收一個正則表達式或一個正則表達式列表,與allow相反,排除絕對url於正則表達式匹配的鏈接,換句話說,就是凡是跟正則表達式能匹配上的全部不提取。
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractor import LinkExtractor 4 5 class WeidsSpider(scrapy.Spider): 6 name = "weids" 7 allowed_domains = ["wds.modian.com"] 8 start_urls = ['http://www.gaosiedu.com/gsschool/'] 9 10 def parse(self, response): 11 pattern = '/gsschool/.+\.shtml' 12 link = LinkExtractor(deny=pattern) 13 links = link.extract_links(response) 14 print(type(links)) 15 for link in links: 16 print(link)
>allow_domains:接收一個域名或一個域名列表,提取到指定域的鏈接。
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractor import LinkExtractor 4 5 class WeidsSpider(scrapy.Spider): 6 name = "weids" 7 allowed_domains = ["wds.modian.com"] 8 start_urls = ['http://www.gaosiedu.com/gsschool/'] 9 10 def parse(self, response): 11 domain = ['gaosivip.com','gaosiedu.com'] 12 link = LinkExtractor(allow_domains=domain) 13 links = link.extract_links(response) 14 print(type(links)) 15 for link in links: 16 print(link)
>deny_domains:和allow_doains相反,拒絕一個域名或一個域名列表,提取除被deny掉的所有匹配url。
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractor import LinkExtractor 4 5 class WeidsSpider(scrapy.Spider): 6 name = "weids" 7 allowed_domains = ["wds.modian.com"] 8 start_urls = ['http://www.gaosiedu.com/gsschool/'] 9 10 def parse(self, response): 11 domain = ['gaosivip.com','gaosiedu.com'] 12 link = LinkExtractor(deny_domains=domain) 13 links = link.extract_links(response) 14 print(type(links)) 15 for link in links: 16 print(link)
>restrict_xpaths:我們在最開始做那個那個例子,接收一個xpath表達式或一個xpath表達式列表,提取xpath表達式選中區域下的鏈接。
>restrict_css:這參數和restrict_xpaths參數經常能用到,所以同學必須掌握,個人更喜歡xpath。
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractor import LinkExtractor 4 5 class WeidsSpider(scrapy.Spider): 6 name = "weids" 7 allowed_domains = ["wds.modian.com"] 8 start_urls = ['http://www.gaosiedu.com/gsschool/'] 9 10 def parse(self, response): 11 link = LinkExtractor(restrict_css='ul.cont_xiaoqu > li') 12 links = link.extract_links(response) 13 print(type(links)) 14 for link in links: 15 print(link)
>tags:接收一個標簽(字符串)或一個標簽列表,提取指定標簽內的鏈接,默認為tags=(‘a’,‘area’)
>attrs:接收一個屬性(字符串)或者一個屬性列表,提取指定的屬性內的鏈接,默認為attrs=(‘href’,),示例,按照這個中提取方法的話,這個頁面上的某些標簽的屬性都會被提取出來,如下例所示,這個頁面的a標簽的href屬性值都被提取到了。
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractor import LinkExtractor 4 5 class WeidsSpider(scrapy.Spider): 6 name = "weids" 7 allowed_domains = ["wds.modian.com"] 8 start_urls = ['http://www.gaosiedu.com/gsschool/'] 9 10 def parse(self, response): 11 link = LinkExtractor(tags='a',attrs='href') 12 links = link.extract_links(response) 13 print(type(links)) 14 for link in links: 15 print(link)
前面我們講了這么多LinkExtractor的基本用法,上面的只是為了快速試驗,真正的基本用法是結合Crawler和Rule,代碼如下
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractor import LinkExtractor 4 from scrapy.spiders.crawl import CrawlSpider,Rule 5 6 7 class GaosieduSpider(CrawlSpider): 8 name = "gaosiedu" 9 allowed_domains = ["www.gaosiedu.com"] 10 start_urls = ['http://www.gaosiedu.com/'] 11 restrict_xpath = '//ul[@class="schoolList clearfix"]' 12 allow = '/gsschool/.+\.shtml' 13 rules = { 14 Rule(LinkExtractor(restrict_xpaths=restrict_xpath), callback="parse_item", follow=True) 15 } 16 17 def parse_item(self,response): 18 schooll_name = response.xpath('//div[@class="area_nav"]//h3/text()').extract_first() 19 print(schooll_name)
簡單的說一下,上面我們本應該繼承scrapy.Spider類,這里需要繼承CrawlSpider類(因為CrawlSpider類也是繼承了scrapy.Spider類),rules是基本寫法,可不是隨便哪個單詞都可以的啊,而且注意rules必須是一個list或者dict,如果是tuple的話就會報錯。里面的話Rule里面包含了幾個參數,LinkExtractor就不在這里熬述了,看上面就行,至於其他的幾個參數,可以看我們另外一篇博文:http://www.cnblogs.com/lei0213/p/7976280.html