scrapy是個好玩的爬蟲框架,基本用法就是:輸入起始的一堆url,讓爬蟲去get這些網頁,然后parse頁面,獲取自己喜歡的東西。。
用上去有django的感覺,有settings,有field。還會自動生成一堆東西。。
用法:scrapy-admin.py startproject abc 生成一個project。 試試就知道會生成什么東西。
在spiders包中新建一個py文件,里面寫自定義的爬蟲類。
自定義爬蟲類必須有變量domain_name和start_urls,和實例方法parse(self,response)..
它會在 Scrapy 查找我們的spider 的時候實例化,並自動被 Scrapy 的引擎找到。
爬蟲的運行過程:
-
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.
The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests. 第一步的關鍵是start_response()..通過parse和start_urls來生成第一個請求。
-
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback. 在parse函數中可以返回request,或者items 或者一個生成器來產生這些。這些urls最后會被轉給downloader去下載。然后無窮無盡的urls和items產生了。
-
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.你可以指定任何的selector,scrapy並不關心你用什么方法生成item,只是給了個XPth的selector而已。見過別人用lxml的,我更喜歡用beautifulsoup,bs的效率最慢。。。
-
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.最后這些items又被交給pipeline,在這里可以進行各種對item的處理,存數據庫啦,寫文件啦什么的。。
這是我本月爬糗事百科的spider:
1 from scrapy.spider import BaseSpider 2 import random,uuid 3 from BeautifulSoup import BeautifulSoup as BS 4 from scrapy.selector import HtmlXPathSelector 5 6 from tutorial.items import TutorialItem 7 def getname(): 8 return uuid.uuid1( ).hex() 9 10 class JKSpider(BaseSpider): 11 name='joke' 12 allowed_domains=["qiushibaike.com"] 13 start_urls=[ 14 "http://www.qiushibaike.com/month?slow", 15 ] 16 17 def parse(self,response): 18 root=BS(response.body) 19 items=[] 20 x=HtmlXPathSelector(response) 21 22 y=x.select("//div[@class='content' and @title]/text()").extract() 23 for i in y: 24 item=TutorialItem() 25 item["content"]=i 26 items.append(item) 27 28 return items
Scrapy comes with some useful generic spiders that you can use, to subclass your spiders from. Their aim is to provide convenient functionality for a few common scraping cases, like following all links on a site based on certain rules, crawling from Sitemaps, or parsing a XML/CSV feed.
scrapy自帶了許多爬蟲,方便去繼承。例如全站爬取。從sitemap中爬取,或者是爬取xml中的url。。
class scrapy.spider.BaseSpider
This is the simplest spider, and the one from which every other spider must inherit from (either the ones that come bundled with Scrapy, or the ones that you write yourself). It doesn’t provide any special functionality. It just requests the given start_urls/start_requests, and calls the spider’s method parse for each of the resulting responses.
這是所有爬蟲的基類,他沒有任何特別的功能,只是請求start_urls/start_requests,然后指定回調函數為parse。
- name
-
A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique. However, nothing prevents you from instantiating more than one instance of the same spider. This is the most important spider attribute and it’s required.
If the spider scrapes a single domain, a common practice is to name the spider after the domain, or without the TLD. So, for example, a spider that crawls mywebsite.com would often be called mywebsite.
name一定要唯一,所以最好命名為域名。相當唯一啊。。
- allowed_domains
-
An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed if OffsiteMiddleware is enabled.
- 不屬於這些域名的url不會被爬取。前提是OffseiteMiddleware被啟用了。
- start_urls
-
A list of URLs where the spider will begin to crawl from, when no particular URLs are specified. So, the first pages downloaded will be those listed here. The subsequent URLs will be generated successively from data contained in the start URLs.
- 起始url列表,不多說
- start_requests ( )
-
This method must return an iterable with the first Requests to crawl for this spider.
這個函數必須得返回一個可迭代對象,以此生成requestsThis is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests. This method is also called only once from Scrapy, so it’s safe to implement it as a generator.
這個方法在沒有指定particular urls的時候被調用(感覺指的是scrapy的命令啟動的時候加上url參數)。如果指定了起始抓取的url,就會調用make_requests_from_url()生成requests。這個函數只會被調用一次。
The default implementation uses make_requests_from_url() to generate Requests for each url in start_urls.
If you want to change the Requests used to start scraping a domain, this is the method to override. For example, if you need to start by logging in using a POST request, you could do:
默認情況是調用make_requests_from_url來為start_urls生成請求。如果要自定義生成起始請求。
def start_requests(self): return [FormRequest("http://www.example.com/login", formdata={'user': 'john', 'pass': 'secret'}, callback=self.logged_in)] def logged_in(self, response): # here you would extract links to follow and return Requests for # each of them, with another callback pass
這樣就可以來抓取登錄后用戶的數據啦。。。
- make_requests_from_url ( url )
-
A method that receives a URL and returns a Request object (or a list of Request objects) to scrape. This method is used to construct the initial requests in the start_requests() method, and is typically used to convert urls to requests.
Unless overridden, this method returns Requests with the parse() method as their callback function, and with dont_filter parameter enabled (see Request class for more info).
這就是剛才說的,為url生成請求。。會為生成的request對象加上parse方法。。
- parse ( response )
-
This is the default callback used by Scrapy to process downloaded responses, when their requests don’t specify a callback.
The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the BaseSpider class.
This method, as well as any other Request callback, must return an iterable of Request and/or Item objects.
Parameters:
response (:class:~scrapy.http.Response`) – the response to parse
- 這個方法得返回request或items。
- log ( message [, level, component ] )
-
Log a message using the scrapy.log.msg() function, automatically populating the spider argument with the name of this spider. For more information see Logging.
例子:
1 from scrapy.selector import HtmlXPathSelector 2 from scrapy.spider import BaseSpider 3 from scrapy.http import Request 4 from myproject.items import MyItem 5 6 class MySpider(BaseSpider): 7 name = 'example.com' 8 allowed_domains = ['example.com'] 9 start_urls = [ 10 'http://www.example.com/1.html', 11 'http://www.example.com/2.html', 12 'http://www.example.com/3.html', 13 ] 14 15 def parse(self, response): 16 hxs = HtmlXPathSelector(response) 17 for h3 in hxs.select('//h3').extract(): 18 yield MyItem(title=h3) 19 20 for url in hxs.select('//a/@href').extract(): 21 yield Request(url, callback=self.parse)
