Scrapy 爬蟲使用指南完全教程

本文轉載自查看原文 2016-12-21 19:41 5855 爬蟲/ python/ Python&Machine Learning/ 多線程/ scrapy

scrapy note

command

全局命令:

startproject ：在 project_name 文件夾下創建一個名為 project_name 的Scrapy項目。

scrapy startproject myproject

settings：在項目中運行時，該命令將會輸出項目的設定值，否則輸出Scrapy默認設定。
runspider：在未創建項目的情況下，運行一個編寫在Python文件中的spider。
shell：以給定的URL(如果給出)或者空(沒有給出URL)啟動Scrapy shell。
fetch：使用Scrapy下載器(downloader)下載給定的URL，並將獲取到的內容送到標准輸出。

scrapy fetch --nolog --headers http://www.example.com/

view：在瀏覽器中打開給定的URL，並以Scrapy spider獲取到的形式展現。

scrapy view http://www.example.com/some/page.html

version：輸出Scrapy版本。

項目(Project-only)命令:

crawl：使用spider進行爬取。
scrapy crawl myspider
check：運行contract檢查。
scrapy check -l
list：列出當前項目中所有可用的spider。每行輸出一個spider。
edit
parse：獲取給定的URL並使用相應的spider分析處理。如果您提供 --callback 選項，則使用spider的該方法處理，否則使用 parse 。

--spider=SPIDER: 跳過自動檢測spider並強制使用特定的spider
--a NAME=VALUE: 設置spider的參數(可能被重復)
--callback or -c: spider中用於解析返回(response)的回調函數
--pipelines: 在pipeline中處理item
--rules or -r: 使用 CrawlSpider 規則來發現用來解析返回(response)的回調函數
--noitems: 不顯示爬取到的item
--nolinks: 不顯示提取到的鏈接
--nocolour: 避免使用pygments對輸出着色
--depth or -d: 指定跟進鏈接請求的層次數(默認: 1)
--verbose or -v: 顯示每個請求的詳細信息
scrapy parse http://www.example.com/ -c parse_item

genspider：在當前項目中創建spider。

scrapy genspider [-t template] <name> <domain>
scrapy genspider -t basic example example.com

deploy：將項目部署到Scrapyd服務。
bench：運行benchmark測試。

使用選擇器(selectors)

body = '<html><body><span>good</span></body></html>'
Selector(text=body).xpath('//span/text()').extract()

response = HtmlResponse(url='http://example.com', body=body)
Selector(response=response).xpath('//span/text()').extract()

Scrapy提供了兩個實用的快捷方式: response.xpath() 及 response.css()

>>> response.xpath('//base/@href').extract()
>>> response.css('base::attr(href)').extract()
>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
>>> response.css('a[href*=image]::attr(href)').extract()
>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
>>> response.css('a[href*=image] img::attr(src)').extract()

嵌套選擇器(selectors)

選擇器方法( .xpath() or .css() )返回相同類型的選擇器列表，因此你也可以對這些選擇器調用選擇器方法。下面是一個例子:

links = response.xpath('//a[contains(@href, "image")]')
for index, link in enumerate(links):
        args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
        print 'Link number %d points to url %s and image %s' % args

結合正則表達式使用選擇器(selectors)

Selector 也有一個 .re() 方法，用來通過正則表達式來提取數據。然而，不同於使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字符串的列表。所以你無法構造嵌套式的 .re() 調用。

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')

使用相對XPaths

>>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()
>>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()
>>> for p in divs.xpath('p'): #gets all <p> from the whole document
...     print p.extract()

例如在XPath的 starts-with() 或 contains() 無法滿足需求時， test() 函數可以非常有用。

>>> sel.xpath('//li//@href').extract()
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()

XPATH TIPS

Avoid using contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.
Beware of the difference between //node[1] and (//node)[1]
When selecting by class, be as specific as necessary，When querying by class, consider using CSS
Learn to use all the different axes
Useful trick to get text content

Item Loaders

populate items

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock]')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()

Item Pipeline

清理HTML數據
驗證爬取的數據(檢查item包含某些字段)
查重(並丟棄)
將爬取結果保存到數據庫中

編寫你自己的item pipeline

每個item pipeline組件都需要調用該方法，這個方法必須返回一個 Item (或任何繼承類)對象，或是拋出 DropItem 異常，被丟棄的item將不會被之后的pipeline組件所處理。
參數:

item (Item 對象) – 被爬取的item
spider (Spider 對象) – 爬取該item的spider

Write items to MongoDB

import pymongo

class MongoPipeline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        collection_name = item.__class__.__name__
        self.db[collection_name].insert(dict(item))
        return item

為了啟用一個Item Pipeline組件，你必須將它的類添加到 ITEM_PIPELINES 配置，就像下面這個例子:

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

分配給每個類的整型值，確定了他們運行的順序，item按數字從低到高的順序，通過pipeline，通常將這些數字定義在0-1000范圍內。

實踐經驗

同一進程運行多個spider

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
    d = runner.crawl('followall', domain=domain)
    dfs.add(d)

defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished

避免被禁止(ban)

使用user agent池，輪流選擇之一來作為user agent。池中包含常見的瀏覽器的user agent(google一下一大堆)
禁止cookies(參考 COOKIES_ENABLED)，有些站點會使用cookies來發現爬蟲的軌跡。
設置下載延遲(2或更高)。參考 DOWNLOAD_DELAY 設置。
如果可行，使用 Google cache 來爬取數據，而不是直接訪問站點。
使用IP池。例如免費的 Tor項目或付費服務(ProxyMesh)。
使用高度分布式的下載器(downloader)來繞過禁止(ban)，您就只需要專注分析處理頁面。這樣的例子有: Crawlera
增加並發 CONCURRENT_REQUESTS = 100
禁止cookies:COOKIES_ENABLED = False
禁止重試:RETRY_ENABLED = False
減小下載超時:DOWNLOAD_TIMEOUT = 15
禁止重定向:REDIRECT_ENABLED = False
啟用 “Ajax Crawlable Pages” 爬取:AJAXCRAWL_ENABLED = True

對爬取有幫助的實用Firefox插件

Firebug
XPather
XPath Checker
Tamper Data
Firecookie
自動限速：AUTOTHROTTLE_ENABLED=True

other

Scrapyd
Spider中間件
 下載器中間件(Downloader Middleware)
內置設定參考手冊

 Requests and Responses

Scrapy入門教程

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 MyBatis完全使用指南 GitHub教程(一) 使用指南 2021升級版微服務教程3—Eureka完全使用指南 Visual Studio Code 不完全使用指南 ant使用指南詳細入門教程 ant使用指南詳細入門教程 ant使用指南詳細入門教程 ENS使用指南系列之一 [ 注冊 .eth 域名詳細教程 ] FIO使用指南 useEffect使用指南

Scrapy 爬蟲 使用指南 完全教程