Scrapy簡介


Scrapy at a glance(Scrapy簡介)

 

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. 
Scrapy是Python開發的一個快速,高層次的屏幕抓取和web抓取框架,用於抓取web站點並從頁面中提取結構化的數據。Scrapy用途廣泛,可以用於數據挖掘、信息處理和歷史檔案。

 

Even though Scrapy was originally designed for screen scraping (more precisely, web scraping), it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.
盡管Scrapy原本是設計用來屏幕抓取(更精確的說,是網絡抓取)的目的,但它也可以用來訪問API來提取數據,比如Amazon的AWS或者用來當作通常目的應用的網絡蜘蛛

 

The purpose of this document is to introduce you to the concepts behind Scrapy so you can get an idea of how it works and decide if Scrapy is what you need.
本文檔的目的是介紹一下Scrapy背后的概念,這樣你會了解它是如何工作的,以決定它是不是你需要的

 

When you’re ready to start a project, you can start with the tutorial.
當你准備啟動一個項目時,可以從這個教程開始

 

Pick a website(選擇一個網站)

So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. Scrapy can help you extract that information.
如果你需要從某個網站提取一些信息,但是網站不提供API或者其他可編程的訪問機制,那么Scrapy可以幫助你提取信息

Let’s say we want to extract the URL, name, description and size of all torrent files added today in the Mininova site.

Let’s say we want to extract the URL, name, description and size of all torrent files added today in the Mininova site.
讓我們看下Mininova網站今天增加的torrent文件,我們需要提取網址,名稱,描述和文件大小

The list of all torrents added today can be found on this page:
下面這個列表是所有今天新增的torrents文件的頁面

Define the data you want to scrape(定義你要抓取的數據)

The first thing is to define the data we want to scrape. In Scrapy, this is done through Scrapy Items (Torrent files, in this case).第一件事情就是定義你要抓取的數據,在Scrapy這個是通過定義Scrapy Items來實現的(本例是BT文件)

This would be our Item:這就是要定義的Item

from scrapy.item import Item, Field

class Torrent(Item):
    url = Field()
    name = Field()
    description = Field()
    size = Field()

 

 

Write a Spider to extract the data(撰寫一個蜘蛛來抓取數據)

The next thing is to write a Spider which defines the start URL (http://www.mininova.org/today), the rules for following links and the rules for extracting the data from pages.下一步是寫一個指定起始網址的蜘蛛,這個蜘蛛的規則包含follow鏈接規則和數據提取規則

If we take a look at that page content we’ll see that all torrent URLs are like http://www.mininova.org/tor/NUMBER where NUMBER is an integer. We’ll use that to construct the regular expression for the links to follow: /tor/\d+. 如果你看一眼頁面內容,就會發現所有的torrent網址都是類似http://www.mininova.org/tor/NUMBER,其中Number是一個整數,我們將用正則表達式,例如 /tor/\d+. 來提取規則

We’ll use XPath for selecting the data to extract from the web page HTML source. Let’s take one of those torrent pages: 我們將使用Xpath,從頁面的HTML Source里面選取要要抽取的數據,我們 選中一個頁面

And look at the page HTML source to construct the XPath to select the data we want which is: torrent name, description and size.根據頁面HTML 源碼,建立XPath,選取我們所要的:torrent name, description和size

By looking at the page HTML source we can see that the file name is contained inside a <h1> tag: 通過頁面HTML源代碼可以看到name屬性包含在H1 標簽內

<h1>Home[2009][Eng]XviD-ovd</h1>

 

An XPath expression to extract the name could be: 使用 XPath expression提取的表達式:

//h1/text()

 

And the description is contained inside a <div> tag with id="description":  同時description被包含在id=”description“的div中
<h2>Description:</h2>

<div id="description">
"HOME" - a documentary film by Yann Arthus-Bertrand
<br/>
<br/>
***
<br/>
<br/>
"We are living in exceptional times. Scientists tell us that we have 10 years to change the way we live, avert the depletion of natural resources and the catastrophic evolution of the Earth's climate.

...

 

An XPath expression to select the description could be:使用 XPath expression提取的表達式:
//div[@id='description']

 

Finally, the file size is contained in the second <p> tag inside the <div> tag with id=specifications: size屬性在第二個<p>tag,id=specifications的div內
<div id="specifications">

<p>
<strong>Category:</strong>
<a href="/cat/4">Movies</a> &gt; <a href="/sub/35">Documentary</a>
</p>

<p>
<strong>Total size:</strong>
699.79&nbsp;megabyte</p>

 

An XPath expression to select the description could be:使用 XPath expression提取的表達式:
//div[@id='specifications']/p[2]/text()[2]

 

For more information about XPath see the XPath reference. 如果要了解更多的XPath 參考這里 XPath reference.

Finally, here’s the spider code: 最后,蜘蛛代碼如下:

class MininovaSpider(CrawlSpider):

    name = 'mininova.org'
    allowed_domains = ['mininova.org']
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)

        torrent = TorrentItem()
        torrent['url'] = response.url
        torrent['name'] = x.select("//h1/text()").extract()
        torrent['description'] = x.select("//div[@id='description']").extract()
        torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
        return torrent

 

For brevity’s sake, we intentionally left out the import statements. The Torrent item is defined above.因為很簡單的原因,我們有意把重要的數據定義放在了上面(torrent數據定義),

Run the spider to extract the data(運行蜘蛛來抓取數據)

Finally, we’ll run the spider to crawl the site an output file scraped_data.json with the scraped data in JSON format:  最后,我們運行蜘蛛來爬取這個網站,輸出為json格式 scraped_data.json

scrapy crawl mininova.org -o scraped_data.json -t json

 

This uses feed exports to generate the JSON file. You can easily change the export format (XML or CSV, for example) or the storage backend (FTP or Amazon S3, for example).
這個使用了feed exports,來生成json格式,當然,你可以很簡單的更改輸出格式為csv,xml,或者存儲在后端(ftp或者Amazon S3)

You can also write an item pipeline to store the items in a database very easily.

你也可以寫一段item pipeline,把數據直接寫入數據庫,很簡單

Review scraped data(檢查抓取的數據)

If you check the scraped_data.json file after the process finishes, you’ll see the scraped items there:
要運行結束以后,查看一下數據:scraped_data.json,內容大致如下

[{"url": "http://www.mininova.org/tor/2657665", "name": ["Home[2009][Eng]XviD-ovd"], "description": ["HOME - a documentary film by ..."], "size": ["699.69 megabyte"]},
# ... other items ...
]

 

You’ll notice that all field values (except for the url which was assigned directly) are actually lists. This is because the selectors return lists. You may want to store single values, or perform some additional parsing/cleansing to the values. That’s what Item Loaders are for.

關注一下數據,你會發現,所有字段都是lists(除了url是直接賦值),這是因為selectors返回的就是lists格式,如果你想存儲單獨數據或者在數據上增加一些解釋或者清洗,可以使用Item Loaders

 

What else?(更多)

You’ve seen how to extract and store items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient, such as:

你也看到了如何使用Scrapy從一個網站提取和存儲數據,但這只是表象,實際上,Scrapy提供了許多強大的特性,讓它更容易和高效的抓取:

  • Built-in support for selecting and extracting data from HTML and XML sources   內建 selecting and extracting,支持從HTML,XML提取數據
  • Built-in support for cleaning and sanitizing the scraped data using a collection of reusable filters (called Item Loaders) shared between all the spiders. 內建Item Loaders,支持數據清洗和過濾消毒,使用預定義的一個過濾器集合,可以在所有蜘蛛間公用
  • Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)內建多格式generating feed exports支持(JSON, CSV, XML),可以在后端存儲為多種方式(FTP, S3, local filesystem)
  • A media pipeline for automatically downloading images (or any other media) associated with the scraped items針對抓取對象,具有自動圖像(或者任何其他媒體)下載automatically downloading images的管道線
  • Support for extending Scrapy by plugging your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines).支持擴展抓取extending Scrap,使用signals來自定義插入函數或者定義好的API(middlewares, extensions, and pipelines)
  • Wide range of built-in middlewares and extensions for:大范圍的內建中間件和擴展:
    • cookies and session handling
    • HTTP compression
    • HTTP authentication
    • HTTP cache
    • user-agent spoofing
    • robots.txt
    • crawl depth restriction
    • and more
  • Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.強壯的編碼支持和自動識別機制,可以處理多種國外的、非標准的、不完整的編碼聲明等等
  • Support for creating spiders based on pre-defined templates, to speed up spider creation and make their code more consistent on large projects. See genspider command for more details.支持根據預定義的模板創建蜘蛛,在大型項目中用來加速蜘蛛並使其代碼更一致。查看genspider命令了解更多細節
  • Extensible stats collection for multiple spider metrics, useful for monitoring the performance of your spiders and detecting when they get broken可擴展的統計采集stats collection,針對數十個采集蜘蛛,在監控蜘蛛性能和識別斷線斷路方面很有用處
  • An Interactive shell console for trying XPaths, very useful for writing and debugging your spiders一個可交互的XPaths腳本命令平台接口Interactive shell console,在調試撰寫蜘蛛是上非常有用
  • A System service designed to ease the deployment and run of your spiders in production.一個系統服務級別的設計,可以在產品中非常容易的部署和運行你的蜘蛛
  • A built-in Web service for monitoring and controlling your bot內建的Web service,可以監視和控制你的機器人
  • A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler一個Telnet控制台Telnet console,可以鈎入一個Python的控制台在你的抓取進程中,以便內視或者調試你的爬蟲
  • Logging facility that you can hook on to for catching errors during the scraping process. Logging功能使得可以在抓取過程中提取捕獲的錯誤
  • Support for crawling based on URLs discovered through Sitemaps支持基於Sitemap的網址發現的爬行抓取
  • A caching DNS resolver 具備緩存DNS解析功能

What’s next?(下一步)

The next obvious steps are for you to download Scrapy, read the tutorial and join the community. Thanks for your interest!很明顯啦,下一步就是下載Scrapy,然后閱讀教程,加入社區,感謝你對Scrapy感興趣~!

 

T:\mininova\mininova\items.py 源碼

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class MininovaItem(Item):
    # define the fields for your item here like:
    # name = Field()
    url = Field()
    name = Field()
    description = Field()
    size = Field()
        

 T:\mininova\mininova\spiders\spider_mininova.py 源碼

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule   
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from mininova.items import MininovaItem

class MininovaSpider(CrawlSpider):

    name = 'mininova.org'
    allowed_domains = ['mininova.org']
    start_urls = ['http://www.mininova.org/today']
    #start_urls = ['http://www.mininova.org/yesterday']
    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_item')]

    # def parse_item(self, response):
        # filename = response.url.split("/")[-1] + ".html"
        # open(filename, 'wb').write(response.body)

    
    def parse_item(self, response):
        x = HtmlXPathSelector(response)
        item = MininovaItem()
        item['url'] = response.url
        #item['name'] = x.select('''//*[@id="content"]/h1''').extract()
        item['name'] = x.select("//h1/text()").extract()
        #item['description'] = x.select("//div[@id='description']").extract()
        item['description'] = x.select('''//*[@id="specifications"]/p[7]/text()''').extract() #download
        #item['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
        item['size'] = x.select('''//*[@id="specifications"]/p[3]/text()''').extract()
        return item

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM