Scrapy 探索之路

1 scrapy 是啥？
2 怎么學習
3 碰到的一些問題
4 后記

1 scrapy 是啥？

scray是python的一個網絡爬蟲框架，爬蟲是啥？請百度。爬蟲可以干嘛？請百度。爬蟲可以做哪些有趣的事，可以上知乎看關於網絡爬蟲的帖子。可以上srapy的官網的描述。
（版本：scrapy 0.24.6）

2 怎么學習

2.1 看手冊

scrapy官網有自帶的入門手冊，有pdf和online兩種格式。

2.2 安裝

手冊中有，自己折騰。

2.3 入門

手冊
對於入門來說，手冊中已經給了詳細的例子，至於網上的大部分文章，基本都是從手冊中摘出來的，然后翻譯一下。我前前后后也看過好幾遍手冊了。80%以上的問題都可以在手冊中得到答案。所以，建議通讀手冊。

xpath，lxml和beautifulsoup，
可以參考w3c上的內容。然后試着自己抓取幾個網頁提取一下。下面是我當時學習lxml時寫的一段代碼，方便一開始熟悉xpath。

from lxml import etree
from StringIO import StringIO

xmlfile="""<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>

<book category="COOKING">
  <title lang="en">Everyday Italian</title>
  <author>Giada De Laurentiis</author>
  <year>2005</year>
  <price>30.00</price>
</book>

<book category="CHILDREN">
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

<book category="WEB">
  <title lang="en">XQuery Kick Start</title>
  <author>James McGovern</author>
  <author>Per Bothner</author>
  <author>Kurt Cagle</author>
  <author>James Linn</author>
  <author>Vaidyanathan Nagarajan</author>
  <year>2003</year>
  <price>49.99</price>
</book>

<book category="WEB">
  <title lang="ch">Learning XML</title>
  <author>Erik T. Ray</author>
  <year>2003</year>
  <price>39.95</price>
</book>

</bookstore>
"""

f=StringIO(xmlfile)
tree=etree.parse(f)

tree.xpath('//title')[0].tag
tree.xpath('//title')[0].text

##  closure for query
def QResF(tree):
    def QResFunction(query):
        res=tree.xpath(query)
        try:
            reslsT=[q.tag for q in res]
            resls=[q.text for q in res]
            print zip(reslsT,resls)
        except AttributeError:
            resls='\n'.join([q for q in res])
            print resls
        return len(res)
    return QResFunction

Q=QResF(tree)

Q('//title')
Q('//author')
Q('/bookstore/book/title')
Q('/bookstore/book[price > 25]')
Q('//book/title[@*]')
Q('//@*')
Q('//*')
Q('//@lang')
Q('//book/title| //book/price')
Q('//bookstore/child::year')
Q('//nothing')
Q('/book/child::year')
Q('//book/child::*')
Q('//book/title[@lang]')

Q('//text()')

學習python的一些網絡庫
像urllib和urllib2。最基本的了解也是很有幫助的。比如像如何將參數添加到url中，如何對中文參數進行轉換。

學習twisted
scrapy就是基於這個實現的。可以了解，我看過基本的tutorial，有一定的了解，但一般用不到。

學習javascript
至少要能讀懂js文件，因為大部分網頁都是用異步回傳的方法get到數據的，也就是說你的第一次網頁訪問往往得不到你想要的內容，這個時候就必須通過分析后台的js代碼來get到數據。也有模擬瀏覽器的行為進行訪問的插件，但個人感覺太繁雜，太笨重，直接分析后台是最簡單快捷的方式（當然，這個就得看你閱讀代碼的耐心和功力了）。

學http協議
對於互聯網，http協議是很重要很基礎的協議，懂了協議，你玩起爬蟲來會更加輕松。這里推薦HTTP The Definitive Guide（http權威指南，e文的，看e文資料容易找，還比較直接，不懂就開有道吧，不過你也可以試試我寫的字典，在我的github上，就是用lxml+xpath實現的）。

動手寫自己的爬蟲
可以在網上搜索別人的爬蟲項目，大部分是以抓知乎上的帖作為例子，也有抓什么美女圖片的（嘻嘻）。看他們的源代碼，看高手是怎么實現的。

不斷的去讀手冊
手冊中的基本原理是解決問題的基石。

研究scrapy本身的源代碼
這個是求助無門的時候必須要懂的。而且，看源代碼還有助於你理解手冊中的內容。我就試過一邊對着源代碼看手冊（還沒研究透，只是看過一些），對手冊的理解有很大的幫助（下面會有例子）。

做一些感興趣的小項目
練練手，加深對概念的理解（比如說，抓抓論壇上的帖子，寫寫字典啥的）。這個時候自然就會發現很多問題。而且，實現東西，還是蠻有成就感的。

2.4 一些工具

手冊中有提到一些，可以參考。很多瀏覽器有自帶的網絡監視器。比如說google頁面右鍵就有inspect element，打開就可以監視當前網頁的情況，可以查看網頁發送的請求，可以查看xpath，可以編輯元素，研究頁面的結構等等。 firefox上可以安裝firebug。 google上推薦使用postman插件，可以模擬瀏覽器發送請求。

3 碰到的一些問題

下面是探索過程中碰到的一些問題，雖然大部分認真看手冊都還是能夠解決，但不是那么容易注意到。

3.1 request和response的拼接關系，

當在response返回的item需要拼接發送請求的request中的數據時，可以通過request的meta屬性傳送，看看手冊：

class scrapy.http.Request(url [ , callback, method=’GET’, headers, body, cookies, meta,
encoding=’utf-8’, priority=0, dont_filter=False, errback ] )
...
meta
A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually
populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this
dict depends on the extensions you have enabled.
See Request.meta special keys for a list of special meta keys recognized by Scrapy.
This dict is shallow copied when the request is cloned using the copy() or replace() methods, and
can also be accessed, in your spider, from the response.meta attribute.

最后一句描述中，該meta會被復制到在response的meta。

3.2 如何post數據

還看request類的參數，有method一項，默認是get，那么理應就有post了，然后，數據在哪post，從http協議中可以知道應該在body中post數據。這樣，就可以構造post請求了。更方便的做法是構造FormRequest，手冊中有例子。

return [FormRequest(url="http://www.example.com/post/action",
                    formdata={’name’: ’John Doe’, ’age’: ’27’},
                    callback=self.after_post)]

3.3 request被scrapy過濾

問題是這樣的，我發送一個請求，獲得了初步的數據，該數據包含后續頁數據的頭描述和當前頁的數據，可是對頁面的數據我想放到同個函數中統一處理，所以，我相當於要對當前頁面的請求發送兩次，但因為srapy會自動過濾重復頁，所以，重復發送被過濾掉了。解決方法同樣可以在手冊中找到。

class scrapy.http.Request(url [ , callback, method=’GET’, headers, body, cookies, meta,
                          encoding=’utf-8’, priority=0, dont_filter=False, errback ] )

看到有dont_filter一項，那么就可以指定該request不被scrapy過濾掉。看來設計scrapy的人考慮的是相當周全。

3.4 scrapy的item是什么？

3.2.2 Item Fields
Field objects are used to specify metadata for each field.
last_updated field illustrated in the example above.
For example, the serializer function for the
You can specify any kind of metadata for each field. There is no restriction on the values accepted by Field objects.
For this same reason, there is no reference list of all available metadata keys. Each key defined in Field objects
could be used by a different components, and only those components know about it. You can also define and use any
other Field key in your project too, for your own needs. The main goal of Field objects is to provide a way to
define all field metadata in one place. Typically, those components whose behaviour depends on each field use certain
field keys to configure that behaviour. You must refer to their documentation to see which metadata keys are used by
each component.
It’s important to note that the Field objects used to declare the item do not stay assigned as class attributes. Instead,
they can be accessed through the Item.fields attribute.
And that’s all you need to know about declaring items.

一大段的看的雲里霧里的，直接看item的源代碼。

class BaseItem(object_ref):
    """Base class for all scraped items."""
    pass


class Field(dict):
    """Container of field metadata"""


class ItemMeta(type):

    def __new__(mcs, class_name, bases, attrs):
        fields = {}
        new_attrs = {}
        for n, v in attrs.iteritems():
            if isinstance(v, Field):
                fields[n] = v
            else:
                new_attrs[n] = v

        cls = super(ItemMeta, mcs).__new__(mcs, class_name, bases, new_attrs)
        cls.fields = cls.fields.copy()
        cls.fields.update(fields)
        return cls


class DictItem(DictMixin, BaseItem):

    fields = {}

    def __init__(self, *args, **kwargs):
        self._values = {}
        if args or kwargs:  # avoid creating dict for most common case
            for k, v in dict(*args, **kwargs).iteritems():
                self[k] = v

    ## ...

class Item(DictItem):

    __metaclass__ = ItemMeta

一下子就豁然開朗了。

3.5 中文的顯示問題

scrapy默認就是以utf格式保存的。不過，假如你是在命令行這樣導出的

scrapy crawl dmoz -o items.json

你得到的將是utf8的內存表示字符串，

"\u5c97\u4f4d\u804c\u8d23\uff1a"

我試過用pipeline的形式導出，可以解決中文的問題。不過，對於嵌套的字典或列表，還是無能為力。另一方面，直接存入到數據庫卻沒有中文的問題。

3.6 復雜的start_ruls

這個問題是：需要根據配置文件來構造多個請求地址，而不是直接從start_urls中。這個可以在spider中覆蓋start_requests方法。手冊有提到，源代碼也可以看得很清楚。

class Spider(object_ref):
    """Base class for scrapy spiders. All spiders must inherit from this
    class.
    """

    name = None

    def __init__(self, name=None, **kwargs):
        if name is not None:
            self.name = name
        elif not getattr(self, 'name', None):
            raise ValueError("%s must have a name" % type(self).__name__)
        self.__dict__.update(kwargs)
        if not hasattr(self, 'start_urls'):
            self.start_urls = []
    ## ...

    def start_requests(self):
        for url in self.start_urls:
            yield self.make_requests_from_url(url)

    def make_requests_from_url(self, url):
        return Request(url, dont_filter=True)

3.7 小爬蟲不想建項目

對於小的簡單的爬蟲，一個文件就可以搞掂。scrapy中已經提供了這樣的指令的

runspider
    Syntax: scrapy runspider <spider_file.py>
    Requires project: no
Run a spider self-contained in a Python file, without having to create a project

3.8 其他

實踐的過程中還碰到過很多其它方面的問題，篇幅時間限制，具體的問題以后發現解決方法再分享出來。

4 后記

半年前接觸了scrapy，后來便在空閑的時間探索它。自己也算寫了幾個簡單的小項目練過手，對scrapy也算比較清楚了（當然還是有很多待解決的問題），故寫下此文，一是對這個過程進行自我總結，二是分享自己的一些探索方法，相信跟我一樣的新手也會碰到類似的問題。其實，大部分問題的問題都可以通過scrapy自帶的手冊解決。所以，建議還是讀手冊，一定要通讀。然后是學習網絡基礎知識。再上網找答案。最后，建議讀源碼。

（謝謝閱讀，歡迎指正，歡迎交流）