原創作者:博客園sharpstill,轉載請注明
Scrapy是一款非常成熟的爬蟲框架,可以抓取網頁數據並抽取結構化數據,目前已經有很多企業用於生產環境。對於它的更多介紹,可以查閱相關資料(官方網站:www.scrapy.org)。
我們根據官網提供的安裝指南,來一步步安裝,主要參考了http://doc.scrapy.org/en/latest/intro/install.html頁面的介紹:
- Requirements?0?9
- Python 2.5, 2.6, 2.7 (3.x is not yet supported)
- Twisted 2.5.0, 8.0 or above (Windows users: you’ll need to install Zope.Interface and maybe pywin32 because of this Twisted bug)
- w3lib
- lxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended)
- simplejson (not required if using Python 2.6 or above)
- pyopenssl (for HTTPS support. Optional, but highly recommended)
准備工作
操作系統:RHEL 5
Python版本:Python-2.7.2
zope.interface版本:zope.interface-3.8.0
Twisted版本:Twisted-11.1.0
libxml2版本:libxml2-2.7.4.tar.gz
w3lib版本:w3lib-1.0
Scrapy版本:Scrapy-0.14.0.2841
安裝配置
1、安裝zlib
首先檢查一下你的系統中是否已經安裝zlib,該庫是一個與數據壓縮相關的工具包,scrapy框架依賴於該工具包。我使用的RHEL 5系統,查看是否安裝:
- [root@localhost scrapy]# rpm -qa zlib
- zlib-1.2.3-3
- [root@localhost scrapy]# tar -xvzf zlib-1.2.5.tar.gz
- [root@localhost zlib-1.2.5]# cd zlib-1.2.5
- [root@localhost zlib-1.2.5]# make
- [root@localhost zlib-1.2.5]# make install
在我的centos上安裝zlib的步驟是yum search zlib,然后yum install zlib-devel
2、安裝Python
我的系統中已經安裝的Python 2.4,根據官網要求和建議,我選擇了Python-2.7.2,下載地址如下所示:
http://www.python.org/download/(需要代理)
http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz
我下載了Python的源代碼,重新編譯后,進行安裝,過程如下所示:
- [root@localhost scrapy]# tar -zvxf Python-2.7.2.tgz
- [root@localhost scrapy]# cd Python-2.7.2
- [root@localhost Python-2.7.2]# ./configure
- [root@localhost Python-2.7.2]# make
- [root@localhost Python-2.7.2]# make install
默認情況下,Python程序被安裝到/usr/local/lib/python2.7。
如果你的系統中沒有安裝過Python,此時通過命令行執行一下:
- [root@localhost scrapy]# python
- Python 2.7.2 (default, Dec 5 2011, 22:04:07)
- [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
- Type "help", "copyright", "credits" or "license" for more information.
- >>>
如果你的系統中還有其他版本的Python,例如我的系統中2.4版本的,所以要做一個符號鏈接:
- [root@localhost python2.7]# mv /usr/bin/python /usr/bin/python.bak
- [root@localhost python2.7]# ln -s /usr/local/bin/python /usr/bin/python
3、安裝setuptools
這里主要是安裝一個用來管理Python模塊的工具,如果已經安裝就跳過該步驟。如果你需要安裝,可以參考下面的鏈接:
wget http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea
chmod +x setuptools-0.6c11-py2.7.egg
./ setuptools-0.6c11-py2.7.egg
不過,在安裝Python-2.7.2以后,可以看到Python的解壓縮包里面有一個setup.py腳本,使用這個腳本可以安裝Python一些相關的模塊,執行命令:
- [root@localhost Python-2.7.2]# python setup.py install
4、安裝zope.interface
下載地址如下所示:
http://pypi.python.org/pypi/zope.interface/3.8.0
http://pypi.python.org/packages/source/z/zope.interface/zope.interface-3.8.0.tar.gz#md5=8ab837320b4532774c9c89f030d2a389
安裝過程如下所示:
- [root@localhost scrapy]$ tar -xvzf zope.interface-3.8.0.tar.gz
- [root@localhost scrapy]$ cd zope.interface-3.8.0
- [root@localhost zope.interface-3.8.0]$ python setup.py build
- [root@localhost zope.interface-3.8.0]$ python setup.py install
5、安裝Twisted
下載地址如下所示:
http://pypi.python.org/packages/source/T/Twisted/Twisted-11.1.0.tar.bz2#md5=972f3497e6e19318c741bf2900ffe31c
安裝過程如下所示:
- [root@localhost scrapy]# bzip2 -d Twisted-11.1.0.tar.bz2
- [root@localhost scrapy]# tar -xvf Twisted-11.1.0.tar
- [root@localhost scrapy]# cd Twisted-11.1.0
- [root@localhost Twisted-11.1.0]# python setup.py install
6、安裝w3lib
下載地址如下所示:
http://pypi.python.org/pypi/w3lib
http://pypi.python.org/packages/source/w/w3lib/w3lib-1.0.tar.gz#md5=f28aeb882f27a616e0fc43d01f4dcb21
安裝過程如下所示:
- [root@localhost scrapy]# tar -xvzf w3lib-1.0.tar.gz
- [root@localhost scrapy]# cd w3lib-1.0
- [root@localhost w3lib-1.0]# python setup.py install
7、安裝libxml2
下載地址如下所示:
http://download.chinaunix.net/download.php?id=28497&ResourceID=6095
http://download.chinaunix.net/down.php?id=28497&ResourceID=6095&site=1
或者,可以到網站http://xmlsoft.org上面找到相應版本的壓縮包。
安裝過程如下所示:
- [root@localhost scrapy]# tar -xvzf libxml2-2.7.4.tar.gz
- [root@localhost scrapy]# cd libxml2-2.7.4
- [root@localhost libxml2-2.7.4]# ./configure
- [root@localhost libxml2-2.7.4]# make
- [root@localhost libxml2-2.7.4]# make install
8、安裝pyOpenSSL
該步驟可選,對應的安裝包下載地址為:
https://launchpad.net/pyopenssl
我在centos上安裝這個pyOpenSSL死活提示有錯,最后google到了https://bugs.launchpad.net/pyopenssl/+bug/845445這個解決方案。最后用easy_install http://pypi.python.org/packages/source/p/pyOpenSSL/pyOpenSSL-0.12.tar.gz 一鍵安裝解決。
如果這一步easy_install仍然出錯,則是因為沒有安裝openssl,在centos上使用yum install openssl-devel搞定后easy_install
9、安裝Scrapy
下載地址如下所示:
http://scrapy.org/download/
http://pypi.python.org/pypi/Scrapy
http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.0.2841.tar.gz#md5=fe63c5606ca4c0772d937b51869be200
安裝過程如下所示:
- [root@localhost scrapy]# tar -xvzf Scrapy-0.14.0.2841.tar.gz
- [root@localhost scrapy]# cd Scrapy-0.14.0.2841
- [root@localhost Scrapy-0.14.0.2841]# python setup.py install
安裝驗證
經過上面的安裝和配置過程,已經完成了Scrapy的安裝,我們可以通過如下命令行來驗證一下:
- [root@localhost scrapy]# scrapy
- Scrapy 0.14.0.2841 - no active project
- Usage:
- scrapy <command> [options] [args]
- Available commands:
- fetch Fetch a URL using the Scrapy downloader
- runspider Run a self-contained spider (without creating a project)
- settings Get settings values
- shell Interactive scraping console
- startproject Create new project
- version Print Scrapy version
- view Open URL in browser, as seen by Scrapy
- Use "scrapy <command> -h" to see more info about a command
- [root@localhost scrapy]# scrapy fetch --help
- Usage
- =====
- scrapy fetch [options] <url>
- Fetch a URL using the Scrapy downloader and print its content to stdout. You
- may want to use --nolog to disable logging
- Options
- =======
- --help, -h show this help message and exit
- --spider=SPIDER use this spider
- --headers print response HTTP headers instead of body
- Global Options
- --------------
- --logfile=FILE log file. if omitted stderr will be used
- --loglevel=LEVEL, -L LEVEL
- log level (default: DEBUG)
- --nolog disable logging completely
- --profile=FILE write python cProfile stats to FILE
- --lsprof=FILE write lsprof profiling stats to FILE
- --pidfile=FILE write process ID to FILE
- --set=NAME=VALUE, -s NAME=VALUE
- set/override setting (may be repeated)
- [root@localhost scrapy]# scrapy fetch http://doc.scrapy.org/en/latest/intro/install.html ;> install.html
- 2011-12-05 23:40:04+0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: scrapybot)
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled item pipelines:
- 2011-12-05 23:40:05+0800 [default] INFO: Spider opened
- 2011-12-05 23:40:05+0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
- 2011-12-05 23:40:05+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
- 2011-12-05 23:40:05+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
- 2011-12-05 23:40:07+0800 [default] DEBUG: Crawled (200) <GET http://doc.scrapy.org/en/latest/intro/install.html> (referer: None)
- 2011-12-05 23:40:07+0800 [default] INFO: Closing spider (finished)
- 2011-12-05 23:40:07+0800 [default] INFO: Dumping spider stats:
- {'downloader/request_bytes': 227,
- 'downloader/request_count': 1,
- 'downloader/request_method_count/GET': 1,
- 'downloader/response_bytes': 22676,
- 'downloader/response_count': 1,
- 'downloader/response_status_count/200': 1,
- 'finish_reason': 'finished',
- 'finish_time': datetime.datetime(2011, 12, 5, 15, 40, 7, 918833),
- 'scheduler/memory_enqueued': 1,
- 'start_time': datetime.datetime(2011, 12, 5, 15, 40, 5, 5749)}
- 2011-12-05 23:40:07+0800 [default] INFO: Spider closed (finished)
- 2011-12-05 23:40:07+0800 [scrapy] INFO: Dumping global stats:
- {'memusage/max': 17711104, 'memusage/startup': 17711104}
- [root@localhost scrapy]# ll install.html
- -rw-r--r-- 1 root root 22404 Dec 5 23:40 install.html
- [root@localhost scrapy]#
接下來,可以根據scrapy官網的指南來進一步應用scrapy框架,Tutorial鏈接頁面為http://doc.scrapy.org/en/latest/intro/tutorial.html。
(1)如果在ubuntu下
使用:$sudo apt-get install libxml2 libxml2-dev 安裝libxml2
(2)如果在centos下
yum install libxml2
yum install libxslt-devel
可以先yum search一下
