Linux(RedHat,Centos)上scrapy詳盡安裝筆記

本文轉載自查看原文 2012-05-26 23:06 3329 python

原創作者:博客園sharpstill,轉載請注明

Scrapy是一款非常成熟的爬蟲框架，可以抓取網頁數據並抽取結構化數據，目前已經有很多企業用於生產環境。對於它的更多介紹，可以查閱相關資料（官方網站：www.scrapy.org）。

我們根據官網提供的安裝指南，來一步步安裝，主要參考了http://doc.scrapy.org/en/latest/intro/install.html頁面的介紹：

view plain

Requirements?0?9
Python 2.5, 2.6, 2.7 (3.x is not yet supported)
Twisted 2.5.0, 8.0 or above (Windows users: you’ll need to install Zope.Interface and maybe pywin32 because of this Twisted bug)
w3lib
lxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended)
simplejson (not required if using Python 2.6 or above)
pyopenssl (for HTTPS support. Optional, but highly recommended)

下面記錄一下從安裝Python到安裝scrapy的過程，最后，通過執行命令進行抓取數據來驗證我們所做的安裝配置工作。

准備工作

操作系統：RHEL 5
Python版本：Python-2.7.2
zope.interface版本：zope.interface-3.8.0
Twisted版本：Twisted-11.1.0
libxml2版本：libxml2-2.7.4.tar.gz
w3lib版本：w3lib-1.0
Scrapy版本：Scrapy-0.14.0.2841

安裝配置

1、安裝zlib

首先檢查一下你的系統中是否已經安裝zlib，該庫是一個與數據壓縮相關的工具包，scrapy框架依賴於該工具包。我使用的RHEL 5系統，查看是否安裝：

view plain

[root@localhost scrapy]# rpm -qa zlib
zlib-1.2.3-3

我的系統已經默認安裝了，安裝的話，可以跳過該步驟。如果沒有安裝的話，可以到 http://www.zlib.net/上下載，並進行安裝。假如下載的是zlib-1.2.5.tar.gz，安裝命令如下所示：

view plain

[root@localhost scrapy]# tar -xvzf zlib-1.2.5.tar.gz
[root@localhost zlib-1.2.5]# cd zlib-1.2.5
[root@localhost zlib-1.2.5]# make
[root@localhost zlib-1.2.5]# make install

在我的centos上安裝zlib的步驟是yum search zlib,然后yum install zlib-devel

2、安裝Python

我的系統中已經安裝的Python 2.4，根據官網要求和建議，我選擇了Python-2.7.2，下載地址如下所示：

http://www.python.org/download/（需要代理）
http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz

我下載了Python的源代碼，重新編譯后，進行安裝，過程如下所示：

view plain

[root@localhost scrapy]# tar -zvxf Python-2.7.2.tgz
[root@localhost scrapy]# cd Python-2.7.2
[root@localhost Python-2.7.2]# ./configure
[root@localhost Python-2.7.2]# make
[root@localhost Python-2.7.2]# make install

默認情況下，Python程序被安裝到/usr/local/lib/python2.7。

如果你的系統中沒有安裝過Python，此時通過命令行執行一下：

view plain

[root@localhost scrapy]# python
Python 2.7.2 (default, Dec 5 2011, 22:04:07)
[GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

表示最新安裝的Python已經可以使用了。

如果你的系統中還有其他版本的Python，例如我的系統中2.4版本的，所以要做一個符號鏈接：

view plain

[root@localhost python2.7]# mv /usr/bin/python /usr/bin/python.bak
[root@localhost python2.7]# ln -s /usr/local/bin/python /usr/bin/python

這樣操作以后，在執行python，就生效了。

3、安裝setuptools

這里主要是安裝一個用來管理Python模塊的工具，如果已經安裝就跳過該步驟。如果你需要安裝，可以參考下面的鏈接：

wget http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea
chmod +x setuptools-0.6c11-py2.7.egg
./ setuptools-0.6c11-py2.7.egg

不過，在安裝Python-2.7.2以后，可以看到Python的解壓縮包里面有一個setup.py腳本，使用這個腳本可以安裝Python一些相關的模塊，執行命令：

view plain

[root@localhost Python-2.7.2]# python setup.py install

安裝執行后，相關Python模塊被安裝到目錄/usr/local/lib/python2.7/site-packages下。

4、安裝zope.interface

下載地址如下所示：

http://pypi.python.org/pypi/zope.interface/3.8.0
http://pypi.python.org/packages/source/z/zope.interface/zope.interface-3.8.0.tar.gz#md5=8ab837320b4532774c9c89f030d2a389

安裝過程如下所示：

view plain

[root@localhost scrapy]$ tar -xvzf zope.interface-3.8.0.tar.gz
[root@localhost scrapy]$ cd zope.interface-3.8.0
[root@localhost zope.interface-3.8.0]$ python setup.py build
[root@localhost zope.interface-3.8.0]$ python setup.py install

安裝完成后，可以在/usr/local/lib/python2.7/site-packages下面看到zope和zope.interface-3.8.0-py2.7.egg-info。

5、安裝Twisted

下載地址如下所示：

http://twistedmatrix.com/trac/
http://pypi.python.org/packages/source/T/Twisted/Twisted-11.1.0.tar.bz2#md5=972f3497e6e19318c741bf2900ffe31c

替換為：

http://twistedmatrix.com/Releases/Twisted/12.0/Twisted-12.0.0.tar.bz2

注意:scrapy 0.14配合 twisted 11.1.0，目前發現bug，爬着爬着會出現不能爬行的情況。可以考慮降級twisted至11.0.0 或升級scrapy至0.15 或升級twisted至12.0.0解決。

安裝過程如下所示：

view plain

[root@localhost scrapy]# bzip2 -d Twisted-11.1.0.tar.bz2
[root@localhost scrapy]# tar -xvf Twisted-11.1.0.tar
[root@localhost scrapy]# cd Twisted-11.1.0
[root@localhost Twisted-11.1.0]# python setup.py install

安裝完成后，可以在/usr/local/lib/python2.7/site-packages下面看到twisted和Twisted-11.1.0-py2.7.egg-info。

6、安裝w3lib

下載地址如下所示：

http://pypi.python.org/pypi/w3lib
http://pypi.python.org/packages/source/w/w3lib/w3lib-1.0.tar.gz#md5=f28aeb882f27a616e0fc43d01f4dcb21

安裝過程如下所示：

view plain

[root@localhost scrapy]# tar -xvzf w3lib-1.0.tar.gz
[root@localhost scrapy]# cd w3lib-1.0
[root@localhost w3lib-1.0]# python setup.py install

安裝完成后，可以在/usr/local/lib/python2.7/site-packages下面看到w3lib和w3lib-1.0-py2.7.egg-info。

7、安裝libxml2

下載地址如下所示：

http://download.chinaunix.net/download.php?id=28497&ResourceID=6095
http://download.chinaunix.net/down.php?id=28497&ResourceID=6095&site=1

或者，可以到網站http://xmlsoft.org上面找到相應版本的壓縮包。

安裝過程如下所示：

view plain

[root@localhost scrapy]# tar -xvzf libxml2-2.7.4.tar.gz
[root@localhost scrapy]# cd libxml2-2.7.4
[root@localhost libxml2-2.7.4]# ./configure
[root@localhost libxml2-2.7.4]# make
[root@localhost libxml2-2.7.4]# make install

8、安裝pyOpenSSL

該步驟可選，對應的安裝包下載地址為：

https://launchpad.net/pyopenssl

我在centos上安裝這個pyOpenSSL死活提示有錯，最后google到了https://bugs.launchpad.net/pyopenssl/+bug/845445這個解決方案。最后用easy_install http://pypi.python.org/packages/source/p/pyOpenSSL/pyOpenSSL-0.12.tar.gz 一鍵安裝解決。

如果這一步easy_install仍然出錯，則是因為沒有安裝openssl,在centos上使用yum install openssl-devel搞定后easy_install

9、安裝Scrapy

注：這一步也可以easy_install scrapy一鍵搞定

下載地址如下所示：

http://scrapy.org/download/
http://pypi.python.org/pypi/Scrapy
http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.0.2841.tar.gz#md5=fe63c5606ca4c0772d937b51869be200

安裝過程如下所示：

view plain

[root@localhost scrapy]# tar -xvzf Scrapy-0.14.0.2841.tar.gz
[root@localhost scrapy]# cd Scrapy-0.14.0.2841
[root@localhost Scrapy-0.14.0.2841]# python setup.py install

安裝驗證

經過上面的安裝和配置過程，已經完成了Scrapy的安裝，我們可以通過如下命令行來驗證一下：

view plain

[root@localhost scrapy]# scrapy
Scrapy 0.14.0.2841 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
fetch Fetch a URL using the Scrapy downloader
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command

上面提示信息，提供了一個fetch命令，這個命令抓取指定的網頁，可以先看看fetch命令的幫助信息，如下所示：

view plain

[root@localhost scrapy]# scrapy fetch --help
Usage
=====
scrapy fetch [options] <url>
Fetch a URL using the Scrapy downloader and print its content to stdout. You
may want to use --nolog to disable logging
Options
=======
--help, -h show this help message and exit
--spider=SPIDER use this spider
--headers print response HTTP headers instead of body
Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--lsprof=FILE write lsprof profiling stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)

根據命令提示，指定一個URL，執行后抓取一個網頁的數據，如下所示：

view plain

[root@localhost scrapy]# scrapy fetch http://doc.scrapy.org/en/latest/intro/install.html ;> install.html
2011-12-05 23:40:04+0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: scrapybot)
2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled item pipelines:
2011-12-05 23:40:05+0800 [default] INFO: Spider opened
2011-12-05 23:40:05+0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2011-12-05 23:40:05+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2011-12-05 23:40:05+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-12-05 23:40:07+0800 [default] DEBUG: Crawled (200) <GET http://doc.scrapy.org/en/latest/intro/install.html> (referer: None)
2011-12-05 23:40:07+0800 [default] INFO: Closing spider (finished)
2011-12-05 23:40:07+0800 [default] INFO: Dumping spider stats:
{'downloader/request_bytes': 227,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 22676,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2011, 12, 5, 15, 40, 7, 918833),
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2011, 12, 5, 15, 40, 5, 5749)}
2011-12-05 23:40:07+0800 [default] INFO: Spider closed (finished)
2011-12-05 23:40:07+0800 [scrapy] INFO: Dumping global stats:
{'memusage/max': 17711104, 'memusage/startup': 17711104}
[root@localhost scrapy]# ll install.html
-rw-r--r-- 1 root root 22404 Dec 5 23:40 install.html
[root@localhost scrapy]#

可見，我們已經成功抓取了一個網頁。

接下來，可以根據scrapy官網的指南來進一步應用scrapy框架，Tutorial鏈接頁面為http://doc.scrapy.org/en/latest/intro/tutorial.html。

注意：安裝過程中如果發生gcc報錯，則需要安裝 libxml2和libxslt:
(1)如果在ubuntu下
使用：$sudo apt-get install libxml2 libxml2-dev 安裝libxml2

使用：$sudo apt-get install libxlst1 libxslt1-dev 安裝libxslt
(2)如果在centos下
yum install libxml2
yum install libxslt-devel
可以先yum search一下

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 在centos和redhat上安裝docker 淺析CentOS和RedHat Linux的區別 Centos下安裝Scrapy linux發展、redhat與centos的區別 CentOS 安裝Scrapy redhat 7安裝CentOS 7 yum源 Linux學習筆記之RedHat Enterprise Linux 6.4 使用 Centos 6 的yum源問題 linux上安裝Oracle 包括常見安裝錯誤(centos8.1,oracle linux8,redhat 8)通過 Linux redhat ICE環境安裝 linux（Redhat7）安裝Apache