scrapy,python開發的一個快速,高層次的屏幕抓取和web抓取框架,用於抓取web站點並從頁面中提取結構化的數據。scrapy用途廣泛,可以用於數據挖掘、監測和自動化測試。scrapy的安裝稍顯麻煩,不過按照以下步驟去進行,相信你也能很輕松的安裝使用scrapy。
安裝python2.7
scrapy1.0.3暫時只支持python2.7
# wget https://www.python.org/ftp/python/2.7.6/Python-2.7.6.tgz
[root@rocket software]# tar -zxvf Python-2.7.6.tgz # 解壓
[root@rocket software]# cd Python-2.7.6
[root@rocket software]# mkdir /usr/local/python27 # 創建安裝目錄
[root@rocket software]# ./configure --prefix=/usr/local/python27
[root@rocket software]# make
[root@rocket software]# make install
# 目前安裝的版本是2.6,需要替換成2.7
[root@rocket software]# mv /usr/bin/python /usr/bin/python2.6.6
[root@rocket software]# ln -s /usr/local/python27/bin/python /usr/bin/python
這里需要注意的是,由於原有系統安裝的是python2.6,升級了python2.7,那么yum也會出問題
需要修改yum使用python2.6的版本
安裝setuptools
進入官網,下載到本地,解壓
https://pypi.python.org/pypi/setuptools#downloads
[root@rocket software]# cd setuptools-18.1
[root@rocket setuptools-18.1]# python setup.py install
安裝pip
進入官網,下載到本地,解壓
https://pypi.python.org/pypi/pip#downloads
[root@rocket software]# cd pip-7.1.2
[root@rocket pip-7.1.2]# python setup.py install
安裝Twisted
進入官網,下載到本地,解壓
wget https://pypi.python.org/packages/source/T/Twisted/Twisted-15.4.0.tar.bz2
[root@rocket software]# cd Twisted-15.4.0
[root@rocket Twisted-15.4.0]# python setup.py install
安裝scrapy
pip install scrapy
在這個過程中,遇到以下問題
1 pip安裝模塊警告InsecurePlatformWarning: A true SSLContext object is not available.
yum install python-devel libffi-devel openssl-devel
pip install pyopenssl ndg-httpsclient pyasn1
在運行pip就不會出現警告了
2 安裝lxml失敗
解決方法是先安裝libxslt開發包:
yum install libxslt-devel
確認安裝成功
[root@rocket software]# rpm -qa | grep libxml
libxml2-devel-2.7.6-20.el6.x86_64
libxml2-python-2.7.6-20.el6.x86_64
libxml2-2.7.6-20.el6.x86_64
3 安裝cffi失敗
[root@rocket software]# yum -y install libffi-devel
[root@rocket software]# rpm -qa | grep libffi
libffi-3.0.5-3.2.el6.x86_64
libffi-devel-3.0.5-3.2.el6.x86_64
4 安裝openssl失敗
[root@rocket software]# yum -y install openssl-devel
[root@rocket software]# rpm -qa | grep openssl
openssl-devel-1.0.1e-42.el6.x86_64
openssl-1.0.1e-42.el6.x86_64
解決完以上幾個問題后,重新執行
pip install scrapy
能夠順利安裝成功。
確認安裝成功
[root@rocket Twisted-15.4.0]# python
Python 2.7.6 (default, Oct 27 2015, 01:21:45)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy
沒報錯,安裝成功。
開始第一個scrapy任務
詳細介紹請參考
http://scrapy-chs.readthedocs.org/zh_CN/latest/intro/overview.html
[root@rocket scrapy]# scrapy startproject mininova
運行的時候報錯,注意運行的時候,必須在mininova的主目錄中運行,不然會報錯
編寫items.py
import scrapy class MininovaItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() url = scrapy.Field() name = scrapy.Field() description = scrapy.Field() size = scrapy.Field()
編寫spiders/mininova_spiders.py
from scrapy.spiders import CrawlSpider, Rule, Spider from scrapy.linkextractors import LinkExtractor import scrapy from mininova.items import MininovaItem class MininovaSpider(scrapy.spiders.CrawlSpider): name = 'mininova' allowed_domains = ['mininova.org'] start_urls = ['http://www.mininova.org/today'] rules = [Rule(LinkExtractor(allow=['/tor/\d+']), 'parse_torrent')] def parse_torrent(self, response): torrent = MininovaItem() torrent['url'] = response.url torrent['name'] = response.xpath("//h1/text()").extract() torrent['description'] = response.xpath("//div[@id='description']").extract() torrent['size'] = response.xpath("//div[@id='info-left']/p[2]/text()[2]").extract() return torrent
運行
[root@rocket mininova]# pwd
/home/demo/scrapy/mininova
[root@rocket mininova]# scrapy crawl mininova -o scraped_data.json
需要安裝 sqlite-devel庫,再重新編譯安裝Python
yum install sqlite-devel
[root@rocket software]# yum install sqlite-devel
[root@rocket software]# ./configure --prefix=/usr/local/python27
[root@rocket software]# make
[root@rocket software]# make install
這樣就可以找到sqlite3的庫了
[root@rocket software]# cd /usr/local/python27/lib/python2.7/lib-dynload/
[root@rocket lib-dynload]# ll|grep sql
-rwxr-xr-x. 1 root root 240971 Oct 28 01:17 _sqlite3.so
[root@rocket mininova]# scrapy crawl mininova -o scraped_data.json
終於可以跑起來了。。
接下來我們將進一步對scrapy的工作原理進行分析,並給出更為實用的例子。