一、urllib2簡單獲取html頁面
#!/usr/bin/env python # -*- coding:utf-8 -*- import urllib2 response = urllib2.urlopen('http://www.baidu.com'); html = response.read(); print html
簡單的幾行代碼就能拿到html頁面,接下來局勢html的解析工作了。
想象很美好,實際操作就出問題了。baidu沒有禁止機器人抓取可以正常抓取到頁面,但是比如:https://b.ishadow.tech/是禁止機器人抓取的,簡單模擬瀏覽器頭部信息也不行。
然后想找個GitHub上的爬蟲來試驗一下行不行,因此找到了https://github.com/scrapy/scrapy,看樣子好像比較叼。
按照readme安裝了一下,安裝失敗了,仔細看了一下文檔。
pip install scrapy
官方建議安裝在Python的虛擬環境里,描述大概如下(https://doc.scrapy.org/en/latest/intro/install.html#intro-using-virtualenv):
Using a virtual environment (recommended) TL;DR: We recommend installing Scrapy inside a virtual environment on all platforms. Python packages can be installed either globally (a.k.a system wide), or in user-space. We do not recommend installing scrapy system wide. Instead, we recommend that you install scrapy within a so-called “virtual environment” (virtualenv). Virtualenvs allow you to not conflict with already-installed Python system packages (which could break some of your system tools and scripts), and still install packages normally with pip (without sudo and the likes).
然后決定安裝一個Python虛擬環境,命令如下:
$ sudo pip install virtualenv
查看基本使用
$virtualenv -h
Usage: virtualenv [OPTIONS] DEST_DIR
只需要 virtualenv加目標目錄就可以了。
因此新建虛擬環境
$virtualevn e27 New python executable in ~/e27/bin/python Installing setuptools, pip, wheel...done.
啟用環境
$source ./bin/activate
注意切換環境成功后當前目錄會有標識,如下
➜ e27 source ./bin/activate
(e27) ➜ e27
退出環境
$deactivate
現在開始干正事,在虛擬環境中安裝爬蟲(https://github.com/scrapy/scrapy)
pip install scrapy
大約三分鍾后安裝完成,之前直接在全局環境安裝盡然還失敗了。成功后shell輸出如下:
......
Successfully built lxml PyDispatcher Twisted pycparser Installing collected packages: lxml, PyDispatcher, zope.interface, constantly, incremental, attrs, Automat, Twisted, ipaddress, asn1crypto, enum34, idna, pycparser, cffi, cryptography, pyOpenSSL, queuelib, w3lib, cssselect, parsel, pyasn1, pyasn1-modules, service-identity, scrapy Successfully installed Automat-0.5.0 PyDispatcher-2.0.5 Twisted-17.1.0 asn1crypto-0.22.0 attrs-16.3.0 cffi-1.10.0 constantly-15.1.0 cryptography-1.8.1 cssselect-1.0.1 enum34-1.1.6 idna-2.5 incremental-16.10.1 ipaddress-1.0.18 lxml-3.7.3 parsel-1.1.0 pyOpenSSL-16.2.0 pyasn1-0.2.3 pyasn1-modules-0.0.8 pycparser-2.17 queuelib-1.4.2 scrapy-1.3.3 service-identity-16.0.0 w3lib-1.17.0 zope.interface-4.3.3
安裝好scrapy后嘗試一個簡單的連接
(e27) ➜ e27 scrapy shell 'http://quotes.toscrape.com/page/1/'
得到一堆結果如下
2017-03-30 22:08:42 [scrapy.core.engine] INFO: Spider opened 2017-03-30 22:08:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x1100bab50> [s] item {} [s] request <GET http://quotes.toscrape.com/page/1/> [s] response <200 http://quotes.toscrape.com/page/1/> [s] settings <scrapy.settings.Settings object at 0x1100baad0> [s] spider <DefaultSpider 'default' at 0x11037ebd0> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser
證明是可以工作的,然后試一下連接:https://b.ishadow.tech/
(e27) ➜ e27 scrapy shell 'https://b.ishadow.tech/'
結果如下:
2017-03-30 22:10:21 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-03-30 22:10:21 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-03-30 22:10:21 [scrapy.core.engine] INFO: Spider opened 2017-03-30 22:11:36 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://b.ishadow.tech/> (failed 1 times): TCP connection timed out: 60: Operation timed out. 2017-03-30 22:12:51 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://b.ishadow.tech/> (failed 2 times): TCP connection timed out: 60: Operation timed out. 2017-03-30 22:14:07 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://b.ishadow.tech/> (failed 3 times): TCP connection timed out: 60: Operation timed out. Traceback (most recent call last):
爬去超時了,看來是被識別出來是機器人爬取內容被拒絕的(當然此時站點通過瀏覽器是可以訪問的),厲害了我的哥!到這里你是不是已經猜到我的真實目的了,沒有的話請打開我爬取得連接看看就知道了。😁
及時沒有爬取成功有時間研究一下這個爬蟲還是挺好的,文檔連接:https://doc.scrapy.org/en/latest/intro/overview.html
接下來慢慢研究怎么突破封鎖。
