Python網頁信息采集:使用PhantomJS采集淘寶天貓商品內容
1,引言
最近一直在看Scrapy 爬蟲框架,並嘗試使用Scrapy框架寫一個可以實現網頁信息采集的簡單的小程序。嘗試過程中遇到了很多小問題,希望大家多多指教。
本文主要介紹如何使用Scrapy結合PhantomJS采集天貓商品內容,文中自定義了一個DOWNLOADER_MIDDLEWARES,用來采集需要加載js的動態網頁內容。看了很多介紹DOWNLOADER_MIDDLEWARES資料,總結來說就是使用簡單,但會阻塞框架,所以性能方面不佳。一些資料中提到了自定義DOWNLOADER_HANDLER或使用scrapyjs可以解決阻塞框架的問題,有興趣的小伙伴可以去研究一下,這里就不多說了。
2,具體實現
2.1,環境需求
需要執行以下步驟,准備Python開發和運行環境:
- Python–官網下載安裝並部署好環境變量 (本文使用Python版本為3.5.1)
- lxml–官網庫下載對應版本的.whl文件,然后命令行界面執行 “pip install .whl文件路徑”
- Scrapy–命令行界面執行 “pip install Scrapy”,詳細請參考《Scrapy的第一次運行測試》
- selenium–命令行界面執行 “pip install selenium”
- PhantomJS —官網下載
上述步驟展示了兩種安裝:1,安裝下載到本地的wheel包;2,用Python安裝管理器執行遠程下載和安裝。注:包的版本需要和python版本配套
2.2,開發和測試過程
首先找到需要采集的網頁,這里簡單找了一個天貓商品,網址https://world.tmall.com/item/526449276263.htm,頁面如下:
然后開始編寫代碼,以下代碼默認都是在命令行界面執行
1),創建scrapy爬蟲項目tmSpider
1
2
|
E:\python-3.5.1>scrapy startproject tmSpider
|
2),修改settings.py配置
- 更改ROBOTSTXT_OBEY的值為False;
- 關閉scrapy默認的下載器中間件;
- 加入自定義DOWNLOADER_MIDDLEWARES。
配置如下:
1
2
3
4
5
|
DOWNLOADER_MIDDLEWARES = {
<span class="hljs-string">'tmSpider.middlewares.middleware.CustomMiddlewares'</span>: <span class="hljs-number">543</span>,
<span class="hljs-string">'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware'</span>: <span class="hljs-keyword">None</span>
}
|
3),在項目目錄下創建middlewares文件夾,然后在文件夾下創建middleware.py文件,代碼如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
<span class="hljs-comment"># -*- coding: utf-8 -*-</span>
<span class="hljs-keyword">from</span> scrapy.exceptions <span class="hljs-keyword">import</span> IgnoreRequest
<span class="hljs-keyword">from</span> scrapy.http <span class="hljs-keyword">import</span> HtmlResponse, Response
<span class="hljs-keyword">import</span> tmSpider.middlewares.downloader <span class="hljs-keyword">as</span> downloader
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">CustomMiddlewares</span><span class="hljs-params">(object)</span>:</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_request</span><span class="hljs-params">(self, request, spider)</span>:</span>
url = str(request.url)
dl = downloader.CustomDownloader()
content = dl.VisitPersonPage(url)
<span class="hljs-keyword">return</span> HtmlResponse(url, status = <span class="hljs-number">200</span>, body = content)
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_response</span><span class="hljs-params">(self, request, response, spider)</span>:</span>
<span class="hljs-keyword">if</span> len(response.body) == <span class="hljs-number">100</span>:
<span class="hljs-keyword">return</span> IgnoreRequest(<span class="hljs-string">"body length == 100"</span>)
<span class="hljs-keyword">else</span>:
<span class="hljs-keyword">return</span> response
|
4),使用selenium和PhantomJS寫一個網頁內容下載器,同樣在上一步創建好的middlewares文件夾中創建downloader.py文件,代碼如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
<span class="hljs-comment"># -*- coding: utf-8 -*-</span>
<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> scrapy.exceptions <span class="hljs-keyword">import</span> IgnoreRequest
<span class="hljs-keyword">from</span> scrapy.http <span class="hljs-keyword">import</span> HtmlResponse, Response
<span class="hljs-keyword">from</span> selenium <span class="hljs-keyword">import</span> webdriver
<span class="hljs-keyword">import</span> selenium.webdriver.support.ui <span class="hljs-keyword">as</span> ui
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">CustomDownloader</span><span class="hljs-params">(object)</span>:</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(self)</span>:</span>
<span class="hljs-comment"># use any browser you wish</span>
cap = webdriver.DesiredCapabilities.PHANTOMJS
cap[<span class="hljs-string">"phantomjs.page.settings.resourceTimeout"</span>] = <span class="hljs-number">1000</span>
cap[<span class="hljs-string">"phantomjs.page.settings.loadImages"</span>] = <span class="hljs-keyword">True</span>
cap[<span class="hljs-string">"phantomjs.page.settings.disk-cache"</span>] = <span class="hljs-keyword">True</span>
cap[<span class="hljs-string">"phantomjs.page.customHeaders.Cookie"</span>] = <span class="hljs-string">'SINAGLOBAL=3955422793326.2764.1451802953297; '</span>
self.driver = webdriver.PhantomJS(executable_path=<span class="hljs-string">'F:/phantomjs/bin/phantomjs.exe'</span>, desired_capabilities=cap)
wait = ui.WebDriverWait(self.driver,<span class="hljs-number">10</span>)
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">VisitPersonPage</span><span class="hljs-params">(self, url)</span>:</span>
print(<span class="hljs-string">'正在加載網站.....'</span>)
self.driver.get(url)
time.sleep(<span class="hljs-number">1</span>)
<span class="hljs-comment"># 翻到底,詳情加載</span>
js=<span class="hljs-string">"var q=document.documentElement.scrollTop=10000"</span>
self.driver.execute_script(js)
time.sleep(<span class="hljs-number">5</span>)
content = self.driver.page_source.encode(<span class="hljs-string">'gbk'</span>, <span class="hljs-string">'ignore'</span>)
print(<span class="hljs-string">'網頁加載完畢.....'</span>)
<span class="hljs-keyword">return</span> content
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__del__</span><span class="hljs-params">(self)</span>:</span>
self.driver.quit()
|
5) 創建爬蟲模塊
在項目目錄E:python-3.5.1tmSpider,執行如下代碼:
1
2
|
<span class="hljs-attribute">E</span>:\python-<span class="hljs-number">3.5</span><span class="hljs-number">.1</span>\tmSpider>scrapy genspider tmall <span class="hljs-string">'tmall.com'</span>
|
執行后,項目目錄E:python-3.5.1tmSpidertmSpiderspiders下會自動生成tmall.py程序文件。該程序中parse函數處理scrapy下載器返回的網頁內容,采集網頁信息的方法可以是:
- 使用xpath或正則方式從response.body中采集所需字段,
- 通過gooseeker api獲取的內容提取器實現一站轉換所有字段,而且不用手工編寫轉換用的xpath(如何獲取內容提取器請參考python使用xslt提取網頁數據),代碼如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
<span class="hljs-comment"># -*- coding: utf-8 -*-</span>
<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">import</span> scrapy
<span class="hljs-keyword">import</span> tmSpider.gooseeker.gsextractor <span class="hljs-keyword">as</span> gsextractor
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">TmallSpider</span><span class="hljs-params">(scrapy.Spider)</span>:</span>
name = <span class="hljs-string">"tmall"</span>
allowed_domains = [<span class="hljs-string">"tmall.com"</span>]
start_urls = (
<span class="hljs-string">'https://world.tmall.com/item/526449276263.htm'</span>,
)
<span class="hljs-comment"># 獲得當前時間戳</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">getTime</span><span class="hljs-params">(self)</span>:</span>
current_time = str(time.time())
m = current_time.find(<span class="hljs-string">'.'</span>)
current_time = current_time[<span class="hljs-number">0</span>:m]
<span class="hljs-keyword">return</span> current_time
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span>
html = response.body
print(<span class="hljs-string">"----------------------------------------------------------------------------"</span>)
extra=gsextractor.GsExtractor()
extra.setXsltFromAPI(<span class="hljs-string">"0a3898683f265e7b28991e0615228baa"</span>, <span class="hljs-string">"淘寶天貓_商品詳情30474"</span>,<span class="hljs-string">"tmall"</span>,<span class="hljs-string">"list"</span>)
result = extra.extract(html)
print(str(result).encode(<span class="hljs-string">'gbk'</span>, <span class="hljs-string">'ignore'</span>).decode(<span class="hljs-string">'gbk'</span>))
<span class="hljs-comment">#file_name = 'F:/temp/淘寶天貓_商品詳情30474_' + self.getTime() + '.xml'</span>
<span class="hljs-comment">#open(file_name,"wb").write(result)</span>
|
6),啟動爬蟲
在E:python-3.5.1tmSpider項目目錄下執行命令
1
2
|
E:\python-3.5.1\simpleSpider>scrapy crawl tmall
|
輸出結果:
提一下,上述命令只能一次啟動一個爬蟲,如果想同時啟動多個呢?那就需要自定義一個爬蟲啟動模塊了,在spiders下創建模塊文件runcrawl.py,代碼如下
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
<span class="hljs-comment"># -*- coding: utf-8 -*-</span>
<span class="hljs-keyword">import</span> scrapy
<span class="hljs-keyword">from</span> twisted.internet <span class="hljs-keyword">import</span> reactor
<span class="hljs-keyword">from</span> scrapy.crawler <span class="hljs-keyword">import</span> CrawlerRunner
<span class="hljs-keyword">from</span> tmall <span class="hljs-keyword">import</span> TmallSpider
...
spider = TmallSpider(domain=<span class="hljs-string">'tmall.com'</span>)
runner = CrawlerRunner()
runner.crawl(spider)
...
d = runner.join()
d.addBoth(<span class="hljs-keyword">lambda</span> _: reactor.stop())
reactor.run()
|
執行runcrawl.py文件,輸出結果:
3,展望
以自定義DOWNLOADER_MIDDLEWARES調用PhantomJs的方式實現爬蟲后,在阻塞框架的問題上糾結了很長的時間,一直在想解決的方式。后續會研究一下scrapyjs,splash等其他調用瀏覽器的方式看是否能有效的解決這個問題。
4,相關文檔
5,集搜客GooSeeker開源代碼下載源
1, GooSeeker開源Python網絡爬蟲GitHub源
6,文檔修改歷史
1,2016-07-04:V1.0
原文鏈接:https://segmentfault.com/a/1190000005866893