一、概述
在上一篇文章中,鏈接如下:https://www.cnblogs.com/xiao987334176/p/13656055.html
已經介紹了如何使用Splash抓取javaScript動態渲染頁面
這里做一下項目實戰,以爬取京東商城商品冰淇淋為例吧
環境說明
操作系統:centos 7.6
docker版本:19.03.12
ip地址:192.168.0.10
說明:使用docker安裝Splash服務
操作系統:windows 10
python版本:3.7.9
ip地址:192.168.0.9
說明:使用Pycharm開發工具,用於本地開發。
關於Splash的使用,參考上一篇文章,這里就不做說明了。
二、分析頁面
打開京東商城,輸入關鍵字:冰淇淋,滑動滾動條,我們發現隨着滾動條向下滑動,越來越多的商品信息被刷新了,這說明該頁面部分是ajax加載
注意:每一條商品信息,都是在<div class="gl-i-wrap"></div>里面的。
我們打開scrapy shell 爬取該頁面,如下圖:
scrapy shell "https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8"
輸出:
... [s] view(response) View response in a browser >>>
注意:url不要用單引號,否則會報錯。
接下來,輸入以下命令,使用css選擇器
>>> response.css('div.gl-i-wrap') [<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' ' ), ' gl-i-wrap ')]" ...
返回了很多Selector 對象。
統計商品信息個數
>>> len(response.css('div.gl-i-wrap'))
30
得到返回結果發現只有30個冰淇凌的信息,而我們再頁面中明明看見了60個冰淇凌信息,這是為什么呢?
答:這也說明了剛開始頁面只用30個冰淇淋信息,而我們滑動滑塊時,執行了js代碼,並向后台發送了ajax請求,瀏覽器拿到數據后再進一步渲染出另外了30個信息
我們可以點擊network選項卡再次確認:
鑒於此,我們就想出了一種解決方案:即用js代碼模擬用戶滑動滑塊到底的行為再結合execute端點提供的js代碼執行服務即可(小伙伴們讓我們開始實踐吧)
注意:<div id="footer-2017"></div>就是頁面底部了,因此,只需要滑動到底部即可。
首先:模擬用戶行為
在console,輸入以下命令:
e = document.getElementById("footer-2017") e.scrollIntoView(true)
效果如下,就直接滑動到底部了。
參數解釋:
scrollIntoView是一個與頁面(容器)滾動相關的API(官方解釋),該API只有boolean類型的參數能得到良好的支持(firefox 36+都支持)
參數為true時調用該函數,頁面(或容器)發生滾動,使element的頂部與視圖(容器)頂部對齊;
使用scrapy.Request
上面我們使用Request發送請求,觀察結果只有30條。為什么呢?因為頁面時動態加載的所有我們只收到了30個冰淇淋的信息。
所以這里,使用scrapy.Request發送請求,並使用execute 端點解決這個問題。
打開上一篇文章中的爬蟲項目dynamic_page,使用Pycharm打開,並點開Terminal
輸入dir,確保當前目錄是dynamic_page
(crawler) E:\python_script\爬蟲\dynamic_page>dir 驅動器 E 中的卷是 file 卷的序列號是 1607-A400 E:\python_script\爬蟲\dynamic_page 的目錄 2020/09/12 10:37 <DIR> . 2020/09/12 10:37 <DIR> .. 2020/09/12 10:20 211 bin.py 2020/09/12 14:30 0 dynamicpage_pipline.json 2020/09/12 10:36 <DIR> dynamic_page 2020/09/12 10:33 0 result.csv 2020/09/12 10:18 267 scrapy.cfg 4 個文件 478 字節 3 個目錄 260,445,159,424 可用字節
接下來打開scrapy shell,輸入命令:
scrapy shell
輸出:
... [s] view(response) View response in a browser >>>
最后粘貼以下代碼:
from scrapy_splash import SplashRequest #使用scrapy.splash.Request發送請求 url = "https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8" lua = ''' function main(splash) splash:go(splash.args.url) splash:wait(3) splash:runjs("document.getElementById('footer-2017').scrollIntoView(true)") splash:wait(3) return splash:html() end ''' fetch(SplashRequest(url,endpoint = 'execute',args= {'lua_source':lua})) #再次請求,我們可以看到現在已通過splash服務的8050端點渲染了js代碼,並成果返回結果 len(response.css('div.gl-i-wrap'))
效果如下:
[s] view(response) View response in a browser >>> from scrapy_splash import SplashRequest #使用scrapy.splash.Request發送請求 >>> url = 'https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8' >>> lua = ''' ... function main(splash) ... splash:go(splash.args.url) ... splash:wait(3) ... splash:runjs("document.getElementById('footer-2017').scrollIntoView(true)") ... splash:wait(3) ... return splash:html() ... end ... ''' >>> fetch(SplashRequest(url,endpoint = 'execute',args= {'lua_source':lua})) #再次請求,我們可以看到現 在已通過splash服務的8050端點渲染了js代碼,並成果返回結果 2020-09-12 14:30:54 [scrapy.core.engine] INFO: Spider opened 2020-09-12 14:30:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://w ww.jd.com/error.aspx> from <GET https://search.jd.com/robots.txt> 2020-09-12 14:30:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jd.com/error.aspx> (ref erer: None) 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 27 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 83 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 92 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 202 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 351 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 375 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 376 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 385 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 386 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 387 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 388 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 389 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 397 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 400 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 403 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 404 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 405 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 406 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 407 without any user agent to enforce it on. 2020-09-12 14:30:55 [protego] DEBUG: Rule at line 408 without any user agent to enforce it on. 2020-09-12 14:30:55 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.0.10:8050/robots.txt > (referer: None) 2020-09-12 14:31:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://search.jd.com/Search?keywor d=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8 via http://192.168.0.10:8050/execute> (referer: None) >>> len(response.css('div.gl-i-wrap')) 60
注意:由於fetch不能直接在python代碼中執行,所以只能在scrapy shell 模式中執行。
最后的任務就回歸到了提取內容了階段了,小伙伴讓我們完成整個代碼吧!---這里結合scrapy shell 進行測試
三、代碼實現
新建項目
這里對目錄就沒有什么要求了,找個空目錄就行。
打開Pycharm,並打開Terminal,執行以下命令
scrapy startproject ice_cream
cd ice_cream
scrapy genspider jd search.jd.com
在scrapy.cfg同級目錄,創建bin.py,用於啟動Scrapy項目,內容如下:
#在項目根目錄下新建:bin.py from scrapy.cmdline import execute # 第三個參數是:爬蟲程序名 execute(['scrapy', 'crawl', 'jd',"--nolog"])
創建好的項目樹形目錄如下:
./ ├── bin.py ├── ice_cream │ ├── __init__.py │ ├── items.py │ ├── middlewares.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ ├── __init__.py │ └── jd.py └── scrapy.cfg
修改settIngs.py
改寫settIngs.py文件這里小伙伴們可參考github(https://github.com/scrapy-plugins/scrapy-splash)---上面有詳細的說明
在最后添加如下內容:
# Splash服務器地址 SPLASH_URL = 'http://192.168.0.10:8050' # 開啟兩個下載中間件,並調整HttpCompressionMiddlewares的次序 DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } # 設置去重過濾器 DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' # 用來支持cache_args(可選) SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' ITEM_PIPELINES = { 'ice_cream.pipelines.IceCreamPipeline': 100, }
注意:請根據實際情況,修改Splash服務器地址,其他的不需要改動。
修改文件jd.py

# -*- coding: utf-8 -*- import scrapy from scrapy_splash import SplashRequest from ice_cream.items import IceCreamItem #自定義lua腳本 lua = ''' function main(splash) splash:go(splash.args.url) splash:wait(3) splash:runjs("document.getElementById('footer-2017').scrollIntoView(true)") splash:wait(3) return splash:html() end ''' class JdSpider(scrapy.Spider): name = 'jd' allowed_domains = ['search.jd.com'] start_urls = ['https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8'] base_url = 'https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8' # 自定義配置,注意:變量名必須是custom_settings custom_settings = { 'REQUEST_HEADERS': { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36', } } def parse(self, response): # 這里能獲取100頁 page_num = int(response.css('span.fp-text i::text').extract_first()) # 但是100頁太多了,這里固定為2頁 page_num = 2 # print("page_num", page_num) for i in range(page_num): url = '%s?page=%s' % (self.base_url, 2 * i + 1) # 通過觀察我們發現url頁面間有規律 # print("url", url) yield SplashRequest(url, headers=self.settings.get('REQUEST_HEADERS'), endpoint='execute', args={'lua_source': lua}, callback=self.parse_item) def parse_item(self, response): # 頁面解析函數 # 創建item字段對象,用來存儲信息 item = IceCreamItem() for sel in response.css('div.gl-i-wrap'): name = sel.css('div.p-name em').extract_first() price = sel.css('div.p-price i::text').extract_first() # print("name", name) # print("price", price) item['name'] = name item['price'] = price yield item # yield { # 'name': name, # 'price': price, # }
修改items.py

# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class IceCreamItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # 與itcast.py定義的一一對應 name = scrapy.Field() price = scrapy.Field()
修改pipelines.py

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import json class IceCreamPipeline(object): def __init__(self): #python3保存文件 必須需要'wb' 保存為json格式 self.f = open("ice_cream_pipline.json",'wb') def process_item(self, item, spider): # 讀取item中的數據 並換行處理 content = json.dumps(dict(item), ensure_ascii=False) + ',\n' self.f.write(content.encode('utf=8')) return item def close_spider(self,spider): #關閉文件 self.f.close()
執行bin.py,等待1分鍾,就會生成文件ice_cream_pipline.json
打開json文件,內容如下:
{"name": "<em><span class=\"p-tag\" style=\"background-color:#c81623\">京東超市</span>\t\n明治(meiji)草莓白巧克力雪糕 245g(6支)彩盒 <font class=\"skcolor_ljg\">冰淇淋</font></em>", "price": "46.80"}, {"name": "<em><span class=\"p-tag\" style=\"background-color:#c81623\">京東超市</span>\t\n伊利 巧樂茲香草巧克力口味脆皮甜筒雪糕<font class=\"skcolor_ljg\">冰淇淋</font>冰激凌冷飲 73g*6/盒</em>", "price": "32.80"}, ...
本文參考鏈接:
https://www.cnblogs.com/518894-lu/p/9067208.html