利用 scrapy-splash 對京東進行模擬點擊並進行數據爬取

本文轉載自查看原文 2019-06-10 16:23 1467 splash/ scrapy

本人是第一次寫博客，有寫得不好的地方歡迎值出來，大家一起進步！

scrapy-splash的介紹

scrapy-splash模塊主要使用了Splash. 所謂的Splash, 就是一個Javascript渲染服務。它是一個實現了HTTP API的輕量級瀏覽器，Splash是用Python實現的，同時使用Twisted和QT。Twisted（QT）用來讓服務具有異步處理能力，以發揮webkit的並發能力。Splash的特點如下：

並行處理多個網頁
得到HTML結果以及（或者）渲染成圖片
關掉加載圖片或使用 Adblock Plus規則使得渲染速度更快
使用JavaScript處理網頁內容
使用Lua腳本
能在Splash-Jupyter Notebooks中開發Splash Lua scripts
能夠獲得具體的HAR格式的渲染信息

參考文檔：https://www.cnblogs.com/jclian91/p/8590617.html

准備配置

scrapy框架
splash安裝，windows用戶通過虛擬機安裝docker,Linux直接安裝docker

頁面分析

首先進入https://search.jd.com/ 網站搜索想要的書籍，這里以 python3.7 書籍為例子。

點擊搜索后發現京東是通過 js 來加載書籍數據的，通過下來鼠標可以發現加載了更多的書籍數據（數據也可以通過京東的api來獲取）

首先是模擬搜索，通過檢查可得：

然后是模擬下拉，這里選擇頁面底部的這個元素作為模擬元素：

開始爬取

模擬點擊的lua腳本並獲取頁數：

 1 function main(splash, args)
 2   splash.images_enabled = false
 3   splash:set_user_agent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36')
 4   assert(splash:go(args.url))
 5   splash:wait(0.5)
 6   local input = splash:select("#keyword")
 7   input:send_text('python3.7')
 8   splash:wait(0.5)
 9   local form = splash:select('.input_submit')
10   form:click()
11   splash:wait(2)
12   splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)")
13   splash:wait(6)
14   return splash:html()
15 end

View Code

同上有模擬下拉的代碼：

1 function main(splash, args)
2   splash.images_enabled = false
3   splash:set_user_agent('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36')
4   assert(splash:go(args.url))
5   splash:wait(2)
6   splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)")
7   splash:wait(6)
8   return splash:html()
9 end

View Code

選擇你想要獲取的元素，通過檢查獲得。附上源碼：

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy import Request
 4 from scrapy_splash import SplashRequest
 5 from ..items import JdsplashItem
 6 
 7 
 8 
 9 lua_script = '''
10 function main(splash, args)
11   splash.images_enabled = false
12   splash:set_user_agent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36')
13   assert(splash:go(args.url))
14   splash:wait(0.5)
15   local input = splash:select("#keyword")
16   input:send_text('python3.7')
17   splash:wait(0.5)
18   local form = splash:select('.input_submit')
19   form:click()
20   splash:wait(2)
21   splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)")
22   splash:wait(6)
23   return splash:html()
24 end
25 '''
26 
27 lua_script2 = '''
28 function main(splash, args)
29   splash.images_enabled = false
30   splash:set_user_agent('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36')
31   assert(splash:go(args.url))
32   splash:wait(2)
33   splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)")
34   splash:wait(6)
35   return splash:html()
36 end
37 '''
38 
39 class JdBookSpider(scrapy.Spider):
40     name = 'jd'
41     allowed_domains = ['search.jd.com']
42     start_urls = ['https://search.jd.com']
43 
44     def start_requests(self):
45         #進入搜索頁進行搜索
46         for each in self.start_urls:
47             yield SplashRequest(each,callback=self.parse,endpoint='execute',
48                 args={'lua_source': lua_script})
49 
50     def parse(self, response):
51         item = JdsplashItem()
52         price = response.css('div.gl-i-wrap div.p-price i::text').getall()
53         page_num = response.xpath("//span[@class= 'p-num']/a[last()-1]/text()").get()
54         #這里使用了 xpath 函數 fn:string(arg):返回參數的字符串值。參數可以是數字、邏輯值或節點集。
55         #可能這就是 xpath 比 css 更精致的地方吧
56         name = response.css('div.gl-i-wrap div.p-name').xpath('string(.//em)').getall()
57         #comment = response.css('div.gl-i-wrap div.p-commit').xpath('string(.//strong)').getall()
58         comment = response.css('div.gl-i-wrap div.p-commit strong a::text').getall()
59         publishstore = response.css('div.gl-i-wrap div.p-shopnum a::attr(title)').getall()
60         href = [response.urljoin(i) for i in response.css('div.gl-i-wrap div.p-img a::attr(href)').getall()]
61         for each in zip(name, price, comment, publishstore,href):
62             item['name'] = each[0]
63             item['price'] = each[1]
64             item['comment'] = each[2]
65             item['p_store'] = each[3]
66             item['href'] = each[4]
67             yield item
68         #這里從第二頁開始
69         url = 'https://search.jd.com/Search?keyword=python3.7&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&page=%d&s=%d&click=0'
70         for each_page in range(1,int(page_num)):
71             yield SplashRequest(url%(each_page*2+1,each_page*60),callback=self.s_parse,endpoint='execute',
72                 args={'lua_source': lua_script2})
73 
74     def s_parse(self, response):
75         item = JdsplashItem()
76         price = response.css('div.gl-i-wrap div.p-price i::text').getall()
77         name = response.css('div.gl-i-wrap div.p-name').xpath('string(.//em)').getall()
78         comment = response.css('div.gl-i-wrap div.p-commit strong a::text').getall()
79         publishstore = response.css('div.gl-i-wrap div.p-shopnum a::attr(title)').getall()
80         href = [response.urljoin(i) for i in response.css('div.gl-i-wrap div.p-img a::attr(href)').getall()]
81         for each in zip(name, price, comment, publishstore, href):
82             item['name'] = each[0]
83             item['price'] = each[1]
84             item['comment'] = each[2]
85             item['p_store'] = each[3]
86             item['href'] = each[4]
87             yield item

View Code

各個文件的配置：

items.py :

 1 import scrapy
 2 
 3 
 4 class JdsplashItem(scrapy.Item):
 5     # define the fields for your item here like:
 6     # name = scrapy.Field()
 7     name = scrapy.Field()
 8     price = scrapy.Field()
 9     p_store = scrapy.Field()
10     comment = scrapy.Field()
11     href = scrapy.Field()
12     pass

settings.py:

1 import scrapy_splash
2 # Splash服務器地址
3 SPLASH_URL = 'http://192.168.99.100:8050'
4 # 開啟Splash的兩個下載中間件並調整HttpCompressionMiddleware的次序
5 DOWNLOADER_MIDDLEWARES = {
6 'scrapy_splash.SplashCookiesMiddleware': 723,
7 'scrapy_splash.SplashMiddleware': 725,
8 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
9 }

最后運行代碼，可以看到書籍數據已經被爬取了：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 利用scrapy-splash爬取JS生成的動態頁面配置scrapy-splash+python爬取醫院信息（利用了scrapy-splash） Python對QQ音樂進行爬取並進行數據分析利用requests模塊進行數據爬取初級操作 python爬取拉勾網數據並進行數據可視化爬取數據並進行數據分析及可視化 Python3 爬取微信好友基本信息，並進行數據清洗爬取百度實時熱點並進行數據分析爬取拉勾網關於python職位並進行數據分析和可視化爬取微博熱搜榜並進行數據分析