本人是第一次寫博客,有寫得不好的地方歡迎值出來,大家一起進步!
scrapy-splash的介紹
scrapy-splash模塊主要使用了Splash. 所謂的Splash, 就是一個Javascript渲染服務。它是一個實現了HTTP API的輕量級瀏覽器,Splash是用Python實現的,同時使用Twisted和QT。Twisted(QT)用來讓服務具有異步處理能力,以發揮webkit的並發能力。Splash的特點如下:
- 並行處理多個網頁
- 得到HTML結果以及(或者)渲染成圖片
- 關掉加載圖片或使用 Adblock Plus規則使得渲染速度更快
- 使用JavaScript處理網頁內容
- 使用Lua腳本
- 能在Splash-Jupyter Notebooks中開發Splash Lua scripts
- 能夠獲得具體的HAR格式的渲染信息
參考文檔:https://www.cnblogs.com/jclian91/p/8590617.html
准備配置
- scrapy框架
- splash安裝,windows用戶通過虛擬機安裝docker,Linux直接安裝docker
頁面分析
首先進入https://search.jd.com/ 網站搜索想要的書籍, 這里以 python3.7 書籍為例子。
點擊搜索后發現京東是通過 js 來加載書籍數據的, 通過下來鼠標可以發現加載了更多的書籍數據(數據也可以通過京東的api來獲取)
首先是模擬搜索,通過檢查可得:
然后是模擬下拉,這里選擇頁面底部的這個元素作為模擬元素:
開始爬取
模擬點擊的lua腳本並獲取頁數:

1 function main(splash, args) 2 splash.images_enabled = false 3 splash:set_user_agent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36') 4 assert(splash:go(args.url)) 5 splash:wait(0.5) 6 local input = splash:select("#keyword") 7 input:send_text('python3.7') 8 splash:wait(0.5) 9 local form = splash:select('.input_submit') 10 form:click() 11 splash:wait(2) 12 splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)") 13 splash:wait(6) 14 return splash:html() 15 end
同上有模擬下拉的代碼:

1 function main(splash, args) 2 splash.images_enabled = false 3 splash:set_user_agent('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36') 4 assert(splash:go(args.url)) 5 splash:wait(2) 6 splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)") 7 splash:wait(6) 8 return splash:html() 9 end
選擇你想要獲取的元素,通過檢查獲得。附上源碼:

1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy import Request 4 from scrapy_splash import SplashRequest 5 from ..items import JdsplashItem 6 7 8 9 lua_script = ''' 10 function main(splash, args) 11 splash.images_enabled = false 12 splash:set_user_agent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36') 13 assert(splash:go(args.url)) 14 splash:wait(0.5) 15 local input = splash:select("#keyword") 16 input:send_text('python3.7') 17 splash:wait(0.5) 18 local form = splash:select('.input_submit') 19 form:click() 20 splash:wait(2) 21 splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)") 22 splash:wait(6) 23 return splash:html() 24 end 25 ''' 26 27 lua_script2 = ''' 28 function main(splash, args) 29 splash.images_enabled = false 30 splash:set_user_agent('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36') 31 assert(splash:go(args.url)) 32 splash:wait(2) 33 splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)") 34 splash:wait(6) 35 return splash:html() 36 end 37 ''' 38 39 class JdBookSpider(scrapy.Spider): 40 name = 'jd' 41 allowed_domains = ['search.jd.com'] 42 start_urls = ['https://search.jd.com'] 43 44 def start_requests(self): 45 #進入搜索頁進行搜索 46 for each in self.start_urls: 47 yield SplashRequest(each,callback=self.parse,endpoint='execute', 48 args={'lua_source': lua_script}) 49 50 def parse(self, response): 51 item = JdsplashItem() 52 price = response.css('div.gl-i-wrap div.p-price i::text').getall() 53 page_num = response.xpath("//span[@class= 'p-num']/a[last()-1]/text()").get() 54 #這里使用了 xpath 函數 fn:string(arg):返回參數的字符串值。參數可以是數字、邏輯值或節點集。 55 #可能這就是 xpath 比 css 更精致的地方吧 56 name = response.css('div.gl-i-wrap div.p-name').xpath('string(.//em)').getall() 57 #comment = response.css('div.gl-i-wrap div.p-commit').xpath('string(.//strong)').getall() 58 comment = response.css('div.gl-i-wrap div.p-commit strong a::text').getall() 59 publishstore = response.css('div.gl-i-wrap div.p-shopnum a::attr(title)').getall() 60 href = [response.urljoin(i) for i in response.css('div.gl-i-wrap div.p-img a::attr(href)').getall()] 61 for each in zip(name, price, comment, publishstore,href): 62 item['name'] = each[0] 63 item['price'] = each[1] 64 item['comment'] = each[2] 65 item['p_store'] = each[3] 66 item['href'] = each[4] 67 yield item 68 #這里從第二頁開始 69 url = 'https://search.jd.com/Search?keyword=python3.7&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&page=%d&s=%d&click=0' 70 for each_page in range(1,int(page_num)): 71 yield SplashRequest(url%(each_page*2+1,each_page*60),callback=self.s_parse,endpoint='execute', 72 args={'lua_source': lua_script2}) 73 74 def s_parse(self, response): 75 item = JdsplashItem() 76 price = response.css('div.gl-i-wrap div.p-price i::text').getall() 77 name = response.css('div.gl-i-wrap div.p-name').xpath('string(.//em)').getall() 78 comment = response.css('div.gl-i-wrap div.p-commit strong a::text').getall() 79 publishstore = response.css('div.gl-i-wrap div.p-shopnum a::attr(title)').getall() 80 href = [response.urljoin(i) for i in response.css('div.gl-i-wrap div.p-img a::attr(href)').getall()] 81 for each in zip(name, price, comment, publishstore, href): 82 item['name'] = each[0] 83 item['price'] = each[1] 84 item['comment'] = each[2] 85 item['p_store'] = each[3] 86 item['href'] = each[4] 87 yield item
各個文件的配置:
items.py :
1 import scrapy 2 3 4 class JdsplashItem(scrapy.Item): 5 # define the fields for your item here like: 6 # name = scrapy.Field() 7 name = scrapy.Field() 8 price = scrapy.Field() 9 p_store = scrapy.Field() 10 comment = scrapy.Field() 11 href = scrapy.Field() 12 pass
settings.py:
1 import scrapy_splash 2 # Splash服務器地址 3 SPLASH_URL = 'http://192.168.99.100:8050' 4 # 開啟Splash的兩個下載中間件並調整HttpCompressionMiddleware的次序 5 DOWNLOADER_MIDDLEWARES = { 6 'scrapy_splash.SplashCookiesMiddleware': 723, 7 'scrapy_splash.SplashMiddleware': 725, 8 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 9 }
最后運行代碼,可以看到書籍數據已經被爬取了: