目前,為了加速頁面的加載速度,頁面的很多部分都是用JS生成的,而對於用scrapy爬蟲來說就是一個很大的問題,因為scrapy沒有JS engine,所以爬取的都是靜態頁面,對於JS生成的動態頁面都無法獲得
解決方案:
1、利用第三方中間件來提供JS渲染服務: scrapy-splash 等。
2、利用webkit或者基於webkit庫
Splash是一個Javascript渲染服務。它是一個實現了HTTP API的輕量級瀏覽器,Splash是用Python實現的,同時使用Twisted和QT。Twisted(QT)用來讓服務具有異步處理能力,以發揮webkit的並發能力。
下面就來講一下如何使用scrapy-splash:
1、利用pip安裝scrapy-splash庫:
2、pip install scrapy-splash
3、安裝docker
scrapy-splash使用的是Splash HTTP API, 所以需要一個splash instance,一般采用docker運行splash,所以需要安裝docker,具體參見:http://www.cnblogs.com/shaosks/p/6932319.html
4、啟動docker
安裝好后運行docker。docker成功安裝后,有“Docker Quickstart Terminal”圖標,雙擊他啟動

5、拉取鏡像(pull the image):
$ docker pull scrapinghub/splash

這樣就正式啟動了。
6、用docker運行scrapinghub/splash服務:
$ docker run -p 8050:8050 scrapinghub/splash
首次啟動會比較慢,加載一些東西,多次啟動會出現以下信息

這時要關閉當前窗口,然后在進程管理器里面關閉一些進程重新打開

重新打開Docker Quickstart Terminal,然后輸入:docker run -p 8050:8050 scrapinghub/splash

7、配置splash服務(以下操作全部在settings.py):
1)添加splash服務器地址:

2)將splash middleware添加到DOWNLOADER_MIDDLEWARE中:

3)Enable SplashDeduplicateArgsMiddleware:

4)Set a custom DUPEFILTER_CLASS:

5)a custom cache storage backend:

8、正式抓取
該例子是抓取京東某個手機產品的詳細信息,地址:https://item.jd.com/2600240.html
如下圖:框住的信息是要榨取的內容

對應的html
1、京東價:

抓取代碼:prices = site.xpath('//span[@class="p-price"]/span/text()')
2、促銷

抓取代碼:cxs = site.xpath('//div[@class="J-prom-phone-jjg"]/em/text()')
3、增值業務

抓取代碼:value_addeds =site.xpath('//ul[@class="choose-support lh"]/li/a/span/text()')
4、重量

抓取代碼:quality = site.xpath('//div[@id="summary-weight"]/div[2]/text()')
5、選擇顏色

抓取代碼:colors = site.xpath('//div[@id="choose-attr-1"]/div[2]/div/@title')
6、選擇版本

抓取代碼:versions = site.xpath('//div[@id="choose-attr-2"]/div[2]/div/@data-value')
7、購買方式

抓取代碼:buy_style = site.xpath('//div[@id="choose-type"]/div[2]/div/a/text()')
8、套 裝

抓取代碼:suits = site.xpath('//div[@id="choose-suits"]/div[2]/div/a/text()')
9、增值保障

抓取代碼:vaps = site.xpath('//div[@class="yb-item-cat"]/div[1]/span[1]/text()')
10、白條分期

抓取代碼:stagings = site.xpath('//div[@class="baitiao-list J-baitiao-list"]/div[@class="item"]/a/strong/text()')
9、運行splash服務
在抓取之前首先要啟動splash服務,命令:docker run -p 8050:8050 scrapinghub/splash,
點擊“Docker Quickstart Terminal” 圖標

10、運行scrapy crawl scrapy_splash

11、抓取數據


12、完整源代碼
1、SplashSpider
# -*- coding: utf-8 -*- import scrapy from scrapy import Request from scrapy.spiders import Spider from scrapy_splash import SplashRequest from scrapy_splash import SplashMiddleware from scrapy.http import Request, HtmlResponse from scrapy.selector import Selector from scrapy_splash import SplashRequest from splash_test.items import SplashTestItem import sys reload(sys) sys.setdefaultencoding('utf-8') sys.stdout = open('output.txt', 'w') class SplashSpider(Spider): name = 'scrapy_splash' start_urls = [ 'https://item.jd.com/2600240.html' ] # request需要封裝成SplashRequest def start_requests(self): for url in self.start_urls: yield SplashRequest(url , self.parse , args={'wait': '0.5'} # ,endpoint='render.json' ) def parse(self, response): # 本文只抓取一個京東鏈接,此鏈接為京東商品頁面,價格參數是ajax生成的。會把頁面渲染后的html存在html.txt # 如果想一直抓取可以使用CrawlSpider,或者把下面的注釋去掉 site = Selector(response) it_list = [] it = SplashTestItem() #京東價 # prices = site.xpath('//span[@class="price J-p-2600240"]/text()') # it['price']= prices[0].extract() # print '京東價:'+ it['price'] prices = site.xpath('//span[@class="p-price"]/span/text()') it['price'] = prices[0].extract()+ prices[1].extract() print '京東價:' + it['price'] # 促 銷 cxs = site.xpath('//div[@class="J-prom-phone-jjg"]/em/text()') strcx = '' for cx in cxs: strcx += str(cx.extract())+' ' it['promotion'] = strcx print '促銷:%s '% strcx # 增值業務 value_addeds =site.xpath('//ul[@class="choose-support lh"]/li/a/span/text()') strValueAdd ='' for va in value_addeds: strValueAdd += str(va.extract())+' ' print '增值業務:%s ' % strValueAdd it['value_add'] = strValueAdd # 重量 quality = site.xpath('//div[@id="summary-weight"]/div[2]/text()') print '重量:%s ' % str(quality[0].extract()) it['quality']=quality[0].extract() #選擇顏色 colors = site.xpath('//div[@id="choose-attr-1"]/div[2]/div/@title') strcolor = '' for color in colors: strcolor += str(color.extract()) + ' ' print '選擇顏色:%s ' % strcolor it['color'] = strcolor # 選擇版本 versions = site.xpath('//div[@id="choose-attr-2"]/div[2]/div/@data-value') strversion = '' for ver in versions: strversion += str(ver.extract()) + ' ' print '選擇版本:%s ' % strversion it['version'] = strversion # 購買方式 buy_style = site.xpath('//div[@id="choose-type"]/div[2]/div/a/text()') print '購買方式:%s ' % str(buy_style[0].extract()) it['buy_style'] = buy_style[0].extract() # 套裝 suits = site.xpath('//div[@id="choose-suits"]/div[2]/div/a/text()') strsuit = '' for tz in suits: strsuit += str(tz.extract()) + ' ' print '套裝:%s ' % strsuit it['suit'] = strsuit # 增值保障 vaps = site.xpath('//div[@class="yb-item-cat"]/div[1]/span[1]/text()') strvaps = '' for vap in vaps: strvaps += str(vap.extract()) + ' ' print '增值保障:%s ' % strvaps it['value_add_protection'] = strvaps # 白條分期 stagings = site.xpath('//div[@class="baitiao-list J-baitiao-list"]/div[@class="item"]/a/strong/text()') strstaging = '' for st in stagings: ststr =str(st.extract()) strstaging += ststr.strip() + ' ' print '白天分期:%s ' % strstaging it['staging'] = strstaging it_list.append(it) return it_list
2、SplashTestItem
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class SplashTestItem(scrapy.Item): #單價 price = scrapy.Field() # description = Field() #促銷 promotion = scrapy.Field() #增值業務 value_add = scrapy.Field() #重量 quality = scrapy.Field() #選擇顏色 color = scrapy.Field() #選擇版本 version = scrapy.Field() #購買方式 buy_style=scrapy.Field() #套裝 suit =scrapy.Field() #增值保障 value_add_protection = scrapy.Field() #白天分期 staging = scrapy.Field() # post_view_count = scrapy.Field() # post_comment_count = scrapy.Field() # url = scrapy.Field()
3、SplashTestPipeline
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import codecs import json class SplashTestPipeline(object): def __init__(self): # self.file = open('data.json', 'wb') self.file = codecs.open( 'spider.txt', 'w', encoding='utf-8') # self.file = codecs.open( # 'spider.json', 'w', encoding='utf-8') def process_item(self, item, spider): line = json.dumps(dict(item), ensure_ascii=False) + "\n" self.file.write(line) return item def spider_closed(self, spider): self.file.close()
4、settings.py
# -*- coding: utf-8 -*- # Scrapy settings for splash_test project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html ITEM_PIPELINES = { 'splash_test.pipelines.SplashTestPipeline':300 } BOT_NAME = 'splash_test' SPIDER_MODULES = ['splash_test.spiders'] NEWSPIDER_MODULE = 'splash_test.spiders' SPLASH_URL = 'http://192.168.99.100:8050' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'splash_test (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'splash_test.middlewares.SplashTestSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'splash_test.middlewares.MyCustomDownloaderMiddleware': 543, #} # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'splash_test.pipelines.SplashTestPipeline': 300, #} # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
