1. Scrapy對接Selenium
Scrapy抓取頁面的方式和requests庫類似,都是直接模擬HTTP請求,而Scrapy也不能抓取JavaScript動態誼染的頁面。在前面的博客中抓取JavaScript渲染的頁面有兩種方式。一種是分析Ajax請求,找到其對應的接口抓取,Scrapy同樣可以用此種方式抓取。另一種是直接用 Selenium模擬瀏覽器進行抓取,我們不需要關心頁面后台發生的請求,也不需要分析渲染過程,只需要關心頁面最終結果即可,可見即可爬。那么,如果Scrapy可以對接Selenium,那 Scrapy就可以處理任何網站的抓取了。
1.1 新建項目
首先新建項目,名為scrapyseleniumtest。
scrapy startproject scrapyseleniumtest
新建一個Spider。
scrapy genspider jd www.jd.com
修改ROBOTSTXT_OBEY為False。
ROBOTSTXT_OBEY = False
1.2 定義Item
這里我們就不調用Item了。
初步實現Spider的start _requests()方法。
# -*- coding: utf-8 -*- from scrapy import Request,Spider from urllib.parse import quote from bs4 import BeautifulSoup class JdSpider(Spider): name = 'jd' allowed_domains = ['www.jd.com'] base_url = 'https://search.jd.com/Search?keyword=' def start_requests(self): for keyword in self.settings.get('KEYWORDS'): for page in range(1, self.settings.get('MAX_PAGE') + 1): url = self.base_url + quote(keyword) # dont_filter = True 不去重 yield Request(url=url, callback=self.parse, meta={'page': page}, dont_filter=True)
首先定義了一個base_url,即商品列表的URL,其后拼接一個搜索關鍵字就是該關鍵字在京東搜索的結果商品列表頁面。
關鍵字用KEYWORDS標識,定義為一個列表。最大翻頁頁碼用MAX_PAGE表示。它們統一定義在settings.py里面。
KEYWORDS = ['iPad']
MAX_PAGE = 2
在start_requests()方法里,我們首先遍歷了關鍵字,遍歷了分頁頁碼,構造並生成Request。由於每次搜索的URL是相同的,所以分頁頁碼用meta參數來傳遞,同時設置dont_filter不去重。這樣爬蟲啟動的時候,就會生成每個關鍵字對應的商品列表的每一頁的請求了。
1.3 對接Selenium
接下來我們需要處理這些請求的抓取。這次我們對接Selenium進行抓取,采用Downloader Middleware來實現。在Middleware中對接selenium,輸出源代碼之后,構造htmlresponse對象,直接返回給spider解析頁面,提取數據,並且也不在執行下載器下載頁面動作。
class SeleniumMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. def __init__(self,timeout=None): self.logger=getLogger(__name__) self.timeout = timeout self.browser = webdriver.Chrome() self.browser.set_window_size(1400,700) self.browser.set_page_load_timeout(self.timeout) self.wait = WebDriverWait(self.browser,self.timeout) def __del__(self): self.browser.close() @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT')) def process_request(self, request, spider): ''' 在下載器中間件中對接使用selenium,輸出源代碼之后,構造htmlresponse對象,直接返回 給spider解析頁面,提取數據 並且也不在執行下載器下載頁面動作 htmlresponse對象的文檔: :param request: :param spider: :return: ''' print('PhantomJS is Starting') page = request.meta.get('page', 1) self.wait = WebDriverWait(self.browser, self.timeout) # self.browser.set_page_load_timeout(30) # self.browser.set_script_timeout(30) try: self.browser.get(request.url) if page > 1: input = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > input'))) input.clear() input.send_keys(page) time.sleep(5) # 將網頁中輸入跳轉頁的輸入框賦值給input變量 EC.presence_of_element_located,判斷輸入框已經被加載出來 input = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > input'))) # 將網頁中調准頁面的確定按鈕賦值給submit變量,EC.element_to_be_clickable 判斷此按鈕是可點擊的 submit = self.wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > a'))) input.clear() input.send_keys(page) submit.click() # 點擊按鈕 time.sleep(5) # 判斷當前頁碼出現在了輸入的頁面中,EC.text_to_be_present_in_element 判斷元素在指定字符串中出現 self.wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.curr'),str(page))) # 等待 #J_goodsList 加載出來,為頁面數據,加載出來之后,在返回網頁源代碼 self.wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.curr'),str(page))) return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf-8',status=200) except TimeoutException: return HtmlResponse(url=request.url, status=500, request=request)
首先我在__init__()里對一些對象進行初始化,包括WebDriverWait等對象,同時設置頁面大小和頁面加載超時時間。在process_request()方法中,我們通過Request的meta屬性獲取當前需要爬取的頁碼,將頁碼賦值給input變量,再將翻頁的點擊按鈕框賦值給submit變量,然后在數據框中輸入頁碼,等待頁面加載,直接返回htmlresponse給spider解析,這里我們沒有經過下載器下載,直接構造response的子類htmlresponse返回。(當下載器中間件返回response對象時,更低優先級的process_request將不在執行,轉而執行其他的process_response()方法,本例中沒有其他的process_response(),所以直接將結果返回給spider解析。)
1.4 解析頁面
Response對象就會回傳給Spider內的回調函數進行解析。所以下一步我們就實現其回調函數,對網頁來進行解析。
def parse(self, response): soup = BeautifulSoup(response.text, 'lxml') lis = soup.find_all(name='li', class_="gl-item") for li in lis: proc_dict = {} dp = li.find(name='span', class_="J_im_icon") if dp: proc_dict['dp'] = dp.get_text().strip() else: continue id = li.attrs['data-sku'] title = li.find(name='div', class_="p-name p-name-type-2") proc_dict['title'] = title.get_text().strip() price = li.find(name='strong', class_="J_" + id) proc_dict['price'] = price.get_text() comment = li.find(name='a', id="J_comment_" + id) proc_dict['comment'] = comment.get_text() + '條評論' url = 'https://item.jd.com/' + id + '.html' proc_dict['url'] = url proc_dict['type'] = 'JINGDONG' yield proc_dict
這里我們采用BeautifulSoup進行解析,匹配所有商品,隨后對結果進行遍歷,依次選取商品的各種信息。
1.5 儲存結果
提取完頁面數據之后,數據會發送到item pipeline處進行數據處理,清洗,入庫等操作,所以我們此時當然需要定義項目管道了,在此我們將數據存儲在mongodb數據庫中。
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymongo class MongoPipeline(object): def __init__(self,mongo_url,mongo_db,collection): self.mongo_url = mongo_url self.mongo_db = mongo_db self.collection = collection @classmethod #from_crawler是一個類方法,由 @classmethod標識,是一種依賴注入的方式,它的參數就是crawler #通過crawler我們可以拿到全局配置的每個配置信息,在全局配置settings.py中的配置項都可以取到。 #所以這個方法的定義主要是用來獲取settings.py中的配置信息 def from_crawler(cls,crawler): return cls( mongo_url=crawler.settings.get('MONGO_URL'), mongo_db = crawler.settings.get('MONGO_DB'), collection = crawler.settings.get('COLLECTION') ) def open_spider(self,spider): self.client = pymongo.MongoClient(self.mongo_url) self.db = self.client[self.mongo_db] def process_item(self,item, spider): # name = item.__class__.collection name = self.collection self.db[name].insert(dict(item)) return item def close_spider(self,spider): self.client.close()
1.6 配置settings文件
配置settings文件,將項目中使用到的配置項在settings文件中配置,本項目中使用到了KEYWORDS,MAX_PAGE,SELENIUM_TIMEOUT(頁面加載超時時間),MONGOURL,MONGODB,COLLECTION。
KEYWORDS=['iPad'] MAX_PAGE=2 MONGO_URL = 'localhost' MONGO_DB = 'test' COLLECTION = 'ProductItem' SELENIUM_TIMEOUT = 30
以及修改配置項,激活下載器中間件和item pipeline。
DOWNLOADER_MIDDLEWARES = { 'scrapyseleniumtest.middlewares.SeleniumMiddleware': 543, } ITEM_PIPELINES = { 'scrapyseleniumtest.pipelines.MongoPipeline': 300, }
1.7 執行結果
項目中所有需要開發的代碼和配置項開發完成,運行項目。
scrapy crawl jd
運行項目之后,在mongodb中查看數據,已經執行成功。
1.8 完整代碼
items.py:
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html from scrapy import Item,Field class ProductItem(Item): # define the fields for your item here like: # name = scrapy.Field() # dp = Field() # title = Field() # price = Field() # comment = Field() # url = Field() # type = Field() pass
jd.py:
# -*- coding: utf-8 -*- from scrapy import Request,Spider from urllib.parse import quote from bs4 import BeautifulSoup class JdSpider(Spider): name = 'jd' allowed_domains = ['www.jd.com'] base_url = 'https://search.jd.com/Search?keyword=' def start_requests(self): for keyword in self.settings.get('KEYWORDS'): for page in range(1, self.settings.get('MAX_PAGE') + 1): url = self.base_url + quote(keyword) # dont_filter = True 不去重 yield Request(url=url, callback=self.parse, meta={'page': page}, dont_filter=True) def parse(self, response): soup = BeautifulSoup(response.text, 'lxml') lis = soup.find_all(name='li', class_="gl-item") for li in lis: proc_dict = {} dp = li.find(name='span', class_="J_im_icon") if dp: proc_dict['dp'] = dp.get_text().strip() else: continue id = li.attrs['data-sku'] title = li.find(name='div', class_="p-name p-name-type-2") proc_dict['title'] = title.get_text().strip() price = li.find(name='strong', class_="J_" + id) proc_dict['price'] = price.get_text() comment = li.find(name='a', id="J_comment_" + id) proc_dict['comment'] = comment.get_text() + '條評論' url = 'https://item.jd.com/' + id + '.html' proc_dict['url'] = url proc_dict['type'] = 'JINGDONG' yield proc_dict
middlewares.py:
# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait from urllib.parse import urlencode from scrapy.http import HtmlResponse from logging import getLogger from selenium.common.exceptions import TimeoutException import time class ScrapyseleniumtestSpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict # or Item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class SeleniumMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. def __init__(self,timeout=None): self.logger=getLogger(__name__) self.timeout = timeout self.browser = webdriver.Chrome() self.browser.set_window_size(1400,700) self.browser.set_page_load_timeout(self.timeout) self.wait = WebDriverWait(self.browser,self.timeout) def __del__(self): self.browser.close() @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT')) def process_request(self, request, spider): ''' 在下載器中間件中對接使用selenium,輸出源代碼之后,構造htmlresponse對象,直接返回 給spider解析頁面,提取數據 並且也不在執行下載器下載頁面動作 htmlresponse對象的文檔: :param request: :param spider: :return: ''' print('PhantomJS is Starting') page = request.meta.get('page', 1) self.wait = WebDriverWait(self.browser, self.timeout) # self.browser.set_page_load_timeout(30) # self.browser.set_script_timeout(30) try: self.browser.get(request.url) if page > 1: input = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > input'))) input.clear() input.send_keys(page) time.sleep(5) # 將網頁中輸入跳轉頁的輸入框賦值給input變量 EC.presence_of_element_located,判斷輸入框已經被加載出來 input = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > input'))) # 將網頁中調准頁面的確定按鈕賦值給submit變量,EC.element_to_be_clickable 判斷此按鈕是可點擊的 submit = self.wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > a'))) input.clear() input.send_keys(page) submit.click() # 點擊按鈕 time.sleep(5) # 判斷當前頁碼出現在了輸入的頁面中,EC.text_to_be_present_in_element 判斷元素在指定字符串中出現 self.wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.curr'),str(page))) # 等待 #J_goodsList 加載出來,為頁面數據,加載出來之后,在返回網頁源代碼 self.wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.curr'),str(page))) return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf-8',status=200) except TimeoutException: return HtmlResponse(url=request.url, status=500, request=request) def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
pipelines.py:
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymongo class MongoPipeline(object): def __init__(self,mongo_url,mongo_db,collection): self.mongo_url = mongo_url self.mongo_db = mongo_db self.collection = collection @classmethod #from_crawler是一個類方法,由 @classmethod標識,是一種依賴注入的方式,它的參數就是crawler #通過crawler我們可以拿到全局配置的每個配置信息,在全局配置settings.py中的配置項都可以取到。 #所以這個方法的定義主要是用來獲取settings.py中的配置信息 def from_crawler(cls,crawler): return cls( mongo_url=crawler.settings.get('MONGO_URL'), mongo_db = crawler.settings.get('MONGO_DB'), collection = crawler.settings.get('COLLECTION') ) def open_spider(self,spider): self.client = pymongo.MongoClient(self.mongo_url) self.db = self.client[self.mongo_db] def process_item(self,item, spider): # name = item.__class__.collection name = self.collection self.db[name].insert(dict(item)) return item def close_spider(self,spider): self.client.close()
settings.py:
# -*- coding: utf-8 -*- # Scrapy settings for scrapyseleniumtest project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'scrapyseleniumtest' SPIDER_MODULES = ['scrapyseleniumtest.spiders'] NEWSPIDER_MODULE = 'scrapyseleniumtest.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'scrapyseleniumtest (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'scrapyseleniumtest.middlewares.ScrapyseleniumtestSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'scrapyseleniumtest.middlewares.ScrapyseleniumtestDownloaderMiddleware': 543, #} DOWNLOADER_MIDDLEWARES = { 'scrapyseleniumtest.middlewares.SeleniumMiddleware': 543, } # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'scrapyseleniumtest.pipelines.ScrapyseleniumtestPipeline': 300, #} ITEM_PIPELINES = { 'scrapyseleniumtest.pipelines.MongoPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' KEYWORDS=['iPad'] MAX_PAGE=2 MONGO_URL = 'localhost' MONGO_DB = 'test' COLLECTION = 'ProductItem' SELENIUM_TIMEOUT = 30