Scrapy對接Pyppeteer
1. 直接對接Pyppeteer
1.Pyppeteer需要基於asyncio異步執行
2.Scrapy2.0開始支持asyncio,通過將Future對象轉化成Twisted下的Deffered對象
Scrapy中Future對象轉化成Deffered對象的方式
import asyncio
from twisted.internet.defer import Defferred
def as_deferred(f):
return Deferred.fromFuture(asyncio.ensure_future(f))
# settings.py
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
對接實現
# middlewares.py
from pyppeteer import launch
from scrapy.http import HtmlResponse
import asyncio
import logging
from twisted.internet.defer import Deferred
# 設置日志級別,防止控制台輸出過多的日志
logging.getLogger('websocket').setLevel('INFO')
logging.getLogger('pyppeteer').setLevel('INFO')
def as_deffered(f):
return Deferred.fromFuture(asyncio.ensure_future(f))
class PyppeteerMiddleware(object):
async def _process_request(self, request, spider):
browser = await launch(headless=False)
page = await browser.newPage()
pyppeteer_response = await page.goto(request.url)
await asyncio.sleep(5)
html = await page.content()
pyppeteer_response.headers.pop('content-encoding', None)
pyppeteer_response.headers.pop('content-Encoding', None)
response = HtmlResponse(
page.url,
status=pyppeteer_response.status,
headers=pyppeteer_response.headers,
body=str.encode(html),
encoding='utf-8',
request=request
)
return response
def process_request(self, request, spider):
return as_deferred(self._process_request(request, spider))
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
'pro.middlewares.PyppeteerMiddleware': 543,
}
直接對接存在的問題
1.不能通過配置的方式對Pypeteer進行配置,如headless等
2.沒有實現異常的處理,如TimeError、PageError等
3.沒有指定頁面加載的等待時間
4.沒有設置Cookie、執行js,截圖等一系列的功能
優化對接的措施
1.可通過settings.py和Request對象對Pyppeteer進行初始化配置
2.實現異常重試的機制
3.加載過程中指定特定節點出現的等待
4.可通過配置的方式設置Cookie、執行js、截圖、代理等功能
5.可通過PyppeteerRquest對象來發送請求,並配置多個擴展參數
6.增加WebDriver反屏蔽功能
7.增加Twister的Reactor對象的設置,不需要在settings.py里面聲明TWISTED_REACTOR
2.優化對接Pyppeteer
通過GerapyPyppeteer包來實現
pip install gerapy-pyppeteer
# settings.py
DOWNLOADER_MIDDLEWARES = {
'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware': 543,
}
CONCURRENT_REQUESTS = 3
GERAPY_PUPPETEER_HEADLESS=False # 默認為True
# 開啟Webdriver反屏蔽模式
GERAPY_PYPPETEER_PRETEND = False
# 設置超時時間
GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT = 30
# 設置窗口大小
GERAPY_PYPPETEER_WINDOW_WIDTH = 1400
GERAPY_PYPPETEER_WINDOW_HEIGHT = 700
# 其他啟動參數
GERAPY_PYPPETEER_DUMPIO = False
GERAPY_PYPPETEER_DEVTOOLS = False
GERAPY_PYPPETEER_EXECUTABLE_PATY = None
GERAPY_PYPPETEER_DISABLE_EXTENSION = True
GERAPY_PYPPETEER_HIDE_SCROLLBARS = True
GERAPY_PYPPETEER_MUTE_AUDIO = True
GERAPY_PYPPETEER_NO_SANDBOX = True
GERAPY_PYPPETEER_DISABLE_SETUID_SANDBOX = True
GERAPY_PYPPETEER_DISABLE_GPU = True
# 忽略資源的加載類型
# document/stylesheet/script/image/media/font/texttrack/xhr/fetch/eventsource/websocket/manifest/other
GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = ['images', 'font']
調用方式
from gerapy_pyppeteer import PyppeteerRequest
yield PyppeteerRequest(start_url, callback=self.parse_index, wait_for='.item /name')
# 截圖參數 screenshot={'type': 'png', 'fullPage': True}