Scrapy对接Pyppeteer
1. 直接对接Pyppeteer
1.Pyppeteer需要基于asyncio异步执行
2.Scrapy2.0开始支持asyncio,通过将Future对象转化成Twisted下的Deffered对象
Scrapy中Future对象转化成Deffered对象的方式
import asyncio
from twisted.internet.defer import Defferred
def as_deferred(f):
return Deferred.fromFuture(asyncio.ensure_future(f))
# settings.py
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
对接实现
# middlewares.py
from pyppeteer import launch
from scrapy.http import HtmlResponse
import asyncio
import logging
from twisted.internet.defer import Deferred
# 设置日志级别,防止控制台输出过多的日志
logging.getLogger('websocket').setLevel('INFO')
logging.getLogger('pyppeteer').setLevel('INFO')
def as_deffered(f):
return Deferred.fromFuture(asyncio.ensure_future(f))
class PyppeteerMiddleware(object):
async def _process_request(self, request, spider):
browser = await launch(headless=False)
page = await browser.newPage()
pyppeteer_response = await page.goto(request.url)
await asyncio.sleep(5)
html = await page.content()
pyppeteer_response.headers.pop('content-encoding', None)
pyppeteer_response.headers.pop('content-Encoding', None)
response = HtmlResponse(
page.url,
status=pyppeteer_response.status,
headers=pyppeteer_response.headers,
body=str.encode(html),
encoding='utf-8',
request=request
)
return response
def process_request(self, request, spider):
return as_deferred(self._process_request(request, spider))
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
'pro.middlewares.PyppeteerMiddleware': 543,
}
直接对接存在的问题
1.不能通过配置的方式对Pypeteer进行配置,如headless等
2.没有实现异常的处理,如TimeError、PageError等
3.没有指定页面加载的等待时间
4.没有设置Cookie、执行js,截图等一系列的功能
优化对接的措施
1.可通过settings.py和Request对象对Pyppeteer进行初始化配置
2.实现异常重试的机制
3.加载过程中指定特定节点出现的等待
4.可通过配置的方式设置Cookie、执行js、截图、代理等功能
5.可通过PyppeteerRquest对象来发送请求,并配置多个扩展参数
6.增加WebDriver反屏蔽功能
7.增加Twister的Reactor对象的设置,不需要在settings.py里面声明TWISTED_REACTOR
2.优化对接Pyppeteer
通过GerapyPyppeteer包来实现
pip install gerapy-pyppeteer
# settings.py
DOWNLOADER_MIDDLEWARES = {
'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware': 543,
}
CONCURRENT_REQUESTS = 3
GERAPY_PUPPETEER_HEADLESS=False # 默认为True
# 开启Webdriver反屏蔽模式
GERAPY_PYPPETEER_PRETEND = False
# 设置超时时间
GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT = 30
# 设置窗口大小
GERAPY_PYPPETEER_WINDOW_WIDTH = 1400
GERAPY_PYPPETEER_WINDOW_HEIGHT = 700
# 其他启动参数
GERAPY_PYPPETEER_DUMPIO = False
GERAPY_PYPPETEER_DEVTOOLS = False
GERAPY_PYPPETEER_EXECUTABLE_PATY = None
GERAPY_PYPPETEER_DISABLE_EXTENSION = True
GERAPY_PYPPETEER_HIDE_SCROLLBARS = True
GERAPY_PYPPETEER_MUTE_AUDIO = True
GERAPY_PYPPETEER_NO_SANDBOX = True
GERAPY_PYPPETEER_DISABLE_SETUID_SANDBOX = True
GERAPY_PYPPETEER_DISABLE_GPU = True
# 忽略资源的加载类型
# document/stylesheet/script/image/media/font/texttrack/xhr/fetch/eventsource/websocket/manifest/other
GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = ['images', 'font']
调用方式
from gerapy_pyppeteer import PyppeteerRequest
yield PyppeteerRequest(start_url, callback=self.parse_index, wait_for='.item /name')
# 截图参数 screenshot={'type': 'png', 'fullPage': True}