Scrapy对接Pyppeteer | GerapyPyppeteer对象 | Scrapy


Scrapy对接Pyppeteer

1. 直接对接Pyppeteer

1.Pyppeteer需要基于asyncio异步执行
2.Scrapy2.0开始支持asyncio,通过将Future对象转化成Twisted下的Deffered对象

Scrapy中Future对象转化成Deffered对象的方式

import asyncio
from twisted.internet.defer import Defferred


def as_deferred(f):
    return Deferred.fromFuture(asyncio.ensure_future(f))
# settings.py
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

对接实现

# middlewares.py
from pyppeteer import launch
from scrapy.http import HtmlResponse
import asyncio
import logging
from twisted.internet.defer import Deferred


# 设置日志级别,防止控制台输出过多的日志
logging.getLogger('websocket').setLevel('INFO')
logging.getLogger('pyppeteer').setLevel('INFO')

def as_deffered(f):
    return Deferred.fromFuture(asyncio.ensure_future(f))
  
class PyppeteerMiddleware(object):
    async def _process_request(self, request, spider):
        browser = await launch(headless=False)
        page = await browser.newPage()
        pyppeteer_response = await page.goto(request.url)
        await asyncio.sleep(5)
        html = await page.content()
        pyppeteer_response.headers.pop('content-encoding', None)
        pyppeteer_response.headers.pop('content-Encoding', None)
        response = HtmlResponse(
        	  page.url,
            status=pyppeteer_response.status,
            headers=pyppeteer_response.headers,
            body=str.encode(html),
            encoding='utf-8',
            request=request
        )
        return response
    def process_request(self, request, spider):
        return as_deferred(self._process_request(request, spider))
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
    'pro.middlewares.PyppeteerMiddleware': 543,
}

直接对接存在的问题

1.不能通过配置的方式对Pypeteer进行配置,如headless等
2.没有实现异常的处理,如TimeError、PageError等
3.没有指定页面加载的等待时间
4.没有设置Cookie、执行js,截图等一系列的功能

优化对接的措施

1.可通过settings.py和Request对象对Pyppeteer进行初始化配置
2.实现异常重试的机制
3.加载过程中指定特定节点出现的等待
4.可通过配置的方式设置Cookie、执行js、截图、代理等功能
5.可通过PyppeteerRquest对象来发送请求,并配置多个扩展参数
6.增加WebDriver反屏蔽功能
7.增加Twister的Reactor对象的设置,不需要在settings.py里面声明TWISTED_REACTOR

2.优化对接Pyppeteer

通过GerapyPyppeteer包来实现

pip install gerapy-pyppeteer

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware': 543,
}
CONCURRENT_REQUESTS = 3
GERAPY_PUPPETEER_HEADLESS=False  # 默认为True
# 开启Webdriver反屏蔽模式
GERAPY_PYPPETEER_PRETEND = False
# 设置超时时间
GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT = 30
# 设置窗口大小
GERAPY_PYPPETEER_WINDOW_WIDTH = 1400
GERAPY_PYPPETEER_WINDOW_HEIGHT = 700

# 其他启动参数
GERAPY_PYPPETEER_DUMPIO = False
GERAPY_PYPPETEER_DEVTOOLS = False
GERAPY_PYPPETEER_EXECUTABLE_PATY = None
GERAPY_PYPPETEER_DISABLE_EXTENSION = True
GERAPY_PYPPETEER_HIDE_SCROLLBARS = True
GERAPY_PYPPETEER_MUTE_AUDIO = True
GERAPY_PYPPETEER_NO_SANDBOX = True
GERAPY_PYPPETEER_DISABLE_SETUID_SANDBOX = True
GERAPY_PYPPETEER_DISABLE_GPU = True

# 忽略资源的加载类型
# document/stylesheet/script/image/media/font/texttrack/xhr/fetch/eventsource/websocket/manifest/other
GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = ['images', 'font']

调用方式

from gerapy_pyppeteer import PyppeteerRequest

yield PyppeteerRequest(start_url, callback=self.parse_index, wait_for='.item /name')
# 截图参数 screenshot={'type': 'png', 'fullPage': True}


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM