Scrapy對接Pyppeteer | GerapyPyppeteer對象 | Scrapy


Scrapy對接Pyppeteer

1. 直接對接Pyppeteer

1.Pyppeteer需要基於asyncio異步執行
2.Scrapy2.0開始支持asyncio,通過將Future對象轉化成Twisted下的Deffered對象

Scrapy中Future對象轉化成Deffered對象的方式

import asyncio
from twisted.internet.defer import Defferred


def as_deferred(f):
    return Deferred.fromFuture(asyncio.ensure_future(f))
# settings.py
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

對接實現

# middlewares.py
from pyppeteer import launch
from scrapy.http import HtmlResponse
import asyncio
import logging
from twisted.internet.defer import Deferred


# 設置日志級別,防止控制台輸出過多的日志
logging.getLogger('websocket').setLevel('INFO')
logging.getLogger('pyppeteer').setLevel('INFO')

def as_deffered(f):
    return Deferred.fromFuture(asyncio.ensure_future(f))
  
class PyppeteerMiddleware(object):
    async def _process_request(self, request, spider):
        browser = await launch(headless=False)
        page = await browser.newPage()
        pyppeteer_response = await page.goto(request.url)
        await asyncio.sleep(5)
        html = await page.content()
        pyppeteer_response.headers.pop('content-encoding', None)
        pyppeteer_response.headers.pop('content-Encoding', None)
        response = HtmlResponse(
        	  page.url,
            status=pyppeteer_response.status,
            headers=pyppeteer_response.headers,
            body=str.encode(html),
            encoding='utf-8',
            request=request
        )
        return response
    def process_request(self, request, spider):
        return as_deferred(self._process_request(request, spider))
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
    'pro.middlewares.PyppeteerMiddleware': 543,
}

直接對接存在的問題

1.不能通過配置的方式對Pypeteer進行配置,如headless等
2.沒有實現異常的處理,如TimeError、PageError等
3.沒有指定頁面加載的等待時間
4.沒有設置Cookie、執行js,截圖等一系列的功能

優化對接的措施

1.可通過settings.py和Request對象對Pyppeteer進行初始化配置
2.實現異常重試的機制
3.加載過程中指定特定節點出現的等待
4.可通過配置的方式設置Cookie、執行js、截圖、代理等功能
5.可通過PyppeteerRquest對象來發送請求,並配置多個擴展參數
6.增加WebDriver反屏蔽功能
7.增加Twister的Reactor對象的設置,不需要在settings.py里面聲明TWISTED_REACTOR

2.優化對接Pyppeteer

通過GerapyPyppeteer包來實現

pip install gerapy-pyppeteer

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware': 543,
}
CONCURRENT_REQUESTS = 3
GERAPY_PUPPETEER_HEADLESS=False  # 默認為True
# 開啟Webdriver反屏蔽模式
GERAPY_PYPPETEER_PRETEND = False
# 設置超時時間
GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT = 30
# 設置窗口大小
GERAPY_PYPPETEER_WINDOW_WIDTH = 1400
GERAPY_PYPPETEER_WINDOW_HEIGHT = 700

# 其他啟動參數
GERAPY_PYPPETEER_DUMPIO = False
GERAPY_PYPPETEER_DEVTOOLS = False
GERAPY_PYPPETEER_EXECUTABLE_PATY = None
GERAPY_PYPPETEER_DISABLE_EXTENSION = True
GERAPY_PYPPETEER_HIDE_SCROLLBARS = True
GERAPY_PYPPETEER_MUTE_AUDIO = True
GERAPY_PYPPETEER_NO_SANDBOX = True
GERAPY_PYPPETEER_DISABLE_SETUID_SANDBOX = True
GERAPY_PYPPETEER_DISABLE_GPU = True

# 忽略資源的加載類型
# document/stylesheet/script/image/media/font/texttrack/xhr/fetch/eventsource/websocket/manifest/other
GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = ['images', 'font']

調用方式

from gerapy_pyppeteer import PyppeteerRequest

yield PyppeteerRequest(start_url, callback=self.parse_index, wait_for='.item /name')
# 截圖參數 screenshot={'type': 'png', 'fullPage': True}


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM