Python爬蟲——Scrapy整合Selenium案例分析(BOSS直聘)


概述

本文主要介紹scrapy架構圖、組建、工作流程,以及結合selenium boss直聘爬蟲案例分析

架構圖

組件

Scrapy 引擎(Engine)

引擎負責控制數據流在系統中所有組件中流動,並在相應動作發生時觸發事件.

調度器(Scheduler)

調度器從引擎接受request並將他們入隊,以便之后引擎請求他們時提供給引擎.

下載器(Downloader)

下載器負責獲取頁面數據並提供給引擎,而后提供給spider.

Spiders

Spider是Scrapy用戶編寫用於分析response並提取item(即獲取到的item)或額外跟進的URL的類. 每個spider負責處理一個特定(或一些)網站,我們前面幾篇文章中,通過Scrapy框架實現的爬蟲例子都是在Spiders這個組件中實現. 更多內容請看 Spiders .

下載器中間件(Downloader Middlewares)

下載器中間件是在引擎及下載器之間的特定鈎子(specific hook),處理Downloader傳遞給引擎的response. 其提供了一個簡便的機制,通過插入自定義代碼來擴展Scrapy功能.更多內容請看 Downloader Middleware .

Spider中間件(Spider Middlewares)

Spider中間件是在引擎及Spider之間的特定鈎子(specific hook),處理spider的輸入(response)和輸出(items及requests). 其提供了一個簡便的機制,通過插入自定義代碼來擴展Scrapy功能

管道(Item Pipeline)

Item Pipeline負責處理被spider提取出來的item.典型的處理有清理、 驗證及持久化(例如存取到數據庫中). 更多內容查看 Item Pipeline .

工作流程

Scrapy中的數據流由執行引擎控制,其過程如下:

  • 引擎從Spiders中獲取到最初的要爬取的請求(Requests).
  • 引擎安排請求(Requests)到調度器中,並向調度器請求下一個要爬取的請求(Requests).
  • 調度器返回下一個要爬取的請求(Requests)給引擎.
  • 引擎將上步中得到的請求(Requests)通過下載器中間件(Downloader Middlewares)發送給下載器(Downloader ),這個過程中下載器中間件(Downloader Middlewares)中的process_request()函數會被調用到.
  • 一旦頁面下載完畢,下載器生成一個該頁面的Response,並將其通過下載中間件(Downloader Middlewares)發送給引擎,這個過程中下載器中間件(Downloader Middlewares)中的process_response()函數會被調用到.
  • 引擎從下載器中得到上步中的Response並通過Spider中間件(Spider Middlewares)發送給Spider處理,這個過程中Spider中間件(Spider Middlewares)中的process_spider_input()函數會被調用到.
  • Spider處理Response並通過Spider中間件(Spider Middlewares)返回爬取到的Item及(跟進的)新的Request給引擎,這個過程中Spider中間件(Spider Middlewares)的process_spider_output()函數會被調用到.
  • 引擎將上步中Spider處理的其爬取到的Item給Item 管道(Pipeline),將Spider處理的Request發送給調度器,並向調度器請求可能存在的下一個要爬取的請求(Requests).
  • (從第二步)重復直到調度器中沒有更多的請求(Requests).

案例分析:BOSS直聘

  • 定義Item
# -*- coding: utf-8 -*-

import scrapy


# 繼承Item: items.py
class Boss(scrapy.Item):
    """
    定義需要爬取的字段及類型
    """

    position = scrapy.Field(serializer=str)     # 招聘職位
    salary = scrapy.Field(serializer=str)       # 薪資
    addr = scrapy.Field(serializer=str)         # 工作地址
    years = scrapy.Field(serializer=str)        # 工作年限
    education = scrapy.Field(serializer=str)    # 學歷
    company = scrapy.Field(serializer=str)      # 招聘公司
    industry = scrapy.Field(serializer=str)     # 行業
    nature = scrapy.Field(serializer=str)       # 性質:是否上市
    scale = scrapy.Field(serializer=str)        # 規模:人數
    publisher = scrapy.Field(serializer=str)    # 招牌者
    publisherPosition = scrapy.Field(serializer=str)        # 招聘者崗位
    publishDateDesc = scrapy.Field(serializer=str)          # 發布時間

  • 定義scrapy爬蟲: myspider.py
# -*- coding: utf-8 -*-
import scrapy

from spider.items import Boss


class BossSpider(scrapy.Spider):
    name = "boss"

    # 設定域名
    allowed_domains = ["www.zhipin.com"]

    def start_requests(self):
        """
        設置第一個爬取的URL,即boss直聘第一頁
        """
        urls = [
            'https://www.zhipin.com/c101210100/h_101210100/?page=1&ka=page-1',
        ]

        # 每次yield都會調用下載器中間件,即 mySpiderMiddleware.SeleniumMiddleware
        # 這里由selenium進行動態抓取招聘信息
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        """
        初始化Item:Boss
        :param response:
        :return:
        """
        boss = Boss()

        # 利用xpath篩選想要爬取的數據
        for box in response.xpath('//div[@class="job-primary"]'):
            boss['position'] = box.xpath('.//div[@class="job-title"]/text()').extract()[0]
            boss['salary'] = box.xpath('.//span[@class="red"]/text()').extract()[0]
            boss['addr'] = box.xpath('.//p[1]/text()').extract()[0]
            boss['years'] = box.xpath('.//p[1]/text()').extract()[1]
            boss['education'] = box.xpath('.//p[1]/text()').extract()[2]
            boss['company'] = box.xpath('.//div[@class="info-company"]//a/text()').extract()[0]
            boss['industry'] = box.xpath('.//p[1]//text()').extract()[3]
            boss['nature'] = box.xpath('.//p[1]//text()').extract()[4]
            boss['scale'] = box.xpath('.//p[1]//text()').extract()[5]
            boss['publisher'] = box.xpath('.//div[@class="info-publis"]//h3/text()').extract()[0]
            boss['publisherPosition'] = box.xpath('.//div[@class="info-publis"]//h3/text()').extract()[1]
            boss['publishDateDesc'] = box.xpath('.//div[@class="info-publis"]//p/text()').extract()[0]

            # 將Item:Boss傳遞給Spider中間件,由它進行數據清洗(去空,去重)等操作
            # 每次yield都將調用SpiderMiddleware, 這里是 mySpiderMiddleware.MyFirstSpiderMiddleware
            yield boss

        # 分頁
        url = response.xpath('//div[@class="page"]//a[@class="next"]/@href').extract()
        if url:
            page = 'https://www.zhipin.com' + url[0]
            yield scrapy.Request(page, callback=self.parse)

  • 定義下載器中間件(DownloadMiddleware): myDownloadMiddleware.py
# -*- coding: utf-8 -*-
from scrapy.http import HtmlResponse
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options


class SeleniumMiddleware(object):
    """
        下載器中間件
    """

    @classmethod
    def process_request(cls, request, spider):
        if spider.name == 'boss':
            if request.url == 'https://www.zhipin.com/c101210100/h_101210100/?page=1&ka=page-1':
                options = Options()
                options.add_argument('-headless')
                
                # geckodriver需要手動下載
                driver = Firefox(executable_path='/ddhome/bin/geckodriver', firefox_options=options)
                driver.get(request.url)

                searchText = driver.find_element_by_xpath('//div[@class="search-form-con"]//input[1]')
                searchText.send_keys(unicode("大數據研發工程師"))
                searchBtn = driver.find_element_by_xpath('//div[@class="search-form "]//button[@class="btn btn-search"]')
                searchBtn.click()

                html = driver.page_source
                driver.quit()

                # 構建response, 將它發送給spider引擎
                return HtmlResponse(url=request.url, body=html, request=request, encoding='utf-8')

  • 定義Spider中間件(SpiderMiddleware): mySpiderMiddleware.py
# -*- coding: utf-8 -*-

import logging

logger = logging.getLogger(__name__)


class MyFirstSpiderMiddleware(object):

    @staticmethod
    def process_start_requests(start_requests, spider):
        """
        第一次發送請求前調用,之后不再調用
        :param start_requests:
        :param spider:
        :return:
        """
        logging.debug("#### 2222222 start_requests %s , spider %s ####" % (start_requests, spider))
        last_request = []
        for one_request in start_requests:
            logging.debug("#### one_request %s , spider %s ####" % (one_request, spider))
            last_request.append(one_request)
        logging.debug("#### last_request %s ####" % last_request)

        return last_request

    @staticmethod
    def process_spider_input(response, spider):
        logging.debug("#### 33333 response %s , spider %s ####" % (response, spider))
        return

    @staticmethod
    def process_spider_output(response, result, spider):
        logging.debug("#### 44444 response %s , result %s , spider %s ####" % (response, result, spider))
        return result

  • 定義管道(Pipeline): pipelines.py
# -*- coding: utf-8 -*-

import json
import codecs
from scrapy.contrib.exporter import CsvItemExporter
from scrapy import signals
import os


class CSVPipeline(object):
    """
        導出CSV格式
    """
    def __init__(self):
        self.file = {}
        self.csvpath = os.path.dirname(__file__) + '/spiders/output'
        self.exporter = None

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        """
        當蜘蛛啟動時自動執行
        :param spider:
        :return:
        """
        f = open('%s/%s_items.csv' % (self.csvpath, spider.name), 'a')  # r只讀, w可寫, a追加
        self.file[spider] = f
        self.exporter = CsvItemExporter(f)
        self.exporter.fields_to_export = spider.settings['FIELDS_TO_EXPORT']
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        """
        蜘蛛每yield一個item,這個方法執行一次
        :param item:
        :param spider:
        :return:
        """
        self.exporter.export_item(item)
        return item

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        f = self.file.pop(spider)
        f.close()


class JSONPipeline(object):
    """
        導出JSON格式
    """
    def __init__(self):
        self.file = None
        self.csvpath = os.path.dirname(__file__) + '/spiders/output'

    def process_item(self, item, spider):
        self.file = codecs.open('%s/%s_items.json' % (self.csvpath, spider.name), 'a', encoding='utf-8')
        line = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.file.write(line)
        # return item

    def spider_closed(self, spider):
        self.file.close()

  • settings.py配置

英文文檔

BOT_NAME = 'spider'

SPIDER_MODULES = ['spider.spiders']
NEWSPIDER_MODULE = 'spider.spiders'

FEED_EXPORT_ENCODING = 'utf-8'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   'spider.middlewares.mySpiderMiddleware.MyFirstSpiderMiddleware': 543,
}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'spider.middlewares.myDownloadMiddleware.SeleniumMiddleware': 542,
    'spider.middlewares.myDownloadMiddleware.PhantomJSMiddleware': 543,  # 鍵為中間件類的路徑,值為中間件的順序
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,  # 禁止內置的中間件
}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'spider.pipelines.CSVPipeline': 300,
    'spider.pipelines.JSONPipeline': 301
}

FEED_EXPORTERS = {
    'csv': 'spider.spiders.csv_item_exporter.MyProjectCsvItemExporter',
}

CSV_DELIMITER = ','

FIELDS_TO_EXPORT = [
    'position',
    'salary',
    'addr',
    'years',
    'education',
    'company',
    'industry',
    'nature',
    'scale',
    'publisher',
    'publisherPosition',
    'publishDateDesc'
]

文檔


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM