分布式爬蟲

本文轉載自查看原文 2020-03-09 18:02 947 老男孩Python

理論
scrapy-redis架構
scrapy - redis安裝與使用
分布式爬取案例

理論

我們大多時候玩的爬蟲

都是運行在自己的機子

之前

我們為了提高爬蟲的效率

說過多進程相關的

什么是分布式？

你開發一個網站想要給別人訪問就需要把網站部署到服務器

當網站用戶增多的時候一個服務器就不滿足需求了於是就會把網站部署到多個服務器上

1583745292392

這種情況通常叫集群

就是把整個網站的所有功能

都同時部署到不同的服務器上一般會使用 ngnix 作負載均衡

不過

有些功能並發量並不是很高比如一些后台的管理

所以就有人想要不然把這個網站的功能都拆分出來

讓每一個模塊只負責具體的功能

比如登錄模塊，內容管理模塊等

1583745268949

然后在部署的時候

把一些並發量大的模塊部署到多個服務器就行了耦合度大大的降低了

並發量小的模塊也不會浪費那么多資源了

當然

這時需要讓模塊與模塊之間產生聯系

也就是調度好它們

一般會用到消息隊列

1583745244702

這就是所謂的分布式

對於一些數據不大的數據我們的爬蟲一般是直接在電腦運行了

也就是所謂的單機爬蟲

而分布式爬蟲

說白了就是把爬蟲的關鍵功能以我們剛說的分布式形式部署到多台機器上然后一起盤(爬)它

1583745216783

那么如何將爬蟲之間聯系起來呢我們可以使用 Redis 的消息隊列進行調度（schedule）

之前我們也有說過 redis

它是一個讀寫速度快的

緩存數據庫

還提供了類似 Python 的list、set 等數據結構

而且它還可以將內存的數據寫到磁盤性能杠杠的

在 scrapy 框架里面

就有一個scrapy-redis專門用它來調度爬蟲的

它可以將請求的 url 放到redis 的消息隊列里面

然后用 spider 模塊將數據給結構化抽出來放到 redis 數據庫里面去

當然，分布式爬蟲有時候還會結合數據庫集群爬取數據

scrapy-redis架構

• 調度器(Scheduler)

scrapy-redis調度器通過redis的set不重復的特性，實現了Duplication Filter去重（DupeFilter set存放爬取過的request）。
Spider新生成的request，將request的指紋到redis的DupeFilter set檢查是否重復，並將不重復的request push寫入redis的request隊列。
調度器每次從redis的request隊列里根據優先級pop出一個request, 將此request發給spider處理。

• Item Pipeline

將Spider爬取到的Item給scrapy-redis的Item Pipeline，將爬取到的Item存入redis的items隊列。可以很方便的從items隊列中提取item，從而實現items processes 集群

scrapy - redis安裝與使用

安裝scrapy-redis

之前已經裝過scrapy了，這里直接裝scrapy-redis

pip install scrapy-redis

使用scrapy-redis的example來修改

先從github上拿到scrapy-redis的example，然后將里面的example-project目錄移到指定的地址

git clone https://github.com/rolando/scrapy-redis.git
cp -r scrapy-redis/example-project ./scrapy-youyuan

或者將整個項目下載回來scrapy-redis-master.zip解壓后

cp -r scrapy-redis-master/example-project/ ./redis-youyuan
cd redis-youyuan/

tree查看項目目錄

修改settings.py

注意：settings里面的中文注釋會報錯，換成英文

# 指定使用scrapy-redis的Scheduler
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 在redis中保持scrapy-redis用到的各個隊列，從而允許暫停和暫停后恢復
SCHEDULER_PERSIST = True

# 指定排序爬取地址時使用的隊列，默認是按照優先級排序
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
# 可選的先進先出排序
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderQueue'
# 可選的后進先出排序
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderStack'

# 只在使用SpiderQueue或者SpiderStack是有效的參數,，指定爬蟲關閉的最大空閑時間
SCHEDULER_IDLE_BEFORE_CLOSE = 10

# 指定RedisPipeline用以在redis中保存item
ITEM_PIPELINES = {
    'example.pipelines.ExamplePipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400
}

# 指定redis的連接參數
# REDIS_PASS是我自己加上的redis連接密碼，需要簡單修改scrapy-redis的源代碼以支持使用密碼連接redis
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
# Custom redis client parameters (i.e.: socket timeout, etc.)
REDIS_PARAMS  = {}
#REDIS_URL = 'redis://user:pass@hostname:9001'
#REDIS_PARAMS['password'] = 'itcast.cn'
LOG_LEVEL = 'DEBUG'

DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'

#The class used to detect and filter duplicate requests.

#The default (RFPDupeFilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function. In order to change the way duplicates are checked you could subclass RFPDupeFilter and override its request_fingerprint method. This method should accept scrapy Request object and return its fingerprint (a string).

#By default, RFPDupeFilter only logs the first duplicate request. Setting DUPEFILTER_DEBUG to True will make it log all duplicate requests.
DUPEFILTER_DEBUG =True

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.8',
    'Connection': 'keep-alive',
    'Accept-Encoding': 'gzip, deflate, sdch',
}

查看pipeline.py

from datetime import datetime

class ExamplePipeline(object):
    def process_item(self, item, spider):
        item["crawled"] = datetime.utcnow()
        item["spider"] = spider.name
        return item

流程

    - 概念：可以使用多台電腦組件一個分布式機群，讓其執行同一組程序，對同一組網絡資源進行聯合爬取。
    - 原生的scrapy是無法實現分布式
        - 調度器無法被共享
        - 管道無法被共享
    - 基於scrapy+redis（scrapy&scrapy-redis組件）實現分布式
    - scrapy-redis組件作用：
        - 提供可被共享的管道和調度器
    - 環境安裝：
        - pip install scrapy-redis
    - 編碼流程：
        1.創建工程
        2.cd proName
        3.創建crawlspider的爬蟲文件
        4.修改一下爬蟲類：
            - 導包：from scrapy_redis.spiders import RedisCrawlSpider
            - 修改當前爬蟲類的父類：RedisCrawlSpider
            - allowed_domains和start_urls刪除
            - 添加一個新屬性：redis_key = 'xxxx'可以被共享的調度器隊列的名稱
        5.修改配置settings.py
            - 指定管道
                ITEM_PIPELINES = {
                        'scrapy_redis.pipelines.RedisPipeline': 400
                    }
            - 指定調度器
                # 增加了一個去重容器類的配置, 作用使用Redis的set集合來存儲請求的指紋數據, 從而實現請求去重的持久化
                DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
                # 使用scrapy-redis組件自己的調度器
                SCHEDULER = "scrapy_redis.scheduler.Scheduler"
                # 配置調度器是否要持久化, 也就是當爬蟲結束了, 要不要清空Redis中請求隊列和去重指紋的set。如果是True, 就表示要持久化存儲, 就不清空數據, 否則清空數據
                SCHEDULER_PERSIST = True
            - 指定redis數據庫
                REDIS_HOST = 'redis服務的ip地址'
                REDIS_PORT = 6379
         6.配置redis數據庫（redis.windows.conf）
            - 關閉默認綁定
                - 56Line：#bind 127.0.0.1
            - 關閉保護模式
                - 75line：protected-mode no
         7.啟動redis服務（攜帶配置文件）和客戶端
            - redis-server.exe redis.windows.conf
            - redis-cli
         8.執行工程
            - scrapy runspider spider.py
         9.將起始的url仍入到可以被共享的調度器的隊列（sun）中
            - 在redis-cli中操作：lpush sun www.xxx.com
         10.redis:
            - xxx:items:存儲的就是爬取到的數據

分布式爬取案例

爬蟲程序

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_redis.spiders import RedisCrawlSpider
from fbs.items import FbsproItem

class FbsSpider(RedisCrawlSpider):
    name = 'fbs_obj'
    # allowed_domains = ['www.xxx.com']
    # start_urls = ['http://www.xxx.com/']
    redis_key = 'sun'#可以被共享的調度器隊列的名稱
    link = LinkExtractor(allow=r'type=4&page=\d+')
    rules = (
        Rule(link, callback='parse_item', follow=True),
    )
    print(123)
    def parse_item(self, response):
        tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr')
        for tr in tr_list:
            title = tr.xpath('./td[2]/a[2]/@title').extract_first()
            status = tr.xpath('./td[3]/span/text()').extract_first()

            item = FbsproItem()
            item['title'] = title
            item['status'] = status
            print(title)
            yield item

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for fbsPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'fbs_obj'

SPIDER_MODULES = ['fbs_obj.spiders']
NEWSPIDER_MODULE = 'fbs_obj.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'fbsPro (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 2

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'fbsPro.middlewares.FbsproSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'fbsPro.middlewares.FbsproDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'fbsPro.pipelines.FbsproPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#指定管道
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400
}
#指定調度器
# 增加了一個去重容器類的配置, 作用使用Redis的set集合來存儲請求的指紋數據, 從而實現請求去重的持久化
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis組件自己的調度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 配置調度器是否要持久化, 也就是當爬蟲結束了, 要不要清空Redis中請求隊列和去重指紋的set。如果是True, 就表示要持久化存儲, 就不清空數據, 否則清空數據
SCHEDULER_PERSIST = True

#指定redis
REDIS_HOST = '192.168.16.119'
REDIS_PORT = 6379

item.py

import scrapy

class FbsproItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    status = scrapy.Field()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 分布式爬蟲分布式爬蟲分布式爬蟲分布式爬蟲分布式爬蟲分布式爬蟲與增量式爬蟲基於java的分布式爬蟲分布式爬蟲實戰分布式爬蟲系統爬蟲的本質是和分布式爬蟲的關系