Redis實現分布式爬蟲


redis分布式爬蟲 

概念:多台機器上可以執行同一個爬蟲程序,實現網站數據的爬取
原生的scrapy是不可以實現分布式爬蟲, 原因如下:

  • 調度器無法共享
  • 管道無法共享

scrapy-redis組件:專門為scrapy開發的一套組件。 該組件可以讓scrapy實現分布式 pip install scrapy-redis

分布式爬取的流程:

1 redis配置文件的配置

  •  將 bind 127.0.0.1 進行注釋
  •  將 protected-mode no 關閉保護模式

2 redis服務器的開啟:基於配置文件的開啟

3 創建scrapy工程后, 創建基於crawlSpider的爬蟲文件

4 導入RedisCrawSpider類 from scrapy_redis.spiders import RedisCrawlSpider

5 將start_url修改成redis_key = 'xxx'

6 解析代碼編寫

7 將項目的管道和調度器配置成基於scrapy-redis組件中

ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400
}
# 使用scrapy-redis組件的去重隊列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis組件自己的調度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允許暫停
SCHEDULER_PERSIST = True

 8 配置Redis服務器地址和端口

# 如果redis服務器不在本機,則需如下配置
REDIS_HOST = '192.168.0.108'
REDIS_PORT = 6379
REDIS_PARAMS = {"password":123456}

9 執行爬蟲文件

scrapy runspider qiubai

10 向調度器隊列中扔入一個起始url(在redis客戶端中操作):lpush redis_key屬性值 起始url

lpush qiubaispider https://www.qiushibaike.com/pic/

實現代碼

class QiubaiSpider(RedisCrawlSpider):
    name = 'qiubai'
    # allowed_domains = ['www.qiushibaike.com/pic']
    # start_urls = ['http://www.qiushibaike.com/pic/']
    redis_key = 'qiubaispider'  # 表示跟start_urls含義一樣
    link = LinkExtractor(allow=r'/pic/page/\d+')
    rules = (
        Rule(link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print('開始爬蟲')
        div_list = response.xpath('//*[@id="content-left"]/div')
        for div in div_list:
            print(div)
            img_url = "http://" + div.xpath('.//div[@class="thumb"]/a/img/@src').extract_first()
            item = RedisproItem()
            item['img_url'] = img_url
            yield item

基於RedisSpider的分布式爬蟲

案例需求:爬取的是基於文字的新聞數據(國內, 國際,軍師, 航空)

  • 1 在爬蟲文件中導入webdriver類
  • 2 在爬蟲文件的爬蟲類的構造方法中進行了瀏覽器實例化操作
  • 3 在爬蟲類的closed方法中進行瀏覽器的關閉操作
  • 4 在下載中間件的process_response方法中編寫執行瀏覽器自動化操作

wangyi.py:

 

# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
from wanyiPro.items import WanyiproItem
from scrapy_redis.spiders import RedisSpider


class WangyiSpider(RedisSpider):
    name = 'wangyi'
    # allowed_domains = ['news.163.com']
    # start_urls = ['https://news.163.com/']
    redis_key = "wangyi"

    def __init__(self):
        # 實例化一個瀏覽器對象
        self.bro = webdriver.Chrome(executable_path='G:\myprogram\路飛學城\第七模塊\wanyiPro\chromedriver.exe')

    # 必須在整個爬蟲結束后關閉瀏覽器
    def closed(self, spider):
        print('爬蟲結束')
        self.bro.quit()

    def parse(self, response):
        lis = response.xpath('//div[@class="ns_area list"]/ul/li')
        indexs = [3, 4, 6, 7]
        li_list = []  # 存儲的就是國內 國際 軍事 航空四個板塊對應的li標簽對象
        for index in indexs:
            li_list.append(lis[index])
        # 獲取四個板塊中的鏈接和文字標題

        for li in li_list:
            url = li.xpath('./a/@href').extract_first()
            title = li.xpath('./a/text()').extract_first()
            # print(url+":"+title)
            # 對每一個板塊對應的url發起請求,獲取頁面數據(標題, 縮略圖, 關鍵字, 發布時間,  url)
            yield scrapy.Request(url=url, callback=self.parseSecond, meta={'title': title})

    def parseSecond(self, response):
        div_list = response.xpath('//div[@class="data_row news_article clearfix "]')
        for div in div_list:
            head = div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first()
            url = div.xpath('.//div[@class="news_title"]/h3/a/@href').extract_first()
            img_url = div.xpath('./a/img/@src').extract_first()
            tag_list = div.xpath('.//div[@class="news_tag"]//text()').extract()
            tags = []
            for t in tag_list:
                t = t.strip('\n \t')
                tags.append(t)
            tag = "".join(tags)
            # 獲取meta傳遞的數據值title
            title = response.meta['title']
            print(head + ":" + url + ":" + img_url)
            # 實例化item對象, 將解析到的數據值存儲在item中
            item = WanyiproItem()
            item['head'] = head
            item['url'] = url
            item['imgUrl'] = img_url
            item['tag'] = tag
            item['title'] = title
            # 對url發起請求 解析新聞詳細內容
            yield scrapy.Request(url=url, callback=self.getContent, meta={'item': item})

    def getContent(self, response):
        # 獲取傳遞過來的item
        item = response.meta['item']
        # 解析當前頁面中存儲的新聞數據
        content_list = response.xpath('//div[@class="post_text"]/p/text()').extract()
        content = "".join(content_list)
        item['content'] = content
        yield item

 

middlewares.py:

from scrapy import signals
from scrapy.http import HtmlResponse
class WanyiproDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # 攔截到響應對象(下載器傳遞給Spider的響應對象)
        # request: 響應對象對應的請求對象
        # response: 攔截到的響應對象
        # spider: 爬蟲文件對應的爬蟲類的實例
        print(request.url + "這是下載中間件")
        # 響應對象中存儲頁面數據的篡改
        if request.url in ['http://news.163.com/domestic/', 'http://news.163.com/world/', 'http://war.163.com/',
                           'http://news.163.com/air/']:
            spider.bro.get(url=request.url)
            js = 'window.scrollTo(0,document.body.scrollHeight)'
            spider.bro.execute_script(js)
            time.sleep(2)  # 一定要給與瀏覽器一定的緩沖加載數據的時間
            # 頁面數據包含了動態加載出來的新聞數據對應的頁面數據
            page_text = spider.bro.page_source
            return HtmlResponse(url=spider.bro.current_url, body=page_text, encoding='utf-8', request=request)
        else:
            return response

UA池和地址池:

from scrapy import signals
from scrapy.http import HtmlResponse
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random

user_agent_list = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
    "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
    "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
    "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
    "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
    "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
    "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
    "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
    "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
    "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

# UA池代碼的編寫(單獨給UA池封裝一個下載中間件的一個類)
# 導包UserAgentMiddleware類
class RandomUserAgent(UserAgentMiddleware):
    def process_request(self, request, spider):
        # 從列表中隨機抽選一個ua值
        ua = random.choice(user_agent_list)
        # ua值進行當前攔截到請求的ua的寫入操作
        request.headers.setdefault('User-Agent', ua)


# 可被選用的代理IP
PROXY_http = [
    '153.180.102.104:80',
    '195.208.131.189:56055',
]
PROXY_https = [
    '120.83.49.90:9000',
    '95.189.112.214:35508',
]

# 批量對攔截到的請求進行IP更換
class Proxy(object):
    def process_request(self, request, spider):
        # 對攔截到請求的url進行判斷(協議頭到底是http還是https)
        # request.url返回值:http://www.xxx.com
        h = request.url.split(':')[0]  # 請求的協議頭
        if h == 'https':
            ip = random.choice(PROXY_https)
            request.meta['proxy'] = 'https://' + ip
        else:
            ip = random.choice(PROXY_http)
            request.meta['proxy'] = 'http://' + ip

基於RedisSpider實現分布式爬蟲步驟

1 導包:from scrapy_redis.spiders import RedisSpider
2 將爬蟲類的父類修改成RedisSpider
3 將起始URL列表注釋, 添加一個redis_key(調度器隊列的名稱)的屬性
4 進行redis數據庫配置文件的配置:

  • 將 bind 127.0.0.1 進行注釋
  • 將 protected-mode no 關閉保護模式

5 settings中配置redis

REDIS_HOST = '192.168.0.108'
REDIS_PORT = 6379
REDIS_PARAMS = {"password": 123456}

# 使用scrapy-redis組件的去重隊列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis組件自己的調度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允許暫停
SCHEDULER_PERSIST = True

ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400
}

6  執行爬蟲文件

scrapy runspider wangyi.py

7 向調度器的管道中扔一個起始url

lpush wangyi https://news.163.com/

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM