scrapy的CrawlSpider類

本文轉載自查看原文 2018-05-13 13:45 1965 網絡爬蟲/ CrawlSpider/ scrapy

了解CrawlSpider

踏實爬取一般網站的常用spider，其中定義了一些規則（rule）來提供跟進link的方便機制，也許該spider不適合你的目標網站，但是對於大多數情況是可以使用的。因此，可以以此為七點，根據需求修改部分方法，當然也可以實現自己的spider。

官方文檔：http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/spiders.html#crawlspider

CrawlSpider的使用

簡單使用

創建爬蟲文件：scrapy genspider -t crawl "spider_name" "url"

得到如下目錄：

其中spider文件夾中的爬蟲文件下的內容如下所示：

CrawlSpider是Spider的派生類，Spider類的設計原則是只爬取start_url列表中的網頁，而CrawlSpider類中定義了一些規則（rule）來提取跟進link的方便機制，從而爬取的網頁中獲取link並繼續爬取。

方法屬性

Name:定義spider的名字

allow_domains:包含了spider允許抓起去的域名列表。

start_url:初始化url列表，當沒有指定的url時，spider將從該列表中開始進行爬取。

start_requests(self):該方法返回一個可迭代對象，該對象包含了spider用於抓取的第一個request。

parse(self, resposne):默認的Request對象回調函數，用來處理返回的response，以及生成Items或者Request對象。

使用CralwSpider抓取數據

編寫CrawlSpider，抓取騰訊招聘的信息，具體網頁分析，見：

http://www.cnblogs.com/pythoner6833/p/9018782.html

具體代碼如下：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tencent2.items import Tencent2Item, DetailsItem


class Tencent2Spider(CrawlSpider):

    # 爬蟲名
    name = 'Tencent2'
    # 允許抓取的url
    allowed_domains = ['hr.tencent.com']
    # 請求開始的url
    start_urls = ['https://hr.tencent.com/position.php?']

    # rules屬性
    rules = (

        # 定義規則，抓取符合要求的url
        # allow是允許爬取的規則，后面的內容是正則表達式，匹配頁面中所有符合匹配規則的a標簽
        # callback是回調函數，用於解析抓取到的符合匹配的鏈接
        # follow：是否跟進，是否繼續請求抓取到的鏈接
        Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_tencent', follow=True),

        #編寫匹配詳情頁的規則，抓取到詳情頁的鏈接后不用跟進
        Rule(LinkExtractor(allow=r'position_detail\.php\?id=\d+'), callback='parse_detail', follow=False),
    )

    def parse_tencent(self, response):
        # 獲取頁面中招聘信息在網頁中位置節點
        node_list = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')

        # 遍歷節點，進入詳情頁，獲取其他信息
        for node in node_list:
            # 實例化，填寫數據
            item = Tencent2Item()

            item['position_name'] = node.xpath('./td[1]/a/text()').extract_first()
            item['position_link'] = node.xpath('./td[1]/a/@href').extract_first()
            item['position_type'] = node.xpath('./td[2]/text()').extract_first()
            item['wanted_number'] = node.xpath('./td[3]/text()').extract_first()
            item['work_location'] = node.xpath('./td[4]/text()').extract_first()
            item['publish_time'] = node.xpath('./td[5]/text()').extract_first()

            yield item

    def parse_detail(self, response):
        """
        解析詳情頁數據
        :param response:
        :return:
        """
        item = DetailsItem()
        # 從詳情頁獲取工作責任和工作技能兩個字段名
        item['work_duties'] = ''.join(response.xpath('//ul[@class="squareli"]')[0].xpath('./li/text()').extract())
        item['work_skills'] = ''.join(response.xpath('//ul[@class="squareli"]')[1].xpath('./li/text()').extract())
        yield item

其他部分，包括items.py和數據保存的pipelines.py里的代碼編寫和上文中鏈接里的已解釋。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Scrapy爬蟲框架---CrawlSpider類 python爬蟲入門（八）Scrapy框架之CrawlSpider類 scrapy系列（四）——CrawlSpider解析 Scrapy框架-Spider和CrawlSpider的區別 python爬蟲之Scrapy框架(CrawlSpider) scrapy框架初識（Spider模塊,CrawlSpider模塊的使用） scrapy爬取微信小程序社區教程（crawlspider） scrapy 使用crawlspider rule不起作用的解決方案 16.Python網絡爬蟲之Scrapy框架（CrawlSpider）爬蟲Scrapy框架-Crawlspider鏈接提取器與規則解析器