scrapy爬蟲系列之二--翻頁爬取及日志的基本用法

本文轉載自查看原文 2019-03-28 21:48 801 python3

功能點：如何翻頁爬取信息，如何發送請求，日志的簡單實用

爬取網站：騰訊社會招聘網

完整代碼：https://files.cnblogs.com/files/bookwed/tencent.zip

主要代碼：

job.py

# -*- coding: utf-8 -*-
import scrapy
from tencent.items import TencentItem
import logging  # 日志模塊

logger = logging.getLogger(__name__)

class JobSpider(scrapy.Spider):
    """職位爬蟲"""
    name = 'job'
    allowed_domains = ["tencent.com"]
    offset = 0
    baseUrl = "https://hr.tencent.com/position.php?start={}"
    start_urls = [baseUrl.format(offset)]

    def parse(self, response):
        # //tr[@class="even" or @class="odd"]
        # xpath()，返回一個含有selector對象的列表
        job_list = response.xpath("//tr[@class='even'] | //tr[@class='odd']")
        for job in job_list:
            item = TencentItem()
            # extract() 提取字符串，返回一個包含字符串數據的列表
            # extract_first()，返回列表中的第一個字符串
            # extract()[0] 可以替換成extract_first()，不用再進行判斷是否為空了
            item["name"] = job.xpath("./td[1]/a/text()").extract_first()
            item["url"] = job.xpath("./td[1]/a/@href").extract()[0]
            item["type"] = job.xpath("./td[2]/text()")
            item["type"] = item["type"].extract()[0] if len(item["type"]) > 0 else None
            item["people_number"] = job.xpath("./td[3]/text()").extract()[0]
            item["place"] = job.xpath("./td[4]/text()").extract()[0]
            item["publish_time"] = job.xpath("./td[5]/text()").extract()[0]
            # 打印方式1
            # logging.warning(item)
            # 打印方式2，【推薦，可以看到是哪個文件打印的】
            logger.warning(item)
            # 為什么使用yield？好處？
            # 讓整個函數變成一個生成器。每次遍歷的時候挨個讀到內存中，不會導致內存的占用量瞬間變高
            yield item

        # 第一種：拼接url
        # if self.offset < 3090:
        #     self.offset += 10
        #     url = self.baseUrl.format(self.offset)
        #     yield scrapy.Request(url, callback=self.parse)

        # yield response.follow(next_page, self.parse)

        # 第二種：從response獲取要爬取的鏈接，並發送請求處理，知道鏈接全部提取完
        if len(response.xpath("//a[@class='noactive' and @id='next']")) == 0:
            temp_url = response.xpath("//a[@id='next']/@href").extract()[0]
            # yield response.follow("https://hr.tencent.com/"+temp_url, callback=self.parse)
            yield scrapy.Request(
                "https://hr.tencent.com/"+temp_url,
                callback=self.parse,
                # meta={"item": item}    # meta實現在不同的解析函數中傳遞數據
                # dont_filter=True    # 重復請求
            )   # 此處的callback指返回的響應由誰進行解析，如果和第一頁是相同的處理，則用parse，否則定義新方法，指定該新方法

    def parse1(self, response):
        item = response.meta["item"]
        print(item)
        print("*"*30)

pipelines.py

import json


class TencentPipeline(object):
    # 可選實現，參數初始化等
    def __init__(self):
        self.f = open('tencent_job.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        # item（Item對象） -  被爬取的item
        # spider（Spider對象）- 爬取item時的spider；通過spider.name可以獲取爬蟲名稱
        content = json.dumps(dict(item), ensure_ascii=False)+",\n"
        self.f.write(content)
        return item

    def open_spider(self, spider):
        # 可選，spider開啟時，該方法被調用
        pass

    def close_spider(self, spider):
        # 可選，spider關閉時，該方法被調用
        self.f.close()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Scrapy爬蟲案例01——翻頁爬取 scrapy爬蟲系列之四--爬取列表和詳情爬蟲---scrapy全站爬取 scrapy爬蟲之爬取汽車信息 Scrapy系列之爬取豆瓣電影 scrapy爬蟲之斷點續爬和多個spider同時爬取爬蟲第六篇：scrapy框架爬取某書網整站爬蟲爬取爬蟲系列之股票信息爬取爬蟲實戰系列（一）：爬取某網站圖片爬蟲Scrapy框架-2爬取網站視頻詳情