爬取實習網實習生招聘信息


  使用Scrapy框架爬取實習網“大數據實習生”信息前3頁的內容,爬取的信息包括:'崗位名稱', '實習城市', '實習企業', '實習工資', '學歷要求', '發布時間', '工作描述',爬取的網址為:https://www.shixi.com/search/index?key=大數據

新建項目

打開cmd,創建一個Scrapy項目,命令如下:

scrapy startproject shixi
cd shixi
scrapy genspider bigdata www.shixi.com

使用pycharm打開

構造請求

在settings.py中,設置MYSQL參數,在后面加上以下代碼

MAX_PAGE = 3

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'shixi.pipelines.MySQLPipeline': 300,
}
MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'spiders'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_PORT = 3306

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 5

提取信息

在items.py中定義Item

import scrapy
from scrapy import Field


class ShixiItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    table = "bigdataPos" #表名

    pos = Field()
    city = Field()
    enter = Field()
    salary = Field()
    acade = Field()
    time = Field()
    jobshow = Field()

 

在spiders目錄下的bigdata.py中修改parse()方法和增加start_request()方法,

需要注意的地方是start_urls定義初始請求,需要改成我們爬取的第一頁https://www.shixi.com/search/index?key=大數據

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Spider, Request
from shixi.items import ShixiItem


class BigdataSpider(scrapy.Spider):
    name = 'bigdata'
    allowed_domains = ['www.shixi.com'] #注意寫域名,而不是url,如果后續的請求鏈接不在這個域名下,則過濾
    start_urls = ['https://www.shixi.com/search/index?key=大數據'] #啟動時爬取的url列表


    def parse(self, response):
        jobs = response.css(".left_list.clearfix .job-pannel-list")
        for job in jobs:
            item = ShixiItem()
            # item[''] = house.css("").extract_first().strip()
            item['pos'] = job.css("div.job-pannel-one > dl > dt > a::text").extract_first().strip()
            item['city'] = job.css(
                ".job-pannel-two > div.company-info > span:nth-child(1) > a::text").extract_first().strip()
            item['enter'] = job.css(".job-pannel-one > dl > dd:nth-child(2) > div > a::text").extract_first().strip()
            item['salary'] = job.css(".job-pannel-two > div.company-info > div::text").extract_first().strip().replace(
                ' ', '')
            item['acade'] = job.css(".job-pannel-one > dl > dd.job-des > span::text").extract_first().strip()
            item['time'] = job.css(".job-pannel-two > div.company-info > span.job-time::text").extract_first().strip()
            next = job.css(".job-pannel-one > dl > dt > a::attr('href')").extract_first()

            # item['describe'] = job.css("::text").extract_first().strip()

            url = response.urljoin(next)
            yield scrapy.Request(url=url, callback=self.parse2, meta={'item':item})
            #meta


    def parse2(self, response):
        item = response.meta['item']
        # decribe不能做列名,是關鍵字
        item['jobshow'] = response.css("div.work.padding_left_30 > div.work_b::text").extract_first().strip()
        yield item

    def start_request(self):
        base_url = "https://www.shixi.com/search/index?key=大數據&page={}"
        for page in range(1, self.settings.get("MAX_PAGE") + 1):
            url = base_url.format(page)
            yield Request(url, self.parse)

存儲信息

在phpstudy啟動MYSQL,然后打開navicat,新建一張表bigdataPos,注意表名必須和items.py寫的表名一致

 

接下來在pipelines.py 實現MySQLPipeline

import pymysql

class MySQLPipeline():
    def __init__(self,host,database,user,password,port):
        self.host = host
        self.database = database
        self.user = user
        self.password = password
        self.port = port

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            host=crawler.settings.get("MYSQL_HOST"),
            database=crawler.settings.get("MYSQL_DATABASE"),
            user=crawler.settings.get("MYSQL_USER"),
            password=crawler.settings.get("MYSQL_PASSWORD"),
            port=crawler.settings.get("MYSQL_PORT")
        )

    def open_spider(self, spider):
        self.db = pymysql.connect(self.host, self.user, self.password,
                                  self.database, charset='utf8', port=self.port)
        self.cursor = self.db.cursor()

    def close_spider(self, spider):
        self.db.close()

    def process_item(self, item, spider):
        data = dict(item)
        keys = ", ".join(data.keys())
        values = ", ".join(["%s"] * len(data))
        sql = "insert into %s (%s) values (%s)" % (item.table, keys, values)
        self.cursor.execute(sql, tuple(data.values()))
        self.db.commit()
        return item

運行程序

在cmd中shixi目錄下運行

scrapy crawl bigdata

可以得到

 數據庫中也成功寫入


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM