使用Scrapy框架爬取實習網“大數據實習生”信息前3頁的內容,爬取的信息包括:'崗位名稱', '實習城市', '實習企業', '實習工資', '學歷要求', '發布時間', '工作描述',爬取的網址為:https://www.shixi.com/search/index?key=大數據
新建項目
打開cmd,創建一個Scrapy項目,命令如下:
scrapy startproject shixi
cd shixi
scrapy genspider bigdata www.shixi.com
使用pycharm打開
構造請求
在settings.py中,設置MYSQL參數,在后面加上以下代碼
MAX_PAGE = 3 # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'shixi.pipelines.MySQLPipeline': 300, } MYSQL_HOST = 'localhost' MYSQL_DATABASE = 'spiders' MYSQL_USER = 'root' MYSQL_PASSWORD = '123456' MYSQL_PORT = 3306 # Obey robots.txt rules ROBOTSTXT_OBEY = False DOWNLOAD_DELAY = 5
提取信息
在items.py中定義Item
import scrapy from scrapy import Field class ShixiItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() table = "bigdataPos" #表名 pos = Field() city = Field() enter = Field() salary = Field() acade = Field() time = Field() jobshow = Field()
在spiders目錄下的bigdata.py中修改parse()方法和增加start_request()方法,
需要注意的地方是start_urls定義初始請求,需要改成我們爬取的第一頁https://www.shixi.com/search/index?key=大數據
# -*- coding: utf-8 -*- import scrapy from scrapy import Spider, Request from shixi.items import ShixiItem class BigdataSpider(scrapy.Spider): name = 'bigdata' allowed_domains = ['www.shixi.com'] #注意寫域名,而不是url,如果后續的請求鏈接不在這個域名下,則過濾 start_urls = ['https://www.shixi.com/search/index?key=大數據'] #啟動時爬取的url列表 def parse(self, response): jobs = response.css(".left_list.clearfix .job-pannel-list") for job in jobs: item = ShixiItem() # item[''] = house.css("").extract_first().strip() item['pos'] = job.css("div.job-pannel-one > dl > dt > a::text").extract_first().strip() item['city'] = job.css( ".job-pannel-two > div.company-info > span:nth-child(1) > a::text").extract_first().strip() item['enter'] = job.css(".job-pannel-one > dl > dd:nth-child(2) > div > a::text").extract_first().strip() item['salary'] = job.css(".job-pannel-two > div.company-info > div::text").extract_first().strip().replace( ' ', '') item['acade'] = job.css(".job-pannel-one > dl > dd.job-des > span::text").extract_first().strip() item['time'] = job.css(".job-pannel-two > div.company-info > span.job-time::text").extract_first().strip() next = job.css(".job-pannel-one > dl > dt > a::attr('href')").extract_first() # item['describe'] = job.css("::text").extract_first().strip() url = response.urljoin(next) yield scrapy.Request(url=url, callback=self.parse2, meta={'item':item}) #meta def parse2(self, response): item = response.meta['item'] # decribe不能做列名,是關鍵字 item['jobshow'] = response.css("div.work.padding_left_30 > div.work_b::text").extract_first().strip() yield item def start_request(self): base_url = "https://www.shixi.com/search/index?key=大數據&page={}" for page in range(1, self.settings.get("MAX_PAGE") + 1): url = base_url.format(page) yield Request(url, self.parse)
存儲信息
在phpstudy啟動MYSQL,然后打開navicat,新建一張表bigdataPos,注意表名必須和items.py寫的表名一致
接下來在pipelines.py 實現MySQLPipeline
import pymysql class MySQLPipeline(): def __init__(self,host,database,user,password,port): self.host = host self.database = database self.user = user self.password = password self.port = port @classmethod def from_crawler(cls, crawler): return cls( host=crawler.settings.get("MYSQL_HOST"), database=crawler.settings.get("MYSQL_DATABASE"), user=crawler.settings.get("MYSQL_USER"), password=crawler.settings.get("MYSQL_PASSWORD"), port=crawler.settings.get("MYSQL_PORT") ) def open_spider(self, spider): self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8', port=self.port) self.cursor = self.db.cursor() def close_spider(self, spider): self.db.close() def process_item(self, item, spider): data = dict(item) keys = ", ".join(data.keys()) values = ", ".join(["%s"] * len(data)) sql = "insert into %s (%s) values (%s)" % (item.table, keys, values) self.cursor.execute(sql, tuple(data.values())) self.db.commit() return item
運行程序
在cmd中shixi目錄下運行
scrapy crawl bigdata
可以得到
數據庫中也成功寫入