爬蟲實戰(三) 用Python爬取拉勾網

本文轉載自查看原文 2019-06-09 15:27 2272 網絡爬蟲/ selenium/ Python

0、前言
1、初始化
2、爬取數據
3、保存數據
4、數據可視化
5、大功告成

0、前言

最近，博主面臨着選方向的困難（唉，選擇困難症患者＞﹏＜），所以希望了解一下目前不同崗位的就業前景

這時，就不妨寫個小爬蟲，爬取一下 拉勾網 的職位數據，並用圖形化的方法展示出來，一目了然

整體的思路是采用 selenium 模擬瀏覽器的行為，具體的步驟如下：

初始化
爬取數據，這里分為兩個部分：一是爬取網頁數據，二是進行翻頁操作
保存數據，將數據保存到文件中
數據可視化

整體的 代碼結構 如下：

class Lagou:
    # 初始化
    def init(self):
        pass

    # 爬取網頁數據
    def parse_page(self):
        pass

    # 進行翻頁操作
    def turn_page(self):
        pass

    # 爬取數據，調用 parse_page 和 turn_page
    def crawl(self):
        pass

    # 保存數據，將數據保存到文件中
    def save(self):
        pass
    
    # 數據可視化
    def draw(self):
        pass

if __name__ == '__main__':
    obj = Lagou()
    obj.init()
    obj.crawl()
    obj.save()
    obj.draw()

好，下面我們一起來看一下整個爬蟲過程的詳細分析吧！！

1、初始化

在初始化的部分，我們完成的工作需要包括以下四個方面：

准備全局變量
啟動瀏覽器
打開起始 URL
設置 cookie

（1）准備全局變量

所謂的全局變量，是指在整個爬蟲過程中都需要用到的變量，這里我們定義兩個全局變量：

data：儲存爬取下來的數據
isEnd：判斷爬取是否結束

（2）啟動瀏覽器

啟動瀏覽器的方式大致可以分為兩種，一是普通啟動，二是無頭啟動

在普通啟動時，整個爬取過程可以可視化，方便調試的時候發現錯誤

from selenium import webdriver
self.browser = webdriver.Chrome()

而無頭啟動可以減少渲染時間，加快爬取過程，一般在正式爬取時使用

from selenium import webdriver
opt = webdriver.chrome.options.Options()
opt.set_headless()
self.browser = webdriver.Chrome(chrome_options = opt)

（3）打開起始 URL

首先，我們打開拉勾網的首頁（URL：https://www.lagou.com/）

在輸入框中輸入【python】進行搜索，可以發現網頁跳轉到如下的 URL：

https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=

然后，我們再次嘗試在輸入框中輸入【爬蟲】進行搜索，網頁跳轉到如下 URL：

https://www.lagou.com/jobs/list_爬蟲?labelWords=&fromSearch=true&suginput=

從中，我們不難發現規律，對 URL 進行泛化后可以得到下面的結果（這個也就是我們的起始 URL）：

https://www.lagou.com/jobs/list_{position}?labelWords=&fromSearch=true&suginput=

其中，參數 position 就是我們在輸入框中輸入的內容（需要進行 URL 編碼）

由於拉勾網對未登錄用戶的訪問數量做了限制，所以在瀏覽一定數量的網頁后，網頁會自動跳轉到登陸界面：

這時，爬蟲就不能正常工作了（當時博主就是在這個地方卡了好久，一直沒找出原因）

為了解決上面的問題，我們可以使用 cookie 進行模擬登陸

方便起見，可以直接在瀏覽器中手動獲取 cookie，然后將 cookie 信息添加到 browser 中

（5）初始化部分完整代碼

# 初始化
def init(self):
    # 准備全局變量
    self.data = list()
    self.isEnd = False
	# 啟動瀏覽器、初始化瀏覽器
    opt = webdriver.chrome.options.Options()
    opt.set_headless()
    self.browser = webdriver.Chrome(chrome_options = opt)
    self.wait = WebDriverWait(self.browser,10)
    # 打開起始 URL
    self.position = input('請輸入職位：')
    self.browser.get('https://www.lagou.com/jobs/list_' + urllib.parse.quote(self.position) + '?labelWords=&fromSearch=true&suginput=')
    # 設置 cookie
    cookie = input('請輸入cookie：')
    for item in cookie.split(';'):
        k,v = item.strip().split('=')
        self.browser.add_cookie({'name':k,'value':v})

2、爬取數據

在這一部分，我們需要完成以下的兩個工作：

爬取網頁數據
進行翻頁操作

（1）爬取網頁數據

在起始頁面中，包含有我們需要的職位信息（可以使用 xpath 進行匹配）：

鏈接：//a[@class="position_link"]
職位：//a[@class="position_link"]/h3
城市：//a[@class="position_link"]/span/em
月薪、經驗與學歷：//div[@class="p_bot"]/div[@class="li_b_l"]
公司名稱：//div[@class="company_name"]/a

這里，我們需要使用 try - except - else 異常處理機制去處理異常，以保證程序的健壯性

（2）進行翻頁操作

我們通過模擬點擊【下一頁】按鈕，進行翻頁操作

這里，我們同樣需要使用 try - except - else 去處理異常

（3）爬取數據部分完整代碼

# 爬取網頁數據
def parse_page(self):
    try:
        # 鏈接
        link = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]')))
        link = [item.get_attribute('href') for item in link]
        # 職位
        position = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]/h3')))
        position = [item.text for item in position]
        # 城市
        city = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]/span/em')))
        city = [item.text for item in city]
        # 月薪、經驗與學歷
        ms_we_eb = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="p_bot"]/div[@class="li_b_l"]')))
        monthly_salary = [item.text.split('/')[0].strip().split(' ')[0] for item in ms_we_eb]
        working_experience = [item.text.split('/')[0].strip().split(' ')[1] for item in ms_we_eb]
        educational_background = [item.text.split('/')[1].strip() for item in ms_we_eb]
        # 公司名稱
        company_name = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="company_name"]/a')))
        company_name = [item.text for item in company_name]
    except TimeoutException:
        self.isEnd = True
    except StaleElementReferenceException:
        time.sleep(3)
        self.parse_page()
    else:
        temp = list(map(lambda a,b,c,d,e,f,g: {'link':a,'position':b,'city':c,'monthly_salary':d,'working_experience':e,'educational_background':f,'company_name':g}, link, position, city, monthly_salary, working_experience, educational_background, company_name))
        self.data.extend(temp)

# 進行翻頁操作
def turn_page(self):
    try:
        pager_next = self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME,'pager_next')))
    except TimeoutException:
        self.isEnd = True
    else:
        pager_next.click()
        time.sleep(3)

# 爬取數據，調用 parse_page 和 turn_page 方法
def crawl(self):
    count = 0
    while not self.isEnd :
        count += 1
        print('正在爬取第 ' + str(count) + ' 頁 ...')
        self.parse_page()
        self.turn_page()
    print('爬取結束')

3、保存數據

接下來，我們將數據儲存到 JSON 文件中

# 將數據保存到文件中
def save(self):
    with open('lagou.json','w',encoding='utf-8') as f:
        for item in self.data:
            json.dump(item,f,ensure_ascii=False)

這里，有兩個需要注意的地方：

在使用 open() 函數時，需要加上參數 encoding='utf-8'
在使用 dump() 函數時，需要加上參數 ensure_ascii=False

4、數據可視化

數據可視化有利於更直觀地展示數據之間的關系，根據爬取的數據，我們可以畫出如下 4 個直方圖：

工作經驗-職位數量
工作經驗-平均月薪
學歷-職位數量
學歷-平均月薪

這里，我們需要用到 matplotlib 庫，需要注意一個中文編碼的問題，可以使用以下的語句解決：

plt.rcParams['font.sans-serif'] = ['SimHei']

# 數據可視化
def draw(self):
    count_we = {'經驗不限':0,'經驗應屆畢業生':0,'經驗1年以下':0,'經驗1-3年':0,'經驗3-5年':0,'經驗5-10年':0}
    total_we = {'經驗不限':0,'經驗應屆畢業生':0,'經驗1年以下':0,'經驗1-3年':0,'經驗3-5年':0,'經驗5-10年':0}
    count_eb = {'不限':0,'大專':0,'本科':0,'碩士':0,'博士':0}
    total_eb = {'不限':0,'大專':0,'本科':0,'碩士':0,'博士':0}
    for item in self.data:
        count_we[item['working_experience']] += 1
        count_eb[item['educational_background']] += 1
        try:
            li = [float(temp.replace('k','000')) for temp in item['monthly_salary'].split('-')]
            total_we[item['working_experience']] += sum(li) / len(li)
            total_eb[item['educational_background']] += sum(li) / len(li)
        except:
            count_we[item['working_experience']] -= 1
            count_eb[item['educational_background']] -= 1
    # 解決中文編碼問題
    plt.rcParams['font.sans-serif'] = ['SimHei']
    # 工作經驗-職位數量
    plt.title(self.position)
    plt.xlabel('工作經驗')
    plt.ylabel('職位數量')
    x = ['經驗不限','經驗應屆畢業生','經驗1-3年','經驗3-5年','經驗5-10年']
    y = [count_we[item] for item in x]
    plt.bar(x,y)
    plt.show()
    # 工作經驗-平均月薪
    plt.title(self.position)
    plt.xlabel('工作經驗')
    plt.ylabel('平均月薪')
    x = list()
    y = list()
    for item in ['經驗不限','經驗應屆畢業生','經驗1-3年','經驗3-5年','經驗5-10年']:
        if count_we[item] != 0:
            x.append(item)
            y.append(total_we[item]/count_we[item])
    plt.bar(x,y)
    plt.show()
    # 學歷-職位數量
    plt.title(self.position)
    plt.xlabel('學歷')
    plt.ylabel('職位數量')
    x = ['不限','大專','本科','碩士','博士']
    y = [count_eb[item] for item in x]
    plt.bar(x,y)
    plt.show()
    # 學歷-平均月薪
    plt.title(self.position)
    plt.xlabel('學歷')
    plt.ylabel('平均月薪')
    x = list()
    y = list()
    for item in ['不限','大專','本科','碩士','博士']:
        if count_eb[item] != 0:
            x.append(item)
            y.append(total_eb[item]/count_eb[item])
    plt.bar(x,y)
    plt.show()

5、大功告成

（1）完整代碼

至此，整個爬蟲過程已經分析完畢，完整的代碼如下：

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException
import urllib.parse
import time
import json
import matplotlib.pyplot as plt

class Lagou:
    # 初始化
    def init(self):
        self.data = list()
        self.isEnd = False
        opt = webdriver.chrome.options.Options()
        opt.set_headless()
        self.browser = webdriver.Chrome(chrome_options = opt)
        self.wait = WebDriverWait(self.browser,10)
        self.position = input('請輸入職位：')
        self.browser.get('https://www.lagou.com/jobs/list_' + urllib.parse.quote(self.position) + '?labelWords=&fromSearch=true&suginput=')
        cookie = input('請輸入cookie：')
        for item in cookie.split(';'):
            k,v = item.strip().split('=')
            self.browser.add_cookie({'name':k,'value':v})

    # 爬取網頁數據
    def parse_page(self):
        try:
            link = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]')))
            link = [item.get_attribute('href') for item in link]
            position = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]/h3')))
            position = [item.text for item in position]
            city = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]/span/em')))
            city = [item.text for item in city]
            ms_we_eb = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="p_bot"]/div[@class="li_b_l"]')))
            monthly_salary = [item.text.split('/')[0].strip().split(' ')[0] for item in ms_we_eb]
            working_experience = [item.text.split('/')[0].strip().split(' ')[1] for item in ms_we_eb]
            educational_background = [item.text.split('/')[1].strip() for item in ms_we_eb]
            company_name = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="company_name"]/a')))
            company_name = [item.text for item in company_name]
        except TimeoutException:
            self.isEnd = True
        except StaleElementReferenceException:
            time.sleep(3)
            self.parse_page()
        else:
            temp = list(map(lambda a,b,c,d,e,f,g: {'link':a,'position':b,'city':c,'monthly_salary':d,'working_experience':e,'educational_background':f,'company_name':g}, link, position, city, monthly_salary, working_experience, educational_background, company_name))
            self.data.extend(temp)

    # 進行翻頁操作
    def turn_page(self):
        try:
            pager_next = self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME,'pager_next')))
        except TimeoutException:
            self.isEnd = True
        else:
            pager_next.click()
            time.sleep(3)

    # 爬取數據
    def crawl(self):
        count = 0
        while not self.isEnd :
            count += 1
            print('正在爬取第 ' + str(count) + ' 頁 ...')
            self.parse_page()
            self.turn_page()
        print('爬取結束')

    # 保存數據
    def save(self):
        with open('lagou.json','w',encoding='utf-8') as f:
            for item in self.data:
                json.dump(item,f,ensure_ascii=False)
    
    # 數據可視化
    def draw(self):
        count_we = {'經驗不限':0,'經驗應屆畢業生':0,'經驗1年以下':0,'經驗1-3年':0,'經驗3-5年':0,'經驗5-10年':0}
        total_we = {'經驗不限':0,'經驗應屆畢業生':0,'經驗1年以下':0,'經驗1-3年':0,'經驗3-5年':0,'經驗5-10年':0}
        count_eb = {'不限':0,'大專':0,'本科':0,'碩士':0,'博士':0}
        total_eb = {'不限':0,'大專':0,'本科':0,'碩士':0,'博士':0}
        for item in self.data:
            count_we[item['working_experience']] += 1
            count_eb[item['educational_background']] += 1
            try:
                li = [float(temp.replace('k','000')) for temp in item['monthly_salary'].split('-')]
                total_we[item['working_experience']] += sum(li) / len(li)
                total_eb[item['educational_background']] += sum(li) / len(li)
            except:
                count_we[item['working_experience']] -= 1
                count_eb[item['educational_background']] -= 1
        # 解決中文編碼問題
        plt.rcParams['font.sans-serif'] = ['SimHei']
        # 工作經驗-職位數量
        plt.title(self.position)
        plt.xlabel('工作經驗')
        plt.ylabel('職位數量')
        x = ['經驗不限','經驗應屆畢業生','經驗1-3年','經驗3-5年','經驗5-10年']
        y = [count_we[item] for item in x]
        plt.bar(x,y)
        plt.show()
        # 工作經驗-平均月薪
        plt.title(self.position)
        plt.xlabel('工作經驗')
        plt.ylabel('平均月薪')
        x = list()
        y = list()
        for item in ['經驗不限','經驗應屆畢業生','經驗1-3年','經驗3-5年','經驗5-10年']:
            if count_we[item] != 0:
                x.append(item)
                y.append(total_we[item]/count_we[item])
        plt.bar(x,y)
        plt.show()
        # 學歷-職位數量
        plt.title(self.position)
        plt.xlabel('學歷')
        plt.ylabel('職位數量')
        x = ['不限','大專','本科','碩士','博士']
        y = [count_eb[item] for item in x]
        plt.bar(x,y)
        plt.show()
        # 學歷-平均月薪
        plt.title(self.position)
        plt.xlabel('學歷')
        plt.ylabel('平均月薪')
        x = list()
        y = list()
        for item in ['不限','大專','本科','碩士','博士']:
            if count_eb[item] != 0:
                x.append(item)
                y.append(total_eb[item]/count_eb[item])
        plt.bar(x,y)
        plt.show()

if __name__ == '__main__':
    obj = Lagou()
    obj.init()
    obj.crawl()
    obj.save()
    obj.draw()