目錄
5.2 職位詳情網頁爬取完整代碼
5.3 職位數據分析輸出扇形圖、柱形圖完整代碼
5.4 根據崗位職責的TXT文件分析輸出詞雲
5.5 合並多個csv文件為同一個exeal表完整代碼
6.4 C語言關鍵字分析崗位職責TXT文件得到的詞雲圖
6.5 C語言關鍵字分析合並csv得到的exeal
1、實現功能
1.1數據爬取:能對前程無憂的招聘數據進行增量爬取,輸入崗位或者公司名,輸出該崗位或公司的招聘數量、平均薪資、技能要求、崗位職責等;
1.2數據分析及可視化:主要為對工作城市、學歷要求、招聘經驗、薪資的簡單統計分析並輸出為csv文件、txt文件,並根據csv的數據得到柱形圖,扇形圖,且保存圖片到本地,以及根據崗位要求數據輸出詞雲;
1.3數據存儲:將爬取的數據存儲在以爬取關鍵詞命名的新文件中,存儲為csv格式,最后將所有分析的csv文件保存到一個exeal表中,並將主要數據存儲到mysql數據庫中。
2、運用庫
1 #單網頁爬取運用庫 2 import requests# 模擬請求 post get 3 from xpinyin import Pinyin # 將中文輸出為拼音 4 import pprint # 格式化輸出模塊 5 import os # 操作文件和目錄 6 import re # 正則表達式模塊 內置 7 import json 8 import csv # 輸出csv的庫 9 import pymysql # 數據庫 10 import time # 計算機內部時間庫 11 #單網頁爬取_職位詳情運用庫 12 import requests# 模擬請求 post get 13 import time 14 import parsel # 數據解析模塊 包含css xpath 15 import pandas as pd # 數據類型和分析工具庫 16 #詞雲分析運用庫 17 import re # 正則表達式庫 18 import matplotlib.pyplot as plt # 圖像展示庫 19 import collections # 詞頻統計庫 20 import numpy as np # numpy數據處理庫 21 from PIL import Image # 圖像處理庫 22 import jieba # 結巴分詞 23 import wordcloud # 詞雲展示庫 24 #畫圖運用庫 25 import pandas as pd 26 import matplotlib.pyplot as plt
3、設計邏輯
3.1 網頁循環爬取並分析
3.2 職位數據分析並輸出圖片
3.3 合並csv為同一個exeal表格
4、代碼分析
4.1 爬取網頁並解析
4.1.1 定義URL函數,進行url拼接,獲取URL
此功能能根據輸入的檢索詞data,頁碼i自動拼接url。我們訪問的招聘網站為前程無憂,可以根據關鍵字data,循環的地址i,獲取拼接url地址。如URL為:
其中:url由以下字符串拼接而成
url1:https://search.51job.com/list/000000,000000,0000,00,9,99
data:#獲取需要檢索的關鍵字變量,如大數據、python
url2: ,2, #請求數據1,這里保持不變
rul_yema: #代表請求的頁碼
url3: .html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=
#代表其他的如工作城市、月薪范圍的檢索,我們保持不變
代碼如下:
4.1.2 定義data、以及分析數據所需要的列表、變量
代碼如下:
4.1.3 創建py相對路勁下的data文件夾
先判斷是否又此文件夾,沒有則創建,此處path用的絕對路徑。
代碼如下:
4.1.4 打開csv文件,並寫入表頭
此處ANSI為css的exeal編碼,utf-8為txt編碼。css編碼不懂得可參考一下網址:
代碼如下:
4.1.5 根據輸入數據的data創建數據庫的表名
數據庫的表名必須為全英文,不能為中文,所以需要定義is_Chinese函數,判斷輸入的data是否為中文,是中文根據輸入的中文轉換為拼音,不是則數據庫名為'job_zhiwei_' + data。如輸入:大數據,返回job_zhiwei_ dsj;輸入prthon,則返回job_zhiwei_python,如果c//或者帶有/*|\等符號的會報錯,且網址也檢索不出來。但是C++是網址有效檢索數據,作為數據庫表名則不行,有興趣的可以自己完善下此代碼,可將C++轉換為C__,這中有效數據。
輸入阿里巴巴,輸出數據庫名效果如圖:

此函數參考網址
代碼如下:
4.1.6 鏈接數據庫,寫入數據庫表頭
代碼根據輸入的data,創建不同的數據庫表名;其中數據庫的host為本地數據庫,一般為127.0.0.1,也可為localhost;port為端口;databa為數據庫;charset='utf8'為設置字符編碼為utf8。
創建后表頭效果如圖:
代碼如下:
4.1.7 爬取數據並解析
獲取的原始數據reponse.test,是響應文本的文本數據
reponse.test效果如圖

用re直接解析獲取的reponse.test,命名為html_data,解析后是我們爬取到需要的招聘數據,為長字符串
效果如圖:

在將獲取到的字符串數據轉換為json字典數據,並且解析json數據,寫成鍵值對方式。每一個公司的數據為一個index,循環取出。pprint是測試是格式化輸出的函數,輸出后是標准的html網頁代碼格式
json_data效果如圖:
index是for遍歷解析的json數據
獲取的 index效果如圖:

最后在解析index,寫入字典dit中,可寫入csv和數據庫中,並且將有效的職位數據寫入存儲分析薪資、地區、學歷、經驗多個列表。以便后面進行進一步的數據分析,最終獲取到有效數據。
dit效果如圖:
參考代碼如下:
1 for i in range(1, 21): # range(1, 31)為循環1-30頁 2 url = url_get(data, i) # 調用url_get函數拼接url 3 # 一個cookie用8頁 4 if i % 8 == 1: 5 cookie_i1 += 1 6 print("cookie={}".format(cookie_i1)) 7 if cookie_i1 == 2: 8 break 9 else: 10 headers = { 11 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.20 Safari/537.36', 12 'Accept-Encoding': 'gzip, deflate, br', 13 'Cookie':cookie_j1[cookie_i1] 14 } 15 time.sleep(5) 16 response = session.get(url=url, headers=headers) # 模擬get請求 返回信息response對象 17 print("********************正在爬取第{}頁********************".format(i)) 18 # 動態網頁 response.json()獲取json字典數據 19 # 保存下載圖片 視屏 音頻 獲取響應體二進制數據 response.content() 20 #print(response.text) # 獲取數據 獲取響應文本的文本數據 reponse.test 21 html_date = re.findall('window.__SEARCH_RESULT__ = (.*?)</script>',response.text)[0] 22 #print(html_date) 23 # 把字符串數據轉換為json字典數據 24 json_date = json.loads(html_date) 25 #格式化輸出(好看) 26 # pprint.pprint(json_date) 27 #解析json數據,寫成鍵值對方式 28 search_esult = json_date['engine_jds'] 29 #for遍歷 提取列表中元素 30 for index in search_esult: 31 #pprint.pprint(index) 32 company_name = index["company_name"] # 公司名稱 33 title = index["job_name"] # 職位名稱 34 info_list = index["attribute_text"] #職位要求基本信息如['成都', '3-4年經驗', '本科', '招1人'] 35 if len(info_list) == 4: # 將薪資基本信息完全的才輸出 36 citf = info_list[0] # 城市 37 exp = info_list[1] # 經驗要求 38 edu = info_list[2] # 學歷要求 39 people = info_list[3] # 招聘人數 40 I = info_list[0][0:2] # 大地區獲取 41 citf_l = citf_l + [I] 42 exp_l = exp_l + [exp] 43 edu_l = edu_l + [edu] 44 #列表轉為字符串 空格分割 45 #str(attribute_text) 46 count_people= count_people+1 # 統計有效爬取人數 47 #job_info = '|'.join(info_list) # 可以將列表轉換為字符串,以|符號分割 48 jobwelf = index["jobwelf"] # 職位福利 49 money = index["providesalary_text"] # 職位薪資 4-4.5千/月 1.5-2.8萬/月 15-20萬/年 50 updatedate = index["updatedate"] # 職位發布日期 51 job_href = index["job_href"] # 職位詳情網頁 52 work_href = work_href + [job_href] 53 # 創建信息字典 鍵只能是元組、數字、字符串 54 dit = { 55 '公司名稱': company_name, '職位名稱': title, 56 '城市地區': citf, '經驗要求': exp, 57 '學歷要求': edu, '招聘人數': people, 58 '職位福利': jobwelf, '職位薪資': money, 59 '職位發布日期': updatedate, '職位詳情網頁': job_href, } 60 csv_writer.writerow(dit) 61 sql_insert = "insert into %s (ompany_name,job_name, job_area,job_exp,job_edu, " \ 62 "recruiting_numbers,job_weal,job_pay,job_release_date,job_webpage) " \ 63 "values('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')" % (a, 64 company_name, title, citf, exp, edu, people, jobwelf, money, updatedate, job_href) 65 cursor.execute(sql_insert) 66 conn.commit() # 提交請求,不然不會插入數據 67 print(dit)
4.2 數據存儲
數據存儲為csv格式和mysql中。
data為C語言 csv存儲效果如下:

data為阿里巴巴 數據庫存儲效果如下:

參考代碼如下:
4.3 數據分析
4.3.1 薪資分析
(1)薪資字符串切分並轉換為不同的4個列表
在薪資分析時有多個干擾字符串,如1.5千以下/月、100萬以上/年這種薪資不便於計算,空的的薪資、時薪、日薪的不在參考范圍,這類數據需要排除;此代碼在有薪資全部基本信息的循環下。
根據切分后輸出的四個列表效果如圖:
參考代碼如下:
(2)平均薪資分析
根據薪資所切分的字符串轉換的4個列表,以及count_xinzi(有效薪資數據個數)計算平均薪資的兩個區間。
最后計算的平均薪資效果如圖:

參考代碼如下:
# 平均薪資分析 xinzi_5 = [] # 此處預定義一個空列表,將薪資分析的數據寫入到csv文件中,以便進一步分析 xinzi_6 = [] # num_list1 = [float(i) for i in xinzi_1]#將薪資數據的隊列string型轉為float型 num_list2 = [float(i) for i in xinzi_2] for i in range(count_xinzi): j1 = num_list1[i]*xinzi_3[i]/xinzi_4[i] j2 = num_list2[i]*xinzi_3[i]/xinzi_4[i] xinzi_5+=[round(j1/1000, 1)] # 將薪資轉換為K單位的薪資,如6.2K,輸出6.2,並保留1位小數 xinzi_6+=[round(j2/1000, 1)] xinzi_count1 = xinzi_count1+j1 xinzi_count2 = xinzi_count2+j2 average_wage1 = xinzi_count1/count_xinzi average_wage2 = xinzi_count2/count_xinzi f_fxinzi = data + "/" + data + "_" + "薪資" + "分析表.csv" f = open(f_fxinzi, mode='w+', encoding='ANSI',newline='') csv_writer = csv.DictWriter(f, ['公司', '薪資數據1','薪資數據2','中位薪資'], dialect='excel') csv_writer.writeheader() # 寫入表頭 for i in range(count_xinzi): xinzi_zw = round((xinzi_5[i]+xinzi_6[i])/2, 1) dit3 = { '公司': work_gs[i], '薪資數據1': xinzi_5[i], '薪資數據2': xinzi_6[i], '中位薪資': xinzi_zw, } csv_writer.writerow(dit3) # 輸出平均薪資的計算結果 print("本次爬取有效職位共{}個,爬取有效薪資的數據為{}個,該職位或該公司的平均薪資為{}-{}元/月". format(count_people, count_xinzi, round(average_wage1, 0), round(average_wage2, 0)))
4.3.2 招聘地區、經驗要求、學歷要求分析
(1)創建字典分析函數:
此函數有5個參數; l,d,k,v,z分別為列表、字典、csv寫入第一列表頭參數k,第二列表頭參數v,以及創建csv的表名變量參數z。此函數是先打開文件,在寫入表頭。之后將之前預定義有數據的列表按照列表的數據統計為字典,最后將字典按照降序的方式遍歷取值存儲到csv中。定義好函數后,調用執行dict_analyse函數,將分析的數據寫入scv中。
方法是
(2)執行dict_analyse函數
列表統計轉換字典后效果如圖:
經驗要求圖;
學歷要求圖:

地區划分圖:

函數定義參考代碼如下:
4.4 根據數據分析保存的csv輸出柱形圖、扇形圖
輸入"數據爬取"關鍵字效果如圖:
輸出文件效果圖:

薪資扇形效果圖:

地區、學歷、經驗扇形效果圖:

地區、學歷、經驗柱形效果圖:

此子項目參考完整代碼如下:
import pandas as pd import matplotlib.pyplot as plt #from 前程無憂_單網頁爬取 import data ## 根據數據得到柱形圖、扇形圖 plt.rcParams['font.sans-serif']=['SimHei'] #顯示中文標簽 plt.rcParams['axes.unicode_minus'] = False def ShowWorkArea (data,i,k,x_axle,y_axle): f = data+'/'+data+'_'+k+'分析表.csv' work = pd.read_csv(f, encoding='gbk', nrows=i) # i為顯示前幾列 # print(work.values) list1 = [] list2 = [] for o in work.values: list1 += [o[0]] list2 += [o[1]] plt.pie(x=list2, labels=list1, autopct='%1.1f%%' ) # 繪圖數據,添加標簽,設置百分比的格式,這里保留一位小數 plt.title(data+'_'+k+"分析扇形圖") plt.savefig("D:/學習/東軟/大三上期/腳本語言開發(陳漢斌)/yunxingDaiMa/招聘/" + data + '/'+data+'_'+k+"分析扇形圖.png") # 保存圖片 #絕對路徑 # 顯示圖形 plt.show() # x軸是招聘地區,y軸是招聘數量,讓直方圖排序顯示,默認升序 work.sort_values(by=y_axle,inplace=True,ascending = False) # 將直方圖顏色統一設置為紅色 work.plot.bar(x=x_axle, y=y_axle, color='red') # 旋轉X軸標簽,讓其橫向寫 plt.xticks(rotation=360) j = str(i) plt.title(data+'_'+k+"分析柱形圖") plt.savefig("D:/學習/東軟/大三上期/腳本語言開發(陳漢斌)/yunxingDaiMa/招聘/"+data+'/'+data+'_'+k+"分析柱形圖.png") #保存圖片的絕對路徑,讀者根據自己的存儲位置改 plt.show() def xinzi_analyze(data,k) : f = data+'/'+data+'_'+k+'分析表.csv' work = pd.read_csv(f, encoding='gbk') list = [] for o in work.values: list += [o[3]] #35-5/10=3 d = (max(list)-min(list))/10 #分10組計算組距整除 ls = [0,0,0,0,0,0,0,0,0,0] #初始化列表ls,記錄每組頻數 L2標簽為11組數據,每相鄰兩個組成一個標簽,組成10個標簽 ls2=[min(list),min(list)+d*1,min(list)+d*2,min(list)+d*3,min(list)+d*4,min(list)+d*5,min(list)+d*6,min(list)+d*7,min(list)+d*8,min(list)+d*9,min(list)+d*10] # 創建標簽ls3: ls3=[] for i in range(10): ch1 = [(str(round(ls2[i], 1))) + "k-" + (str(round(ls2[i + 1], 1)) + "k")] ls3 = ls3 + ch1 for i in range(len(list)): #在a中依次找出每組數據,並在ls中計數 if list[i] <= min(list) + d * 1: ls[0] = ls[0] + 1 elif list[i] <= min(list) + d * 2: ls[1] = ls[1] + 1 elif list[i] <= min(list) + d * 3: ls[2] = ls[2] + 1 elif list[i] <= min(list) + d * 4: ls[3] = ls[3] + 1 elif list[i] <= min(list) + d * 5: ls[4] = ls[4] + 1 elif list[i] <= min(list) + d * 6: ls[5] = ls[5] + 1 elif list[i] <= min(list) + d * 7: ls[6] = ls[6] + 1 elif list[i] <= min(list) + d * 8: ls[7] = ls[7] + 1 elif list[i] <= min(list) + d * 9: ls[8] = ls[8] + 1 elif list[i] <= min(list) + d * 10: ls[9] = ls[9] + 1 plt.pie(x=ls, labels=ls3, autopct='%1.1f%%' ) plt.axis("equal")#使之呈現“正圓”,默認扁圓 plt.title(data+'_薪資分布扇形圖') plt.savefig("D:/學習/東軟/大三上期/腳本語言開發(陳漢斌)/yunxingDaiMa/招聘/" + data + '/' + data + "_薪資分析扇形圖.png") # 保存圖片,絕對路徑 plt.show() data = str(input("請輸入查詢的關鍵詞:")) #data = '人工智能' #i = int(input("請輸入查詢的數量:")) ShowWorkArea(data, 10, k="地區",x_axle="招聘地區",y_axle="招聘數量") # 展示招聘職位地區分布 ShowWorkArea(data, 10, k="經驗",x_axle="經驗要求",y_axle="數量") # 展示招聘職位地區分布 ShowWorkArea(data, 10, k="學歷",x_axle="學歷要求",y_axle="數量") # 展示招聘職位地區分布 xinzi_analyze(data,k="薪資") print("職位數據分析_畫圖運行成功!")
4.5 將輸出在同一文件價的csv文件合並為同一個exeal
輸入‘’工程管理‘’合並csv文件效果如圖:
此子項目完整代碼如下:
4.6 根據爬取到的網址再次爬取崗位要求、崗位職責等信息
根據之前爬取到的職位網址再次模擬網址請求,進入職位詳情頁面獲取崗位要求、崗位職責等信息。
此功能注:此項功能由於網站有反扒機制,代碼尚未完善。需要手動添加cookie,但是一次能爬取6-15個網站。后面會完善。
輸入‘’工程管理‘’爬取崗位要求、崗位職責效果如圖:
此子項目完整參考代碼如下:
import requests# 模擬請求 post get import time import parsel # 數據解析模塊 包含css xpath import pandas as pd # 數據類型和分析工具庫 session = requests.session()# 為了保持代碼中所有的session統一 data = str(input("請輸入爬取的關鍵詞:")) f = open(data+"/"+"招聘_"+data + ".txt", "w+", encoding="utf-8") f2 = data+'/招聘_'+data+'.csv' work = pd.read_csv(f2, encoding='gbk') work_href = [] for o in work.values: work_href += [o[9]]#網頁列表創建 print("有效網頁有{}個".format(len(work_href))) count_i= 1 cookie_i = -1 # 此處cookie需要手動添加 cookie_j2 = [ '_uab_collina=164040561244084158170825; guid=6d76337f83a21caacfd07e454cd82029; nsearch=jobarea%3D%26%7C%26ord_field%3D%26%7C%26recentSearch0%3D%26%7C%26recentSearch1%3D%26%7C%26recentSearch2%3D%26%7C%26recentSearch3%3D%26%7C%26recentSearch4%3D%26%7C%26collapse_expansion%3D; search=jobarea%7E%60090200%7C%21ord_field%7E%600%7C%21recentSearch0%7E%60090200%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FA%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21; acw_tc=76b20fec16404056112165750e4a3e0c69e32b95740f4dab8a0269f0369c0f; acw_sc__v2=61c69a6b55ed94e57ed07c823ebd93024a27e544; ssxmod_itna=eqAxRDBQit0=u0Dl+QmqG=eDQCi=7rkhbK=HDl=BoxA5D8D6DQeGTb2eYkpkKzkinDPiDuC2TrQAiAraRp=Q7DcIfHmDB3DEx06+T+YxiicDCeDIDWeDiDG4GmS4GtDpxG=Dj0FUZMUxi3Dbh=Df4DmDGY9QqDgDYQDGMp6D7QDIw=Q38q9bCLKLLRg33rY7qDMUeGXl7ctwmaHy7XedP+7QDqDC2G=91G=/a3QQiiWQqGybKGuULt/SfR1sLNleje4i2D4mGvem0qa38DeFGqpSU+xDLx=eUmsFGxEiYQX28qDG+kldbD==; ssxmod_itna2=eqAxRDBQit0=u0Dl+QmqG=eDQCi=7rkhbK=D61FD405eR403ORL7jk2DuQLw5MoIh=jDnRG1uFlqoqwYuaBgjOYQ8G8ILOXaew0O=Q3wi7wpg=iCrj=M+jZ8B4ZP196C65ldZ=mZOhvGWdiGQMLKQqhO3TLRfxPDKwYDFqD2Y8q190qHG0wKeD==; _ga=GA1.3.2041326383.1640405619; _gid=GA1.3.1948191126.1640405619; _gat=1', 'guid=b61b9790083897515ead52b430fa7f7c; _ga=GA1.3.732902233.1639826279; acw_tc=2f624a3e16402705890986233e15eafe05670d10f2064fbb56641531540303; acw_sc__v2=61c48afdb72c254f7a8458401de17d24862a0d40; ssxmod_itna=Qq0xciiQKCqYu0xl4iTbID9ix0B7COCdeWbqIoDl6YxA5D8D6DQeGTbRWeCbtzKBiGhfGq077iKae90fhxaQDE4itWWc2veDHxY=DUPxueDxOq0rD74irDDxD3Db4QDSDWKD9Gqi3DEBKDaxDbDimkAxGCDeKD0xwHDQKDugFKq0hkPSa+yPO1qW7ajxDzUxG1T40HqA8I=U8LCBv3Ax0k940Og1vknPoDUBqFt7bNtiGeraQ3A0ialYxqorrKimGotmroTO6x7DzNCDDWqCrNqYD===; ssxmod_itna2=Qq0xciiQKCqYu0xl4iTbID9ix0B7COCdeWbqWD6p4mqrF9DDsKjjDL01M4NLvCdx5OSCeQq2GxOYqAT3qi4lyjE5kIdYxOKV2jki3==m=29eQUrg3mkErdk0GGKcYuR=5L=svjqR0=GsD3Pto7BOc6vxRlITY=vhQxnkYEukQZTKdpW4uOyNMpO7CfuIFzuI4hcvkanrED07qx08DY9iC7qetnnDoihDD===', 'guid=b61b9790083897515ead52b430fa7f7c; _ga=GA1.3.732902233.1639826279; acw_tc=2f624a3e16402705890986233e15eafe05670d10f2064fbb56641531540303; acw_sc__v2=61c48afdb72c254f7a8458401de17d24862a0d40; ssxmod_itna=eqGOGK4UrTD5GHvxCDjoxmrrjeyDitOiDcitD0yioeGzDAxn40iDtxosBiDprmEn5PEe47I3pQ8opqQL3EAW3KebPSife/GDB3DEx0=wAYxYYkDt4DTD34DYDixibhxi5GRD0FCDbxYpyDA3Di4D+zd=DmqG0DDt7/4G2D7Uyj0Y4zu0XWeNo6uYqqE=D0M=DjwbD/RAacnrF1pGVbo=DzMFDtTNlzLtox0PsR3wqGAvQ+73w=SDvaYezGi3v6Yqwi05dCBDKUUP3AA54DDf1iF4xD==; ssxmod_itna2=eqGOGK4UrTD5GHvxCDjoxmrrjeyDitOiDcmxnKS2OKDsp=DLBDQwQk9kIQidqRKlS9NOO3g+5hGYgaQmLkdXea9WAextirczlKGlGXXYuKnrbs8gSWLPH+dUP/BGXp8yPigC=NvBHC6vka=PHpWyCYSaiCEi+XaGNY3+dN9u=GEybZYIIqjvvAxSmvE5rNbP4/mfGhBuLWp2LZm38O0EveB+qFYwXPWqxOWqQ9p5iBRW8kOSHtmyHsPOiLgRbz/9bDuP9g6WIqC68RGnbllRW9h4Dw6dDLxG7M+iQiO005ebmO7hP4D=', 'guid=b61b9790083897515ead52b430fa7f7c; _ga=GA1.3.732902233.1639826279; acw_tc=2f624a3e16402705890986233e15eafe05670d10f2064fbb56641531540303; acw_sc__v2=61c48afdb72c254f7a8458401de17d24862a0d40; ssxmod_itna=QqfO0IxUgGCYGQDXDnD+r8NWDQiPyh900eerKDsFADSxGKidDqxBmWQPDtCbizeBATrfGqR7Bi31kFrixhW3DpfxpW1lox0aDbqGkpRQxiicDCeDIDWeDiDG4Gm/4GtDpxG6yDm4i36xA3Di4D+8MQDmqG0DDUHS4G2D7U9bGw9Yjde6qbSKG0KFqDl8qDMmeGXKDcdOQcagAXWtqGyAPGu0uU9IqbDC2vLGG4oSx41mYfh5G4IaAxzmD3IND+xjG7Yti53If8W0DwzBDDAGil3eD===; ssxmod_itna2=QqfO0IxUgGCYGQDXDnD+r8NWDQiPyh900eexikApvDlrrxxj+PpQdrdvQiduWD0EchzFUxEp21FR2jY6qXI2qY=h3YnAyNGCbbBIcmqm6W2BRn8mCyb3oa3FMRC9Qy+l1wlQ8gltsmohF4C53FwXnnGiYK0K0j04tKAhuBhQdI24GixG2l4GcDiQXjAe4zwwYa10QaG4xD==', 'guid=b61b9790083897515ead52b430fa7f7c; _ga=GA1.3.732902233.1639826279; acw_tc=2f624a3e16402705890986233e15eafe05670d10f2064fbb56641531540303; acw_sc__v2=61c48afdb72c254f7a8458401de17d24862a0d40; ssxmod_itna=QqAx9iDteDwpGCDzrD2YLYKiK0=jt++feoNQD/fDfr4iNDnD8x7YDvmma7QgFObnxcXx7F=0DfeYWrDbRtq7G44xW=Qox0aDbqGkq9GrQGGjxBYDQxAYDGDDPDogPD1D3qDkXxYPGW8qiaDGeDec9ODY5DhxDC00PDwx0C6OGW6rYHY85mn=DhijKDBcKD9ooDs=DfQnKwEU3qZhRODlF6DCKz9c9Ci4GdI7cKeYxNkWR3AiCxPnxeo0T5510KqWr2klDx1nhxknRPlQtDia2elmGDD=; ssxmod_itna2=QqAx9iDteDwpGCDzrD2YLYKiK0=jt++feoNG9tM9DBdOdx7ph3ABa9xg=I7d9=2S2kDG8D3QFpFqbFETtvK7r0O4mD8E+Rw1+h=BTFzcfCSRX=PHEBy0HLZckKGx28TlI68MrNU96Yhp384LgtficGMLD=BeDKwxFD7=DeTRvG+hez70eKGP5DpeoOBEe0nqYeD=' ] for i in work_href: # 一次coolie至多15次 headers2 = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3880.400 QQBrowser/10.8.4554.400', 'Accept-Encoding': 'gzip, deflate, br', 'Cookie':cookie_j2[cookie_i] } if count_i % 8 == 1 : # 輸出為1,2,3,4,5,6,7,0為8次循環,一次cookie用8個網頁 cookie_i += 1 if cookie_i == 1 :# 爬取5*count_i個崗位數據存入txt中 break else: print("cookie_i={}".format(cookie_i)) print('正在爬取第{}頁,爬取網址:{}'.format(count_i,i)) # 循環網址 count_i += 1 time.sleep(5) response = session.get(url=i, headers=headers2) # 模擬get請求 返回信息response對象 html = response.text.encode('iso-8859-1').decode('gbk') # print(html) selector = parsel.Selector(html) str_list = selector.xpath(' //div[@class="bmsg job_msg inbox"]//p/text()').getall() # 獲取職位信息標簽獲取 str_info = '\n'.join(str_list) # 列表 轉換為字符串 \n划分 print(str_info) print() f.write(str_info) f.write('\r\n') print("工作職責寫入成功")
5、完整代碼
5.1 主網頁爬取完整代碼
代碼如下:
import requests# 模擬請求 post get from xpinyin import Pinyin # 將中文輸出為拼音 import pprint # 格式化輸出模塊 import os # 操作文件和目錄 import re # 正則表達式模塊 內置 import json import csv # 輸出csv的庫 import pymysql # 數據庫 import time # 計算機內部時間庫 session = requests.session()# 為了保持代碼中所有的session統一 def url_get(data,i):#定義URL拼接 url_1 = 'https://search.51job.com/list/000000,000000,0000,00,9,99,' url_2 = ',2,' url_yema = str(i) url_3 = '.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&' \ 'jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=' url = url_1 +data+url_2+url_yema+url_3 return (url) # 代碼運行 data = str(input("請選擇爬取職位或者公司: ")) xinzi_1 = []# 薪資列表 數據1 xinzi_2 = []# 薪資列表 數據2 xinzi_3 = []# 薪資列表 元的單位 返回10000 or 1000 xinzi_4 = []# 薪資列表 年or月單位 citf_l = []# 城市列表 exp_l = [] # 經驗要求 edu_l = [] # 學歷要求 count_people = 0 # 統計總爬取有效數據個數 count_xinzi = 0 # 統計總爬取薪資有效數據 xinzi_count1 = float(0) # 將如0.6-1萬/月薪資的0.6轉為float型,便於計算 xinzi_count2 = float(0) # 將如0.6-1萬/月薪資的1轉為float型,便於計算 citf_d = {} # 字典 exp_d = {} # 經驗要求 edu_d = {} # 學歷要求 work_gs = [] work_href = [] # 工作詳細網址列表,便於二次循環爬取 f_file = data+"/"+"招聘_"+data + ".csv" # 定義根據輸入的data自定義文件名並打開的變量 # 創建data文件夾 path = "D:/學習/東軟/大三上期/腳本語言開發(陳漢斌)/yunxingDaiMa/招聘/"+data if not os.path.exists(path): os.mkdir(path) f = open(f_file, mode='w+', encoding='ANSI', newline='') # ANSI為css的exeal編碼 utf-8為txt編碼 https://www.runoob.com/python/python-func-open.html csv_writer = csv.DictWriter(f, ['公司名稱', '職位名稱', '城市地區', '經驗要求', '學歷要求', '招聘人數','職位福利', '職位薪資', '職位發布日期', '職位詳情網頁'], dialect='excel') csv_writer.writeheader() # 寫入表頭 # 判斷輸入語句是否為中文,是中文變量a為'job_zhiwei_'+ result,不是a為'job_zhiwei_' + data,a是數據庫名 # 參考網址: https://blog.csdn.net/Kobe123brant/article/details/110326353?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522163990869516780357286336%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fall.%2522%257D&request_id=163990869516780357286336&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~first_rank_ecpm_v1~rank_v31_ecpm-2-110326353.pc_search_result_cache&utm_term=python%E5%B0%86%E4%B8%AD%E6%96%87%E8%BD%AC%E6%8D%A2%E4%B8%BA%E6%8B%BC%E9%9F%B3%E7%9A%84%E6%96%B9%E6%B3%95&spm=1018.2226.3001.4187 # 數據庫名預定義函數,不需要數據庫可刪除以下代碼 def is_Chinese(word): for ch in word: if '\u4e00' <= ch <= '\u9fff': return True return False if is_Chinese(data) : s = Pinyin().get_pinyin(data).split('-') result = ''.join([i[0]. upper() for i in s]) # 中文轉換為首字母 如大數據:DSJ a = 'job_zhiwei_'+ result# 實現中文轉義 else: a = 'job_zhiwei_' + data # print("******************************測試分割以下為數據庫變量名字a************************************") # print(a) # 連接數據庫 conn = pymysql.connect(host="localhost", port=3306,user="root",password="",database="new_wen",charset='utf8') cursor = conn.cursor()# 定義數據庫游標 #創建表job_zhiwei_'data'[ 'id', '公司名稱', '職位名稱', '城市地區', '經驗要求', '學歷要求', '招聘人數','職位福利', '職位薪資', '職位發布日期', '職位詳情網頁'] sql_create = """create table %s ( id int(4) NOT NULL AUTO_INCREMENT , ompany_name varchar(100) not null, job_name varchar(100) not null, job_area varchar(50) not null, job_exp varchar(20) not null, job_edu varchar(20) not null, recruiting_numbers varchar(20) not null, job_weal varchar(200) null, job_pay varchar(50) null, job_release_date varchar(20) not null, job_webpage varchar(500) not null, PRIMARY KEY (id) )""" %(a) cursor.execute(sql_create) # 執行創建數據庫 cookie_i1 = -1 # 此處cookie需要手動添加5個 cookie_j1 = [ '_uab_collina=164009531079133854257018; guid=b61b9790083897515ead52b430fa7f7c; nsearch=jobarea%3D%26%7C%26ord_field%3D%26%7C%26recentSearch0%3D%26%7C%26recentSearch1%3D%26%7C%26recentSearch2%3D%26%7C%26recentSearch3%3D%26%7C%26recentSearch4%3D%26%7C%26collapse_expansion%3D; search=jobarea%7E%60000000%7C%21ord_field%7E%600%7C%21recentSearch0%7E%60000000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FA%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch1%7E%60000000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FA%CA%FD%BE%DD%BF%AA%B7%A2%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch2%7E%60000000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FAcss%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch3%7E%60000000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FA%B4%F3%CA%FD%BE%DD%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21; acw_tc=76b20fed16401872784651156e73596d44930468e6d412e280ea9ea5b4f281; acw_sc__v2=61c3458e4cf4f427e44da3d4d2b1abe8417b4db4; ssxmod_itna=eqUx9DgD0DuDn7DRGDzgKnh2EKDCY+lWWCCRSRDBkDAK3DZDiqAPGhDC++9Om3brmF3DE8YwQ677ipv=j8rApfqYWh0h3tVADB3DEx0=KrRWKiiBDCeDIDWeDiDG4Gm4qGtDpxG=gDm4GWGqGfDDoDY4tcDitD4qDBmgdDKqGgGuDKWDUqV82lIltKtexyDDNBD0U3xBdK71u5kZaPDWPcDB=CxBjZRqtMIeDH/oLYtrhslG+Et/h5YDxsmSxo77GT+0DNCGGqe3q3bzCDGb82EGxxD=; ssxmod_itna2=eqUx9DgD0DuDn7DRGDzgKnh2EKDCY+lWWCCRSD8dpgxGX=G8DFgwOUc2aOIyoRPGF+nkKxK45WCDeNMI=evUG3qS7Kvmi=G=I9GeIPuGi+xE20oaAhVllC=4X5CLckNu+iI5azQWlcaU25HrBPseE2kc65WyQGHyePTU7Ib3bu8UogpmP9mR2pImhvIYB44tFwtC2w+Q6nEegfHqQk=mUmCnBfChgfTHcLdl+d2dr9=nS+02agbtpA2bMInrAuptLmLY6ZfUCT+zrnwR1q=3zwGIdI2HcIjgaTm7awT36/upVQSmrT7rDoeEsUioo4iehdiS7qYa20oURYNhYRh5o4YlD1ie5KuN1Wz9+QL2N2DowWQ5fYNloFQKzE7Kfbl0+v3=joNwL34bFIricf3N4Yr0QATAD1zRe3AkClOsPD3gTqffDPWbo+Q98ftoawRPs3rRg38SpigbxLjcYzUb5+aoXGNHbtFb3PjYtiUidPD7QDxGcDG7Dfxz0DywDYiDD===', '_uab_collina=164009531079133854257018; guid=b61b9790083897515ead52b430fa7f7c; nsearch=jobarea%3D%26%7C%26ord_field%3D%26%7C%26recentSearch0%3D%26%7C%26recentSearch1%3D%26%7C%26recentSearch2%3D%26%7C%26recentSearch3%3D%26%7C%26recentSearch4%3D%26%7C%26collapse_expansion%3D; search=jobarea%7E%60000000%7C%21ord_field%7E%600%7C%21recentSearch0%7E%60000000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FA%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch1%7E%60000000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FA%CA%FD%BE%DD%BF%AA%B7%A2%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch2%7E%60000000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FAcss%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch3%7E%60000000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FA%B4%F3%CA%FD%BE%DD%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21; acw_tc=76b20fed16401872784651156e73596d44930468e6d412e280ea9ea5b4f281; acw_sc__v2=61c3458e4cf4f427e44da3d4d2b1abe8417b4db4; ssxmod_itna=QuG=BKGKD5emxDq0LeYKO6mGcc2Dxmqmwn4KDsaDcYxA5D8D6DQeGTbRQTrtLiziZkDIaKicYhfa5FhSihiA4ECwmbDU4i8DCkiE3pDem=D5xGoDPxDeDADYoUDAqiOD7L=DEDmR8DaxDbDin8pxGCDeKD0xwFDQKDu6FxqD284fphngACqY7YExDzLxG1i40HqCfp3+ffij53px0ku40O9ry82gYDUCpbtmGelC+T4Ni4osiYdQrToHGqqW20tW04oCDW/Gx2sqxDicReqYD===; ssxmod_itna2=QuG=BKGKD5emxDq0LeYKO6mGcc2Dxmqmwn4ikvqhqDlEaDjbpublfk6ZG==i=P8xApq2eFoI+FY5Qkr+hYrSYRoDwxEPGcDYFFtD1aOlxxD=', '', '', '' ] for i in range(1, 21): # range(1, 31)為循環1-30頁 url = url_get(data, i) # 調用url_get函數拼接url # 一個cookie用8頁 if i % 8 == 1: cookie_i1 += 1 print("cookie={}".format(cookie_i1)) if cookie_i1 == 2: break else: headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.20 Safari/537.36', 'Accept-Encoding': 'gzip, deflate, br', 'Cookie':cookie_j1[cookie_i1] } time.sleep(5) response = session.get(url=url, headers=headers) # 模擬get請求 返回信息response對象 print("********************正在爬取第{}頁********************".format(i)) # 動態網頁 response.json()獲取json字典數據 # 保存下載圖片 視屏 音頻 獲取響應體二進制數據 response.content() #print(response.text) # 獲取數據 獲取響應文本的文本數據 reponse.test html_date = re.findall('window.__SEARCH_RESULT__ = (.*?)</script>',response.text)[0] #print(html_date) # 把字符串數據轉換為json字典數據 json_date = json.loads(html_date) #格式化輸出(好看) # pprint.pprint(json_date) #解析json數據,寫成鍵值對方式 search_esult = json_date['engine_jds'] #for遍歷 提取列表中元素 for index in search_esult: #pprint.pprint(index) company_name = index["company_name"] # 公司名稱 title = index["job_name"] # 職位名稱 info_list = index["attribute_text"] #職位要求基本信息如['成都', '3-4年經驗', '本科', '招1人'] if len(info_list) == 4: # 將薪資基本信息完全的才輸出 citf = info_list[0] # 城市 exp = info_list[1] # 經驗要求 edu = info_list[2] # 學歷要求 people = info_list[3] # 招聘人數 I = info_list[0][0:2] # 大地區獲取 citf_l = citf_l + [I] exp_l = exp_l + [exp] edu_l = edu_l + [edu] #列表轉為字符串 空格分割 #str(attribute_text) count_people= count_people+1 # 統計有效爬取人數 #job_info = '|'.join(info_list) # 可以將列表轉換為字符串,以|符號分割 jobwelf = index["jobwelf"] # 職位福利 money = index["providesalary_text"] # 職位薪資 4-4.5千/月 1.5-2.8萬/月 15-20萬/年 updatedate = index["updatedate"] # 職位發布日期 job_href = index["job_href"] # 職位詳情網頁 work_href = work_href + [job_href] # 創建信息字典 鍵只能是元組、數字、字符串 dit = { '公司名稱': company_name, '職位名稱': title, '城市地區': citf, '經驗要求': exp, '學歷要求': edu, '招聘人數': people, '職位福利': jobwelf, '職位薪資': money, '職位發布日期': updatedate, '職位詳情網頁': job_href, } csv_writer.writerow(dit) sql_insert = "insert into %s (ompany_name,job_name, job_area,job_exp,job_edu, " \ "recruiting_numbers,job_weal,job_pay,job_release_date,job_webpage) " \ "values('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')" % (a, company_name, title, citf, exp, edu, people, jobwelf, money, updatedate, job_href) cursor.execute(sql_insert) conn.commit() # 提交請求,不然不會插入數據 print(dit) ### 薪資分析代碼 if money == '' or money[money.find("/") - 2:-3:1] == "以" or [money[-1:]] == ["天"] or [money[-1:]] == ["時"]: # 排除干擾項 1.5千以下/月、100萬以上/年 pass else: count_xinzi += 1 a1 = money.find("-") a2 = money.find("/") work_gs+=[company_name] xinzi_1 = xinzi_1 + [money[0:a1]] # 錢數字1 xinzi_2 = xinzi_2 + [money[a1 + 1:-3]] # 錢數字2 if [money[a2 - 1:-1]] == ["萬/"]: # 輸出單位萬/ or 千/的數列代碼 d1 = float(10000) else: d1 = float(1000) if [money[-1:]] == ["月"]: # 輸出單位年的數列代碼 d2 = float(1) elif [money[-1:]] == ["年"]: d2 = float(12) xinzi_3 = xinzi_3 + [d1] # 輸出單位萬/&千/ 的值10000 or 1000 xinzi_4 = xinzi_4 + [d2] # 輸出單位月和年的值1或者12 #break # 調試單次循環,便於查看 # 招聘數據、分析寫入 字典數據排序(內容,關鍵點=匿名函數以x中第二個元素為標准排序,采用逆序)[地區列表] cursor.close() # 關閉數據庫鏈接 conn.close() def dict_analyse(l,d,k,v,z):# 創建字典分析方法 f_dqfb_file = data+"/"+data + "_" + z + "分析表.csv" f = open(f_dqfb_file, mode='w+', encoding='ANSI', newline='') # ANSI為exeal編碼 utf-8為txt編碼 https://www.runoob.com/python/python-func-open.html csv_writer = csv.DictWriter(f, [k, v], dialect='excel') csv_writer.writeheader() # 寫入表頭 for i in l:#分析區域分布,結果為字典 d[i] = d.get(i,0)+1 #print(d) for i,j in sorted(d.items(), key=lambda x:x[1], reverse=True):# value值降序排序並寫入csv中 dit2 = { k: i, v: j } csv_writer.writerow(dit2) dict_analyse(citf_l, citf_d, k="招聘地區", v="招聘數量",z="地區") # 調用執行dict_analyse函數,寫入scv中 dict_analyse(exp_l, exp_d, k="經驗要求", v="數量",z="經驗") dict_analyse(edu_l, edu_d, k="學歷要求", v="數量",z="學歷") # 平均薪資分析 xinzi_5 = [] # 此處預定義一個空列表,將薪資分析的數據寫入到csv文件中,以便進一步分析 xinzi_6 = [] # num_list1 = [float(i) for i in xinzi_1]#將薪資數據的隊列string型轉為float型 num_list2 = [float(i) for i in xinzi_2] for i in range(count_xinzi): j1 = num_list1[i]*xinzi_3[i]/xinzi_4[i] j2 = num_list2[i]*xinzi_3[i]/xinzi_4[i] xinzi_5+=[round(j1/1000, 1)] # 將薪資轉換為K單位的薪資,如6.2K,輸出6.2,並保留1位小數 xinzi_6+=[round(j2/1000, 1)] xinzi_count1 = xinzi_count1+j1 xinzi_count2 = xinzi_count2+j2 average_wage1 = xinzi_count1/count_xinzi average_wage2 = xinzi_count2/count_xinzi f_fxinzi = data + "/" + data + "_" + "薪資" + "分析表.csv" f = open(f_fxinzi, mode='w+', encoding='ANSI',newline='') csv_writer = csv.DictWriter(f, ['公司', '薪資數據1','薪資數據2','中位薪資'], dialect='excel') csv_writer.writeheader() # 寫入表頭 for i in range(count_xinzi): xinzi_zw = round((xinzi_5[i]+xinzi_6[i])/2, 1) dit3 = { '公司': work_gs[i], '薪資數據1': xinzi_5[i], '薪資數據2': xinzi_6[i], '中位薪資': xinzi_zw, } csv_writer.writerow(dit3) # 輸出平均薪資的計算結果 print("本次爬取有效職位共{}個,爬取有效薪資的數據為{}個,該職位或該公司的平均薪資為{}-{}元/月". format(count_people, count_xinzi, round(average_wage1, 0), round(average_wage2, 0)))
5.2 職位詳情網頁爬取完整代碼
代碼如下:
import requests# 模擬請求 post get import time import parsel # 數據解析模塊 包含css xpath import pandas as pd # 數據類型和分析工具庫 session = requests.session()# 為了保持代碼中所有的session統一 data = str(input("請輸入爬取的關鍵詞:")) f = open(data+"/"+"招聘_"+data + ".txt", "w+", encoding="utf-8") f2 = data+'/招聘_'+data+'.csv' work = pd.read_csv(f2, encoding='gbk') work_href = [] for o in work.values: work_href += [o[9]]#網頁列表創建 print("有效網頁有{}個".format(len(work_href))) count_i= 1 cookie_i = -1 # 此處cookie需要手動添加 cookie_j2 = [ '_uab_collina=164040561244084158170825; guid=6d76337f83a21caacfd07e454cd82029; nsearch=jobarea%3D%26%7C%26ord_field%3D%26%7C%26recentSearch0%3D%26%7C%26recentSearch1%3D%26%7C%26recentSearch2%3D%26%7C%26recentSearch3%3D%26%7C%26recentSearch4%3D%26%7C%26collapse_expansion%3D; search=jobarea%7E%60090200%7C%21ord_field%7E%600%7C%21recentSearch0%7E%60090200%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FA%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21; acw_tc=76b20fec16404056112165750e4a3e0c69e32b95740f4dab8a0269f0369c0f; acw_sc__v2=61c69a6b55ed94e57ed07c823ebd93024a27e544; ssxmod_itna=eqAxRDBQit0=u0Dl+QmqG=eDQCi=7rkhbK=HDl=BoxA5D8D6DQeGTb2eYkpkKzkinDPiDuC2TrQAiAraRp=Q7DcIfHmDB3DEx06+T+YxiicDCeDIDWeDiDG4GmS4GtDpxG=Dj0FUZMUxi3Dbh=Df4DmDGY9QqDgDYQDGMp6D7QDIw=Q38q9bCLKLLRg33rY7qDMUeGXl7ctwmaHy7XedP+7QDqDC2G=91G=/a3QQiiWQqGybKGuULt/SfR1sLNleje4i2D4mGvem0qa38DeFGqpSU+xDLx=eUmsFGxEiYQX28qDG+kldbD==; ssxmod_itna2=eqAxRDBQit0=u0Dl+QmqG=eDQCi=7rkhbK=D61FD405eR403ORL7jk2DuQLw5MoIh=jDnRG1uFlqoqwYuaBgjOYQ8G8ILOXaew0O=Q3wi7wpg=iCrj=M+jZ8B4ZP196C65ldZ=mZOhvGWdiGQMLKQqhO3TLRfxPDKwYDFqD2Y8q190qHG0wKeD==; _ga=GA1.3.2041326383.1640405619; _gid=GA1.3.1948191126.1640405619; _gat=1', 'guid=b61b9790083897515ead52b430fa7f7c; _ga=GA1.3.732902233.1639826279; acw_tc=2f624a3e16402705890986233e15eafe05670d10f2064fbb56641531540303; acw_sc__v2=61c48afdb72c254f7a8458401de17d24862a0d40; ssxmod_itna=Qq0xciiQKCqYu0xl4iTbID9ix0B7COCdeWbqIoDl6YxA5D8D6DQeGTbRWeCbtzKBiGhfGq077iKae90fhxaQDE4itWWc2veDHxY=DUPxueDxOq0rD74irDDxD3Db4QDSDWKD9Gqi3DEBKDaxDbDimkAxGCDeKD0xwHDQKDugFKq0hkPSa+yPO1qW7ajxDzUxG1T40HqA8I=U8LCBv3Ax0k940Og1vknPoDUBqFt7bNtiGeraQ3A0ialYxqorrKimGotmroTO6x7DzNCDDWqCrNqYD===; ssxmod_itna2=Qq0xciiQKCqYu0xl4iTbID9ix0B7COCdeWbqWD6p4mqrF9DDsKjjDL01M4NLvCdx5OSCeQq2GxOYqAT3qi4lyjE5kIdYxOKV2jki3==m=29eQUrg3mkErdk0GGKcYuR=5L=svjqR0=GsD3Pto7BOc6vxRlITY=vhQxnkYEukQZTKdpW4uOyNMpO7CfuIFzuI4hcvkanrED07qx08DY9iC7qetnnDoihDD===', 'guid=b61b9790083897515ead52b430fa7f7c; _ga=GA1.3.732902233.1639826279; acw_tc=2f624a3e16402705890986233e15eafe05670d10f2064fbb56641531540303; acw_sc__v2=61c48afdb72c254f7a8458401de17d24862a0d40; ssxmod_itna=eqGOGK4UrTD5GHvxCDjoxmrrjeyDitOiDcitD0yioeGzDAxn40iDtxosBiDprmEn5PEe47I3pQ8opqQL3EAW3KebPSife/GDB3DEx0=wAYxYYkDt4DTD34DYDixibhxi5GRD0FCDbxYpyDA3Di4D+zd=DmqG0DDt7/4G2D7Uyj0Y4zu0XWeNo6uYqqE=D0M=DjwbD/RAacnrF1pGVbo=DzMFDtTNlzLtox0PsR3wqGAvQ+73w=SDvaYezGi3v6Yqwi05dCBDKUUP3AA54DDf1iF4xD==; ssxmod_itna2=eqGOGK4UrTD5GHvxCDjoxmrrjeyDitOiDcmxnKS2OKDsp=DLBDQwQk9kIQidqRKlS9NOO3g+5hGYgaQmLkdXea9WAextirczlKGlGXXYuKnrbs8gSWLPH+dUP/BGXp8yPigC=NvBHC6vka=PHpWyCYSaiCEi+XaGNY3+dN9u=GEybZYIIqjvvAxSmvE5rNbP4/mfGhBuLWp2LZm38O0EveB+qFYwXPWqxOWqQ9p5iBRW8kOSHtmyHsPOiLgRbz/9bDuP9g6WIqC68RGnbllRW9h4Dw6dDLxG7M+iQiO005ebmO7hP4D=', 'guid=b61b9790083897515ead52b430fa7f7c; _ga=GA1.3.732902233.1639826279; acw_tc=2f624a3e16402705890986233e15eafe05670d10f2064fbb56641531540303; acw_sc__v2=61c48afdb72c254f7a8458401de17d24862a0d40; ssxmod_itna=QqfO0IxUgGCYGQDXDnD+r8NWDQiPyh900eerKDsFADSxGKidDqxBmWQPDtCbizeBATrfGqR7Bi31kFrixhW3DpfxpW1lox0aDbqGkpRQxiicDCeDIDWeDiDG4Gm/4GtDpxG6yDm4i36xA3Di4D+8MQDmqG0DDUHS4G2D7U9bGw9Yjde6qbSKG0KFqDl8qDMmeGXKDcdOQcagAXWtqGyAPGu0uU9IqbDC2vLGG4oSx41mYfh5G4IaAxzmD3IND+xjG7Yti53If8W0DwzBDDAGil3eD===; ssxmod_itna2=QqfO0IxUgGCYGQDXDnD+r8NWDQiPyh900eexikApvDlrrxxj+PpQdrdvQiduWD0EchzFUxEp21FR2jY6qXI2qY=h3YnAyNGCbbBIcmqm6W2BRn8mCyb3oa3FMRC9Qy+l1wlQ8gltsmohF4C53FwXnnGiYK0K0j04tKAhuBhQdI24GixG2l4GcDiQXjAe4zwwYa10QaG4xD==', 'guid=b61b9790083897515ead52b430fa7f7c; _ga=GA1.3.732902233.1639826279; acw_tc=2f624a3e16402705890986233e15eafe05670d10f2064fbb56641531540303; acw_sc__v2=61c48afdb72c254f7a8458401de17d24862a0d40; ssxmod_itna=QqAx9iDteDwpGCDzrD2YLYKiK0=jt++feoNQD/fDfr4iNDnD8x7YDvmma7QgFObnxcXx7F=0DfeYWrDbRtq7G44xW=Qox0aDbqGkq9GrQGGjxBYDQxAYDGDDPDogPD1D3qDkXxYPGW8qiaDGeDec9ODY5DhxDC00PDwx0C6OGW6rYHY85mn=DhijKDBcKD9ooDs=DfQnKwEU3qZhRODlF6DCKz9c9Ci4GdI7cKeYxNkWR3AiCxPnxeo0T5510KqWr2klDx1nhxknRPlQtDia2elmGDD=; ssxmod_itna2=QqAx9iDteDwpGCDzrD2YLYKiK0=jt++feoNG9tM9DBdOdx7ph3ABa9xg=I7d9=2S2kDG8D3QFpFqbFETtvK7r0O4mD8E+Rw1+h=BTFzcfCSRX=PHEBy0HLZckKGx28TlI68MrNU96Yhp384LgtficGMLD=BeDKwxFD7=DeTRvG+hez70eKGP5DpeoOBEe0nqYeD=' ] for i in work_href: # 一次coolie至多15次 headers2 = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3880.400 QQBrowser/10.8.4554.400', 'Accept-Encoding': 'gzip, deflate, br', 'Cookie':cookie_j2[cookie_i] } if count_i % 8 == 1 : # 輸出為1,2,3,4,5,6,7,0為8次循環,一次cookie用8個網頁 cookie_i += 1 if cookie_i == 1 :# 爬取5*count_i個崗位數據存入txt中 break else: print("cookie_i={}".format(cookie_i)) print('正在爬取第{}頁,爬取網址:{}'.format(count_i,i)) # 循環網址 count_i += 1 time.sleep(5) response = session.get(url=i, headers=headers2) # 模擬get請求 返回信息response對象 html = response.text.encode('iso-8859-1').decode('gbk') # print(html) selector = parsel.Selector(html) str_list = selector.xpath(' //div[@class="bmsg job_msg inbox"]//p/text()').getall() # 獲取職位信息標簽獲取 str_info = '\n'.join(str_list) # 列表 轉換為字符串 \n划分 print(str_info) print() f.write(str_info) f.write('\r\n') print("工作職責寫入成功")
6、運行效果
6.1 C語言關鍵字保存的csv
6.1.1 主要數據csv
效果如圖:
6.1.2 地區分析scv
效果如圖:
6.1.3 經驗要求csv
效果如圖:

6.1.4 學歷要求csv
效果如圖:

6.2 C語言關鍵字保存的數據庫
效果如圖:
6.3 C語言關鍵字分析得到的分析圖
效果如圖:


6.4 C語言關鍵字分析崗位職責TXT文件得到的詞雲圖

6.5 C語言關鍵字分析合並csv得到的exeal
效果如圖:


https://search.51job.com/list/000000,000000,0000,00,9,99,+,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=










