一、選題背景
為什么要選擇此選題?要達到的數據分析的預期目標是什么?(10 分)
通過網絡爬蟲爬取前程無憂網的數據信息,並且對爬取的數據進行進一步清洗處理,提取可利用數據信息,同時加以分析各維度數據,篩選該網站入駐的企業和為求職者提供的人才招聘、求職、找工作、培訓等在內的全方位的人力資源服務,讓數據看起來直觀清晰。
二、主題式網絡爬蟲設計方案(10 分)
1.網絡爬蟲名稱:“前程無憂網絡爬蟲及數據清洗分析”。
2.網絡爬蟲爬取的內容與數據特征分析:
通過網絡爬蟲技術分析該網站網頁結構獲取數據,前程無憂網為人才招聘求職網站,因此爬取的數據內容涵蓋公司名稱、崗位類型、薪資標准、工作經驗、學歷、城市、招聘人數、公司規模等。獲取多維度數據,並且經過數據清洗分析,獲取所需數據,經過清洗后的數據無重復值無空值,使數據更加可靠。
3.網絡爬蟲設計方案概述:
需多個步驟實現:
通過獲取網頁資源,設置請求頭,防止被網頁識別爬蟲,利用requests請求,使用etree解析網頁,定位爬取資源將數據保存到csv文件中。
三、主題頁面的結構特征分析(10 分)
數據來源:https://search.51job.com
所需頁面代碼:
四、網絡爬蟲程序設計(10 分)
1、數據爬取
1 import requests 2 import time 3 import re 4 import csv 5 import json 6 import pandas as pd 7 from lxml import etree 8 #創建一個csv文件,設置編碼格式 9 file = open('qcwy.csv','a+',encoding='gbk') 10 #寫入表頭 11 writer =csv.writer(file) 12 writer.writerow(['公司','崗位','薪資','福利','工作經驗','學歷','城市','招聘人數','公司規模','公司方向']) 13 file.close() 14 from urllib.parse import urlencode 15 #頁數循環,設置10頁 16 for page in range(1,10): 17 try: 18 url0 = 'https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,{}.html?'.format(page) 19 #設置請求頭,防止被網站識別爬蟲 20 headers = { 21 'Connection': 'keep-alive', 22 'Host': 'search.51job.com', 23 # 'Cookie': 'guid=011c029d4be2f1535b1488058cc65d73; _ujz=MTYwMzQwMDYxMA%3D%3D; ps=needv%3D0; 51job=cuid%3D160340061%26%7C%26cusername%3Dphone_13361643992_201907142883%26%7C%26cpassword%3D%26%7C%26cname%3D%25C1%25F5%25D3%25D1%25C6%25BD%26%7C%26cemail%3D1449917271%2540qq.com%26%7C%26cemailstatus%3D0%26%7C%26cnickname%3D%26%7C%26ccry%3D.0OeIvjQVBfOY%26%7C%26cconfirmkey%3D14GmM5Lom81vo%26%7C%26cautologin%3D1%26%7C%26cenglish%3D0%26%7C%26sex%3D0%26%7C%26cnamekey%3D14mtpvIzJS2LQ%26%7C%26to%3D09195b4aaed52ca7f08007a61faa9c5b5fa1dec4%26%7C%26; adv=adsnew%3D0%26%7C%26adsnum%3D2004282%26%7C%26adsresume%3D1%26%7C%26adsfrom%3Dhttps%253A%252F%252Fwww.baidu.com%252Fother.php%253Fsc.a000000rSfJlGV4Bc8t4eqTEk_frowutPMhm3ZhaV8zIaGVDK_J1tjeEsmjcUTG1io1MZWBYnm9m1fVaxnJzS1lJwgLwxWzOYsvqLLSOjcXMXX2Ezu-WL-EBCumLUi63OznuK6Pa1ma8DWIYz6HIRTy83AwhghV_XRUyLJlvDRLmdmNQ8BBou719icrwXwO7DLroB6mb3OslhcCKDEnlXfsWqsw9.DR_NR2Ar5Od66CHnsGtVdXNdlc2D1n2xx81IZ76Y_uQQr1F_zIyT8P9MqOOgujSOODlxdlPqKMWSxKSgqjlSzOFqtZOmzUlZlS5S8QqxZtVAOtIOtHOuS81wODSgL35SKsSXKMqOOgfESyOHjGLY51xVOeNH5exS88Zqq1ZgVm9udSnQr1__oodvgvnehUrPL72xZgjX1IIYJN9h9merzEuY60.TLFWgv-b5HDkrfK1ThPGujYknHb0THY0IAYqkea11neXYtT0IgP-T-qYXgK-5H00mywxIZ-suHY10ZIEThfqkea11neXYtT0ThPv5HD0IgF_gv-b5HDdnWcsPWcYnjD0UgNxpyfqnHnLnWD4nj60UNqGujYknjT3njc1rfKVIZK_gv-b5HDkPHnY0ZKvgv-b5H00pywW5R9rffKWThnqnH64n1f%2526ck%253D6535.8.123.259.153.289.335.295%2526dt%253D1604443777%2526wd%253D%2525E5%252589%25258D%2525E7%2525A8%25258B%2525E6%252597%2525A0%2525E5%2525BF%2525A7%2526tpl%253Dtpl_11534_23295_19442%2526l%253D1522062401%2526us%253DlinkName%25253D%252525E6%252525A0%25252587%252525E5%25252587%25252586%252525E5%252525A4%252525B4%252525E9%25252583%252525A8-%252525E4%252525B8%252525BB%252525E6%252525A0%25252587%252525E9%252525A2%25252598%252526linkText%25253D%252525E3%25252580%25252590%252525E5%25252589%2525258D%252525E7%252525A8%2525258B%252525E6%25252597%252525A0%252525E5%252525BF%252525A751Job%252525E3%25252580%25252591-%25252520%252525E5%252525A5%252525BD%252525E5%252525B7%252525A5%252525E4%252525BD%2525259C%252525E5%252525B0%252525BD%252525E5%2525259C%252525A8%252525E5%25252589%2525258D%252525E7%252525A8%2525258B%252525E6%25252597%252525A0%252525E5%252525BF%252525A7%2521%252526linkType%25253D%26%7C%26ad_logid_url%3Dhttps%253A%252F%252Ftrace.51job.com%252Ftrace.php%253Fadsnum%253D4178248%2526ajp%253DaHR0cHM6Ly9ta3QuNTFqb2IuY29tL3RnL3NlbS9MUF8yMDIwXzEuaHRtbD9mcm9tPWJhaWR1YWQ%253D%2526k%253Dd946ba049bfb67b64f408966cbda3ee9%2526bd_vid%253D8087414180147234955%26%7C%26; slife=lowbrowser%3Dnot%26%7C%26lastlogindate%3D20201104%26%7C%26securetime%3DAT1daFQyWTlfOVVvCzNePVdgVmU%253D; nsearch=jobarea%3D%26%7C%26ord_field%3D%26%7C%26recentSearch0%3D%26%7C%26recentSearch1%3D%26%7C%26recentSearch2%3D%26%7C%26recentSearch3%3D%26%7C%26recentSearch4%3D%26%7C%26collapse_expansion%3D; search=jobarea%7E%60030200%7C%21ord_field%7E%600%7C%21recentSearch0%7E%60030200%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch1%7E%60190300%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch2%7E%60020000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch3%7E%60000000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch4%7E%60130000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21', 24 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36' 25 } 26 #在url后拼接參數,參數固定 27 params = { 28 'lang': 'c', 29 'postchannel': '0000', 30 'workyear': '99', 31 'cotype': '99', 32 'degreefrom': '99', 33 'jobterm': '99', 34 'companysize': '99', 35 'ord_field': '0', 36 'dibiaoid': '0', 37 'line': '', 38 'welfare': '', 39 } 40 #拼接url 41 url = url0+urlencode(params) 42 print(url) 43 #requests請求,設置請求時間最長為30秒,超時報錯 44 r = requests.get(url,headers=headers,timeout=30) 45 #將請求到的字符串轉化為html標簽 46 html = etree.HTML(r.text) 47 #標簽定位到該字段 48 nr = html.xpath('//script[@type="text/javascript"]/text()')[0].replace('\n','').replace('\t','').replace('window.__SEARCH_RESULT__ = ','') 49 #將字符串抓華為json格式 50 datas = json.loads(nr)['engine_search_result'] 51 #循環,獲取字段 52 for sjs in datas: 53 #判斷 54 if len(sjs['attribute_text']) == 4: 55 workyear = sjs['attribute_text'][1] 56 education = sjs['attribute_text'][2] 57 city = sjs['attribute_text'][0] 58 renshu = sjs['attribute_text'][-1] 59 else: 60 city = sjs['attribute_text'][0] 61 renshu = sjs['attribute_text'][-1] 62 test = sjs['attribute_text'][1] 63 #判斷經驗是否在test里面 64 if '經驗' in test: 65 workyear = test 66 education = '無' 67 else: 68 education = test 69 workyear = '無' 70 company_name = sjs['company_name'] 71 job_name = sjs['job_name'] 72 providesalary_text = sjs['providesalary_text'].replace('\\',"") 73 jobwelf = sjs['jobwelf'].replace('\\',"") 74 companysize_text = sjs['companysize_text'].replace('\\',"") 75 companyind_text = sjs['companyind_text'].replace('\\',"") 76 #如果為空,直接設置為無 77 if not providesalary_text: 78 providesalary_text = '無' 79 if not jobwelf: 80 jobwelf = '無' 81 if not companysize_text: 82 companysize_text = '無' 83 if not companyind_text: 84 companyind_text = '無' 85 file = open('qcwy.csv', 'a+', encoding='gbk') 86 writer = csv.writer(file) 87 #將數據每行寫入 88 writer.writerow([company_name,job_name,providesalary_text,jobwelf,workyear,education,city,renshu,companysize_text,companyind_text]) 89 print(company_name,job_name,providesalary_text,jobwelf,workyear,education,city,renshu,companysize_text,companyind_text) 90 #異常處理 91 except Exception as e: 92 print(e) 93 time.sleep(1) 94 95 # break 96 #將csv轉成excel 97 datas = pd.read_csv('qcwy.csv',encoding='gbk')
2、對數據進行清洗和處理
1 import pandas as pd 2 qcwy = pd.DataFrame(pd.read_csv('qcwy.csv',encoding="gbk")) 3 qcwy.head()
1 qcwy.drop('福利',axis = 1,inplace= True) 2 qcwy.head()
1 qcwy.duplicated()
1 qcwy = qcwy.drop_duplicates() 2 qcwy.head()
1 qcwy['公司'].isnull().value_counts()
1 qcwy['崗位'].isnull().value_counts()
1 qcwy['薪資'].isnull().value_counts()
1 qcwy['工作經驗'].isnull().value_counts()
1 qcwy['學歷'].isnull().value_counts()
1 qcwy['城市'].isnull().value_counts()
1 qcwy['招聘人數'].isnull().value_counts()
1 qcwy['公司規模'].isnull().value_counts()
1 qcwy['公司方向'].isnull().value_counts()
1 qcwy ['薪資'] = qcwy['薪資'].map(str.strip) #刪除數據兩邊的空格
1 qcwy ['薪資'] = qcwy['薪資'].map(str.lstrip) #刪除數據左邊的空格
1 qcwy ['薪資'] = qcwy['薪資'].map(str.rstrip) #刪除數據右邊的空格
1 qcwy.describe()
3、數據可視化
1 import pandas as pd 2 import numpy as mp 3 import sklearn 4 import seaborn as sns 5 import matplotlib.pyplot as plt 6 7 #學歷占比餅圖 8 plt.rcParams['font.sans-serif'] = ['SimHei']#解決亂碼問題 9 gw_score = qcwy['學歷'].value_counts() #統計評分情況 10 plt.title("學歷占比圖") #設置餅圖標題 11 plt.pie(gw_score.values,labels = gw_score.index,autopct='%1.1f%%') #繪圖 12 #autopct表示圓里面的文本格式,在python里%操作符可用於格式化字符串操作 13 plt.show()
1 qcwy = pd.DataFrame(pd.read_csv('qcwy.csv',encoding="gbk")) 2 sns.distplot(qcwy['招聘人數'])
1 sns.regplot(x = '招聘人數',y = '公司規模',data=qcwy)
1 import seaborn as sns 2 from scipy.optimize import leastsq 3 plt.rcParams['font.sans-serif'] = ['SimHei']#解決亂碼問題 4 #定義變量 5 gsgm=qcwy.loc[:,'公司規模'] 6 zprs=qcwy.loc[:,'招聘人數'] 7 #函數表達式 8 def func(params,x): 9 a,b,c=params 10 return a*x*x+b*x+c 11 def error_func(params,x,y): 12 return func(params,x)-y 13 P0=[1,9.0] 14 def main(): 15 plt.figure(figsize=(8,6)) 16 P0=[1,9.0,1] 17 Para=leastsq(error_func,P0,args=(gsgm,zprs)) 18 a,b,c=Para[0] 19 print("a=",a, "b=",b, "c=",c) 20 #繪圖 21 plt.scatter(gsgm,zprs,color="green",label="樣本數據",linewidth=2) 22 x=mp.linspace(1000,10,400) 23 y=a*x*x+b*x+c 24 #右上角標 25 plt.plot(x,y,color="red",label="擬合曲線",linewidth=2) 26 #x,y軸名稱 27 plt.xlabel('公司規模') 28 plt.ylabel('招聘人數') 29 #標題 30 plt.title("公司規模與招聘人數回歸方程") 31 plt.grid() 32 plt.legend() 33 plt.show() 34 main()
五、附完整程序代碼
1 import requests 2 import time 3 import re 4 import csv 5 import json 6 import pandas as pd 7 from lxml import etree 8 #創建一個csv文件,設置編碼格式 9 file = open('qcwy.csv','a+',encoding='gbk') 10 #寫入表頭 11 writer =csv.writer(file) 12 writer.writerow(['公司','崗位','薪資','福利','工作經驗','學歷','城市','招聘人數','公司規模','公司方向']) 13 file.close() 14 from urllib.parse import urlencode 15 #頁數循環,設置10頁 16 for page in range(1,10): 17 try: 18 url0 = 'https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,{}.html?'.format(page) 19 #設置請求頭,防止被網站識別爬蟲 20 headers = { 21 'Connection': 'keep-alive', 22 'Host': 'search.51job.com', 23 # 'Cookie': 'guid=011c029d4be2f1535b1488058cc65d73; _ujz=MTYwMzQwMDYxMA%3D%3D; ps=needv%3D0; 51job=cuid%3D160340061%26%7C%26cusername%3Dphone_13361643992_201907142883%26%7C%26cpassword%3D%26%7C%26cname%3D%25C1%25F5%25D3%25D1%25C6%25BD%26%7C%26cemail%3D1449917271%2540qq.com%26%7C%26cemailstatus%3D0%26%7C%26cnickname%3D%26%7C%26ccry%3D.0OeIvjQVBfOY%26%7C%26cconfirmkey%3D14GmM5Lom81vo%26%7C%26cautologin%3D1%26%7C%26cenglish%3D0%26%7C%26sex%3D0%26%7C%26cnamekey%3D14mtpvIzJS2LQ%26%7C%26to%3D09195b4aaed52ca7f08007a61faa9c5b5fa1dec4%26%7C%26; adv=adsnew%3D0%26%7C%26adsnum%3D2004282%26%7C%26adsresume%3D1%26%7C%26adsfrom%3Dhttps%253A%252F%252Fwww.baidu.com%252Fother.php%253Fsc.a000000rSfJlGV4Bc8t4eqTEk_frowutPMhm3ZhaV8zIaGVDK_J1tjeEsmjcUTG1io1MZWBYnm9m1fVaxnJzS1lJwgLwxWzOYsvqLLSOjcXMXX2Ezu-WL-EBCumLUi63OznuK6Pa1ma8DWIYz6HIRTy83AwhghV_XRUyLJlvDRLmdmNQ8BBou719icrwXwO7DLroB6mb3OslhcCKDEnlXfsWqsw9.DR_NR2Ar5Od66CHnsGtVdXNdlc2D1n2xx81IZ76Y_uQQr1F_zIyT8P9MqOOgujSOODlxdlPqKMWSxKSgqjlSzOFqtZOmzUlZlS5S8QqxZtVAOtIOtHOuS81wODSgL35SKsSXKMqOOgfESyOHjGLY51xVOeNH5exS88Zqq1ZgVm9udSnQr1__oodvgvnehUrPL72xZgjX1IIYJN9h9merzEuY60.TLFWgv-b5HDkrfK1ThPGujYknHb0THY0IAYqkea11neXYtT0IgP-T-qYXgK-5H00mywxIZ-suHY10ZIEThfqkea11neXYtT0ThPv5HD0IgF_gv-b5HDdnWcsPWcYnjD0UgNxpyfqnHnLnWD4nj60UNqGujYknjT3njc1rfKVIZK_gv-b5HDkPHnY0ZKvgv-b5H00pywW5R9rffKWThnqnH64n1f%2526ck%253D6535.8.123.259.153.289.335.295%2526dt%253D1604443777%2526wd%253D%2525E5%252589%25258D%2525E7%2525A8%25258B%2525E6%252597%2525A0%2525E5%2525BF%2525A7%2526tpl%253Dtpl_11534_23295_19442%2526l%253D1522062401%2526us%253DlinkName%25253D%252525E6%252525A0%25252587%252525E5%25252587%25252586%252525E5%252525A4%252525B4%252525E9%25252583%252525A8-%252525E4%252525B8%252525BB%252525E6%252525A0%25252587%252525E9%252525A2%25252598%252526linkText%25253D%252525E3%25252580%25252590%252525E5%25252589%2525258D%252525E7%252525A8%2525258B%252525E6%25252597%252525A0%252525E5%252525BF%252525A751Job%252525E3%25252580%25252591-%25252520%252525E5%252525A5%252525BD%252525E5%252525B7%252525A5%252525E4%252525BD%2525259C%252525E5%252525B0%252525BD%252525E5%2525259C%252525A8%252525E5%25252589%2525258D%252525E7%252525A8%2525258B%252525E6%25252597%252525A0%252525E5%252525BF%252525A7%2521%252526linkType%25253D%26%7C%26ad_logid_url%3Dhttps%253A%252F%252Ftrace.51job.com%252Ftrace.php%253Fadsnum%253D4178248%2526ajp%253DaHR0cHM6Ly9ta3QuNTFqb2IuY29tL3RnL3NlbS9MUF8yMDIwXzEuaHRtbD9mcm9tPWJhaWR1YWQ%253D%2526k%253Dd946ba049bfb67b64f408966cbda3ee9%2526bd_vid%253D8087414180147234955%26%7C%26; slife=lowbrowser%3Dnot%26%7C%26lastlogindate%3D20201104%26%7C%26securetime%3DAT1daFQyWTlfOVVvCzNePVdgVmU%253D; nsearch=jobarea%3D%26%7C%26ord_field%3D%26%7C%26recentSearch0%3D%26%7C%26recentSearch1%3D%26%7C%26recentSearch2%3D%26%7C%26recentSearch3%3D%26%7C%26recentSearch4%3D%26%7C%26collapse_expansion%3D; search=jobarea%7E%60030200%7C%21ord_field%7E%600%7C%21recentSearch0%7E%60030200%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch1%7E%60190300%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch2%7E%60020000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch3%7E%60000000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch4%7E%60130000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21', 24 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36' 25 } 26 #在url后拼接參數,參數固定 27 params = { 28 'lang': 'c', 29 'postchannel': '0000', 30 'workyear': '99', 31 'cotype': '99', 32 'degreefrom': '99', 33 'jobterm': '99', 34 'companysize': '99', 35 'ord_field': '0', 36 'dibiaoid': '0', 37 'line': '', 38 'welfare': '', 39 } 40 #拼接url 41 url = url0+urlencode(params) 42 print(url) 43 #requests請求,設置請求時間最長為30秒,超時報錯 44 r = requests.get(url,headers=headers,timeout=30) 45 #將請求到的字符串轉化為html標簽 46 html = etree.HTML(r.text) 47 #標簽定位到該字段 48 nr = html.xpath('//script[@type="text/javascript"]/text()')[0].replace('\n','').replace('\t','').replace('window.__SEARCH_RESULT__ = ','') 49 #將字符串抓華為json格式 50 datas = json.loads(nr)['engine_search_result'] 51 #循環,獲取字段 52 for sjs in datas: 53 #判斷 54 if len(sjs['attribute_text']) == 4: 55 workyear = sjs['attribute_text'][1] 56 education = sjs['attribute_text'][2] 57 city = sjs['attribute_text'][0] 58 renshu = sjs['attribute_text'][-1] 59 else: 60 city = sjs['attribute_text'][0] 61 renshu = sjs['attribute_text'][-1] 62 test = sjs['attribute_text'][1] 63 #判斷經驗是否在test里面 64 if '經驗' in test: 65 workyear = test 66 education = '無' 67 else: 68 education = test 69 workyear = '無' 70 company_name = sjs['company_name'] 71 job_name = sjs['job_name'] 72 providesalary_text = sjs['providesalary_text'].replace('\\',"") 73 jobwelf = sjs['jobwelf'].replace('\\',"") 74 companysize_text = sjs['companysize_text'].replace('\\',"") 75 companyind_text = sjs['companyind_text'].replace('\\',"") 76 #如果為空,直接設置為無 77 if not providesalary_text: 78 providesalary_text = '無' 79 if not jobwelf: 80 jobwelf = '無' 81 if not companysize_text: 82 companysize_text = '無' 83 if not companyind_text: 84 companyind_text = '無' 85 file = open('qcwy.csv', 'a+', encoding='gbk') 86 writer = csv.writer(file) 87 #將數據每行寫入 88 writer.writerow([company_name,job_name,providesalary_text,jobwelf,workyear,education,city,renshu,companysize_text,companyind_text]) 89 print(company_name,job_name,providesalary_text,jobwelf,workyear,education,city,renshu,companysize_text,companyind_text) 90 #異常處理 91 except Exception as e: 92 print(e) 93 time.sleep(1) 94 95 # break 96 #將csv轉成excel 97 datas = pd.read_csv('qcwy.csv',encoding='gbk') 98 99 import pandas as pd 100 qcwy = pd.DataFrame(pd.read_csv('qcwy.csv',encoding="gbk")) 101 qcwy.head() 102 103 qcwy.drop('福利',axis = 1,inplace= True) 104 qcwy.head() 105 106 qcwy.duplicated() 107 108 qcwy = qcwy.drop_duplicates() 109 qcwy.head() 110 111 qcwy['公司'].isnull().value_counts() 112 qcwy['崗位'].isnull().value_counts() 113 qcwy['薪資'].isnull().value_counts() 114 qcwy['工作經驗'].isnull().value_counts() 115 qcwy['學歷'].isnull().value_counts() 116 qcwy['城市'].isnull().value_counts() 117 qcwy['招聘人數'].isnull().value_counts() 118 qcwy['公司方向'].isnull().value_counts() 119 120 qcwy ['薪資'] = qcwy['薪資'].map(str.strip) #刪除數據兩邊的空格 121 qcwy ['薪資'] = qcwy['薪資'].map(str.lstrip) #刪除數據左邊的空格 122 qcwy ['薪資'] = qcwy['薪資'].map(str.rstrip) #刪除數據右邊的空格 123 124 qcwy.describe() 125 126 import pandas as pd 127 import numpy as mp 128 import sklearn 129 import seaborn as sns 130 import matplotlib.pyplot as plt 131 132 #學歷占比餅圖 133 plt.rcParams['font.sans-serif'] = ['SimHei']#解決亂碼問題 134 gw_score = qcwy['學歷'].value_counts() #統計評分情況 135 plt.title("學歷占比圖") #設置餅圖標題 136 plt.pie(gw_score.values,labels = gw_score.index,autopct='%1.1f%%') #繪圖 137 #autopct表示圓里面的文本格式,在python里%操作符可用於格式化字符串操作 138 plt.show() 139 140 qcwy = pd.DataFrame(pd.read_csv('qcwy.csv',encoding="gbk")) 141 sns.distplot(qcwy['招聘人數']) 142 143 sns.regplot(x = '招聘人數',y = '公司規模',data=qcwy) 144 145 import seaborn as sns 146 from scipy.optimize import leastsq 147 plt.rcParams['font.sans-serif'] = ['SimHei']#解決亂碼問題 148 #定義變量 149 gsgm=qcwy.loc[:,'公司規模'] 150 zprs=qcwy.loc[:,'招聘人數'] 151 #函數表達式 152 def func(params,x): 153 a,b,c=params 154 return a*x*x+b*x+c 155 def error_func(params,x,y): 156 return func(params,x)-y 157 P0=[1,9.0] 158 def main(): 159 plt.figure(figsize=(8,6)) 160 P0=[1,9.0,1] 161 Para=leastsq(error_func,P0,args=(gsgm,zprs)) 162 a,b,c=Para[0] 163 print("a=",a, "b=",b, "c=",c) 164 #繪圖 165 plt.scatter(gsgm,zprs,color="green",label="樣本數據",linewidth=2) 166 x=mp.linspace(1000,10,400) 167 y=a*x*x+b*x+c 168 #右上角標 169 plt.plot(x,y,color="red",label="擬合曲線",linewidth=2) 170 #x,y軸名稱 171 plt.xlabel('公司規模') 172 plt.ylabel('招聘人數') 173 #標題 174 plt.title("公司規模與招聘人數回歸方程") 175 plt.grid() 176 plt.legend() 177 plt.show() 178 main()
六、總結
1.經過對主題數據的分析與可視化,可以得到哪些結論?
(1)通過數據可視化,公司規模和公司所招聘的人數成正比例關系。
(2)通過回歸方程分析,公司規模較大的公司普遍招聘的人數較多,較為集中。
(3)經過數據可視化以及數據清洗分析后,我們可以快速直觀的了解到所需數據。
2.在完成此設計過程中,得到哪些收獲?以及要改進的建議?
在本次網絡爬蟲設計過程當中,參考了許多的資料文獻,例如:GetHub這類開源代碼的交流社區,前期跟隨一些優質作者學習,從頭到尾了解了整個爬蟲的編寫過程,同時也遇到諸多問題:例如:帶有標題導航欄的頁面,帶有廣告的頁面,無法正常完成數據的爬取。同時也學習到如何應對一些網站的反爬措施,當然在自己編寫過程中還存在諸多難題,通過查閱資料課本逐漸解決,同時也逐漸深入的了解Python這門課程,當然,Python這門編程語言還有很多的未知領域和創作空間,我也因此所產生濃厚的興趣,需要學習的內容還有很多,數據可視化、數據清洗、數據分析僅僅是冰山一角,因此,將努力學習,做出更好的成績。