Python網絡爬蟲——前程無憂網數據爬取及可視化分析


一、選題背景

為什么要選擇此選題?要達到的數據分析的預期目標是什么?(10 分)

通過網絡爬蟲爬取前程無憂網的數據信息,並且對爬取的數據進行進一步清洗處理,提取可利用數據信息,同時加以分析各維度數據,篩選該網站入駐的企業和為求職者提供的人才招聘、求職、找工作、培訓等在內的全方位的人力資源服務,讓數據看起來直觀清晰。

二、主題式網絡爬蟲設計方案(10 分)

1.網絡爬蟲名稱:“前程無憂網絡爬蟲及數據清洗分析”。

2.網絡爬蟲爬取的內容與數據特征分析:

通過網絡爬蟲技術分析該網站網頁結構獲取數據,前程無憂網為人才招聘求職網站,因此爬取的數據內容涵蓋公司名稱、崗位類型、薪資標准、工作經驗、學歷、城市、招聘人數、公司規模等。獲取多維度數據,並且經過數據清洗分析,獲取所需數據,經過清洗后的數據無重復值無空值,使數據更加可靠。

3.網絡爬蟲設計方案概述:

需多個步驟實現:

通過獲取網頁資源,設置請求頭,防止被網頁識別爬蟲,利用requests請求,使用etree解析網頁,定位爬取資源將數據保存到csv文件中。

三、主題頁面的結構特征分析(10 分)

數據來源:https://search.51job.com

 

 

 

 

 

 

 

 

 

 

 

 

所需頁面代碼:

四、網絡爬蟲程序設計(10 分)

1、數據爬取

 1 import requests
 2 import time
 3 import re
 4 import csv
 5 import json
 6 import pandas as pd
 7 from lxml import etree
 8 #創建一個csv文件,設置編碼格式
 9 file = open('qcwy.csv','a+',encoding='gbk')
10 #寫入表頭
11 writer  =csv.writer(file)
12 writer.writerow(['公司','崗位','薪資','福利','工作經驗','學歷','城市','招聘人數','公司規模','公司方向'])
13 file.close()
14 from urllib.parse import urlencode
15 #頁數循環,設置10頁
16 for page in range(1,10):
17     try:
18         url0 = 'https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,{}.html?'.format(page)
19         #設置請求頭,防止被網站識別爬蟲
20         headers = {
21             'Connection': 'keep-alive',
22             'Host': 'search.51job.com',
23             # 'Cookie': 'guid=011c029d4be2f1535b1488058cc65d73; _ujz=MTYwMzQwMDYxMA%3D%3D; ps=needv%3D0; 51job=cuid%3D160340061%26%7C%26cusername%3Dphone_13361643992_201907142883%26%7C%26cpassword%3D%26%7C%26cname%3D%25C1%25F5%25D3%25D1%25C6%25BD%26%7C%26cemail%3D1449917271%2540qq.com%26%7C%26cemailstatus%3D0%26%7C%26cnickname%3D%26%7C%26ccry%3D.0OeIvjQVBfOY%26%7C%26cconfirmkey%3D14GmM5Lom81vo%26%7C%26cautologin%3D1%26%7C%26cenglish%3D0%26%7C%26sex%3D0%26%7C%26cnamekey%3D14mtpvIzJS2LQ%26%7C%26to%3D09195b4aaed52ca7f08007a61faa9c5b5fa1dec4%26%7C%26; adv=adsnew%3D0%26%7C%26adsnum%3D2004282%26%7C%26adsresume%3D1%26%7C%26adsfrom%3Dhttps%253A%252F%252Fwww.baidu.com%252Fother.php%253Fsc.a000000rSfJlGV4Bc8t4eqTEk_frowutPMhm3ZhaV8zIaGVDK_J1tjeEsmjcUTG1io1MZWBYnm9m1fVaxnJzS1lJwgLwxWzOYsvqLLSOjcXMXX2Ezu-WL-EBCumLUi63OznuK6Pa1ma8DWIYz6HIRTy83AwhghV_XRUyLJlvDRLmdmNQ8BBou719icrwXwO7DLroB6mb3OslhcCKDEnlXfsWqsw9.DR_NR2Ar5Od66CHnsGtVdXNdlc2D1n2xx81IZ76Y_uQQr1F_zIyT8P9MqOOgujSOODlxdlPqKMWSxKSgqjlSzOFqtZOmzUlZlS5S8QqxZtVAOtIOtHOuS81wODSgL35SKsSXKMqOOgfESyOHjGLY51xVOeNH5exS88Zqq1ZgVm9udSnQr1__oodvgvnehUrPL72xZgjX1IIYJN9h9merzEuY60.TLFWgv-b5HDkrfK1ThPGujYknHb0THY0IAYqkea11neXYtT0IgP-T-qYXgK-5H00mywxIZ-suHY10ZIEThfqkea11neXYtT0ThPv5HD0IgF_gv-b5HDdnWcsPWcYnjD0UgNxpyfqnHnLnWD4nj60UNqGujYknjT3njc1rfKVIZK_gv-b5HDkPHnY0ZKvgv-b5H00pywW5R9rffKWThnqnH64n1f%2526ck%253D6535.8.123.259.153.289.335.295%2526dt%253D1604443777%2526wd%253D%2525E5%252589%25258D%2525E7%2525A8%25258B%2525E6%252597%2525A0%2525E5%2525BF%2525A7%2526tpl%253Dtpl_11534_23295_19442%2526l%253D1522062401%2526us%253DlinkName%25253D%252525E6%252525A0%25252587%252525E5%25252587%25252586%252525E5%252525A4%252525B4%252525E9%25252583%252525A8-%252525E4%252525B8%252525BB%252525E6%252525A0%25252587%252525E9%252525A2%25252598%252526linkText%25253D%252525E3%25252580%25252590%252525E5%25252589%2525258D%252525E7%252525A8%2525258B%252525E6%25252597%252525A0%252525E5%252525BF%252525A751Job%252525E3%25252580%25252591-%25252520%252525E5%252525A5%252525BD%252525E5%252525B7%252525A5%252525E4%252525BD%2525259C%252525E5%252525B0%252525BD%252525E5%2525259C%252525A8%252525E5%25252589%2525258D%252525E7%252525A8%2525258B%252525E6%25252597%252525A0%252525E5%252525BF%252525A7%2521%252526linkType%25253D%26%7C%26ad_logid_url%3Dhttps%253A%252F%252Ftrace.51job.com%252Ftrace.php%253Fadsnum%253D4178248%2526ajp%253DaHR0cHM6Ly9ta3QuNTFqb2IuY29tL3RnL3NlbS9MUF8yMDIwXzEuaHRtbD9mcm9tPWJhaWR1YWQ%253D%2526k%253Dd946ba049bfb67b64f408966cbda3ee9%2526bd_vid%253D8087414180147234955%26%7C%26; slife=lowbrowser%3Dnot%26%7C%26lastlogindate%3D20201104%26%7C%26securetime%3DAT1daFQyWTlfOVVvCzNePVdgVmU%253D; nsearch=jobarea%3D%26%7C%26ord_field%3D%26%7C%26recentSearch0%3D%26%7C%26recentSearch1%3D%26%7C%26recentSearch2%3D%26%7C%26recentSearch3%3D%26%7C%26recentSearch4%3D%26%7C%26collapse_expansion%3D; search=jobarea%7E%60030200%7C%21ord_field%7E%600%7C%21recentSearch0%7E%60030200%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch1%7E%60190300%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch2%7E%60020000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch3%7E%60000000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch4%7E%60130000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21',
24             'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
25         }
26         #在url后拼接參數,參數固定
27         params = {
28             'lang': 'c',
29             'postchannel': '0000',
30             'workyear': '99',
31             'cotype': '99',
32             'degreefrom': '99',
33             'jobterm': '99',
34             'companysize': '99',
35             'ord_field': '0',
36             'dibiaoid': '0',
37             'line': '',
38             'welfare': '',
39         }
40         #拼接url
41         url = url0+urlencode(params)
42         print(url)
43         #requests請求,設置請求時間最長為30秒,超時報錯
44         r = requests.get(url,headers=headers,timeout=30)
45         #將請求到的字符串轉化為html標簽
46         html = etree.HTML(r.text)
47         #標簽定位到該字段
48         nr = html.xpath('//script[@type="text/javascript"]/text()')[0].replace('\n','').replace('\t','').replace('window.__SEARCH_RESULT__ = ','')
49         #將字符串抓華為json格式
50         datas = json.loads(nr)['engine_search_result']
51         #循環,獲取字段
52         for sjs in datas:
53             #判斷
54             if len(sjs['attribute_text']) == 4:
55                 workyear = sjs['attribute_text'][1]
56                 education = sjs['attribute_text'][2]
57                 city = sjs['attribute_text'][0]
58                 renshu = sjs['attribute_text'][-1]
59             else:
60                 city = sjs['attribute_text'][0]
61                 renshu = sjs['attribute_text'][-1]
62                 test = sjs['attribute_text'][1]
63                 #判斷經驗是否在test里面
64                 if '經驗' in test:
65                     workyear = test
66                     education = ''
67                 else:
68                     education = test
69                     workyear = ''
70             company_name = sjs['company_name']
71             job_name = sjs['job_name']
72             providesalary_text = sjs['providesalary_text'].replace('\\',"")
73             jobwelf = sjs['jobwelf'].replace('\\',"")
74             companysize_text = sjs['companysize_text'].replace('\\',"")
75             companyind_text = sjs['companyind_text'].replace('\\',"")
76             #如果為空,直接設置為無
77             if not providesalary_text:
78                 providesalary_text = ''
79             if not jobwelf:
80                 jobwelf = ''
81             if not companysize_text:
82                 companysize_text = ''
83             if not companyind_text:
84                 companyind_text = ''
85             file = open('qcwy.csv', 'a+', encoding='gbk')
86             writer = csv.writer(file)
87             #將數據每行寫入
88             writer.writerow([company_name,job_name,providesalary_text,jobwelf,workyear,education,city,renshu,companysize_text,companyind_text])
89             print(company_name,job_name,providesalary_text,jobwelf,workyear,education,city,renshu,companysize_text,companyind_text)
90       #異常處理
91     except Exception as e:
92         print(e)
93         time.sleep(1)
94 
95         # break
96 #將csv轉成excel
97 datas = pd.read_csv('qcwy.csv',encoding='gbk')

 

 

2、對數據進行清洗和處理

 

1 import pandas as pd
2 qcwy = pd.DataFrame(pd.read_csv('qcwy.csv',encoding="gbk"))
3 qcwy.head()

1 qcwy.drop('福利',axis = 1,inplace= True)
2 qcwy.head()

 

1 qcwy.duplicated()

1 qcwy = qcwy.drop_duplicates()
2 qcwy.head()

 

1 qcwy['公司'].isnull().value_counts()
1 qcwy['崗位'].isnull().value_counts()
1 qcwy['薪資'].isnull().value_counts()
1 qcwy['工作經驗'].isnull().value_counts()
1 qcwy['學歷'].isnull().value_counts()
1 qcwy['城市'].isnull().value_counts()
1 qcwy['招聘人數'].isnull().value_counts()
1 qcwy['公司規模'].isnull().value_counts()
1 qcwy['公司方向'].isnull().value_counts()

1 qcwy ['薪資'] = qcwy['薪資'].map(str.strip) #刪除數據兩邊的空格
1 qcwy ['薪資'] = qcwy['薪資'].map(str.lstrip) #刪除數據左邊的空格
1 qcwy ['薪資'] = qcwy['薪資'].map(str.rstrip) #刪除數據右邊的空格
1 qcwy.describe()

 

 

 3、數據可視化

 1 import pandas as pd
 2 import numpy as mp
 3 import sklearn
 4 import seaborn as sns
 5 import matplotlib.pyplot as plt
 6 
 7 #學歷占比餅圖
 8 plt.rcParams['font.sans-serif'] = ['SimHei']#解決亂碼問題
 9 gw_score = qcwy['學歷'].value_counts() #統計評分情況
10 plt.title("學歷占比圖") #設置餅圖標題
11 plt.pie(gw_score.values,labels = gw_score.index,autopct='%1.1f%%') #繪圖
12 #autopct表示圓里面的文本格式,在python里%操作符可用於格式化字符串操作
13 plt.show()

1 qcwy = pd.DataFrame(pd.read_csv('qcwy.csv',encoding="gbk"))
2 sns.distplot(qcwy['招聘人數'])

1 sns.regplot(x = '招聘人數',y = '公司規模',data=qcwy)

 

 1 import seaborn as sns
 2 from scipy.optimize import leastsq
 3 plt.rcParams['font.sans-serif'] = ['SimHei']#解決亂碼問題
 4 #定義變量
 5 gsgm=qcwy.loc[:,'公司規模']
 6 zprs=qcwy.loc[:,'招聘人數']
 7 #函數表達式
 8 def func(params,x):
 9     a,b,c=params
10     return a*x*x+b*x+c
11 def error_func(params,x,y):
12     return func(params,x)-y
13 P0=[1,9.0]
14 def main():
15     plt.figure(figsize=(8,6))
16     P0=[1,9.0,1]
17     Para=leastsq(error_func,P0,args=(gsgm,zprs))
18     a,b,c=Para[0]
19     print("a=",a, "b=",b, "c=",c)
20     #繪圖
21     plt.scatter(gsgm,zprs,color="green",label="樣本數據",linewidth=2)
22     x=mp.linspace(1000,10,400)
23     y=a*x*x+b*x+c
24     #右上角標
25     plt.plot(x,y,color="red",label="擬合曲線",linewidth=2)
26     #x,y軸名稱
27     plt.xlabel('公司規模')
28     plt.ylabel('招聘人數')
29     #標題
30     plt.title("公司規模與招聘人數回歸方程")
31     plt.grid()
32     plt.legend()
33     plt.show()
34     main()

 

 

 五、附完整程序代碼

  1 import requests
  2 import time
  3 import re
  4 import csv
  5 import json
  6 import pandas as pd
  7 from lxml import etree
  8 #創建一個csv文件,設置編碼格式
  9 file = open('qcwy.csv','a+',encoding='gbk')
 10 #寫入表頭
 11 writer  =csv.writer(file)
 12 writer.writerow(['公司','崗位','薪資','福利','工作經驗','學歷','城市','招聘人數','公司規模','公司方向'])
 13 file.close()
 14 from urllib.parse import urlencode
 15 #頁數循環,設置10頁
 16 for page in range(1,10):
 17     try:
 18         url0 = 'https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,{}.html?'.format(page)
 19         #設置請求頭,防止被網站識別爬蟲
 20         headers = {
 21             'Connection': 'keep-alive',
 22             'Host': 'search.51job.com',
 23             # 'Cookie': 'guid=011c029d4be2f1535b1488058cc65d73; _ujz=MTYwMzQwMDYxMA%3D%3D; ps=needv%3D0; 51job=cuid%3D160340061%26%7C%26cusername%3Dphone_13361643992_201907142883%26%7C%26cpassword%3D%26%7C%26cname%3D%25C1%25F5%25D3%25D1%25C6%25BD%26%7C%26cemail%3D1449917271%2540qq.com%26%7C%26cemailstatus%3D0%26%7C%26cnickname%3D%26%7C%26ccry%3D.0OeIvjQVBfOY%26%7C%26cconfirmkey%3D14GmM5Lom81vo%26%7C%26cautologin%3D1%26%7C%26cenglish%3D0%26%7C%26sex%3D0%26%7C%26cnamekey%3D14mtpvIzJS2LQ%26%7C%26to%3D09195b4aaed52ca7f08007a61faa9c5b5fa1dec4%26%7C%26; adv=adsnew%3D0%26%7C%26adsnum%3D2004282%26%7C%26adsresume%3D1%26%7C%26adsfrom%3Dhttps%253A%252F%252Fwww.baidu.com%252Fother.php%253Fsc.a000000rSfJlGV4Bc8t4eqTEk_frowutPMhm3ZhaV8zIaGVDK_J1tjeEsmjcUTG1io1MZWBYnm9m1fVaxnJzS1lJwgLwxWzOYsvqLLSOjcXMXX2Ezu-WL-EBCumLUi63OznuK6Pa1ma8DWIYz6HIRTy83AwhghV_XRUyLJlvDRLmdmNQ8BBou719icrwXwO7DLroB6mb3OslhcCKDEnlXfsWqsw9.DR_NR2Ar5Od66CHnsGtVdXNdlc2D1n2xx81IZ76Y_uQQr1F_zIyT8P9MqOOgujSOODlxdlPqKMWSxKSgqjlSzOFqtZOmzUlZlS5S8QqxZtVAOtIOtHOuS81wODSgL35SKsSXKMqOOgfESyOHjGLY51xVOeNH5exS88Zqq1ZgVm9udSnQr1__oodvgvnehUrPL72xZgjX1IIYJN9h9merzEuY60.TLFWgv-b5HDkrfK1ThPGujYknHb0THY0IAYqkea11neXYtT0IgP-T-qYXgK-5H00mywxIZ-suHY10ZIEThfqkea11neXYtT0ThPv5HD0IgF_gv-b5HDdnWcsPWcYnjD0UgNxpyfqnHnLnWD4nj60UNqGujYknjT3njc1rfKVIZK_gv-b5HDkPHnY0ZKvgv-b5H00pywW5R9rffKWThnqnH64n1f%2526ck%253D6535.8.123.259.153.289.335.295%2526dt%253D1604443777%2526wd%253D%2525E5%252589%25258D%2525E7%2525A8%25258B%2525E6%252597%2525A0%2525E5%2525BF%2525A7%2526tpl%253Dtpl_11534_23295_19442%2526l%253D1522062401%2526us%253DlinkName%25253D%252525E6%252525A0%25252587%252525E5%25252587%25252586%252525E5%252525A4%252525B4%252525E9%25252583%252525A8-%252525E4%252525B8%252525BB%252525E6%252525A0%25252587%252525E9%252525A2%25252598%252526linkText%25253D%252525E3%25252580%25252590%252525E5%25252589%2525258D%252525E7%252525A8%2525258B%252525E6%25252597%252525A0%252525E5%252525BF%252525A751Job%252525E3%25252580%25252591-%25252520%252525E5%252525A5%252525BD%252525E5%252525B7%252525A5%252525E4%252525BD%2525259C%252525E5%252525B0%252525BD%252525E5%2525259C%252525A8%252525E5%25252589%2525258D%252525E7%252525A8%2525258B%252525E6%25252597%252525A0%252525E5%252525BF%252525A7%2521%252526linkType%25253D%26%7C%26ad_logid_url%3Dhttps%253A%252F%252Ftrace.51job.com%252Ftrace.php%253Fadsnum%253D4178248%2526ajp%253DaHR0cHM6Ly9ta3QuNTFqb2IuY29tL3RnL3NlbS9MUF8yMDIwXzEuaHRtbD9mcm9tPWJhaWR1YWQ%253D%2526k%253Dd946ba049bfb67b64f408966cbda3ee9%2526bd_vid%253D8087414180147234955%26%7C%26; slife=lowbrowser%3Dnot%26%7C%26lastlogindate%3D20201104%26%7C%26securetime%3DAT1daFQyWTlfOVVvCzNePVdgVmU%253D; nsearch=jobarea%3D%26%7C%26ord_field%3D%26%7C%26recentSearch0%3D%26%7C%26recentSearch1%3D%26%7C%26recentSearch2%3D%26%7C%26recentSearch3%3D%26%7C%26recentSearch4%3D%26%7C%26collapse_expansion%3D; search=jobarea%7E%60030200%7C%21ord_field%7E%600%7C%21recentSearch0%7E%60030200%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch1%7E%60190300%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch2%7E%60020000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch3%7E%60000000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21recentSearch4%7E%60130000%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21',
 24             'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
 25         }
 26         #在url后拼接參數,參數固定
 27         params = {
 28             'lang': 'c',
 29             'postchannel': '0000',
 30             'workyear': '99',
 31             'cotype': '99',
 32             'degreefrom': '99',
 33             'jobterm': '99',
 34             'companysize': '99',
 35             'ord_field': '0',
 36             'dibiaoid': '0',
 37             'line': '',
 38             'welfare': '',
 39         }
 40         #拼接url
 41         url = url0+urlencode(params)
 42         print(url)
 43         #requests請求,設置請求時間最長為30秒,超時報錯
 44         r = requests.get(url,headers=headers,timeout=30)
 45         #將請求到的字符串轉化為html標簽
 46         html = etree.HTML(r.text)
 47         #標簽定位到該字段
 48         nr = html.xpath('//script[@type="text/javascript"]/text()')[0].replace('\n','').replace('\t','').replace('window.__SEARCH_RESULT__ = ','')
 49         #將字符串抓華為json格式
 50         datas = json.loads(nr)['engine_search_result']
 51         #循環,獲取字段
 52         for sjs in datas:
 53             #判斷
 54             if len(sjs['attribute_text']) == 4:
 55                 workyear = sjs['attribute_text'][1]
 56                 education = sjs['attribute_text'][2]
 57                 city = sjs['attribute_text'][0]
 58                 renshu = sjs['attribute_text'][-1]
 59             else:
 60                 city = sjs['attribute_text'][0]
 61                 renshu = sjs['attribute_text'][-1]
 62                 test = sjs['attribute_text'][1]
 63                 #判斷經驗是否在test里面
 64                 if '經驗' in test:
 65                     workyear = test
 66                     education = ''
 67                 else:
 68                     education = test
 69                     workyear = ''
 70             company_name = sjs['company_name']
 71             job_name = sjs['job_name']
 72             providesalary_text = sjs['providesalary_text'].replace('\\',"")
 73             jobwelf = sjs['jobwelf'].replace('\\',"")
 74             companysize_text = sjs['companysize_text'].replace('\\',"")
 75             companyind_text = sjs['companyind_text'].replace('\\',"")
 76             #如果為空,直接設置為無
 77             if not providesalary_text:
 78                 providesalary_text = ''
 79             if not jobwelf:
 80                 jobwelf = ''
 81             if not companysize_text:
 82                 companysize_text = ''
 83             if not companyind_text:
 84                 companyind_text = ''
 85             file = open('qcwy.csv', 'a+', encoding='gbk')
 86             writer = csv.writer(file)
 87             #將數據每行寫入
 88             writer.writerow([company_name,job_name,providesalary_text,jobwelf,workyear,education,city,renshu,companysize_text,companyind_text])
 89             print(company_name,job_name,providesalary_text,jobwelf,workyear,education,city,renshu,companysize_text,companyind_text)
 90       #異常處理
 91     except Exception as e:
 92         print(e)
 93         time.sleep(1)
 94 
 95         # break
 96 #將csv轉成excel
 97 datas = pd.read_csv('qcwy.csv',encoding='gbk')
 98 
 99 import pandas as pd
100 qcwy = pd.DataFrame(pd.read_csv('qcwy.csv',encoding="gbk"))
101 qcwy.head()
102 
103 qcwy.drop('福利',axis = 1,inplace= True)
104 qcwy.head()
105 
106 qcwy.duplicated()
107 
108 qcwy = qcwy.drop_duplicates()
109 qcwy.head()
110 
111 qcwy['公司'].isnull().value_counts()
112 qcwy['崗位'].isnull().value_counts()
113 qcwy['薪資'].isnull().value_counts()
114 qcwy['工作經驗'].isnull().value_counts()
115 qcwy['學歷'].isnull().value_counts()
116 qcwy['城市'].isnull().value_counts()
117 qcwy['招聘人數'].isnull().value_counts()
118 qcwy['公司方向'].isnull().value_counts()
119 
120 qcwy ['薪資'] = qcwy['薪資'].map(str.strip) #刪除數據兩邊的空格
121 qcwy ['薪資'] = qcwy['薪資'].map(str.lstrip) #刪除數據左邊的空格
122 qcwy ['薪資'] = qcwy['薪資'].map(str.rstrip) #刪除數據右邊的空格
123 
124 qcwy.describe()
125 
126 import pandas as pd
127 import numpy as mp
128 import sklearn
129 import seaborn as sns
130 import matplotlib.pyplot as plt
131 
132 #學歷占比餅圖
133 plt.rcParams['font.sans-serif'] = ['SimHei']#解決亂碼問題
134 gw_score = qcwy['學歷'].value_counts() #統計評分情況
135 plt.title("學歷占比圖") #設置餅圖標題
136 plt.pie(gw_score.values,labels = gw_score.index,autopct='%1.1f%%') #繪圖
137 #autopct表示圓里面的文本格式,在python里%操作符可用於格式化字符串操作
138 plt.show()
139 
140 qcwy = pd.DataFrame(pd.read_csv('qcwy.csv',encoding="gbk"))
141 sns.distplot(qcwy['招聘人數'])
142 
143 sns.regplot(x = '招聘人數',y = '公司規模',data=qcwy)
144 
145 import seaborn as sns
146 from scipy.optimize import leastsq
147 plt.rcParams['font.sans-serif'] = ['SimHei']#解決亂碼問題
148 #定義變量
149 gsgm=qcwy.loc[:,'公司規模']
150 zprs=qcwy.loc[:,'招聘人數']
151 #函數表達式
152 def func(params,x):
153     a,b,c=params
154     return a*x*x+b*x+c
155 def error_func(params,x,y):
156     return func(params,x)-y
157 P0=[1,9.0]
158 def main():
159     plt.figure(figsize=(8,6))
160     P0=[1,9.0,1]
161     Para=leastsq(error_func,P0,args=(gsgm,zprs))
162     a,b,c=Para[0]
163     print("a=",a, "b=",b, "c=",c)
164     #繪圖
165     plt.scatter(gsgm,zprs,color="green",label="樣本數據",linewidth=2)
166     x=mp.linspace(1000,10,400)
167     y=a*x*x+b*x+c
168     #右上角標
169     plt.plot(x,y,color="red",label="擬合曲線",linewidth=2)
170     #x,y軸名稱
171     plt.xlabel('公司規模')
172     plt.ylabel('招聘人數')
173     #標題
174     plt.title("公司規模與招聘人數回歸方程")
175     plt.grid()
176     plt.legend()
177     plt.show()
178     main()

六、總結

1.經過對主題數據的分析與可視化,可以得到哪些結論?

(1)通過數據可視化,公司規模和公司所招聘的人數成正比例關系。

(2)通過回歸方程分析,公司規模較大的公司普遍招聘的人數較多,較為集中。

(3)經過數據可視化以及數據清洗分析后,我們可以快速直觀的了解到所需數據。

2.在完成此設計過程中,得到哪些收獲?以及要改進的建議?

在本次網絡爬蟲設計過程當中,參考了許多的資料文獻,例如:GetHub這類開源代碼的交流社區,前期跟隨一些優質作者學習,從頭到尾了解了整個爬蟲的編寫過程,同時也遇到諸多問題:例如:帶有標題導航欄的頁面,帶有廣告的頁面,無法正常完成數據的爬取。同時也學習到如何應對一些網站的反爬措施,當然在自己編寫過程中還存在諸多難題,通過查閱資料課本逐漸解決,同時也逐漸深入的了解Python這門課程,當然,Python這門編程語言還有很多的未知領域和創作空間,我也因此所產生濃厚的興趣,需要學習的內容還有很多,數據可視化、數據清洗、數據分析僅僅是冰山一角,因此,將努力學習,做出更好的成績。

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM