Python高級應用程序設計任務要求


Python高級應用程序設計任務要求

用Python實現一個面向主題的網絡爬蟲程序,並完成以下內容:
(注:每人一題,主題內容自選,所有設計內容與源代碼需提交到博客園平台)

一、主題式網絡爬蟲設計方案(15分)
1.主題式網絡爬蟲名稱

  (1)爬取51招聘網站上指定城市的招聘信息。

  (2)對爬取到的招聘信息進行分析。

2.主題式網絡爬蟲爬取的內容與數據特征分析

  (1)爬取所需要的城市以及相關崗位的招聘信息(如:公司名、薪資、所需工作經驗等)

  (2)對職業崗位的需求數量進行分析
3.主題式網絡爬蟲設計方案概述(包括實現思路與技術難點)

思路:

  使用requests庫爬取數據

  使用數據庫進行數據存放導出csv文件

  使用matplotlib模塊實現數據可視化

難點

  目標網頁信息的格式分析

  容易忽視格式轉換

二、主題頁面的結構特征分析(15分)
1.主題頁面的結構特征

  

 

 

 

 

 

 

 頁面的結構大體為包含搜索的頂部標題欄、包含篩選條件的職位展示列表、和網站信息的底部欄


2.Htmls頁面解析

  通過谷歌瀏覽器的HTML分析可以看出,我們所需要的信息就在dw_table部分中,以一個個的“盒子”

進行存放,我們就是要爬取其中的內容。

 

 


3.節點(標簽)查找方法與遍歷方法
(必要時畫出節點樹結構)

<html>→<div class="dw_wp">→<div class="el">→<p class="t1">

                        <p class="t2">

                        <p class="t3"> 

                        <p class="t4">

                        <p class="t5">

提前將子頁格式進行轉換,然后符合我們需要的數據進行存儲。

 1  url = "https://search.51job.com/list/{},000000,0000,00,9,99,{},2,{}.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="
 2         # 將爬取的城市數據存入list,后面直接引用
 3         city_list = ["010000", "020000", "030200", "040000", "080200"]
 4         city_dict = {"010000": "北京", "020000": "上海", "030200": "廣州", "040000": "深圳", "080200": "杭州"}#數據存入dict
 5         job_list = ["Java", "PHP", "C++", "數據挖掘", "Hadoop", "Python", "前端", "C"]
 6         # 記錄開始時間
 7         time_start = time.time()
 8         try:
 9             for citys in city_list:
10                 for job in job_list:
11                     for i in range(1, 31):  #循環存儲數據操作
12                         print("正在保存{}市{}崗位第{}頁".format(city_dict[citys], job, i))
13                         response = self.response_handler(url.format(citys, job, i))
14                         hrefs = self.parse(response)
15                         for href in hrefs:
16                             item = self.sub_page(href)
17                             #獲取目標為空則跳過執行下一次循環
18                             if item == None:
19                                 continue
20                                 #連接數據庫
21                             conn = self.connnect()
22                             cur = conn.cursor()
23                             #存儲數據至sql
24                             sql = "insert into tbl_qcwy_data(post,min_wages,max_wages,experience,education,city,company,scale,nature) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"
25                             #執行sql語句
26                             cur.execute(sql, (
27                                 #將數據存放到指定列表下
28                             item['post'], item['min_wages'], item['max_wages'], item['experience'], item['education'],
29                             item['city'], item['company'], item['scale'], item['nature']))
30                             #提交事務,使sql語句生效
31                             conn.commit()
32             cur.close()
33             conn.close()

 

三、網絡爬蟲程序設計(60分)
爬蟲程序主體要包括以下各部分,要附源代碼及較詳細注釋,並在每部分程序后面提供輸出結果的截圖。
1.數據爬取與采集

 1    def sub_page(self, href):
 2         try:
 3             response = self.response_handler(href)
 4             resp = response.text #獲取頁面數據
 5             html = etree.HTML(resp)  #解析網頁內容
 6             div = html.xpath("//div[@class='cn']")#對標簽進行數據轉換
 7             if len(div) > 0:
 8                 div = div[0]
 9             else:
10                 return None
11             wages = div.xpath(".//strong/text()")
12             if len(wages) > 0:
13                 wages = str(wages[0])
14 
15                 if wages[-1].endswith("") or wages.endswith("") or wages.find("") != -1:
16                     return None 
17                 min_wage = wages.split("-")[0]
18                 min_wage = float(min_wage) * 10000
19                 max_wage = wages.split("-")[1]
20                 if max_wage.find("") != -1:
21                     i = max_wage.index("")
22                     max_wage = float(max_wage[0:i]) * 10000
23                 else:
24                     return None
25             else:
26                 return None
27             #去除標簽獲取文本
28             title = div.xpath(".//p[@class='msg ltype']/text()")
29             city = re.sub("\\n|\\t|\\r|\\xa0", "", title[0])
30             experience = re.sub("\\n|\\t|\\r|\\xa0", "", title[1])
31             education = re.sub("\\n|\\t|\\r|\\xa0", "", title[2])
32             post = div.xpath(".//h1/@title")[0]
33             company = div.xpath(".//p[@class='cname']/a/@title")[0]
34             scale = html.xpath("//div[@class='com_tag']/p/@title")[1]
35             nature = html.xpath("//div[@class='com_tag']/p/@title")[0]
36             #存入詞典
37             item = {"min_wages": min_wage, "max_wages": max_wage, "experience": experience, "education": education,
38                     "city": city, "post": post, "company": company, "scale": scale, "nature": nature}
39             return item
40         except Exception:
41             return None
42 #爬蟲主函數
43     def main(self):
44         url = "https://search.51job.com/list/{},000000,0000,00,9,99,{},2,{}.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="
45         # 將爬取的城市數據存入list,后面直接引用
46         city_list = ["010000", "020000", "030200", "040000", "080200"]
47         city_dict = {"010000": "北京", "020000": "上海", "030200": "廣州", "040000": "深圳", "080200": "杭州"}#數據存入dict
48         job_list = ["Java", "PHP", "C++", "數據挖掘", "Hadoop", "Python", "前端", "C"]
49         # 記錄開始時間
50         time_start = time.time()
51         try:
52             for citys in city_list:
53                 for job in job_list:
54                     for i in range(1, 31):  #循環存儲數據操作
55                         print("正在保存{}市{}崗位第{}頁".format(city_dict[citys], job, i))
56                         response = self.response_handler(url.format(citys, job, i))
57                         hrefs = self.parse(response)
58                         for href in hrefs:
59                             item = self.sub_page(href)
60                             #獲取目標為空則跳過執行下一次循環
61                             if item == None:
62                                 continue
63                                 #連接數據庫
64                             conn = self.connnect()
65                             cur = conn.cursor()
66                             #存儲數據至sql
67                             sql = "insert into tbl_qcwy_data(post,min_wages,max_wages,experience,education,city,company,scale,nature) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"
68                             #執行sql語句
69                             cur.execute(sql, (
70                                 #將數據存放到指定列表下
71                             item['post'], item['min_wages'], item['max_wages'], item['experience'], item['education'],
72                             item['city'], item['company'], item['scale'], item['nature']))
73                             #提交事務,使sql語句生效
74                             conn.commit()
75             cur.close()
76             conn.close()
77             #記錄結束時間
78             time_end = time.time()
79             #計算程序運行時長
80             print("共用時{}".format(time_end - time_start))
81         except Exception:
82             print("異常")

 

 

2.對數據進行清洗和處理

因為數據存儲前已經進行篩選,所以可以直接跳出可視化樹狀圖。

 

 


3.文本分析(可選):jieba分詞、wordcloud可視化
4.數據分析與可視化
(例如:數據柱形圖、直方圖、散點圖、盒圖、分布圖、數據回歸分析等)

在篩選過的城市和崗位中再次進行獲取更加准確的信息。

 1 # 使中文正常顯示
 2 font = {'family': 'SimHei'}
 3 matplotlib.rc('font', **font)
 4 
 5 #  engine='python' 加上之后,文件的名字就可以用中文的
 6 df = pd.read_csv('recruit_data.csv', encoding='utf-8', engine='python')
 7 #對城市為空進行處理
 8 df['city'].isnull().value_counts()
 9 #數據異常值處理
10 df.describe()
11 # 工作崗位
12 job_list = ['JAVA開發工程師', '大數據架構師', '大數據開發工程師', '大數據分析師', '大數據算法工程師']
13 # 工作數量
14 job_number = [len(df[df.post.str.contains('java開發工程師|Java開發工程師|JAVA開發工程師', na=False)])]
15 
16 for i in range(len(job_list)):
17     if i != 0:
18         job_number.append(len(df[df.post.str.contains(job_list[i], na=False)]))
19 
20 plt.figure(figsize=(12, 6))
21 
22 plt.bar(job_list, job_number, width=0.5, label='崗位數量')
23 plt.title('大數據相關招聘職位數量')
24 plt.legend()
25 plt.xlabel(' ' * 68 + '職位名稱' + '\n' * 5, labelpad=10, ha='left', va='center')
26 plt.ylabel('崗位數量/個' + '\n' * 16, rotation=0)
27 # 進行圖標簽
28 for a, b in zip(job_list, job_number):
29     plt.text(a, b + 0.1, '%.0f' % b, ha='center', va='bottom', fontsize=13)
30 plt.show()

 

 

 


5.數據持久化

從數據庫中導出到csv文件

 

 


6.附完整程序代碼

  1 #導入包
  2 import requests
  3 import re
  4 from lxml import etree
  5 import pymysql
  6 import time
  7 
  8 #模仿瀏覽器訪問目標地址
  9 class crawl():
 10     #標識請求頭
 11     def __init__(self):
 12         self.headers = {
 13             "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
 14                           "Chrome/71.0.3578.98 Safari/537.36",
 15             "Host": "search.51job.com",
 16             "Referer": "https://www.51job.com/",
 17             "Upgrade-Insecure-Requests": "1",
 18             "Cookie": "guid=667fe7f59e43e367a18efcf690e3d464; "
 19                       "slife=lowbrowser%3Dnot%26%7C%26lastlogindate%3D20190412%26%7C%26; "
 20                       "adv=adsnew%3D0%26%7C%26adsnum%3D2004282%26%7C%26adsresume%3D1%26%7C%26adsfrom%3Dhttps%253A"
 21                       "%252F%252Fsp0.baidu.com%252F9q9JcDHa2gU2pMbgoY3K%252Fadrc.php%253Ft"
 22                       "%253D06KL00c00fDewkY0gPN900uiAsa4XQKT00000c6R7dC00000v47CS_"
 23                       ".THLZ_Q5n1VeHksK85HRsnj0snjckgv99UdqsusK15HbLPvFBPWcYnj0snAPWnAn0IHYYn1"
 24                       "-Dn1DzrDujn1bLfHn3fbPawbc3P10vnRfzf1DLPfK95gTqFhdWpyfqn1ckPHb4nH63nzusThqbpyfqnHm0uHdCIZwsT1CEQLwzmyP-QWRkphqBQhPEUiqYTh7Wui4spZ0Omyw1UMNV5HcsnjfzrjchmyGs5y7cRWKWiDYvHZb4IAD1RgNrNDukmWFFINbzrgwnndF8HjPrUAFHrgI-Uj94HRw7PDkVpjKBNLTEyh42IhFRny-uNvkou79aPBuo5HnvuH0dPA79njfdrynzuywbPycYPvmdPvPBPWDsuj6z0APzm1YdPWT1Ps%2526tpl%253Dtpl_11534_19347_15370%2526l%253D1511462024%2526attach%253Dlocation%25253D%252526linkName%25253D%252525E6%252525A0%25252587%252525E5%25252587%25252586%252525E5%252525A4%252525B4%252525E9%25252583%252525A8-%252525E6%252525A0%25252587%252525E9%252525A2%25252598-%252525E4%252525B8%252525BB%252525E6%252525A0%25252587%252525E9%252525A2%25252598%252526linkText%25253D%252525E3%25252580%25252590%252525E5%25252589%2525258D%252525E7%252525A8%2525258B%252525E6%25252597%252525A0%252525E5%252525BF%252525A751Job%252525E3%25252580%25252591-%25252520%252525E5%252525A5%252525BD%252525E5%252525B7%252525A5%252525E4%252525BD%2525259C%252525E5%252525B0%252525BD%252525E5%2525259C%252525A8%252525E5%25252589%2525258D%252525E7%252525A8%2525258B%252525E6%25252597%252525A0%252525E5%252525BF%252525A7%2521%252526xp%25253Did%2528%25252522m3215991883_canvas%25252522%2529%2525252FDIV%2525255B1%2525255D%2525252FDIV%2525255B1%2525255D%2525252FDIV%2525255B1%2525255D%2525252FDIV%2525255B1%2525255D%2525252FDIV%2525255B1%2525255D%2525252FH2%2525255B1%2525255D%2525252FA%2525255B1%2525255D%252526linkType%25253D%252526checksum%25253D23%2526ie%253Dutf-8%2526f%253D8%2526srcqid%253D3015640922443247634%2526tn%253D50000021_hao_pg%2526wd%253D%2525E5%252589%25258D%2525E7%2525A8%25258B%2525E6%252597%2525A0%2525E5%2525BF%2525A7%2526oq%253D%2525E5%252589%25258D%2525E7%2525A8%25258B%2525E6%252597%2525A0%2525E5%2525BF%2525A7%2526rqlang%253Dcn%2526sc%253DUWd1pgw-pA7EnHc1FMfqnHRdPH0vrHDsnWb4PauW5y99U1Dznzu9m1Y1nWm3P1R4PWTz%2526ssl_sample%253Ds_102%2526H123Tmp%253Dnunew7; track=registertype%3D1; 51job=cuid%3D155474095%26%7C%26cusername%3Dphone_15592190359_201904126833%26%7C%26cpassword%3D%26%7C%26cemail%3D%26%7C%26cemailstatus%3D0%26%7C%26cnickname%3D%26%7C%26ccry%3D.0Kv7MYCl6oSc%26%7C%26cconfirmkey%3D%25241%2524pje29p9d%2524ZK3qb5Vg5.sXKVb3%252Fic9y%252F%26%7C%26cenglish%3D0%26%7C%26to%3Db9207cfd84ebc727453bdbb9b672c3a45cb0791d%26%7C%26; nsearch=jobarea%3D%26%7C%26ord_field%3D%26%7C%26recentSearch0%3D%26%7C%26recentSearch1%3D%26%7C%26recentSearch2%3D%26%7C%26recentSearch3%3D%26%7C%26recentSearch4%3D%26%7C%26collapse_expansion%3D; search=jobarea%7E%60200300%7C%21ord_field%7E%600%7C%21recentSearch0%7E%601%A1%FB%A1%FA200300%2C00%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA%B1%B1%BE%A9%A1%FB%A1%FA2%A1%FB%A1%FA%A1%FB%A1%FA-1%A1%FB%A1%FA1555069299%A1%FB%A1%FA0%A1%FB%A1%FA%A1%FB%A1%FA%7C%21",
 25             "Accept-Encoding": "gzip, deflate, br",
 26             "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
 27         }
 28 #初始化mysql
 29     def connnect(self):
 30         conn = pymysql.connect(
 31             host='127.0.0.1',
 32             port=3306,
 33             user='root',
 34             passwd='root',
 35             db='recruit',
 36             charset='utf8'
 37         )
 38         return conn
 39 #接受服務器端返回數據
 40     def response_handler(self, url):
 41         response = requests.get(url=url, headers=self.headers)
 42         return response
 43 #數據轉換,將HTML轉換成方便處理的數據結構
 44     def parse(self, response):
 45         resp = response.content.decode("gbk")
 46         html = etree.HTML(resp)
 47         href = html.xpath("//p[@class='t1 ']//a[@target='_blank']/@href")
 48         return href
 49 #對子頁信息進行處理
 50     def sub_page(self, href):
 51         try:
 52             response = self.response_handler(href)
 53             resp = response.text #獲取頁面數據
 54             html = etree.HTML(resp)  #解析網頁內容
 55             div = html.xpath("//div[@class='cn']")#對標簽進行數據轉換
 56             if len(div) > 0:
 57                 div = div[0]
 58             else:
 59                 return None
 60             wages = div.xpath(".//strong/text()")
 61             if len(wages) > 0:
 62                 wages = str(wages[0])
 63 
 64                 if wages[-1].endswith("") or wages.endswith("") or wages.find("") != -1:
 65                     return None 
 66                 min_wage = wages.split("-")[0]
 67                 min_wage = float(min_wage) * 10000
 68                 max_wage = wages.split("-")[1]
 69                 if max_wage.find("") != -1:
 70                     i = max_wage.index("")
 71                     max_wage = float(max_wage[0:i]) * 10000
 72                 else:
 73                     return None
 74             else:
 75                 return None
 76             #去除標簽獲取文本
 77             title = div.xpath(".//p[@class='msg ltype']/text()")
 78             city = re.sub("\\n|\\t|\\r|\\xa0", "", title[0])
 79             experience = re.sub("\\n|\\t|\\r|\\xa0", "", title[1])
 80             education = re.sub("\\n|\\t|\\r|\\xa0", "", title[2])
 81             post = div.xpath(".//h1/@title")[0]
 82             company = div.xpath(".//p[@class='cname']/a/@title")[0]
 83             scale = html.xpath("//div[@class='com_tag']/p/@title")[1]
 84             nature = html.xpath("//div[@class='com_tag']/p/@title")[0]
 85             #存入詞典
 86             item = {"min_wages": min_wage, "max_wages": max_wage, "experience": experience, "education": education,
 87                     "city": city, "post": post, "company": company, "scale": scale, "nature": nature}
 88             return item
 89         except Exception:
 90             return None
 91 #爬蟲主函數
 92     def main(self):
 93         url = "https://search.51job.com/list/{},000000,0000,00,9,99,{},2,{}.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="
 94         # 將爬取的城市數據存入list,后面直接引用
 95         city_list = ["010000", "020000", "030200", "040000", "080200"]
 96         city_dict = {"010000": "北京", "020000": "上海", "030200": "廣州", "040000": "深圳", "080200": "杭州"}#數據存入dict
 97         job_list = ["Java", "PHP", "C++", "數據挖掘", "Hadoop", "Python", "前端", "C"]
 98         # 記錄開始時間
 99         time_start = time.time()
100         try:
101             for citys in city_list:
102                 for job in job_list:
103                     for i in range(1, 31):  #循環存儲數據操作
104                         print("正在保存{}市{}崗位第{}頁".format(city_dict[citys], job, i))
105                         response = self.response_handler(url.format(citys, job, i))
106                         hrefs = self.parse(response)
107                         for href in hrefs:
108                             item = self.sub_page(href)
109                             #獲取目標為空則跳過執行下一次循環
110                             if item == None:
111                                 continue
112                                 #連接數據庫
113                             conn = self.connnect()
114                             cur = conn.cursor()
115                             #存儲數據至sql
116                             sql = "insert into tbl_qcwy_data(post,min_wages,max_wages,experience,education,city,company,scale,nature) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"
117                             #執行sql語句
118                             cur.execute(sql, (
119                                 #將數據存放到指定列表下
120                             item['post'], item['min_wages'], item['max_wages'], item['experience'], item['education'],
121                             item['city'], item['company'], item['scale'], item['nature']))
122                             #提交事務,使sql語句生效
123                             conn.commit()
124             cur.close()
125             conn.close()
126             #記錄結束時間
127             time_end = time.time()
128             #計算程序運行時長
129             print("共用時{}".format(time_end - time_start))
130         except Exception:
131             print("異常")
132 
133 
134 if __name__ == '__main__':
135     c = crawl()
136     c.main()
 1 # coding=utf-8
 2 """
 3 @author : tongwei
 4 # @Date  : 2019/5/22
 5 @File  : 2_draw.py
 6 """
 7 
 8 import pandas as pd
 9 import matplotlib.pyplot as plt
10 import matplotlib
11 
12 # 統計出各個崗位中相關招聘職位的數量  (大數據相關)
13 
14 
15 # 使中文正常顯示
16 font = {'family': 'SimHei'}
17 matplotlib.rc('font', **font)
18 
19 #  engine='python' 加上之后,文件的名字就可以用中文的
20 df = pd.read_csv('recruit_data.csv', encoding='utf-8', engine='python')
21 #對城市為空進行處理
22 df['city'].isnull().value_counts()
23 #數據異常值處理
24 df.describe()
25 # 工作崗位
26 job_list = ['JAVA開發工程師', '大數據架構師', '大數據開發工程師', '大數據分析師', '大數據算法工程師']
27 # 工作數量
28 job_number = [len(df[df.post.str.contains('java開發工程師|Java開發工程師|JAVA開發工程師', na=False)])]
29 
30 for i in range(len(job_list)):
31     if i != 0:
32         job_number.append(len(df[df.post.str.contains(job_list[i], na=False)]))
33 
34 plt.figure(figsize=(12, 6))
35 
36 plt.bar(job_list, job_number, width=0.5, label='崗位數量')
37 plt.title('大數據相關招聘職位數量')
38 plt.legend()
39 plt.xlabel(' ' * 68 + '職位名稱' + '\n' * 5, labelpad=10, ha='left', va='center')
40 plt.ylabel('崗位數量/個' + '\n' * 16, rotation=0)
41 # 進行圖標簽
42 for a, b in zip(job_list, job_number):
43     plt.text(a, b + 0.1, '%.0f' % b, ha='center', va='bottom', fontsize=13)
44 plt.show()

 

四、結論(10分)
1.經過對主題數據的分析與可視化,可以得到哪些結論?

可以發現在一線城市中,對於java開發工程師的需求還是比較多的。

學習好java對於就業會更有幫助。
2.對本次程序設計任務完成的情況做一個簡單的小結。

接觸python的時間並不長,這個任務雖然有着兩人的合作,但是還是完成得較為艱難,在寫程序的過程中,發現程序能執行並不僅僅只局限在於所學習到的方法,而是能夠根據人的需求不斷的健壯,一個好的程序勢必具有很好的擴展性。

 對爬蟲的綜合應用,更加了解爬蟲的好處,更加有針對性的解決自己所碰到的困難,從大數據看出社會的需求。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM