Python高级应用程序设计任务要求
用Python实现一个面向主题的网络爬虫程序,并完成以下内容:
(注:每人一题,主题内容自选,所有设计内容与源代码需提交到博客园平台)
一、主题式网络爬虫设计方案(15分)
1.主题式网络爬虫名称
(1)爬取51招聘网站上指定城市的招聘信息。
(2)对爬取到的招聘信息进行分析。
2.主题式网络爬虫爬取的内容与数据特征分析
(1)爬取所需要的城市以及相关岗位的招聘信息(如:公司名、薪资、所需工作经验等)
(2)对职业岗位的需求数量进行分析
3.主题式网络爬虫设计方案概述(包括实现思路与技术难点)
思路:
使用requests库爬取数据
使用数据库进行数据存放导出csv文件
使用matplotlib模块实现数据可视化
难点
目标网页信息的格式分析
容易忽视格式转换
二、主题页面的结构特征分析(15分)
1.主题页面的结构特征
页面的结构大体为包含搜索的顶部标题栏、包含筛选条件的职位展示列表、和网站信息的底部栏
2.Htmls页面解析
通过谷歌浏览器的HTML分析可以看出,我们所需要的信息就在dw_table部分中,以一个个的“盒子”
进行存放,我们就是要爬取其中的内容。
3.节点(标签)查找方法与遍历方法
(必要时画出节点树结构)
<html>→<div class="dw_wp">→<div class="el">→<p class="t1">
<p class="t2">
<p class="t3">
<p class="t4">
<p class="t5">
提前将子页格式进行转换,然后符合我们需要的数据进行存储。
1 url = "https://search.51job.com/list/{},000000,0000,00,9,99,{},2,{}.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=" 2 # 将爬取的城市数据存入list,后面直接引用 3 city_list = ["010000", "020000", "030200", "040000", "080200"] 4 city_dict = {"010000": "北京", "020000": "上海", "030200": "广州", "040000": "深圳", "080200": "杭州"}#数据存入dict 5 job_list = ["Java", "PHP", "C++", "数据挖掘", "Hadoop", "Python", "前端", "C"] 6 # 记录开始时间 7 time_start = time.time() 8 try: 9 for citys in city_list: 10 for job in job_list: 11 for i in range(1, 31): #循环存储数据操作 12 print("正在保存{}市{}岗位第{}页".format(city_dict[citys], job, i)) 13 response = self.response_handler(url.format(citys, job, i)) 14 hrefs = self.parse(response) 15 for href in hrefs: 16 item = self.sub_page(href) 17 #获取目标为空则跳过执行下一次循环 18 if item == None: 19 continue 20 #连接数据库 21 conn = self.connnect() 22 cur = conn.cursor() 23 #存储数据至sql 24 sql = "insert into tbl_qcwy_data(post,min_wages,max_wages,experience,education,city,company,scale,nature) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)" 25 #执行sql语句 26 cur.execute(sql, ( 27 #将数据存放到指定列表下 28 item['post'], item['min_wages'], item['max_wages'], item['experience'], item['education'], 29 item['city'], item['company'], item['scale'], item['nature'])) 30 #提交事务,使sql语句生效 31 conn.commit() 32 cur.close() 33 conn.close()
三、网络爬虫程序设计(60分)
爬虫程序主体要包括以下各部分,要附源代码及较详细注释,并在每部分程序后面提供输出结果的截图。
1.数据爬取与采集
1 def sub_page(self, href): 2 try: 3 response = self.response_handler(href) 4 resp = response.text #获取页面数据 5 html = etree.HTML(resp) #解析网页内容 6 div = html.xpath("//div[@class='cn']")#对标签进行数据转换 7 if len(div) > 0: 8 div = div[0] 9 else: 10 return None 11 wages = div.xpath(".//strong/text()") 12 if len(wages) > 0: 13 wages = str(wages[0]) 14 15 if wages[-1].endswith("元") or wages.endswith("年") or wages.find("千") != -1: 16 return None 17 min_wage = wages.split("-")[0] 18 min_wage = float(min_wage) * 10000 19 max_wage = wages.split("-")[1] 20 if max_wage.find("万") != -1: 21 i = max_wage.index("万") 22 max_wage = float(max_wage[0:i]) * 10000 23 else: 24 return None 25 else: 26 return None 27 #去除标签获取文本 28 title = div.xpath(".//p[@class='msg ltype']/text()") 29 city = re.sub("\\n|\\t|\\r|\\xa0", "", title[0]) 30 experience = re.sub("\\n|\\t|\\r|\\xa0", "", title[1]) 31 education = re.sub("\\n|\\t|\\r|\\xa0", "", title[2]) 32 post = div.xpath(".//h1/@title")[0] 33 company = div.xpath(".//p[@class='cname']/a/@title")[0] 34 scale = html.xpath("//div[@class='com_tag']/p/@title")[1] 35 nature = html.xpath("//div[@class='com_tag']/p/@title")[0] 36 #存入词典 37 item = {"min_wages": min_wage, "max_wages": max_wage, "experience": experience, "education": education, 38 "city": city, "post": post, "company": company, "scale": scale, "nature": nature} 39 return item 40 except Exception: 41 return None 42 #爬虫主函数 43 def main(self): 44 url = "https://search.51job.com/list/{},000000,0000,00,9,99,{},2,{}.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=" 45 # 将爬取的城市数据存入list,后面直接引用 46 city_list = ["010000", "020000", "030200", "040000", "080200"] 47 city_dict = {"010000": "北京", "020000": "上海", "030200": "广州", "040000": "深圳", "080200": "杭州"}#数据存入dict 48 job_list = ["Java", "PHP", "C++", "数据挖掘", "Hadoop", "Python", "前端", "C"] 49 # 记录开始时间 50 time_start = time.time() 51 try: 52 for citys in city_list: 53 for job in job_list: 54 for i in range(1, 31): #循环存储数据操作 55 print("正在保存{}市{}岗位第{}页".format(city_dict[citys], job, i)) 56 response = self.response_handler(url.format(citys, job, i)) 57 hrefs = self.parse(response) 58 for href in hrefs: 59 item = self.sub_page(href) 60 #获取目标为空则跳过执行下一次循环 61 if item == None: 62 continue 63 #连接数据库 64 conn = self.connnect() 65 cur = conn.cursor() 66 #存储数据至sql 67 sql = "insert into tbl_qcwy_data(post,min_wages,max_wages,experience,education,city,company,scale,nature) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)" 68 #执行sql语句 69 cur.execute(sql, ( 70 #将数据存放到指定列表下 71 item['post'], item['min_wages'], item['max_wages'], item['experience'], item['education'], 72 item['city'], item['company'], item['scale'], item['nature'])) 73 #提交事务,使sql语句生效 74 conn.commit() 75 cur.close() 76 conn.close() 77 #记录结束时间 78 time_end = time.time() 79 #计算程序运行时长 80 print("共用时{}".format(time_end - time_start)) 81 except Exception: 82 print("异常")
2.对数据进行清洗和处理
因为数据存储前已经进行筛选,所以可以直接跳出可视化树状图。
3.文本分析(可选):jieba分词、wordcloud可视化
4.数据分析与可视化
(例如:数据柱形图、直方图、散点图、盒图、分布图、数据回归分析等)
在筛选过的城市和岗位中再次进行获取更加准确的信息。
1 # 使中文正常显示 2 font = {'family': 'SimHei'} 3 matplotlib.rc('font', **font) 4 5 # engine='python' 加上之后,文件的名字就可以用中文的 6 df = pd.read_csv('recruit_data.csv', encoding='utf-8', engine='python') 7 #对城市为空进行处理 8 df['city'].isnull().value_counts() 9 #数据异常值处理 10 df.describe() 11 # 工作岗位 12 job_list = ['JAVA开发工程师', '大数据架构师', '大数据开发工程师', '大数据分析师', '大数据算法工程师'] 13 # 工作数量 14 job_number = [len(df[df.post.str.contains('java开发工程师|Java开发工程师|JAVA开发工程师', na=False)])] 15 16 for i in range(len(job_list)): 17 if i != 0: 18 job_number.append(len(df[df.post.str.contains(job_list[i], na=False)])) 19 20 plt.figure(figsize=(12, 6)) 21 22 plt.bar(job_list, job_number, width=0.5, label='岗位数量') 23 plt.title('大数据相关招聘职位数量') 24 plt.legend() 25 plt.xlabel(' ' * 68 + '职位名称' + '\n' * 5, labelpad=10, ha='left', va='center') 26 plt.ylabel('岗位数量/个' + '\n' * 16, rotation=0) 27 # 进行图标签 28 for a, b in zip(job_list, job_number): 29 plt.text(a, b + 0.1, '%.0f' % b, ha='center', va='bottom', fontsize=13) 30 plt.show()
5.数据持久化
从数据库中导出到csv文件
6.附完整程序代码
1 #导入包 2 import requests 3 import re 4 from lxml import etree 5 import pymysql 6 import time 7 8 #模仿浏览器访问目标地址 9 class crawl(): 10 #标识请求头 11 def __init__(self): 12 self.headers = { 13 "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " 14 "Chrome/71.0.3578.98 Safari/537.36", 15 "Host": "search.51job.com", 16 "Referer": "https://www.51job.com/", 17 "Upgrade-Insecure-Requests": "1", 18 "Cookie": "guid=667fe7f59e43e367a18efcf690e3d464; " 19 "slife=lowbrowser%3Dnot%26%7C%26lastlogindate%3D20190412%26%7C%26; " 20 "adv=adsnew%3D0%26%7C%26adsnum%3D2004282%26%7C%26adsresume%3D1%26%7C%26adsfrom%3Dhttps%253A" 21 "%252F%252Fsp0.baidu.com%252F9q9JcDHa2gU2pMbgoY3K%252Fadrc.php%253Ft" 22 "%253D06KL00c00fDewkY0gPN900uiAsa4XQKT00000c6R7dC00000v47CS_" 23 ".THLZ_Q5n1VeHksK85HRsnj0snjckgv99UdqsusK15HbLPvFBPWcYnj0snAPWnAn0IHYYn1" 24 "-Dn1DzrDujn1bLfHn3fbPawbc3P10vnRfzf1DLPfK95gTqFhdWpyfqn1ckPHb4nH63nzusThqbpyfqnHm0uHdCIZwsT1CEQLwzmyP-QWRkphqBQhPEUiqYTh7Wui4spZ0Omyw1UMNV5HcsnjfzrjchmyGs5y7cRWKWiDYvHZb4IAD1RgNrNDukmWFFINbzrgwnndF8HjPrUAFHrgI-Uj94HRw7PDkVpjKBNLTEyh42IhFRny-uNvkou79aPBuo5HnvuH0dPA79njfdrynzuywbPycYPvmdPvPBPWDsuj6z0APzm1YdPWT1Ps%2526tpl%253Dtpl_11534_19347_15370%2526l%253D1511462024%2526attach%253Dlocation%25253D%252526linkName%25253D%252525E6%252525A0%25252587%252525E5%25252587%25252586%252525E5%252525A4%252525B4%252525E9%25252583%252525A8-%252525E6%252525A0%25252587%252525E9%252525A2%25252598-%252525E4%252525B8%252525BB%252525E6%252525A0%25252587%252525E9%252525A2%25252598%252526linkText%25253D%252525E3%25252580%25252590%252525E5%25252589%2525258D%252525E7%252525A8%2525258B%252525E6%25252597%252525A0%252525E5%252525BF%252525A751Job%252525E3%25252580%25252591-%25252520%252525E5%252525A5%252525BD%252525E5%252525B7%252525A5%252525E4%252525BD%2525259C%252525E5%252525B0%252525BD%252525E5%2525259C%252525A8%252525E5%25252589%2525258D%252525E7%252525A8%2525258B%252525E6%25252597%252525A0%252525E5%252525BF%252525A7%2521%252526xp%25253Did%2528%25252522m3215991883_canvas%25252522%2529%2525252FDIV%2525255B1%2525255D%2525252FDIV%2525255B1%2525255D%2525252FDIV%2525255B1%2525255D%2525252FDIV%2525255B1%2525255D%2525252FDIV%2525255B1%2525255D%2525252FH2%2525255B1%2525255D%2525252FA%2525255B1%2525255D%252526linkType%25253D%252526checksum%25253D23%2526ie%253Dutf-8%2526f%253D8%2526srcqid%253D3015640922443247634%2526tn%253D50000021_hao_pg%2526wd%253D%2525E5%252589%25258D%2525E7%2525A8%25258B%2525E6%252597%2525A0%2525E5%2525BF%2525A7%2526oq%253D%2525E5%252589%25258D%2525E7%2525A8%25258B%2525E6%252597%2525A0%2525E5%2525BF%2525A7%2526rqlang%253Dcn%2526sc%253DUWd1pgw-pA7EnHc1FMfqnHRdPH0vrHDsnWb4PauW5y99U1Dznzu9m1Y1nWm3P1R4PWTz%2526ssl_sample%253Ds_102%2526H123Tmp%253Dnunew7; track=registertype%3D1; 51job=cuid%3D155474095%26%7C%26cusername%3Dphone_15592190359_201904126833%26%7C%26cpassword%3D%26%7C%26cemail%3D%26%7C%26cemailstatus%3D0%26%7C%26cnickname%3D%26%7C%26ccry%3D.0Kv7MYCl6oSc%26%7C%26cconfirmkey%3D%25241%2524pje29p9d%2524ZK3qb5Vg5.sXKVb3%252Fic9y%252F%26%7C%26cenglish%3D0%26%7C%26to%3Db9207cfd84ebc727453bdbb9b672c3a45cb0791d%26%7C%26; nsearch=jobarea%3D%26%7C%26ord_field%3D%26%7C%26recentSearch0%3D%26%7C%26recentSearch1%3D%26%7C%26recentSearch2%3D%26%7C%26recentSearch3%3D%26%7C%26recentSearch4%3D%26%7C%26collapse_expansion%3D; search=jobarea%7E%60200300%7C%21ord_field%7E%600%7C%21recentSearch0%7E%601%A1%FB%A1%FA200300%2C00%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA%B1%B1%BE%A9%A1%FB%A1%FA2%A1%FB%A1%FA%A1%FB%A1%FA-1%A1%FB%A1%FA1555069299%A1%FB%A1%FA0%A1%FB%A1%FA%A1%FB%A1%FA%7C%21", 25 "Accept-Encoding": "gzip, deflate, br", 26 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", 27 } 28 #初始化mysql 29 def connnect(self): 30 conn = pymysql.connect( 31 host='127.0.0.1', 32 port=3306, 33 user='root', 34 passwd='root', 35 db='recruit', 36 charset='utf8' 37 ) 38 return conn 39 #接受服务器端返回数据 40 def response_handler(self, url): 41 response = requests.get(url=url, headers=self.headers) 42 return response 43 #数据转换,将HTML转换成方便处理的数据结构 44 def parse(self, response): 45 resp = response.content.decode("gbk") 46 html = etree.HTML(resp) 47 href = html.xpath("//p[@class='t1 ']//a[@target='_blank']/@href") 48 return href 49 #对子页信息进行处理 50 def sub_page(self, href): 51 try: 52 response = self.response_handler(href) 53 resp = response.text #获取页面数据 54 html = etree.HTML(resp) #解析网页内容 55 div = html.xpath("//div[@class='cn']")#对标签进行数据转换 56 if len(div) > 0: 57 div = div[0] 58 else: 59 return None 60 wages = div.xpath(".//strong/text()") 61 if len(wages) > 0: 62 wages = str(wages[0]) 63 64 if wages[-1].endswith("元") or wages.endswith("年") or wages.find("千") != -1: 65 return None 66 min_wage = wages.split("-")[0] 67 min_wage = float(min_wage) * 10000 68 max_wage = wages.split("-")[1] 69 if max_wage.find("万") != -1: 70 i = max_wage.index("万") 71 max_wage = float(max_wage[0:i]) * 10000 72 else: 73 return None 74 else: 75 return None 76 #去除标签获取文本 77 title = div.xpath(".//p[@class='msg ltype']/text()") 78 city = re.sub("\\n|\\t|\\r|\\xa0", "", title[0]) 79 experience = re.sub("\\n|\\t|\\r|\\xa0", "", title[1]) 80 education = re.sub("\\n|\\t|\\r|\\xa0", "", title[2]) 81 post = div.xpath(".//h1/@title")[0] 82 company = div.xpath(".//p[@class='cname']/a/@title")[0] 83 scale = html.xpath("//div[@class='com_tag']/p/@title")[1] 84 nature = html.xpath("//div[@class='com_tag']/p/@title")[0] 85 #存入词典 86 item = {"min_wages": min_wage, "max_wages": max_wage, "experience": experience, "education": education, 87 "city": city, "post": post, "company": company, "scale": scale, "nature": nature} 88 return item 89 except Exception: 90 return None 91 #爬虫主函数 92 def main(self): 93 url = "https://search.51job.com/list/{},000000,0000,00,9,99,{},2,{}.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=" 94 # 将爬取的城市数据存入list,后面直接引用 95 city_list = ["010000", "020000", "030200", "040000", "080200"] 96 city_dict = {"010000": "北京", "020000": "上海", "030200": "广州", "040000": "深圳", "080200": "杭州"}#数据存入dict 97 job_list = ["Java", "PHP", "C++", "数据挖掘", "Hadoop", "Python", "前端", "C"] 98 # 记录开始时间 99 time_start = time.time() 100 try: 101 for citys in city_list: 102 for job in job_list: 103 for i in range(1, 31): #循环存储数据操作 104 print("正在保存{}市{}岗位第{}页".format(city_dict[citys], job, i)) 105 response = self.response_handler(url.format(citys, job, i)) 106 hrefs = self.parse(response) 107 for href in hrefs: 108 item = self.sub_page(href) 109 #获取目标为空则跳过执行下一次循环 110 if item == None: 111 continue 112 #连接数据库 113 conn = self.connnect() 114 cur = conn.cursor() 115 #存储数据至sql 116 sql = "insert into tbl_qcwy_data(post,min_wages,max_wages,experience,education,city,company,scale,nature) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)" 117 #执行sql语句 118 cur.execute(sql, ( 119 #将数据存放到指定列表下 120 item['post'], item['min_wages'], item['max_wages'], item['experience'], item['education'], 121 item['city'], item['company'], item['scale'], item['nature'])) 122 #提交事务,使sql语句生效 123 conn.commit() 124 cur.close() 125 conn.close() 126 #记录结束时间 127 time_end = time.time() 128 #计算程序运行时长 129 print("共用时{}".format(time_end - time_start)) 130 except Exception: 131 print("异常") 132 133 134 if __name__ == '__main__': 135 c = crawl() 136 c.main()
1 # coding=utf-8 2 """ 3 @author : tongwei 4 # @Date : 2019/5/22 5 @File : 2_draw.py 6 """ 7 8 import pandas as pd 9 import matplotlib.pyplot as plt 10 import matplotlib 11 12 # 统计出各个岗位中相关招聘职位的数量 (大数据相关) 13 14 15 # 使中文正常显示 16 font = {'family': 'SimHei'} 17 matplotlib.rc('font', **font) 18 19 # engine='python' 加上之后,文件的名字就可以用中文的 20 df = pd.read_csv('recruit_data.csv', encoding='utf-8', engine='python') 21 #对城市为空进行处理 22 df['city'].isnull().value_counts() 23 #数据异常值处理 24 df.describe() 25 # 工作岗位 26 job_list = ['JAVA开发工程师', '大数据架构师', '大数据开发工程师', '大数据分析师', '大数据算法工程师'] 27 # 工作数量 28 job_number = [len(df[df.post.str.contains('java开发工程师|Java开发工程师|JAVA开发工程师', na=False)])] 29 30 for i in range(len(job_list)): 31 if i != 0: 32 job_number.append(len(df[df.post.str.contains(job_list[i], na=False)])) 33 34 plt.figure(figsize=(12, 6)) 35 36 plt.bar(job_list, job_number, width=0.5, label='岗位数量') 37 plt.title('大数据相关招聘职位数量') 38 plt.legend() 39 plt.xlabel(' ' * 68 + '职位名称' + '\n' * 5, labelpad=10, ha='left', va='center') 40 plt.ylabel('岗位数量/个' + '\n' * 16, rotation=0) 41 # 进行图标签 42 for a, b in zip(job_list, job_number): 43 plt.text(a, b + 0.1, '%.0f' % b, ha='center', va='bottom', fontsize=13) 44 plt.show()
四、结论(10分)
1.经过对主题数据的分析与可视化,可以得到哪些结论?
可以发现在一线城市中,对于java开发工程师的需求还是比较多的。
学习好java对于就业会更有帮助。
2.对本次程序设计任务完成的情况做一个简单的小结。
接触python的时间并不长,这个任务虽然有着两人的合作,但是还是完成得较为艰难,在写程序的过程中,发现程序能执行并不仅仅只局限在于所学习到的方法,而是能够根据人的需求不断的健壮,一个好的程序势必具有很好的扩展性。
对爬虫的综合应用,更加了解爬虫的好处,更加有针对性的解决自己所碰到的困难,从大数据看出社会的需求。