1.lxml 是一種使用 Python 編寫的庫,可以迅速、靈活地處理 XML ,支持 XPath (XML Path Language),使用 lxml 的 etree 庫來進行爬取網站信息
2.Beautiful Soup支持從HTML或XML文件中提取數據的Python庫;支持Python標准庫中的HTML解析器;還支持一些第三方的解析器lxml, 使用的是 Xpath 語法
Beautiful Soup自動將輸入文檔轉換為Unicode編碼,輸出文檔轉換為utf-8編碼。
我們爬取騰訊招聘網站的鏈接為https://hr.tencent.com/position.php?&start=10#a
需要獲取職位名稱、職位類別、招聘人數、工作地點、發布時間等信息
一、使用etree爬取信息
1.導入庫
1 from lxml import etree 2 from urllib import request#進一步了解urllib和requests的區別 3 import json
在python.3中使用urllib庫中的request模塊,保存輸出為json文件
2.獲取網站並寫到json文件中
1 response=request.urlopen('https://hr.tencent.com/position.php?&start=10#a')#獲取網站鏈接 2 resHtml=response.read() 3 output=open('tencent1.json','wb+')#使用二進制方式打開,寫入到json文件
如果只使用w來寫入文件會報錯:
1 write() argument must be str, not bytes
我們需要用二進制來打開改為wb+
3.獲取我們需要得到的標簽
1 html=etree.HTML(resHtml) 2 result=html.xpath('//tr[@class="odd"] | //tr[@class="even"]')#獲取tr標簽下的所有class只有odd和even,用|並列 3 for site in result: 4 item={ }
必須是字典形式,先定義一個空字典
1 name=site.xpath('./td[1]/a')[0].text 2 detailLink=site.xpath('./td[1]/a')[0].attrib['href'] 3 catalog=site.xpath('./td[2]')[0].text 4 recruitNumber=site.xpath('./td[3]')[0].text 5 workLocation=site.xpath('./td[4]')[0].text 6 publishTime=site.xpath('./td[5]')[0].text
找到我們需要的字段
4.規范輸出形式
1 print(type(name)) 2 print(name,detailLink,catalog,recruitNumber,workLocation,publishTime) 3 item['name']=name 4 item['detailLink']=detailLink 5 item['catalog']=catalog 6 item['recruitNumber']=recruitNumber 7 item['publishTime']=publishTime 8 9 line = json.dumps(item,ensure_ascii=False) + '\n' 10 print(line) 11 output.write(line.encode('utf-8'))#編碼格式 12 13 output.close()
運行后結果如下:
<class 'str'> 23677-互娛服務采購經理 position_detail.php?id=44802&keywords=&tid=0&lid=0 職能類 1 深圳 2018-10-16 {"catalog": "職能類", "name": "23677-互娛服務采購經理", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44802&keywords=&tid=0&lid=0"} <class 'str'> 22989-騰訊雲塊存儲底層開發工程師(深圳) position_detail.php?id=44803&keywords=&tid=0&lid=0 技術類 2 深圳 2018-10-16 {"catalog": "技術類", "name": "22989-騰訊雲塊存儲底層開發工程師(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44803&keywords=&tid=0&lid=0"} <class 'str'> 24549-渠道管理經理(政策管理方向-上海) position_detail.php?id=44804&keywords=&tid=0&lid=0 市場類 1 上海 2018-10-16 {"catalog": "市場類", "name": "24549-渠道管理經理(政策管理方向-上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44804&keywords=&tid=0&lid=0"} <class 'str'> 24549-渠道管理經理(ROC管理方向-上海) position_detail.php?id=44805&keywords=&tid=0&lid=0 市場類 1 上海 2018-10-16 {"catalog": "市場類", "name": "24549-渠道管理經理(ROC管理方向-上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44805&keywords=&tid=0&lid=0"} <class 'str'> 24549-廣告營銷業務分析師(上海) position_detail.php?id=44806&keywords=&tid=0&lid=0 市場類 1 上海 2018-10-16 {"catalog": "市場類", "name": "24549-廣告營銷業務分析師(上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44806&keywords=&tid=0&lid=0"} <class 'str'> 28297-RPG手游—市場和平台渠道推廣(深圳) position_detail.php?id=44809&keywords=&tid=0&lid=0 產品/項目類 1 深圳 2018-10-16 {"catalog": "產品/項目類", "name": "28297-RPG手游—市場和平台渠道推廣(深圳)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44809&keywords=&tid=0&lid=0"} <class 'str'> 21309-在線教育-運營視覺設計師(深圳) position_detail.php?id=44800&keywords=&tid=0&lid=0 設計類 2 深圳 2018-10-16 {"catalog": "設計類", "name": "21309-在線教育-運營視覺設計師(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44800&keywords=&tid=0&lid=0"} <class 'str'> 21309-在線教育-UI設計師(深圳) position_detail.php?id=44801&keywords=&tid=0&lid=0 設計類 2 深圳 2018-10-16 {"catalog": "設計類", "name": "21309-在線教育-UI設計師(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44801&keywords=&tid=0&lid=0"} <class 'str'> 22989-數據庫高級產品運營經理 position_detail.php?id=44795&keywords=&tid=0&lid=0 產品/項目類 1 北京 2018-10-16 {"catalog": "產品/項目類", "name": "22989-數據庫高級產品運營經理", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44795&keywords=&tid=0&lid=0"} <class 'str'> 27087-海外區域中心空間運營經理(深圳) position_detail.php?id=44797&keywords=&tid=0&lid=0 市場類 1 深圳 2018-10-16 {"catalog": "市場類", "name": "27087-海外區域中心空間運營經理(深圳)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44797&keywords=&tid=0&lid=0"}
導出的json文件如下:
{"catalog": "職能類", "name": "23677-互娛服務采購經理", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44802&keywords=&tid=0&lid=0"} {"catalog": "技術類", "name": "22989-騰訊雲塊存儲底層開發工程師(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44803&keywords=&tid=0&lid=0"} {"catalog": "市場類", "name": "24549-渠道管理經理(政策管理方向-上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44804&keywords=&tid=0&lid=0"} {"catalog": "市場類", "name": "24549-渠道管理經理(ROC管理方向-上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44805&keywords=&tid=0&lid=0"} {"catalog": "市場類", "name": "24549-廣告營銷業務分析師(上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44806&keywords=&tid=0&lid=0"} {"catalog": "產品/項目類", "name": "28297-RPG手游—市場和平台渠道推廣(深圳)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44809&keywords=&tid=0&lid=0"} {"catalog": "設計類", "name": "21309-在線教育-運營視覺設計師(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44800&keywords=&tid=0&lid=0"} {"catalog": "設計類", "name": "21309-在線教育-UI設計師(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44801&keywords=&tid=0&lid=0"} {"catalog": "產品/項目類", "name": "22989-數據庫高級產品運營經理", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44795&keywords=&tid=0&lid=0"} {"catalog": "市場類", "name": "27087-海外區域中心空間運營經理(深圳)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44797&keywords=&tid=0&lid=0"}
二、使用Beautiful Soup爬取信息
1.導入庫
1 from bs4 import BeautifulSoup 2 from urllib import request 3 import json
2.獲取網站並寫到json文件中
1 response=request.urlopen('https://hr.tencent.com/position.php?&start=10#a') 2 resHtml=response.read() 3 output=open('tencent2.json','wb+')
3.獲取我們需要得到的標簽
1 html = BeautifulSoup(resHtml,'lxml') 2 result = html.select('tr[class="even"]') 3 result2= html.select('tr[class="odd"]') 4 result+=result2 5 print(len(result)) 6 7 for site in result: 8 item = {} 9 10 name = site.select('td a')[0].get_text() 11 detailLink = site.select('td a')[0].attrs['href']#Tag就是 HTML 中的一個個標簽,它的兩個屬性是name和attrs 12 catalog = site.select('td ')[1].get_text() 13 recruitNumber = site.select('td ')[2].get_text() 14 workLocation = site.select('td ')[3].get_text() 15 publishTime = site.select('td ')[4].get_text()
4.規范輸出形式
1 item['name']=name 2 item['detailLink'] = detailLink 3 item['catalog'] = catalog 4 item['recruitNumber'] = recruitNumber 5 item['workLocation'] = workLocation 6 item['publishTime'] = publishTime 7 8 line = json.dumps(item,ensure_ascii=False) 9 print(line) 10 11 output.write(line.encode('utf-8')) 12 13 output.close()
運行結果如下:
1 10 2 {"detailLink": "position_detail.php?id=44802&keywords=&tid=0&lid=0", "catalog": "職能類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "23677-互娛服務采購經理", "workLocation": "深圳"} 3 {"detailLink": "position_detail.php?id=44804&keywords=&tid=0&lid=0", "catalog": "市場類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "24549-渠道管理經理(政策管理方向-上海)", "workLocation": "上海"} 4 {"detailLink": "position_detail.php?id=44806&keywords=&tid=0&lid=0", "catalog": "市場類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "24549-廣告營銷業務分析師(上海)", "workLocation": "上海"} 5 {"detailLink": "position_detail.php?id=44800&keywords=&tid=0&lid=0", "catalog": "設計類", "publishTime": "2018-10-16", "recruitNumber": "2", "name": "21309-在線教育-運營視覺設計師(深圳)", "workLocation": "深圳"} 6 {"detailLink": "position_detail.php?id=44795&keywords=&tid=0&lid=0", "catalog": "產品/項目類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "22989-數據庫高級產品運營經理", "workLocation": "北京"} 7 {"detailLink": "position_detail.php?id=44803&keywords=&tid=0&lid=0", "catalog": "技術類", "publishTime": "2018-10-16", "recruitNumber": "2", "name": "22989-騰訊雲塊存儲底層開發工程師(深圳)", "workLocation": "深圳"} 8 {"detailLink": "position_detail.php?id=44805&keywords=&tid=0&lid=0", "catalog": "市場類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "24549-渠道管理經理(ROC管理方向-上海)", "workLocation": "上海"} 9 {"detailLink": "position_detail.php?id=44809&keywords=&tid=0&lid=0", "catalog": "產品/項目類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "28297-RPG手游—市場和平台渠道推廣(深圳)", "workLocation": "深圳"} 10 {"detailLink": "position_detail.php?id=44801&keywords=&tid=0&lid=0", "catalog": "設計類", "publishTime": "2018-10-16", "recruitNumber": "2", "name": "21309-在線教育-UI設計師(深圳)", "workLocation": "深圳"} 11 {"detailLink": "position_detail.php?id=44797&keywords=&tid=0&lid=0", "catalog": "市場類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "27087-海外區域中心空間運營經理(深圳)", "workLocation": "深圳"}
以上為兩種方法爬取網站信息,個人覺得用Beautiful Soup爬取比較方便