etree和Beautiful Soup的使用


1.lxml 是一種使用 Python 編寫的庫,可以迅速、靈活地處理 XML ,支持 XPath (XML Path Language),使用 lxml 的 etree 庫來進行爬取網站信息

2.Beautiful Soup支持從HTML或XML文件中提取數據的Python庫;支持Python標准庫中的HTML解析器;還支持一些第三方的解析器lxml, 使用的是 Xpath 語法

Beautiful Soup自動將輸入文檔轉換為Unicode編碼,輸出文檔轉換為utf-8編碼。

我們爬取騰訊招聘網站的鏈接為https://hr.tencent.com/position.php?&start=10#a

需要獲取職位名稱、職位類別、招聘人數、工作地點、發布時間等信息

一、使用etree爬取信息

1.導入庫

1 from lxml import etree
2 from urllib import request#進一步了解urllib和requests的區別
3 import json

 

在python.3中使用urllib庫中的request模塊,保存輸出為json文件

2.獲取網站並寫到json文件中

1 response=request.urlopen('https://hr.tencent.com/position.php?&start=10#a')#獲取網站鏈接
2 resHtml=response.read()
3 output=open('tencent1.json','wb+')#使用二進制方式打開,寫入到json文件

 

如果只使用w來寫入文件會報錯:

 1 write() argument must be str, not bytes 

我們需要用二進制來打開改為wb+

3.獲取我們需要得到的標簽

 

1 html=etree.HTML(resHtml)
2 result=html.xpath('//tr[@class="odd"] | //tr[@class="even"]')#獲取tr標簽下的所有class只有odd和even,用|並列
3 for site in result:
4     item={ }

 

必須是字典形式,先定義一個空字典

1     name=site.xpath('./td[1]/a')[0].text
2     detailLink=site.xpath('./td[1]/a')[0].attrib['href']
3     catalog=site.xpath('./td[2]')[0].text
4     recruitNumber=site.xpath('./td[3]')[0].text
5     workLocation=site.xpath('./td[4]')[0].text
6     publishTime=site.xpath('./td[5]')[0].text

 

找到我們需要的字段

4.規范輸出形式

 1     print(type(name))
 2     print(name,detailLink,catalog,recruitNumber,workLocation,publishTime)
 3     item['name']=name
 4     item['detailLink']=detailLink
 5     item['catalog']=catalog
 6     item['recruitNumber']=recruitNumber
 7     item['publishTime']=publishTime
 8 
 9     line = json.dumps(item,ensure_ascii=False) + '\n'
10     print(line)
11     output.write(line.encode('utf-8'))#編碼格式
12 
13 output.close()

 

運行后結果如下:

<class 'str'>
23677-互娛服務采購經理 position_detail.php?id=44802&keywords=&tid=0&lid=0 職能類 1 深圳 2018-10-16
{"catalog": "職能類", "name": "23677-互娛服務采購經理", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44802&keywords=&tid=0&lid=0"}

<class 'str'>
22989-騰訊雲塊存儲底層開發工程師(深圳) position_detail.php?id=44803&keywords=&tid=0&lid=0 技術類 2 深圳 2018-10-16
{"catalog": "技術類", "name": "22989-騰訊雲塊存儲底層開發工程師(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44803&keywords=&tid=0&lid=0"}

<class 'str'>
24549-渠道管理經理(政策管理方向-上海) position_detail.php?id=44804&keywords=&tid=0&lid=0 市場類 1 上海 2018-10-16
{"catalog": "市場類", "name": "24549-渠道管理經理(政策管理方向-上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44804&keywords=&tid=0&lid=0"}

<class 'str'>
24549-渠道管理經理(ROC管理方向-上海) position_detail.php?id=44805&keywords=&tid=0&lid=0 市場類 1 上海 2018-10-16
{"catalog": "市場類", "name": "24549-渠道管理經理(ROC管理方向-上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44805&keywords=&tid=0&lid=0"}

<class 'str'>
24549-廣告營銷業務分析師(上海) position_detail.php?id=44806&keywords=&tid=0&lid=0 市場類 1 上海 2018-10-16
{"catalog": "市場類", "name": "24549-廣告營銷業務分析師(上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44806&keywords=&tid=0&lid=0"}

<class 'str'>
28297-RPG手游—市場和平台渠道推廣(深圳) position_detail.php?id=44809&keywords=&tid=0&lid=0 產品/項目類 1 深圳 2018-10-16
{"catalog": "產品/項目類", "name": "28297-RPG手游—市場和平台渠道推廣(深圳)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44809&keywords=&tid=0&lid=0"}

<class 'str'>
21309-在線教育-運營視覺設計師(深圳) position_detail.php?id=44800&keywords=&tid=0&lid=0 設計類 2 深圳 2018-10-16
{"catalog": "設計類", "name": "21309-在線教育-運營視覺設計師(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44800&keywords=&tid=0&lid=0"}

<class 'str'>
21309-在線教育-UI設計師(深圳) position_detail.php?id=44801&keywords=&tid=0&lid=0 設計類 2 深圳 2018-10-16
{"catalog": "設計類", "name": "21309-在線教育-UI設計師(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44801&keywords=&tid=0&lid=0"}

<class 'str'>
22989-數據庫高級產品運營經理 position_detail.php?id=44795&keywords=&tid=0&lid=0 產品/項目類 1 北京 2018-10-16
{"catalog": "產品/項目類", "name": "22989-數據庫高級產品運營經理", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44795&keywords=&tid=0&lid=0"}

<class 'str'>
27087-海外區域中心空間運營經理(深圳) position_detail.php?id=44797&keywords=&tid=0&lid=0 市場類 1 深圳 2018-10-16
{"catalog": "市場類", "name": "27087-海外區域中心空間運營經理(深圳)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44797&keywords=&tid=0&lid=0"}

 

導出的json文件如下:

{"catalog": "職能類", "name": "23677-互娛服務采購經理", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44802&keywords=&tid=0&lid=0"}
{"catalog": "技術類", "name": "22989-騰訊雲塊存儲底層開發工程師(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44803&keywords=&tid=0&lid=0"}
{"catalog": "市場類", "name": "24549-渠道管理經理(政策管理方向-上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44804&keywords=&tid=0&lid=0"}
{"catalog": "市場類", "name": "24549-渠道管理經理(ROC管理方向-上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44805&keywords=&tid=0&lid=0"}
{"catalog": "市場類", "name": "24549-廣告營銷業務分析師(上海)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44806&keywords=&tid=0&lid=0"}
{"catalog": "產品/項目類", "name": "28297-RPG手游—市場和平台渠道推廣(深圳)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44809&keywords=&tid=0&lid=0"}
{"catalog": "設計類", "name": "21309-在線教育-運營視覺設計師(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44800&keywords=&tid=0&lid=0"}
{"catalog": "設計類", "name": "21309-在線教育-UI設計師(深圳)", "recruitNumber": "2", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44801&keywords=&tid=0&lid=0"}
{"catalog": "產品/項目類", "name": "22989-數據庫高級產品運營經理", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44795&keywords=&tid=0&lid=0"}
{"catalog": "市場類", "name": "27087-海外區域中心空間運營經理(深圳)", "recruitNumber": "1", "publishTime": "2018-10-16", "detailLink": "position_detail.php?id=44797&keywords=&tid=0&lid=0"}

 

 

二、使用Beautiful Soup爬取信息

1.導入庫

 1 from bs4 import BeautifulSoup 2 from urllib import request 3 import json 

2.獲取網站並寫到json文件中

 

1 response=request.urlopen('https://hr.tencent.com/position.php?&start=10#a')
2 resHtml=response.read()
3 output=open('tencent2.json','wb+')

 

3.獲取我們需要得到的標簽

 1 html = BeautifulSoup(resHtml,'lxml')
 2 result = html.select('tr[class="even"]')
 3 result2= html.select('tr[class="odd"]')
 4 result+=result2
 5 print(len(result))
 6 
 7 for site in result:
 8     item = {}
 9 
10     name = site.select('td a')[0].get_text()
11     detailLink = site.select('td a')[0].attrs['href']#Tag就是 HTML 中的一個個標簽,它的兩個屬性是name和attrs
12     catalog = site.select('td ')[1].get_text()
13     recruitNumber = site.select('td ')[2].get_text()
14     workLocation = site.select('td ')[3].get_text()
15     publishTime = site.select('td ')[4].get_text()

 

4.規范輸出形式

 1  item['name']=name
 2     item['detailLink'] = detailLink
 3     item['catalog'] = catalog
 4     item['recruitNumber'] = recruitNumber
 5     item['workLocation'] = workLocation
 6     item['publishTime'] = publishTime
 7 
 8     line = json.dumps(item,ensure_ascii=False)
 9     print(line)
10 
11     output.write(line.encode('utf-8'))
12 
13 output.close()

 

運行結果如下:

 1 10
 2 {"detailLink": "position_detail.php?id=44802&keywords=&tid=0&lid=0", "catalog": "職能類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "23677-互娛服務采購經理", "workLocation": "深圳"}
 3 {"detailLink": "position_detail.php?id=44804&keywords=&tid=0&lid=0", "catalog": "市場類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "24549-渠道管理經理(政策管理方向-上海)", "workLocation": "上海"}
 4 {"detailLink": "position_detail.php?id=44806&keywords=&tid=0&lid=0", "catalog": "市場類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "24549-廣告營銷業務分析師(上海)", "workLocation": "上海"}
 5 {"detailLink": "position_detail.php?id=44800&keywords=&tid=0&lid=0", "catalog": "設計類", "publishTime": "2018-10-16", "recruitNumber": "2", "name": "21309-在線教育-運營視覺設計師(深圳)", "workLocation": "深圳"}
 6 {"detailLink": "position_detail.php?id=44795&keywords=&tid=0&lid=0", "catalog": "產品/項目類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "22989-數據庫高級產品運營經理", "workLocation": "北京"}
 7 {"detailLink": "position_detail.php?id=44803&keywords=&tid=0&lid=0", "catalog": "技術類", "publishTime": "2018-10-16", "recruitNumber": "2", "name": "22989-騰訊雲塊存儲底層開發工程師(深圳)", "workLocation": "深圳"}
 8 {"detailLink": "position_detail.php?id=44805&keywords=&tid=0&lid=0", "catalog": "市場類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "24549-渠道管理經理(ROC管理方向-上海)", "workLocation": "上海"}
 9 {"detailLink": "position_detail.php?id=44809&keywords=&tid=0&lid=0", "catalog": "產品/項目類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "28297-RPG手游—市場和平台渠道推廣(深圳)", "workLocation": "深圳"}
10 {"detailLink": "position_detail.php?id=44801&keywords=&tid=0&lid=0", "catalog": "設計類", "publishTime": "2018-10-16", "recruitNumber": "2", "name": "21309-在線教育-UI設計師(深圳)", "workLocation": "深圳"}
11 {"detailLink": "position_detail.php?id=44797&keywords=&tid=0&lid=0", "catalog": "市場類", "publishTime": "2018-10-16", "recruitNumber": "1", "name": "27087-海外區域中心空間運營經理(深圳)", "workLocation": "深圳"}

 

以上為兩種方法爬取網站信息,個人覺得用Beautiful Soup爬取比較方便

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM