BeautifulSoup模塊的常用方法小結

本文轉載自查看原文 2017-09-05 18:13 2115 python

Beautiful Soup庫是解析、遍歷、維護“標簽樹”的功能庫。

1 url = "http://desk.zol.com.cn/"
2 request = requests.get(url)
3 html = request.content
4 soup = BeautifulSoup(html, "html.parser", from_encoding="utf-8")

一.解析器：
1.BeautifulSoup(markup, "html.parser")
2.BeautifulSoup(markup, "lxml")
3.BeautifulSoup(markup, "xml")
4.BeautifulSoup(markup, "html5lib")

二.Beautiful Soup將復雜HTML文檔轉換成一個復雜的樹形結構,每個節點都是Python對象,所有對象可以歸納為4種:
　　Tag , NavigableString , BeautifulSoup , Comment .

1.Tag 標簽:
任何存在於HTML語法中的標簽都可以用soup.<tag>訪問獲得。
當HTML文檔中存在多個相同<tag>對應內容時，soup.<tag>返回第一個。
例如，

      soup.a ---> 返回<a>標簽的內容；
      soup.a.name --> 返回<a>標簽的名字；
      soup.a.parent.name --> 返回<a>標簽上一層的標簽名字；
      soup.a.parent.parent.name

      soup.a.attrs --> 返回<a>標簽的所有屬性；
      soup.a.attrs['class'] --> 返回<a>標簽的class屬性；

      soup.a.string --> 返回<a>標簽中的非屬性內容(也就是<>...</>中的內容)；只能獲取一個！

soup.get_text() --> 獲取所有內容；獲取標簽下所有的文字內容！ soup.get_text(" ", strip=True)可以這樣去除空白行；

soup.strings --> 如果tag中包含多個字符串,可以使用 .strings 來循環獲取;

soup.stripped_strings --> soup.strings輸出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白內容;

三.基於bs4庫的HTML內容遍歷方法
      soup.contents
      soup.a.contents --> 將<a>標簽所有子節點存入列表；
      soup.a.children --> 與contents類似，但用於循環遍歷子節點；
      soup.a.descendants --> 用於循環遍歷子孫節點；
注意：BeautifulSoup 對象本身一定會包含子節點,也就是說<html>標簽也是 BeautifulSoup 對象的子節點！

      soup.prettify() --> 讓HTML內容更加“友好”的顯示，prettify()為HTML文本<>及其內容增加更加'\n'。

四.信息提取
      soup.find_all(name,attrs,recursive,string,**kwargs)
      　　name:對標簽名稱的檢索；
      　　attrs:對標簽屬性值的檢索；
      　　recursive:是否對子孫全部檢索，默認為True;
      　　string: <>...</>中字符串區域的檢索。

例如，soup.find_all('a')
      soup.find_all(['a','b'])

注意：find_all()中可以使用正則表達式來檢索特定內容！
      soup.find_all(re.compile(r'^a'))

例一：

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 request = requests.get("http://www.163hnzk.com/index_pc.php")
 5 html = request.content
 6 soup = BeautifulSoup(html, "html.parser", from_encoding="utf-8")
 7 spans = soup.find_all(name='span', attrs={'class': 'newstitle'})
 8 
 9 hrefs = []
10 for href in spans:
11     hrefs.append(href.a.attrs['href'])
12 
13 for url in hrefs:
14     # 因為url含有特殊字符不能創建文件，所以split去掉特殊字符
15     with open("E:\%s" % url.split('?')[1], "wb") as f:
16         # 'wb'所以要用content，‘w’用text
17         f.write(requests.get("http://www.163hnzk.com/"+url).content)

例二：

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import pandas as pd
 4 
 5 #request函數用來解析頁面，獲取所需內容
 6 def request(number):
 7     header={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'}
 8     html = requests.get("https://hr.tencent.com/position.php?&start="+str(number), headers=header).text
 9     soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')
10     evens = soup.find_all(name='tr', attrs='even')
11     odds = soup.find_all(name='tr', attrs='odd')
12     trs = evens+odds
13     for tr in trs:
14         dct={}
15         dct["職位名稱"]=tr.select('td a')[0].get_text()
16         dct["職位類別"]=tr.select('td')[1].get_text()
17         dct["招聘人數"]=tr.select('td')[2].get_text()
18         dct["工作地點"]=tr.select('td')[3].get_text()
19         dct["發布時間"]=tr.select('td')[4].get_text()
20         dct["鏈接地址"]='http://hr.tencent.com/'+tr.select('td a')[0].attrs['href']
21         lst.append(dct)
22 
23 #使用pandas保存為excel文件
24 def read_write(lst):
26     with open(r'E:\zhaopin.csv', 'w', encoding='utf-8') as f:
27         #字典列表可作為輸入數據傳遞以創建數據幀(DataFrame),字典鍵默認為列名。
28         datafram = pd.DataFrame(lst)
29         datafram.to_csv(r'E:\zhaopin.csv', index=False)
30 
31 if __name__=="__main__":
32     number = 0
33     #lst用來保存抓取的信息
34     lst=[]
35     while True:
36         #只抓取前5頁的內容
37         if number < 50:
38             request(number)
39             number = number+10
40         else:
41             break
42     read_write(lst)


結果：

傳送門--Beautifulsoup官方文檔

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 beautifulSoup常用方法 Python爬蟲常用模塊，BeautifulSoup筆記 Beautifulsoup模塊 python中的BeautifulSoup使用小結 python3 BeautifulSoup模塊 python BeautifulSoup模塊的安裝 python 模塊BeautifulSoup使用 BeautifulSoup模塊函數詳解常用模塊:re模塊下的常用方法 python3 BeautifulSoup模塊