使用Beautiful Soup爬取知乎發現【方法選擇器find_all】【CSS選擇器,select】


使用Beautiful Soup

 Beautiful Soup在解析時實際上依賴解析器,它除了支持Python標准庫中的HTML解析器外,還支持一些第三方解析器(比如lxml)。

解析器

使用方法

優勢

劣勢

Python標准庫

BeautifulSoup(markup, "html.parser")

Python的內置標准庫、執行速度適中、文檔容錯能力強

Python 2.7.3及Python 3.2.2之前的版本文檔容錯能力差

lxml HTML解析器

BeautifulSoup(markup, "lxml")

速度快、文檔容錯能力強

需要安裝C語言庫

lxml XML解析器

BeautifulSoup(markup, "xml")

速度快、唯一支持XML的解析器

需要安裝C語言庫

html5lib

BeautifulSoup(markup, "html5lib")

最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔

速度慢、不依賴外部擴展

一、lxml解析器有解析HTML和XML的功能,而且速度快,容錯能力強,所以先用它來解析。

用戶名(1) 

用戶名(2)

 

if item.find_all(class_ = 'author-link'):
author = item.find_all(class_ = 'author-link')[0].string
else:
author = item.find_all(class_ = 'name')[0].string

 

另外,還有許多查詢方法,其用法與find_all()find()方法完全相同,只不過查詢范圍不同。

另外,還有許多查詢方法,其用法與前面介紹的find_all()、find()方法完全相同,只不過查詢范圍不同,這里簡單說明一下。

 

find_parents()和find_parent():前者返回所有祖先節點,后者返回直接父節點。

find_next_siblings()和find_next_sibling():前者返回后面所有的兄弟節點,后者返回后面第一個兄弟節點。

find_previous_siblings()和find_previous_sibling():前者返回前面所有的兄弟節點,后者返回前面第一個兄弟節點。

find_all_next()和find_next():前者返回節點后所有符合條件的節點,后者返回第一個符合條件的節點。

find_all_previous()和find_previous():前者返回節點后所有符合條件的節點,后者返回第一個符合條件的節點。

 

 

 

 

既可以為屬性值,也可以為文本

q = item.find_all(class_ = 'bio')[0].string


q = item.find_all(class_ = 'bio')[0].attrs['title']

 1 import requests
 2 import json
 3 from bs4 import BeautifulSoup
 4 
 5 url = 'https://www.zhihu.com/explore'
 6 headers = {
 7     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
 8 }
 9 r = requests.get(url, headers=headers)
10 soup = BeautifulSoup(r.text, 'lxml')
11 explore = {}
12 items = soup.find_all(class_ = 'explore-feed feed-item')
13 for item in items:
14     question = item.find_all('h2')[0].string
15     #print(question)
16     if item.find_all(class_ = 'author-link'):
17         author = item.find_all(class_ = 'author-link')[0].string
18     else:
19         author = item.find_all(class_ = 'name')[0].string
20     #print(author)
21     answer = item.find_all(class_ = 'content')[0].string
22     #print(answer)
23     #q = item.find_all(class_ = 'bio')[0].string
24     q = item.find_all(class_ = 'bio')[0].attrs['title']
25     #print(q)
26 
27     explore = {
28         "question" : question,
29         "author" : author,
30         "answer" : answer,
31         "q": q,
32     } 
33 
34     with open("explore.json", "a") as f:
35         #f.write(json.dumps(items, ensure_ascii = False).encode("utf-8") + "\n")
36         f.write(json.dumps(explore, ensure_ascii = False) + "\n")

 

 

 

     for t in item.find_all(class_ = 'bio'):
         q =t.get('title') 
 1 import requests
 2 import json
 3 from bs4 import BeautifulSoup
 4 
 5 url = 'https://www.zhihu.com/explore'
 6 headers = {
 7     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
 8 }
 9 r = requests.get(url, headers=headers)
10 soup = BeautifulSoup(r.text, 'lxml')
11 explore = {}
12 items = soup.find_all(class_ = 'explore-feed feed-item')
13 for item in items:
14     question = item.find_all('h2')[0].string
15     #print(question)
16     if item.find_all(class_ = 'author-link'):
17         author = item.find_all(class_ = 'author-link')[0].string
18     else:
19         author = item.find_all(class_ = 'name')[0].string
20     #print(author)
21     answer = item.find_all(class_ = 'content')[0].string
22     #print(answer)
23     #q = item.find_all(class_ = 'bio')[0].string
24     #q = item.find_all(class_ = 'bio')[0].attrs['title']
25     for t in item.find_all(class_ = 'bio'):
26         q =t.get('title')    
27     print(q)
28 
29     explore = {
30         "question" : question,
31         "author" : author,
32         "answer" : answer,
33         "q": q,
34     } 
35 
36     with open("explore.json", "a") as f:
37         #f.write(json.dumps(items, ensure_ascii = False).encode("utf-8") + "\n")
38         f.write(json.dumps(explore, ensure_ascii = False) + "\n")

 

 

 

 

二、使用Python標准庫中的HTML解析器

 

 

 

 

soup = BeautifulSoup(r.text, 'html.parser')

三、Beautiful Soup還提供了另外一種選擇器,那就是CSS選擇器。

 使用CSS選擇器時,只需要調用select()方法,傳入相應的CSS選擇器即可。

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import json
 4 
 5 url = 'https://www.zhihu.com/explore'
 6 headers = {
 7     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
 8 }
 9 r = requests.get(url, headers=headers)
10 soup = BeautifulSoup(r.text, 'lxml')
11 #print(soup)
12 explore = {}
13 items = soup.select('.explore-tab .feed-item')
14 #items = soup.select('#js-explore-tab .explore-feed feed-item')
15 #print(items)
16 for item in items:
17 
18     question = item.select('h2')[0].string
19     if item.select('.author-link'):
20         author = item.select('.author-link')[0].string
21     else:
22         author = item.select('.name')[0].string
23     answer = item.select('.content')[0].string
24     if item.select('.bio'):
25         q = item.select('.bio')[0].string
26     #print(question)
27     #print(author)
28     #print(answer)
29     #print(q)
30     explore = {
31         "question" : question,
32         "author" : author,
33         "answer" : answer,
34         "q": q,
35     } 
36 
37     with open("explore.json", "a") as f:
38         #f.write(json.dumps(items, ensure_ascii = False).encode("utf-8") + "\n")
39         f.write(json.dumps(explore, ensure_ascii = False) + "\n")

 獲取文本,除了string屬性,還有一個方法,get_text()

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import json
 4 
 5 url = 'https://www.zhihu.com/explore'
 6 headers = {
 7     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
 8 }
 9 r = requests.get(url, headers=headers)
10 soup = BeautifulSoup(r.text, 'lxml')
11 #print(soup)
12 explore = {}
13 items = soup.select('.explore-tab .feed-item')
14 #items = soup.select('#js-explore-tab .explore-feed feed-item')
15 #print(items)
16 for item in items:
17 
18     question = item.select('h2')[0].get_text()
19     if item.select('.author-link'):
20         author = item.select('.author-link')[0].get_text()
21     else:
22         author = item.select('.name')[0].get_text()
23     answer = item.select('.content')[0].get_text()
24     if item.select('.bio'):
25         #q = item.select('.bio')[0].string
26         q = item.select('.bio')[0].attrs['title']
27     else:
28         q = None
29     #print(question)
30     #print(author)
31     #print(answer)
32     #print(q)
33     explore = {
34         "question" : question,
35         "author" : author,
36         "answer" : answer,
37         "q": q,
38     } 
39 
40     with open("explore.json", "a") as f:
41         #f.write(json.dumps(items, ensure_ascii = False).encode("utf-8") + "\n")
42         f.write(json.dumps(explore, ensure_ascii = False) + "\n")

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM