Python爬蟲之BeautifulSoup和requests

本文轉載自查看原文 2020-04-25 21:06 2344 爬蟲/ python相關文檔/ BeautifulSoup/ requests/ Python

用Python實現爬蟲的包有很多，可以結合使用，但是目前個人覺得BeautifulSoup至少在看上去會更方便和美觀一些。

這里只涉及靜態網頁的爬取，暫不支持cookie、session等。

Python實現微博熱搜榜的爬取

微博熱搜地址：https://s.weibo.com/top/summary

微博熱搜榜：https://s.weibo.com/top/summary?cate=realtimehot

1. requests庫：比urllib2模塊更簡潔，request支持http連接保持和連接池，支持使用cookie保持會話，支持文件上傳，支持自動響應內容的編碼，支持國際化的URL和POST數據自動編碼。

requests.get()：用於請求目標網站，類型是一個HTTPresponse類型。

還有：requests.post()、requests.put()、requests.delete()、requests.head()、requests.options()等方法。

2. BeautifulSoup解析庫：用於解析requests得到的網頁，主要包括三種選擇器：方法選擇器（例如find、find_all等方法）、 CSS選擇器、節點選擇器。

3. 正式進入示例：

(1). 首先導入需要的庫：

import requests
from bs4 import BeautifulSoup

(2). 通過url地址，使用requests包獲取網頁：

url = 'https://s.weibo.com/top/summary'   #微博熱搜
web_html = requests.get(url)  #可以帶更多dict格式的參數

#可以帶更多dict格式的參數，形成：https://s.weibo.com/top/summary?cate=realtimehot，多個dict參數會形成：...cate=realtimehot&param_2=value_2 格式
web_html_2 = requests.get(url,params={'cate':'realtimehot'})  
bs4_2 = BeautifulSoup(web_html_2.content, 'lxml')
print(bs4_2.prettify())

通過得到的web_html可以獲得狀態碼、頭信息、編碼格式、url地址等：

#print(web_html.status_code)  # 打印狀態碼 --- 200
#print(web_html.headers)      # 打印頭信息
#print(web_html.content)   #以字節的方式顯示，中文顯示為字符形式
#print(web_html.text)      #以text的方式顯示
#print(web_html.url)       #url地址
#print(web_html.encoding)  #編碼格式

(3). 通過BeautifulSoup包進行解析（bs4形式的數據都可以進行的操作）：

get_text()直接獲取文本形式的信息：

bs4 = BeautifulSoup(web_html.content, 'lxml')   #聲明bs對象和解析器，返回解析后的網頁信息
# print(bs4)                              #看起來會比較混亂一點

print(bs4.get_text()) #直接取文本，更簡潔，屬於字符串形式，看上去比較零散，因為有很多'\n'

prettify()粉墨登場，就變成有條理和順序的網頁文本了：

print(bs4.prettify())                     #格式化代碼，對齊、縮進、換行等
out_str = bs4.prettify()
with open('test_html.html','w',encoding='utf-8') as f:
    f.write(out_str)

(4). 之后的操作均是在這個html文本上進行，也就是網頁內容的獲取 —— 涉及三種選擇器：方法選擇器（例如find、find_all等方法）、 CSS選擇器、節點選擇器。

節點選擇器：

# 直接標簽內容
print('title內容：\n', bs4.title.string)   #以string格式打印出title標簽中的內容

print('title標簽：\n', bs4.title)   #返回標題值 <title></title>之間的值，因為一般只會有一個title
print(type(bs4.title))    #注意：為bs4.element.Tag類型，同樣屬於bs4 ———— 也就是說你可以對這個類型的結果繼續進行類似的選擇操作

# 也可以通過父子節點順序讀取，更為精准
print('title標簽：\n', bs4.head.title)   #返回標題值 <title></title>之間的值，也可以嵌套獲取
print('title內容：\n', bs4.head.title.string)

print('head標簽：\n', bs4.head)    # <head></head>之間的值

# -----------------------------------------------------------------------------

這是原本的html內容：

# p標簽的屬性，均屬於字典格式
print('第一個p標簽：\n', bs4.p)  # 結果發現只輸出了一個p標簽，但是HTML中有3個p標簽，所以該選擇器的特性：當有多個標簽的時候，若不特別指定，它只返回第一個標簽的內容，內容是bs4.element.Tag類型的，也就是可以繼續操作的格式
print('p標簽的屬性1：\n', bs4.p.attrs['class']) #方式一：列表，['class']是因為該p標簽有這個屬性，如果沒有，會報錯
print('p標簽的屬性2：\n', bs4.p['class'])       #方式二：列表
print('p標簽的屬性3：\n', bs4.p.get('class'))   #方式三：列表，不是bs4類型哦，get是字典值獲取方法
print('p標簽的屬性4：\n', bs4.p.string)  #獲取<p></p>中的網頁文本

# 嵌套使用
print('嵌套使用：\n', bs4.p.a)   #同樣未指定，且存在多個a標簽（相同子標簽時），取第一個a標簽；結果是bs4.element.Tag類型
print('嵌套使用：\n', bs4.p.option)
# <!-- ... -->    #屬於注釋
print('嵌套使用：\n', bs4.p.contents)   #獲取該p標簽中的所有內容，包括注釋，屬於列表類型

# 獲取子節點，每個child是bs4.element.NavigableString類型，同樣屬於bs4類型中的
print(bs4.p.children)   #迭代器
for num, child in enumerate(bs4.p.children):
    print(num, child)

# 獲取父節點
print(bs4.a.parent)   #默認第一個a標簽的父節點

# 還有類似的：
# parents 屬性：輸出該標簽的父節點、父節點的父節點、父節點的父節點的父節點......
# next_sibings 屬性：輸出該標簽后面的兄弟標簽，注意兄弟標簽指的是在同一父標簽下的標簽
# previous_sibling屬性：輸出該標簽前面的兄弟標簽，注意兄弟標簽指的是在同一父標簽下的標簽

list_ = []
for num, parent in enumerate(bs4.a.parents):
    print(num, parent)
    list_.append(parent)

print(bs4.a.next_sibings)

方法選擇器：

find_all查詢器：可根據標簽名、屬性、內容查找
find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

參數：

所有attrs在使用的時候可以類似以下方式：

{'class': vaule1, 'id': value2, ...} 或者 attrs = {'class': vaule1, 'id': value2, ...} ，如果只有一個參數，也可以直接： class_ ='value1'，（class_這是因為不能直接用class，會與python關鍵字沖突）；
如果只知道名稱或者值value會變化，那么也可以只通過名稱，但是要將value設置為True，例如：attrs = {'class': True}；
也可以將value設置為正則表達式：例如 attrs = {'class': re.compile(r'\d+')}

# 獲取menu
menu_div = bs4.find("div", class_='menu') # class_名稱不要與class重復
menu_href_list = []
menu_title_list = []
menu_list = []
for kk in menu_div.find_all("a"):  #取值方式：menu_div.find_all("a")[i]，每一個i對應得到的都是bs4.element.Tag類型
    href = kk['href']
    menu_title = kk['title']
    menu = kk.string               #當下<a></a>中的文本
    
    menu_href_list.append(href)
    menu_title_list.append(menu_title)
    menu_list.append(menu)
print('menu的鏈接是：\n',menu_href_list)
print('menu的標題是：\n',menu_title_list)
print('menu的內容是：\n',menu_list)

類似的方法還有：

# =============================================================================
# find(name=None, attrs={}, recursive=True, text=None, **kwargs)
# 和find_all類似，只不過find方法是返回單個元素，如果有多個相同的結果，則返回第一個元素
# 
# find_parents() find_parent()
# find_parents()返回所有祖先節點，find_parent()返回直接父節點。
# 
# find_next_siblings() find_next_sibling()
# find_next_siblings()返回后面所有兄弟節點，find_next_sibling()返回后面第一個兄弟節點。
# 
# find_previous_siblings() find_previous_sibling()
# find_previous_siblings()返回前面所有兄弟節點，find_previous_sibling()返回前面第一個兄弟節點。
# 
# find_all_next() find_next()
# find_all_next()返回節點后所有符合條件的節點, find_next()返回第一個符合條件的節點
# 
# find_all_previous() 和 find_previous()
# find_all_previous()返回節點后所有符合條件的節點, find_previous()返回第一個符合條件的節點
# =============================================================================

def find_next(self, name=None, attrs={}, text=None, **kwargs)
def find_all_next(self, name=None, attrs={}, text=None, limit=None, **kwargs)
def find_next_sibling(self, name=None, attrs={}, text=None, **kwargs)
def find_next_siblings(self, name=None, attrs={}, text=None, limit=None, **kwargs)
def find_previous(self, name=None, attrs={}, text=None, **kwargs)
def find_all_previous(self, name=None, attrs={}, text=None, limit=None, **kwargs)
def find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs)
def find_previous_siblings(self, name=None, attrs={}, text=None, limit=None, **kwargs)
def find_parent(self, name=None, attrs={}, **kwargs)
def find_parents(self, name=None, attrs={}, limit=None, **kwargs)
def previous(self)

例如：

print(kk.find_parent())  #因為此時kk是上面的結果數據，屬於bs4.element.Tag類型

# ---------------------------------------------------------------

好了，真正獲取熱搜的部分來了：

# 獲取熱搜
timehot_rank_list = []
timehot_href_list = []
timehot_content_list = []
timehot_num_list = []
timehot_div = bs4.find("div", {'class':"data",'id':"pl_top_realtimehot"})  #如果是字典形式傳參，則key要與html文件中的一致
# timehot_tbody = timehot_div.find("tbody").get_text()   #獲取文本形式的數據

#timehot_tbody = timehot_div.find("tbody")   #可用
timehot_tbody = timehot_div.tbody       #返回第一個tbody，如果只有一個tbody，也可以直接用
return_str = ''
for ii, mm in enumerate(timehot_tbody.find_all("tr")):
    rank = mm.find("td", class_ = "td-01 ranktop")
    td = mm.find("td", class_="td-02")
    timehot_href_list.append(td.a['href'])
    timehot_content_list.append(td.a.string)
    
    if ii==0:
        timehot_rank_list.append('0')
        timehot_num_list.append('9999999999')
        return_str = return_str +'\t'+ '0' +'\t'+ td.a.string +'\t'+ td.a['href'] +'\t'+ '9999999999' + '\n'
    else:
        timehot_rank_list.append(rank.string)
        timehot_num_list.append(td.span.string)
        return_str = return_str +'\t'+ rank.string +'\t'+ td.a.string +'\t'+ td.a['href'] +'\t'+ td.span.string + '\n'
    
with open('微博熱搜榜.txt','w',encoding='utf-8') as f:
    f.write(return_str)

CSS選擇器：這個更強，不過要對前端編程熟悉一點。

先總結：

class選擇要加 '.'
id選擇要加 '#'
tag選擇不用加特殊標號
但是多重時必須要用空格隔開

css：1重選擇

# 1重選擇
print(bs4.select('.data'))   # class選擇
print(bs4.select('table'))   # tag選擇
print(bs4.select('#pl_top_realtimehot'))   # id選擇

css：2重選擇

# 2重選擇
#選擇class為data中的class為td-02的內容（也就是熱搜標題）
aa = bs4.select('.data .td-02')    #元素組成的列表
print(bs4.select('.data .td-02'))

bb = bs4.select('tr td')              #微博熱搜中，一個tr有3個td；所有tr的td依次排列
print(bs4.select('tr td'))            #標簽選擇，選擇所有tr標簽中的td標簽，實現嵌套

css：3重選擇

# 3重選擇
print(bs4.select('tr td i'))          #3重選擇

css：交叉選擇

#交叉選擇
print(bs4.select('#pl_top_realtimehot .td-02')) #'#'表示id選擇器：選擇id為pl_top_realtimehot中，class為td-02的內容
print(bs4.select('#pl_top_realtimehot .td-02 a'))  #3重交叉選擇，選出所有熱搜的地址和內容
                 
print(bs4.select('#pl_top_realtimehot .td-02 a')[0])  #列表取值，但是每個值又是bs4類型的哦
                 
print(type(bs4.select('#pl_top_realtimehot .td-02 a')[0]))  #每一個內容的類別是：bs4.element.Tag，也屬於可以繼續find等選擇的格式

css：另一種實現嵌套選擇的方式

# 另一種實現嵌套的方式
for tr in bs4.select('tr'):   #對每一個查到的tr，再進行選擇；因為每一個的內容格式是：bs4.element.Tag
    print(tr.select('td'))
    print(tr.get_text())   #直接獲取內容

綜合以上：真正取熱搜的代碼如下 ~~~

import requests
from bs4 import BeautifulSoup

url = 'https://s.weibo.com/top/summary'   #微博熱搜
web_html = requests.get(url)  #可以帶更多dict格式的參數

bs4 = BeautifulSoup(web_html.content, 'lxml')   #聲明bs對象和解析器，返回解析后的網頁信息

timehot_rank_list = []
timehot_href_list = []
timehot_content_list = []
timehot_num_list = []
timehot_div = bs4.find("div", {'class':"data",'id':"pl_top_realtimehot"})  #如果是字典形式，則key要與html文件中的一致
# timehot_tbody = timehot_div.find("tbody").get_text()   #獲取文本形式的數據

#timehot_tbody = timehot_div.find("tbody")   #可用
timehot_tbody = timehot_div.tbody       #返回第一個tbody，如果只有一個tbody，也可以直接用
return_str = ''
for ii, mm in enumerate(timehot_tbody.find_all("tr")):
    rank = mm.find("td", class_ = "td-01 ranktop")
    td = mm.find("td", class_="td-02")
    timehot_href_list.append(td.a['href'])
    timehot_content_list.append(td.a.string)
    
    if ii==0:
        timehot_rank_list.append('0')
        timehot_num_list.append('9999999999')
        return_str = return_str +'\t'+ '0' +'\t'+ td.a.string +'\t'+ td.a['href'] +'\t'+ '9999999999' + '\n'
    else:
        timehot_rank_list.append(rank.string)
        timehot_num_list.append(td.span.string)
        return_str = return_str +'\t'+ rank.string +'\t'+ td.a.string +'\t'+ td.a['href'] +'\t'+ td.span.string + '\n'
    
with open('微博熱搜榜.txt','w',encoding='utf-8') as f:
    f.write(return_str)

View Code

參考：

https://www.cnblogs.com/Caiyundo/p/12507111.html

https://www.jianshu.com/p/9cd4a7160784

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python 爬蟲—— requests BeautifulSoup Python requests+BeautifulSoup爬蟲（下載圖片） python3 爬蟲（requests+BeautifulSoup） python爬蟲之requests+selenium+BeautifulSoup $python爬蟲系列（2）—— requests和BeautifulSoup庫的基本用法 Python爬蟲常用庫介紹（requests、BeautifulSoup、lxml、json）爬蟲不過如此（python的Re 、Requests、BeautifulSoup 詳細篇）爬蟲基本操作、requests和BeautifulSoup python爬蟲之request and BeautifulSoup python爬蟲---BeautifulSoup的用法