BeautifulSoup4的find_all()和select()，簡單爬蟲學習

本文轉載自查看原文 2019-11-03 13:58 1478 BeautifulSoup

正則表達式+BeautifulSoup爬取網頁可事半功倍。

就拿百度貼吧網址來練練手：https://tieba.baidu.com/index.html

1.find_all()：搜索當前節點的所有子節點，孫子節點。

下面例子是用find_all()匹配貼吧分類模塊，href鏈接中帶有“娛樂”兩字的鏈接。

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

f = urlopen('https://tieba.baidu.com/index.html').read()
soup = BeautifulSoup(f,'html.parser')

for link in soup.find_all('a',href=re.compile('娛樂')):    #這里用了正則表達式來過濾
    print(link.get('title')+':'+link.get('href'))

結果：
娛樂明星:/f/index/forumpark?pcn=娛樂明星&pci=0&ct=1&rn=20&pn=1
港台東南亞明星:/f/index/forumpark?cn=港台東南亞明星&ci=0&pcn=娛樂明星&pci=0&ct=1&rn=20&pn=1
內地明星:/f/index/forumpark?cn=內地明星&ci=0&pcn=娛樂明星&pci=0&ct=1&rn=20&pn=1
韓國明星:/f/index/forumpark?cn=韓國明星&ci=0&pcn=娛樂明星&pci=0&ct=1&rn=20&pn=1
日本明星:/f/index/forumpark?cn=日本明星&ci=0&pcn=娛樂明星&pci=0&ct=1&rn=20&pn=1
時尚人物:/f/index/forumpark?cn=時尚人物&ci=0&pcn=娛樂明星&pci=0&ct=1&rn=20&pn=1
歐美明星:/f/index/forumpark?cn=歐美明星&ci=0&pcn=娛樂明星&pci=0&ct=1&rn=20&pn=1
主持人:/f/index/forumpark?cn=主持人&ci=0&pcn=娛樂明星&pci=0&ct=1&rn=20&pn=1
其他娛樂明星:/f/index/forumpark?cn=其他娛樂明星&ci=0&pcn=娛樂明星&pci=0&ct=1&rn=20&pn=1

soup.find_all('a',href=re.compile('娛樂')) 等效於：soup('a',href=re.compile('娛樂'))
上面的例子也可以用soup代替。

**如果沒有合適過濾器,那么還可以定義一個方法,方法只接受一個元素參數。通過一個方法來過濾一類標簽屬性的時候, 這個方法的參數是要被過濾的屬性的值, 而不是這個標簽.

import re
def abc(href):
    return href and not re.compile('娛樂明星').search(href)
print(soup.find_all(href=abc))

find_all()的參數：find_all( name , attrs , recursive , string , **kwargs )

<a href="/f/index/forumpark?pcn=電視節目&amp;pci=0&amp;ct=1&amp;rn=20&amp;pn=1" rel="noopener" target="_blank" title="愛綜藝">愛綜藝</a>

find_all('a') ：查找所有<a>標簽

find_all(title='愛綜藝')：查找所有屬性包含“title='愛綜藝'”的標簽

find(string=re.compile('貼吧'))：查找第一個標簽中包含“貼吧”的字符串

find_all(href=re.compile('娛樂明星'),title='娛樂明星')：多個指定名字的參數可以同時過濾tag的多個屬性

find_all(attrs={"title": "娛樂明星"})：可以用attrs來搜索包含特殊屬性（無法直接搜索的標簽屬性）的tag

find_all(href=re.compile('娛樂明星'),limit=3)：limit參數限制返回結果的數量

2.通過CSS選擇器來查找tag，select()循環你需要的內容：

** 搜索html頁面中a標簽下以“/f/index”開頭的href：

for link2 in soup.select('a[href^="/f/index"]'):
    print(link2.get('title')+':'+link2.get('href'))


**搜索html頁面中a標簽下以“&pn=1”結尾的href：

for link2 in soup.select('a[href$="&pn=1"]'):
    print(link2.get('title')+':'+link2.get('href'))


**搜索html頁面中a標簽下包含“娛樂”的href：

for link3 in soup.select('a[href*="娛樂"]'):
    print(link3.get('title')+':'+link3.get('href'))

soup.select('meta')：根據標簽查找

soup.select('html meta link')：根據標簽逐層查找

soup.select('meta > link:nth-of-type(3)')：找到meta標簽下的第3個link子標簽

soup.select('div > #head')：找到div標簽下，屬性id=head的子標簽

soup.select('div > a')：找到div標簽下，所有a標簽

soup.select("#searchtb ~ .authortb")：找到id=searchtb標簽的class=authortb兄弟節點標簽

soup.select("[class~=m_pic]") 和 soup.select(".m_pic")：找到class=m_pic的標簽

soup.select(".tag-name,.post_author")：同時用多種CSS選擇器查詢

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 beautifulsoup用法2 (find_all select) python爬蟲：BeautifulSoup庫find_all ()、find()方法詳解 python3爬蟲（find_all用法等） find_all的用法 Python（bs4，BeautifulSoup） python爬蟲時如何使用find和find_all的講解 python爬蟲beautifulsoup4系列3 【python小練】圖片爬蟲之BeautifulSoup4 使用Beautiful Soup爬取知乎發現【方法選擇器find_all】【CSS選擇器，select】 Python學習之beautifulsoup4庫的使用 python 3.x 爬蟲基礎---Requersts,BeautifulSoup4（bs4）