python爬蟲---BeautifulSoup的用法

本文轉載自查看原文 2017-11-23 12:09 4938 Python爬蟲

BeautifulSoup是一個靈活的網頁解析庫，不需要編寫正則表達式即可提取有效信息。

推薦使用lxml作為解析器,因為效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必須安裝lxml或html5lib, 因為那些Python版本的標准庫中內置的HTML解析方法不夠穩定.

如下的html_doc是一個缺少部分閉合標簽的html文檔

html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """

基本用法

html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml')  # 聲明對象並選定解析方式
print(soup.prettify())  # prettify()方法將html格式化並補齊代碼
print(soup.title.string) #輸出title標簽內容

結果：可以看到html缺失的</body>和</html>被補齊了，同時也打印出了title標簽的內容

標簽選擇器

元素選擇

在解析對象聲明之后，便可以進行元素選擇了，會打印輸出選擇元素的標簽及內容

html_doc = """"
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') 
print(soup.prettify())  
print(soup.title) print(soup.head) print(soup.a)

結果：

獲取標簽名稱：

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc,'lxml') print(soup.title.name)

結果：
title

獲取標簽內屬性：

在第一個a標簽內有href屬性，直接使用['name']或者attrs['name']即可獲得屬性值，兩者是一樣的。

html_doc = """

"""

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml')  
print(soup.prettify())  


print(soup.a['href']) print(soup.a.attrs['href'])

結果：

獲取內容

#如下示例,獲取a標簽內容
html_doc="""<a>this is tag a</a>"""

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc,'lxml') print(soup.a.string)

結果：

this is tag a

嵌套的選擇

html_doc = """<html><head></head><a>this is tag a</a><body></body></html> from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc,'lxml') print(soup.head.a.string)

結果：

this is tag a

一次性獲得html文檔所有的內容

html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml')  # 聲明對象並選定解析方式
print(soup.get_text())

輸出：

遍歷子節點和子孫節點

html還是示例的，文檔中有兩個閉合的p標簽，這里尋找時總是返回第一個結果，不再向后查詢

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml')  # 聲明對象並選定解析方式
print(soup.p.contents)

結果：第一個閉合p標簽的內容被完全打印出，以list形式返回

迭代子節點和子孫節點

soup.p.children實際上是一個迭代器，使用枚舉的方式將內容列舉出。

html_doc = """
"""

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml')  
print(soup.p.contents) for i, child in enumerate(soup.p.children): #只迭代子節點，不包含子孫節點 print(i,child) for i, child in enumerate(soup.p.descendants): # 迭代p標簽所有子孫節點
    print(i, child)

結果：

只迭代子節點，這里p為一級標簽，只能迭代出字標簽

迭代子孫節點，p為一級標簽，可以迭代出所有次級標簽

獲取父親節點和祖先節點

html_doc = """
"""

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.a.parent)

結果：p是a的父親節點，p被完整打印

獲取祖先節點

html_doc = """
"""

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(list(enumerate(soup.a.parents)))

結果：這個枚舉向上迭代通過層級查找，在最后一組元素迭代出所有祖先節點內容

兄弟節點

html = """
"""

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(list(enumerate(soup.a.next_siblings))) # a以下並列節點
print(list(enumerate(soup.a.previous_siblings))) # a以上並列兄弟節點

結果：

標准選擇器

find_all(name, attrs, recursive, text, **kwargs)

可根據標簽名，屬性，內容查找文檔。

使用的html文檔如下

html = """ <div class="panel">
    <div class="panel-heading">
        <h3>hello</h3>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div> """

使用find_all()方法

from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul')) #返回一個list 以bs4的bs4.element.tag方式查找
print(soup.find_all('ul')[0]) #返回一個list 從0開始索引

結果：find_all('ul') 查找到所有ul標簽

find_all('ul')[0]只返回第一個查找到的結果

使用find_all方法進行嵌套遍歷查找

html = """
"""

from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')  #使用find_all()方法的嵌套遍歷查找

for ul in soup.find_all('ul'): print(ul.find_all('li'))

結果：第一個ul標簽下的字標簽li全部被查找到，結果以list返回

第二個ul下的簽同樣

以屬性進行查找

from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')  #使用find_all()方法的嵌套遍歷查找

print(soup.find_all(attrs={'id': 'list-1'})) #以屬性形式查找標簽,attrs是一個dict形式,內容為參數名稱及屬性
print(soup.find_all(attrs={'class': 'list'}))

結果：以id=list-1進行查找，結果為紅色框內容

以class=list進行查找，結果兩個ul全部找到

如有特殊的屬性，可以直接以屬性名稱進行查找

html = """
"""

from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')  #使用find_all()方法的嵌套遍歷查找

print(soup.find_all(class_='element'))  #因為class在python內為關鍵字，查找時加個下划線就ok

結果：class=element的全部查找到

text以文本方式進行查找

html = """
"""

from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text='Foo'))  #直接返回內容

結果：

['Foo', 'Foo']

find(name, attrs, recursive, text, **kw)方法

find返回單個元素，find_all返回所有元素

查找方式及返回結果

html = """
"""

from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find('ul')) #返回的是單個的字符串
print(soup.find('li')['class'])

結果：對ul的查找以字符串返回了，而li標簽下的class屬性則以list 返回

以屬性查找同find_all，只是find返回的結果只是字符串

html = """
"""



from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find(attrs={'id': 'list-1'})) print(soup.find(text='Foo'))

結果：

find_parent()和find_parents()

前者只返回父親節點，后者返回所有祖先節點

find_next_siblings()和find_next_sibling()

前者返回后面所有兄弟節點，后者只返回后面第一個兄弟節點

find_previous_siblings()和 find_previous_sibling()

前者返回前面所有的兄弟節點，后者返回前面的第一個兄弟節點

CSS選擇器

通過select()直接傳入選擇器即可完成選擇

如果選擇了class屬性，則在css選擇器中要以.代替，而id以#代替

html = """
"""

from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading'))  # 先選擇外層元素，然后里層元素次級標簽選擇
print(soup.select('ul')) print(soup.select('ul li'))  # 選擇ul 次級選li 打印出li標簽內容
print(soup.select('#list-2 .element'))  # id使用#選擇 
print(soup.select('ul')[1])   #ul標簽選擇為一個整體,選擇索引從0開始的

選擇結果：

以選擇器進行迭代

from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('ul'): #這里使用迭代方式 結果返回list
     print(li.select('li'))

結果：同查找的遍歷，這里是以選擇器方式進行遍歷

獲取屬性

html = """
"""


from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') # 獲取屬性
for li in soup.select('ul'): print(li['id'])  # or print(li.attrs['id'])

結果：

list-1

list-2

獲取文本內容

html = """
"""


from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'): print(li.get_text())

結果：

總結：

lxml是bs最快的解析庫，如果不能滿足使用，則使用html.parse

查找方式以find_all和find方便

熟悉css選擇器則可以使用css select()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲beautifulsoup查找定位Select用法 python爬蟲：BeautifulSoup 庫的基本函數用法及框架 $python爬蟲系列（2）—— requests和BeautifulSoup庫的基本用法 python爬蟲之request and BeautifulSoup Python 爬蟲—— requests BeautifulSoup Python爬蟲之BeautifulSoup和requests python爬蟲（beautifulsoup） python爬蟲之beautifulsoup的使用 Python網絡爬蟲之BeautifulSoup模塊 python爬蟲beautifulsoup4系列1