### BeautifulSoup解析庫的介紹和使用 ### 三大選擇器:節點選擇器,方法選擇器,CSS選擇器 ### 使用建議:方法選擇器 > CSS選擇器 > 節點選擇器 ## 測試文本 text = ''' <html><head><title>there is money</title></head> <body> <p class="title" name="dmr"><b>there is money</b></p> <p class="money">good good study, day day up <a href="https://www.baidu.com/1" class="error" id="l1"><span><!-- 1 --></span></a>, <a href="https://www.baidu.com/2" class="error" id="l2"><span>2</span></a> and <a href="https://www.baidu.com/3" class="error" id="l3">3</a>; 66666666666 </p> <p class='body'>...</p> '''
1. 基本用法
## 基本用法 from bs4 import BeautifulSoup # 初始化BeautifulSoup對象,選擇lxml類型 soup = BeautifulSoup(text, 'lxml') # 以標准的縮進格式輸出 print(soup.prettify()) # 提取title節點的文本內容 print(soup.title.string) ''' 輸出內容: <html> <head> <title> there is money </title> </head> <body> <p class="title" name="dmr"> <b> there is money </b> </p> <p class="money"> good good study, day day up <a class="error" href="https://www.baidu.com/1" id="l1"> <!-- 1 --> </a> , <a class="error" href="https://www.baidu.com/2" id="l2"> 2 </a> and <a class="error" href="https://www.baidu.com/3" id="l3"> 3 </a> ; 66666666666 </p> <p class="body"> ... </p> </body> </html> there is money '''
2. 節點選擇器
### 節點選擇器 from bs4 import BeautifulSoup soup = BeautifulSoup(text, 'lxml') print(type(soup)) print(soup.title) print(type(soup.title)) print(soup.p) print(soup.head) ''' 輸出結果: <class 'bs4.BeautifulSoup'> <title>there is money</title> <class 'bs4.element.Tag'> <p class="title" name="dmr"><b>there is money</b></p> <head><title>there is money</title></head> ''' ## 提取信息 from bs4 import BeautifulSoup soup = BeautifulSoup(text, 'lxml') # 提取title標簽的文本內容 print(soup.title.string) # p表情的名稱 print(soup.p.name) # p標簽的屬性,字典格式 print(soup.p.attrs) print(soup.p.attrs.get('name')) # attrs可省略,直接以字典的提取方式進行信息提取 print(soup.p['class']) print(soup.p.get('class')) print(soup.p.string) ''' 輸出內容: there is money p {'class': ['title'], 'name': 'dmr'} dmr ['title'] ['title'] there is money ''' ## 嵌套選擇,套中套 from bs4 import BeautifulSoup soup = BeautifulSoup(text, 'lxml') print(soup.body.p.string) ''' 輸出內容: there is money ''' ## 關聯選擇 ## 子節點和子孫節點 from bs4 import BeautifulSoup soup = BeautifulSoup(text, 'lxml') # 直接子節點,包含換行符文本內容等;contents獲取到一個list, children生成一個迭代器(建議使用) print(soup.body.contents) print(len(soup.body.contents)) print(soup.body.children) for i, child in enumerate(soup.body.children): print(i, child) print(soup.body.descendants) for j, item in enumerate(soup.body.descendants): print(j, item) ''' 輸出結果: ['\n', <p class="title" name="dmr"><b>there is money</b></p>, '\n', <p class="money">good good study, day day up <a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and <a class="error" href="https://www.baidu.com/3" id="l3">3</a>; 66666666666 </p>, '\n', <p class="body">...</p>, '\n'] 7 <list_iterator object at 0x0000000002DAD320> 0 1 <p class="title" name="dmr"><b>there is money</b></p> 2 3 <p class="money">good good study, day day up <a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and <a class="error" href="https://www.baidu.com/3" id="l3">3</a>; 66666666666 </p> 4 5 <p class="body">...</p> 6 <generator object Tag.descendants at 0x0000000002D67E58> 0 1 <p class="title" name="dmr"><b>there is money</b></p> 2 <b>there is money</b> 3 there is money 4 5 <p class="money">good good study, day day up <a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and <a class="error" href="https://www.baidu.com/3" id="l3">3</a>; 66666666666 </p> 6 good good study, day day up 7 <a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a> 8 <span><!-- 1 --></span> 9 1 10 , 11 <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> 12 <span>2</span> 13 2 14 and 15 <a class="error" href="https://www.baidu.com/3" id="l3">3</a> 16 3 17 ; 66666666666 18 19 <p class="body">...</p> 20 ... 21 ''' ## 父節點和祖先節點 from bs4 import BeautifulSoup soup = BeautifulSoup(text, 'lxml') print(soup.a.parent) print(soup.a.parents) for i, parent in enumerate(soup.a.parents): print(i, parent) ''' 輸出結果: <p class="money">good good study, day day up <a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and <a class="error" href="https://www.baidu.com/3" id="l3">3</a>; 66666666666 </p> <generator object PageElement.parents at 0x0000000002D68E58> 0 <p class="money">good good study, day day up <a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and <a class="error" href="https://www.baidu.com/3" id="l3">3</a>; 66666666666 </p> 1 <body> <p class="title" name="dmr"><b>there is money</b></p> <p class="money">good good study, day day up <a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and <a class="error" href="https://www.baidu.com/3" id="l3">3</a>; 66666666666 </p> <p class="body">...</p> </body> 2 <html><head><title>there is money</title></head> <body> <p class="title" name="dmr"><b>there is money</b></p> <p class="money">good good study, day day up <a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and <a class="error" href="https://www.baidu.com/3" id="l3">3</a>; 66666666666 </p> <p class="body">...</p> </body></html> 3 <html><head><title>there is money</title></head> <body> <p class="title" name="dmr"><b>there is money</b></p> <p class="money">good good study, day day up <a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a> and <a class="error" href="https://www.baidu.com/3" id="l3">3</a>; 66666666666 </p> <p class="body">...</p> </body></html> ''' ## 兄弟節點 from bs4 import BeautifulSoup soup = BeautifulSoup(text, 'lxml') print('Next sibling: ', soup.a.next_sibling) print('Previous sibling: ', soup.a.previous_sibling) print('Next siblings: ', soup.a.next_siblings) print('Previous siblings: ', soup.a.previous_sibling) ''' 輸出結果: Next sibling: , Previous sibling: good good study, day day up Next siblings: <generator object PageElement.next_siblings at 0x0000000002D67E58> Previous siblings: good good study, day day up '''
3. 方法選擇器
### 方法選擇器,較為靈活 ## find_all方法,查詢所有符合條件的,返回一個列表,元素類型為tag ## find方法,查詢符合條件的第一個元素,返回一個tag類型對象 ## 同理,find_parents和find_parent ## find_next_siblings和find_next_sibling ## find_previous_siblings和find_previous_sibling ## find_all_next和find_next ## find_all_previous和find_previous from bs4 import BeautifulSoup import re soup = BeautifulSoup(text, 'lxml') # 找到節點名為a的節點,為一個列表 print(soup.find_all(name='a')) print(soup.find_all(name='a')[0]) # 找到id屬性為l1, class屬性為error的節點 print(soup.find_all(attrs={'id': 'l1'})) print(soup.find_all(class_='error')) # 通過文本關鍵字來進行匹配文本內容 print(soup.find_all(text=re.compile('money'))) ''' 輸出內容: [<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a>, <a class="error" href="https://www.baidu.com/3" id="l3">3</a>] <a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a> [<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>] [<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a>, <a class="error" href="https://www.baidu.com/3" id="l3">3</a>] ['there is money', 'there is money'] '''
4. CSS選擇器
### CSS選擇器,select方法,返回一個列表 from bs4 import BeautifulSoup soup = BeautifulSoup(text, 'lxml') print(soup.select('p a')) print(soup.select('.error')) print(soup.select('#l1 span')) print(soup.select('a')) print(type(soup.select('a'))) ''' 輸出內容: [<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a>, <a class="error" href="https://www.baidu.com/3" id="l3">3</a>] [<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a>, <a class="error" href="https://www.baidu.com/3" id="l3">3</a>] [<span><!-- 1 --></span>] [<a class="error" href="https://www.baidu.com/1" id="l1"><span><!-- 1 --></span></a>, <a class="error" href="https://www.baidu.com/2" id="l2"><span>2</span></a>, <a class="error" href="https://www.baidu.com/3" id="l3">3</a>] <class 'bs4.element.ResultSet'> ''' ## 嵌套選擇,獲取屬性,獲取文本 from bs4 import BeautifulSoup soup = BeautifulSoup(text, 'lxml') # 嵌套選擇 for i in soup.select('a'): print(i.select('span')) # 獲取屬性 print(soup.select('a')[0].attrs) print(soup.select('a')[0].get('class')) # 獲取文本 print(soup.select('a')[1].string) print(soup.select('a')[2].get_text()) ''' 輸出結果: [<span><!-- 1 --></span>] [<span>2</span>] [] {'href': 'https://www.baidu.com/1', 'class': ['error'], 'id': 'l1'} ['error'] 2 3 '''