Python BeautifulSoup 使用

本文轉載自查看原文 2019-01-20 16:43 810 Python

BS4庫簡單使用:

1.最好配合LXML庫，下載：pip install lxml

2.最好配合Requests庫，下載：pip install requests

3.下載bs4：pip install bs4

4.直接輸入pip沒用？解決：環境變量->系統變量->Path->新建：C:\Python27\Scripts

案例：獲取網站標題

            # -*- coding:utf-8 -*- 
          
            from bs4 import BeautifulSoup 
          
            import requests 
          
            url = "https://www.baidu.com" 
          
            response = requests.get(url) 
          
            soup = BeautifulSoup(response.content, 'lxml') 
          
            print soup.title.text

標簽識別

示例1：

            # -*- coding:utf-8 -*- 
          
            from bs4 import BeautifulSoup 
          
            html = ''' 
          
            <html> 
          
            <head><title>The Dormouse's story</title></head> 
          
            <body> 
          
            <p class="title"><b>The Dormouse's story</b></p> 
          
            <p class="story">Once upon a time there were three little sisters; and their names were 
          
            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 
          
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
          
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
          
            and they lived at the bottom of a well.</p> 
          
            <p class="story">...</p> 
          
            </body> 
          
            </html> 
          
            ''' 
          
            soup = BeautifulSoup(html, 'lxml') 
          
            # BeautifulSoup中有內置的方法來實現格式化輸出 
          
            print(soup.prettify()) 
          
            # title標簽內容 
          
            print(soup.title.string) 
          
            # title標簽的父節點名 
          
            print(soup.title.parent.name) 
          
            # 標簽名為p的內容 
          
            print(soup.p) 
          
            # 標簽名為p的class內容 
          
            print(soup.p["class"]) 
          
            # 標簽名為a的內容 
          
            print(soup.a) 
          
            # 查找所有的字符a 
          
            print(soup.find_all('a')) 
          
            # 查找id='link3'的內容 
          
            print(soup.find(id='link3'))

示例2：

            # -*- coding:utf-8 -*- 
          
            from bs4 import BeautifulSoup 
          
            html = ''' 
          
            <html> 
          
            <head><title>The Dormouse's story</title></head> 
          
            <body> 
          
            <p class="story">Once upon a time there were three little sisters; and their names were 
          
            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 
          
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
          
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
          
            and they lived at the bottom of a well.</p> 
          
            <p class="story">...</p> 
          
            </body> 
          
            </html> 
          
            ''' 
          
            soup = BeautifulSoup(html, 'lxml') 
          
            # 將p標簽下的所有子標簽存入到了一個列表中 
          
            print (soup.p.contents)

find_all示例:

            # -*- coding:utf-8 -*- 
          
            from bs4 import BeautifulSoup 
          
            html = ''' 
          
            <div class="panel"> 
          
                <div class="panel-heading"> 
          
                    <h4>Hello</h4> 
          
                </div> 
          
                <div class="panel-body"> 
          
                    <ul class="list" id="list-1"> 
          
                        <li class="element">Foo</li> 
          
                        <li class="element">Bar</li> 
          
                        <li class="element">Jay</li> 
          
                    </ul> 
          
                    <ul class="list list-small" id="list-2"> 
          
                        <li class="element">Foo</li> 
          
                        <li class="element">Bar</li> 
          
                    </ul> 
          
                </div> 
          
            </div> 
          
            ''' 
          
            soup = BeautifulSoup(html, 'lxml') 
          
            # 查找所有的ul標簽內容 
          
            print(soup.find_all('ul')) 
          
            # 針對結果再次find_all,從而獲取所有的li標簽信息 
          
            for ul in soup.find_all('ul'): 
          
                print(ul.find_all('li')) 
          
            # 查找id為list-1的內容 
          
            print(soup.find_all(attrs={'id': 'list-1'})) 
          
            # 查找class為element的內容 
          
            print(soup.find_all(attrs={'class': 'element'})) 
          
            # 查找所有的text='Foo'的文本 
          
            print(soup.find_all(text='Foo'))

CSS選擇器示例：

            # -*- coding:utf-8 -*- 
          
            from bs4 import BeautifulSoup 
          
            html = ''' 
          
            <div class="panel"> 
          
                <div class="panel-heading"> 
          
                    <h4>Hello</h4> 
          
                </div> 
          
                <div class="panel-body"> 
          
                    <ul class="list" id="list-1"> 
          
                        <li class="element">Foo</li> 
          
                        <li class="element">Bar</li> 
          
                        <li class="element">Jay</li> 
          
                    </ul> 
          
                    <ul class="list list-small" id="list-2"> 
          
                        <li class="element">Foo</li> 
          
                        <li class="element">Bar</li> 
          
                    </ul> 
          
                </div> 
          
            </div> 
          
            ''' 
          
            soup = BeautifulSoup(html, 'lxml') 
          
            # 獲取class名為panel下panel-heading的內容 
          
            print(soup.select('.panel .panel-heading')) 
          
            # 獲取class名為ul和li的內容 
          
            print(soup.select('ul li')) 
          
            # 獲取class名為element，id為list-2的內容 
          
            print(soup.select('#list-2 .element')) 
          
            # 使用get_text()獲取文本內容 
          
            for li in soup.select('li'): 
          
                print(li.get_text()) 
          
            # 獲取屬性的時候可以通過[屬性名]或者attrs[屬性名] 
          
            for ul in soup.select('ul'): 
          
                print(ul['id']) 
          
                # print(ul.attrs['id'])

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 [Python]BeautifulSoup安裝與使用 python 模塊BeautifulSoup使用 python爬蟲之beautifulsoup的使用 python 使用 BeautifulSoup 解析html Python之BeautifulSoup常用詳細使用 python3 BeautifulSoup模塊使用 python中的BeautifulSoup使用小結 python爬蟲：BeautifulSoup 使用select方法的使用 python3 之 bs4 BeautifulSoup 簡單使用 python爬蟲：使用BeautifulSoup進行查找