python學習之beautifulsoup4、搜索文檔樹、遍歷文檔樹、 mongDB的簡單使用

本文轉載自查看原文 2019-06-20 23:44 573

1.beautifulsoup的簡單使用

# 解析庫：re,selenium
# XML解析器
# Beatifulsoup解析庫，需要配合解析器使用
# 目前主要的解析器：Python標准庫，lxml HTML解析器（首選）
# Beatifulsoup能給我們提供一種查找文檔樹的方法，其內部封裝了re
# 1.什么bs4,為什么要使用bs4
# html_doc = """
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="sister"><b>$37</b></p>
#
# <p class="story" id="p">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" >Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p>
# """

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="sister"><b>$37</b></p>

<p class="story" id="p">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" >Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
from bs4 import BeautifulSoup  # 從bs4中導入Beautiful
# 調用BeautifulSoup實例化一個soup對象
# 參數一：解析文本
# 參數二：解析器（html.parser、lxml）
soup=BeautifulSoup(html_doc,'lxml')
print(soup)
print(type(soup))
# 文檔美化
html=soup.prettify()
print(html)

2.bs4之搜索文檔樹

''''''
html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="sister"><b>$37</b></p><p class="story" id="p">Once upon a time there were three little sisters; and their names were<b>tank</b><a href="http://example.com/elsie" class="sister" >Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.<hr></hr></p><p class="story">...</p>"""
'''
搜索文檔樹:
    find()  找一個  
    find_all()  找多個
    
標簽查找與屬性查找:
    標簽:
            name 屬性匹配
            attrs 屬性查找匹配
            text 文本匹配
            
        - 字符串過濾器   
            字符串全局匹配

        - 正則過濾器
            re模塊匹配

        - 列表過濾器
            列表內的數據匹配

        - bool過濾器
            True匹配

        - 方法過濾器
            用於一些要的屬性以及不需要的屬性查找。

    屬性:
        - class_
        - id
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

# 字符串過濾器
# name
p_tag = soup.find(name='p')
print(p_tag)  # 根據文本p查找某個標簽
# 找到所有標簽名為p的節點
tag_s1 = soup.find_all(name='p')
print(tag_s1)


# attrs
# 查找第一個class為sister的節點
p = soup.find(attrs={"class": "sister"})
print(p)
# 查找所有class為sister的節點
tag_s2 = soup.find_all(attrs={"class": "sister"})
print(tag_s2)


# text
text = soup.find(text="$37")
print(text)


# 配合使用:
# 找到一個id為link2、文本為Lacie的a標簽
a_tag = soup.find(name="a", attrs={"id": "link2"}, text="Lacie")
print(a_tag)



# # 正則過濾器
# import re
# # name
# p_tag = soup.find(name=re.compile('p'))
# print(p_tag)

# 列表過濾器
# import re
# # name
# tags = soup.find_all(name=['p', 'a', re.compile('html')])
# print(tags)

# - bool過濾器
# True匹配
# 找到有id的p標簽
# p = soup.find(name='p', attrs={"id": True})
# print(p)

# 方法過濾器
# 匹配標簽名為a、屬性有id沒有class的標簽
# def have_id_class(tag):
#     if tag.name == 'a' and tag.has_attr('id') and tag.has_attr('class'):
#         return tag
#
# tag = soup.find(name=have_id_class)
# print(tag)

3.bs4之遍歷文檔樹

html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="sister"><b>$37</b></p><p class="story" id="p">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" >Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')
'''
遍歷文檔樹：
1.直接使用
'''
#　1.直接使用
print(soup.p)  # 查找第一個<p>標簽
print(soup.a)  # 查找第一個<a>標簽
# 2.獲取標簽的名稱
print(soup.head.name)
# 3.獲取標簽的屬性
print(soup.a.attrs)  # 以字典的形式
print(soup.a.attrs['href'])  # 獲取a標簽中的href屬性
# 4.獲取標簽的內容
print(soup.p.text)  # $37
# 5.嵌套選擇
print(soup.html.head)
# 6.子節點，子孫節點
# 找到閉合的標簽
print(soup.body.children)  #  找到body所有的子節點，返回的是迭代器的對象，這樣可以節省電腦的資源
print(list(soup.body.children))  # 強制轉化為列表類型
print(soup.body.descendants)  #返回子孫節點
print(list(soup.body.descendants))
# 7.父節點、祖先節點
print(soup.p.parent)# 獲取p標簽的父親節點
print(soup.p.parents)  #  獲取p標簽所有的祖先節點
# 8.兄弟節點
# 找下一個兄弟
print(soup.p.next_sibling)
# 找下面所有的兄弟
print(soup.p.next_siblings)  # 此時返回的是迭代器的對象，這樣可以節省電腦的資源
print(list(soup.p.next_siblings))
# 找上面的兄弟，逗號，文本都可以是兄弟
print(soup.a.previous_sibling)  # 找到a標簽的上一個兄弟
# 找到a標簽上面所有的兄弟
print(soup.a.previous_siblings)
print(list(soup.a.previous_siblings))

3.mongDB的簡單使用

關系型數據庫，強大的查詢功能
非關系型數據庫，靈活模式，擴展性，性能，需要建集合，沒有一一對應的關系，
1.MangoDB
db全局變量顯示當前位置
創建集合
SQL:
create table f1,f2...
MangoDB:
db.student
插入數據
MangoDB:
插多條
db.student.insert([{"name1":"tank1",{"name2":"tank2"}])
插一條
db.student.insert({"name1":"tank1"})
查數據
查全部
db.student.find({})
查一條查找name為tank的記錄
db.student.find({"name":"tank"})

from pymongo import MongoClient

# 1、鏈接mongoDB客戶端
# 參數1: mongoDB的ip地址
# 參數2: mongoDB的端口號 默認:27017
client = MongoClient('localhost', 27017)
print(client)

# 2、進入tank_db庫,沒有則創建
print(client['tank_db'])

# 3、創建集合
print(client['tank_db']['people'])

# 4、給tank_db庫插入數據

# 1.插入一條
data1 = {
    'name': 'tank',
    'age': 18,
    'sex': 'male'
}
client['tank_db']['people'].insert(data1)

# 2.插入多條
data1 = {
    'name': 'tank',
    'age': 18,
    'sex': 'male'
}
data2 = {
    'name': 'tank1',
    'age': 84,
    'sex': 'female'
}
data3 = {
    'name': 'tank2',
    'age': 73,
    'sex': 'male'
}
client['tank_db']['people'].insert([data1, data2, data3])

# 5、查數據
# 查看所有數據
data_s = client['tank_db']['people'].find()
print(data_s)  # <pymongo.cursor.Cursor object at 0x000002EEA6720128>
# 需要循環打印所有數據
for data in data_s:
    print(data)

# 查看一條數據
data = client['tank_db']['people'].find_one()
print(data)

# 官方推薦使用
# 插入一條insert_one
# client['tank_db']['people'].insert_one()
# 插入多條insert_many
# client['tank_db']['people'].insert_many()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python安裝BeautifulSoup4 BeautifulSoup4 庫的基本使用 python爬蟲beautifulsoup4系列1 python爬蟲beautifulsoup4系列2 Python beautifulsoup4 快速入門 Python 爬蟲之 Beautifulsoup4，爬網站圖片 python BeautifulSoup4 獲取 script 節點問題使用pip安裝BeautifulSoup4模塊 python3 之 bs4 BeautifulSoup 簡單使用 python3.7.8安裝BeautifulSoup4時出現錯誤