1.beautifulsoup的簡單使用
# 解析庫:re,selenium # XML解析器 # Beatifulsoup解析庫,需要配合解析器使用 # 目前主要的解析器:Python標准庫,lxml HTML解析器(首選) # Beatifulsoup能給我們提供一種查找文檔樹的方法,其內部封裝了re # 1.什么bs4,為什么要使用bs4 # html_doc = """ # <html><head><title>The Dormouse's story</title></head> # <body> # <p class="sister"><b>$37</b></p> # # <p class="story" id="p">Once upon a time there were three little sisters; and their names were # <a href="http://example.com/elsie" class="sister" >Elsie</a>, # <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and # <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; # and they lived at the bottom of a well.</p> # # <p class="story">...</p> # """ html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="sister"><b>$37</b></p> <p class="story" id="p">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" >Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup # 從bs4中導入Beautiful # 調用BeautifulSoup實例化一個soup對象 # 參數一:解析文本 # 參數二:解析器(html.parser、lxml) soup=BeautifulSoup(html_doc,'lxml') print(soup) print(type(soup)) # 文檔美化 html=soup.prettify() print(html)
2.bs4之搜索文檔樹
'''''' html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="sister"><b>$37</b></p><p class="story" id="p">Once upon a time there were three little sisters; and their names were<b>tank</b><a href="http://example.com/elsie" class="sister" >Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.<hr></hr></p><p class="story">...</p>""" ''' 搜索文檔樹: find() 找一個 find_all() 找多個 標簽查找與屬性查找: 標簽: name 屬性匹配 attrs 屬性查找匹配 text 文本匹配 - 字符串過濾器 字符串全局匹配 - 正則過濾器 re模塊匹配 - 列表過濾器 列表內的數據匹配 - bool過濾器 True匹配 - 方法過濾器 用於一些要的屬性以及不需要的屬性查找。 屬性: - class_ - id ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') # 字符串過濾器 # name p_tag = soup.find(name='p') print(p_tag) # 根據文本p查找某個標簽 # 找到所有標簽名為p的節點 tag_s1 = soup.find_all(name='p') print(tag_s1) # attrs # 查找第一個class為sister的節點 p = soup.find(attrs={"class": "sister"}) print(p) # 查找所有class為sister的節點 tag_s2 = soup.find_all(attrs={"class": "sister"}) print(tag_s2) # text text = soup.find(text="$37") print(text) # 配合使用: # 找到一個id為link2、文本為Lacie的a標簽 a_tag = soup.find(name="a", attrs={"id": "link2"}, text="Lacie") print(a_tag) # # 正則過濾器 # import re # # name # p_tag = soup.find(name=re.compile('p')) # print(p_tag) # 列表過濾器 # import re # # name # tags = soup.find_all(name=['p', 'a', re.compile('html')]) # print(tags) # - bool過濾器 # True匹配 # 找到有id的p標簽 # p = soup.find(name='p', attrs={"id": True}) # print(p) # 方法過濾器 # 匹配標簽名為a、屬性有id沒有class的標簽 # def have_id_class(tag): # if tag.name == 'a' and tag.has_attr('id') and tag.has_attr('class'): # return tag # # tag = soup.find(name=have_id_class) # print(tag)
3.bs4之遍歷文檔樹
html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="sister"><b>$37</b></p><p class="story" id="p">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" >Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>""" from bs4 import BeautifulSoup soup=BeautifulSoup(html_doc,'lxml') ''' 遍歷文檔樹: 1.直接使用 ''' # 1.直接使用 print(soup.p) # 查找第一個<p>標簽 print(soup.a) # 查找第一個<a>標簽 # 2.獲取標簽的名稱 print(soup.head.name) # 3.獲取標簽的屬性 print(soup.a.attrs) # 以字典的形式 print(soup.a.attrs['href']) # 獲取a標簽中的href屬性 # 4.獲取標簽的內容 print(soup.p.text) # $37 # 5.嵌套選擇 print(soup.html.head) # 6.子節點,子孫節點 # 找到閉合的標簽 print(soup.body.children) # 找到body所有的子節點,返回的是迭代器的對象,這樣可以節省電腦的資源 print(list(soup.body.children)) # 強制轉化為列表類型 print(soup.body.descendants) #返回子孫節點 print(list(soup.body.descendants)) # 7.父節點、祖先節點 print(soup.p.parent)# 獲取p標簽的父親節點 print(soup.p.parents) # 獲取p標簽所有的祖先節點 # 8.兄弟節點 # 找下一個兄弟 print(soup.p.next_sibling) # 找下面所有的兄弟 print(soup.p.next_siblings) # 此時返回的是迭代器的對象,這樣可以節省電腦的資源 print(list(soup.p.next_siblings)) # 找上面的兄弟,逗號,文本都可以是兄弟 print(soup.a.previous_sibling) # 找到a標簽的上一個兄弟 # 找到a標簽上面所有的兄弟 print(soup.a.previous_siblings) print(list(soup.a.previous_siblings))
3.mongDB的簡單使用
關系型數據庫,強大的查詢功能
非關系型數據庫,靈活模式,擴展性,性能,需要建集合,沒有一一對應的關系,
1.MangoDB
db全局變量顯示當前位置
創建集合
SQL:
create table f1,f2...
MangoDB:
db.student
插入數據
MangoDB:
插多條
db.student.insert([{"name1":"tank1",{"name2":"tank2"}])
插一條
db.student.insert({"name1":"tank1"})
查數據
查全部
db.student.find({})
查一條查找name為tank的記錄
db.student.find({"name":"tank"})
from pymongo import MongoClient # 1、鏈接mongoDB客戶端 # 參數1: mongoDB的ip地址 # 參數2: mongoDB的端口號 默認:27017 client = MongoClient('localhost', 27017) print(client) # 2、進入tank_db庫,沒有則創建 print(client['tank_db']) # 3、創建集合 print(client['tank_db']['people']) # 4、給tank_db庫插入數據 # 1.插入一條 data1 = { 'name': 'tank', 'age': 18, 'sex': 'male' } client['tank_db']['people'].insert(data1) # 2.插入多條 data1 = { 'name': 'tank', 'age': 18, 'sex': 'male' } data2 = { 'name': 'tank1', 'age': 84, 'sex': 'female' } data3 = { 'name': 'tank2', 'age': 73, 'sex': 'male' } client['tank_db']['people'].insert([data1, data2, data3]) # 5、查數據 # 查看所有數據 data_s = client['tank_db']['people'].find() print(data_s) # <pymongo.cursor.Cursor object at 0x000002EEA6720128> # 需要循環打印所有數據 for data in data_s: print(data) # 查看一條數據 data = client['tank_db']['people'].find_one() print(data) # 官方推薦使用 # 插入一條insert_one # client['tank_db']['people'].insert_one() # 插入多條insert_many # client['tank_db']['people'].insert_many()