BeautifulSoup4
是一個Python庫,用於從HTML和XML文件中提取數據。它與您最喜歡的解析器一起使用,提供導航,搜索和修改解析樹的慣用方法。它通常可以節省程序員數小時或數天的工作量。
1.安裝BeautifulSoup4
pip install bs4
2.詳細操作
from bs4 import BeautifulSoup from urllib import request #獲取網頁內容 base_url = 'http://langlang2017.com/route.html' response = request.urlopen(base_url) html = response.read() #數據解析(從頁面當中提取數據) #創建bs4對象 soup = BeautifulSoup(html,'lxml') #格式化輸出對象中的內容 content = soup.prettify() #提取頁面當中的指定內容 # print(soup.title) #獲取title內容 #一 只能匹配到第一個標簽內容 #1.tag(name) # print(soup.name) #輸出文檔類型 # print(soup.div.name) #輸出標簽名 #2attrs # print(soup.title.attrs) # print(soup.img.attrs) #3修改屬性值 img = soup.img.attrs # print(img) domain = 'http://www.langlang2017.com' img["src"] = domain+ img["src"] # print(img) #4刪除 img= soup.img.attrs # print(img) del img["alt"] # print(img) #二 #1獲取文本 # print(soup.title) # print(soup.title.attrs) # print(soup.title.name) #格式:標簽名.string # print(soup.title.string) #三 標簽名.contents 獲取子節點列表 head = soup.head.contents # print(head) # print(head[3]) #標簽名.children --子節點 head_children = soup.head.children # for i in head_children: # print(i) #便簽名.descendants --子孫節點 # print(soup.div) # for i in soup.div.descendants: # print(i) #搜索文檔 find_all() # print(soup.meta) #只能獲取一個 # for i in soup.find_all('meta'): # print(i) #標簽列表 # print(soup.find_all(["h1","h2"])) #關鍵詞 # print(soup.find_all(id='weixin')) #四 css選擇器 soup.select() #通過類名查找 # print(soup.select('.logo')) #通過標簽名查找 # print(soup.select('a')) #通過id查找 # print(soup.select('#weixin'))
3.注意:運行報錯
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need
解決:安裝 lxml包
pip install lxml