Python 爬蟲之 Beautifulsoup4，爬網站圖片

本文轉載自查看原文 2019-05-05 14:51 1451 爬蟲/ jQuery/ javascript/ Pthon

安裝：

pip3 install beautifulsoup4
pip install beautifulsoup4

Beautifulsoup4 解析器使用 lxml，原因為，解析速度快，容錯能力強，效率夠高

安裝解析器：

pip install lxml

使用方法：

加載 beautifulsoup4 模塊
加載 urllib 庫的 urlopen 模塊
使用 urlopen 讀取網頁，如果是中文，需要添加 utf-8 編碼模式
使用 beautifulsoup4 解析網頁

#coding: utf8
#python 3.7

from bs4 import BeautifulSoup
from urllib.request import urlopen

#if chinese apply decode()
html = urlopen("https://www.anviz.com/product/entries/1.html").read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
all_li = soup.find_all("li",{"class","product-subcategory-item"})
for li_title in all_li:
  li_item_title = li_title.get_text()
  print(li_item_title)

Beautifulsoup4文檔： https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id13

方法同 jQuery 類似：

//獲取所有的某個標簽：soup.find_all('a')，find_all() 和 find() 只搜索當前節點的所有子節點,孫子節點
find_all()
soup.find_all("a")  //查找所有的標簽
soup.find_all(re.compile("a"))  //查找匹配包含 a 的標簽
soup.find_all(id="link2")
soup.find_all(href=re.compile("elsie")) //搜索匹配每個tag的href屬性
soup.find_all(id=True)  //搜索匹配包含 id 的屬性
soup.find_all("a", class_="sister")  //搜索匹配 a 標簽中 class 為 sister 
soup.find_all("p", class_="strikeout")
soup.find_all("p", class_="body strikeout")
soup.find_all(text="Elsie")  //搜索匹配內容為 Elsie 
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
soup.find_all("a", limit=2)  //當搜索內容滿足第2頁時，停止搜索
//獲取tag中包含的文本內容
get_text() 
soup.get_text("|")
soup.get_text("|", strip=True)
//用來搜索當前節點的父輩節點
find_parents()
find_parent()
//用來搜索兄弟節點
find_next_siblings() //返回所有符合條件的后面的兄弟節點
find_next_sibling()  //只返回符合條件的后面的第一個tag節點
//用來搜索兄弟節點
find_previous_siblings() //返回所有符合條件的前面的兄弟節點
find_previous_sibling() //返回第一個符合條件的前面的兄弟節點

find_all_next()  //返回所有符合條件的節點
find_next()  //返回第一個符合條件的節點

find_all_previous() //返回所有符合條件的節點
find_previous()  //返回第一個符合條件的節點

.select() 方法中傳入字符串參數,即可使用CSS選擇器的語法找到tag
soup.select("body a")
soup.select("head > title")
soup.select("p > a")
soup.select("p > a:nth-of-type(2)")
soup.select("#link1 ~ .sister")
soup.select(".sister")
soup.select("[class~=sister]")
soup.select("#link1")
soup.select('a[href]')
soup.select('a[href="http://example.com/elsie"]')

.wrap() 方法可以對指定的tag元素進行包裝 [8] ,並返回包裝后的結果

爬取 anviz 網站產品列表圖片： demo

使用了

BeautifulSoup

requests

os

#Python 自帶的模塊有以下幾個，使用時直接 import 即可
    import json
    import random     //生成隨機數
    import datetime
    import time
    import os       //建立文件夾

#coding: utf8
#python 3.7

from bs4 import BeautifulSoup
import requests
import os

URL = "https://www.anviz.com/product/entries/2.html"
html = requests.get(URL).text
os.makedirs("./imgs/",exist_ok=True)
soup = BeautifulSoup(html,features="lxml")

all_li = soup.find_all("li",class_="product-subcategory-item")
for li in all_li:
    imgs = li.find_all("img")
    for img in imgs:
        imgUrl = "https://www.anviz.com/" + img["src"]
        r = requests.get(imgUrl,stream=True)
        imgName = imgUrl.split('/')[-1]
        with open('./imgs/%s' % imgName, 'wb') as f:
            for chunk in r.iter_content(chunk_size=128):
                f.write(chunk)
        print('Saved %s' % imgName)

爬取的這個 URL 地址是寫死的，其實這個網站是分三大塊的，末尾 ID 不一樣，還沒搞明白怎么自動全爬。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲beautifulsoup4系列1 python爬蟲beautifulsoup4系列2 python爬蟲——利用BeautifulSoup4爬取糗事百科的段子 python網絡爬蟲之解析網頁的BeautifulSoup(爬取電影圖片)[三] python安裝BeautifulSoup4 Python爬蟲(十四)_BeautifulSoup4 解析器 python爬蟲beautifulsoup4系列4-子節點 Python beautifulsoup4 快速入門 Python--爬蟲之(斗圖啦網站)圖片爬取 Python requests+BeautifulSoup爬蟲（下載圖片）