python利用xpath進行圖片爬取的簡單示例

本文轉載自查看原文 2020-08-20 17:40 802 爬蟲/ xpath/ python/ 圖片爬取

xpath解析知識點：
    - 最常用也是最便捷高效的一種解析方式
    - xpath解析原理：
        - 1.實例化一個etree的對象，且需要將被解析的頁面源碼加載到對象中;
        - 2.調用etree中的xpath方法結合着xpath表達式實現標簽的定位和內容的捕獲。
    - 環境的安裝：
        - pip install lxml
    - 如何實例化一個etree對象
        - 1.將本地html文檔中的源碼數據加載到etree對象中：
            etree.parse(filePath)
        - 2.可以將從互聯網上獲取的源碼數據加載到該對象中
            etree.HTML('page_text')
        - xpath('xpath表達式')
    - xpath表達式：
        - /:表示的是從根節點開始定位，表示的是一個層級
        - //：表示的是多個層級,表示從任意位置開始定位
        - 屬性定位：//div[@class='card zz_other_shici'] （通用寫法：//tag[@attrName='attrValue']）
        - 索引定位：//div[@class="card zz_other_shici"]/ul/li[3] ，索引從1開始，表示定位第3個li標簽
        - 取文本：
            - /text()是獲取標簽中直系的文本內容
            - //text()是獲取標簽中所有的文本內容
        - 取屬性：
            - /@attrName
    - 其他
        - 解決中文亂碼問題：
            如：str 顯示亂碼，解決方法：str = str.encode('iso-8859-1').decode('gbk')

代碼：

  1 # _*_ coding:utf-8 _*_
  2 """
  3 @FileName   :6.4k圖片解析爬取.py
  4 @CreateTime :2020/8/14 0014 10:01
  5 @Author     : Lurker Zhang
  6 @E-mail     : 289735192@qq.com
  7 @Desc.      :
  8 """
  9 
 10 import requests
 11 from lxml import etree
 12 from setting.config import *
 13 import json
 14 import os
 15 import time
 16 
 17 
 18 def main():
 19     # 圖片采集源地址
 20     source_url = 'http://pic.netbian.com/4kmeinv/'
 21     temp_url = 'http://pic.netbian.com/4kmeinv/index_{}.html'
 22     # 本此采集前多少頁,大於1的整數
 23     page_sum = 5
 24     if page_sum == 1:
 25         pic_list_url = source_url
 26         print('開始下載:'+pic_list_url)
 27         down_pic(pic_list_url)
 28     else:
 29         # 先采集第一頁
 30         pic_list_url = source_url
 31         # 調用采集單頁圖片鏈接的函數
 32         down_pic(pic_list_url)
 33         # 再采集第二頁開始后面的頁數
 34         for page_num in range(2, page_sum + 1):
 35             pic_list_url = temp_url.format(page_num)
 36             print('開始下載:'+pic_list_url)
 37             down_pic(pic_list_url)
 38 
 39     print('采集完成，本地成功下載{0}張圖片,失敗{1}張圖片。'.format(total_success, total_fail))
 40     # 存儲已下載文件名列表：
 41     with open("../depository/meinv/pic_name_list.json",'w',encoding='utf-8') as fp:
 42         json.dump(pic_name_list,fp)
 43 
 44 
 45 def down_pic(pic_list_url):
 46     global total_success, total_fail,pic_name_list
 47     # 獲取圖片列表頁的網頁數據
 48     pic_list_page_text = requests.get(url=pic_list_url, headers=headers).text
 49     tree_1 = etree.HTML(pic_list_page_text)
 50     # 獲取圖片地址列表
 51     pic_show_url_list = tree_1.xpath('//div[@class="slist"]/ul//a/@href')
 52     pic_url_list = [get_pic_url('http://pic.netbian.com' + pic_show_url) for pic_show_url in pic_show_url_list]
 53 
 54     # 開始下載並保存圖片
 55     for pic_url in pic_url_list:
 56         picname = get_pic_name(pic_url)
 57         if not picname in pic_name_list:
 58             if save_pic(pic_url, picname):
 59                 # 將下載過的圖片記錄到已下載圖片的列表中
 60                 pic_name_list.append(picname)
 61                 total_success += 1
 62                 print("成功保存圖片:{0},共成功采集{1}張。".format(picname,total_success))
 63 
 64             else:
 65                 print(picname + "保存失敗")
 66                 total_fail += 1
 67         else:
 68             print("跳過，已下載過圖片：" + picname)
 69             total_fail += 1
 70 
 71 
 72 def save_pic(pic_url, picname):
 73     # 獲取日期作為保存位置文件夾
 74     path = '../depository/meinv/' + time.strftime('%Y%m%d', time.localtime()) + '/'
 75     if not os.path.exists(path):
 76         os.mkdir(path)
 77     pic = requests.get(url=pic_url, headers=headers).content
 78     try:
 79         with open(path + picname, 'wb') as fp:
 80             fp.write(pic)
 81     except IOError:
 82         return 0
 83     else:
 84         return 1
 85 
 86 
 87 def get_pic_name(pic_url):
 88     return pic_url.split('/')[-1]
 89 
 90 
 91 def get_pic_url(pic_show_url):
 92     tree = etree.HTML(requests.get(url=pic_show_url, headers=headers).text)
 93     return 'http://pic.netbian.com/' + tree.xpath('//div[@class="photo-pic"]/a/img/@src')[0]
 94 
 95 
 96 if __name__ == '__main__':
 97     # 讀入已采集圖片的名稱庫，名稱存在重復的表示已經采集過將跳過不采集
 98     if not os.path.exists('../depository/meinv/pic_name_list.json'):
 99         with open("../depository/meinv/pic_name_list.json", 'w', encoding="utf-8") as fp:
100             json.dump([], fp)
101     with open("../depository/meinv/pic_name_list.json", "r", encoding="utf-8") as fp:
102         pic_name_list = json.load(fp)
103     # 記錄本次采集圖片的數量
104     total_success = 0
105     total_fail = 0
106     main()

運行結果：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python xpath圖片爬取利用Python爬取網頁圖片利用Xpath和jQuery進行元素定位示例 Python超簡單的爬取網站中圖片五、XML與xpath--------------爬取美女圖片 python保存爬取的圖片【個人】爬蟲實踐，利用xpath方式爬取數據之爬取蝦米音樂排行榜【Python爬蟲】之爬取頁面內容、圖片以及用selenium爬取爬取千千小說 -- xpath 利用Python爬取免費代理IP