批量下載網站圖片的Python實用小工具

本文轉載自查看原文 2016-10-22 16:44 3862 並發/ 復用/ python/ 3綜合編程/ BeautifulSoup/ requests/ 程序/ 線程池

本文適合於熟悉Python編程且對互聯網高清圖片饒有興趣的筒鞋。讀完本文后，將學會如何使用Python庫批量並發地抓取網頁和下載圖片資源。只要懂得如何安裝Python庫以及運行Python程序，就能使用本文給出的程序批量下載指定圖片啦！

　在網上沖浪的時候，總有些“小浪花”令人喜悅。沒錯，小浪花就是美圖啦。邊瀏覽邊下載，自然是不錯的；不過，好花不常開，好景不常在，想要便捷地保存下來，一個個地另存為還是很麻煩的。能不能批量下載呢？

目標

太平洋攝影網，一個不錯的攝影網站。如果你喜歡自然風光的話，不妨在上面好好飽覽一頓吧。飽覽一會，或許你還想打包帶走呢。這並不是難事，讓我們順藤摸瓜地來嘗試一番吧（懶得截圖，自己打開網站觀賞吧）。

首先，我們打開網址 http://dp.pconline.com.cn/list/all_t145.html ；那么，馬上有N多美妙的縮略圖呈現在你面前；

任意點擊其中一個鏈接，就到了一個系列的第一張圖片的頁面： http://dp.pconline.com.cn/photo/3687487.html，再點擊下可以到第二張圖片的頁面： http://dp.pconline.com.cn/photo/3687487_2.html ；圖片下方點擊“查看原圖”，會跳轉到 http://dp.pconline.com.cn/public/photo/source_photo.jsp?id=19706865&photoId=3687487 這個頁面，呈現出一張美美的高清圖。右鍵另存為，就可以保存到本地。

也許你的心已經開始癢癢啦：要是一個命令行，就能把美圖盡收懷中，豈不美哉！

思路

該如何下手呢？要想用程序自動化解決問題，就得找到其中規律！規律，YES ！

只要你做過 web 開發，一定知道，在瀏覽器的控制台，會有頁面的 html ，而 html 里會包含圖片，或者是包含圖片的另一個 HTML。對於上面的情況而言， http://dp.pconline.com.cn/list/all_t145.html 是一個大主題系列的入口頁面，比如自然是 t145，建築是 t292，記作 EntryHtml ；這個入口頁面包含很多鏈接指向子的HTML，這些子 HTML 是這個大主題下的不同個性風格的攝影師拍攝的不同系列的美圖，記作 SerialHtml ; 而這些 SerialHtml 又會包含一個子系列每一張圖片的首 HTML，記作 picHtml ，這個 picHtml 包含一個“查看原圖”鏈接，指向圖片高清地址的鏈接 http://dp.pconline.com.cn/public/photo/source_photo.jsp?id=19706865&photoId=3687487 ，記作 picOriginLink ；最后，在 picOriginLink 里找到 img 元素，即高清圖片的真真實地址 picOrigin。 (⊙v⊙)嗯，貌似有點繞暈了，我們來總結一下：

EntryHtml （主題入口頁面） -> SerialHtml （子系列入口頁面） -> picHtml （子系列圖片瀏覽頁面） -> picOriginLink （高清圖片頁面） -> picOrigin （高清圖片的真實地址）

現在，我們要弄清楚這五級是怎么關聯的。

經過查看 HTML 元素，可知：

(1) SerialHtml 元素是 EntryHtml 頁面里的 class="picLink" 的 a 元素；

(2) picHtml 元素是 SerialHtml 的加序號的結果，比如 SerialHtml 是 http://dp.pconline.com.cn/photo/3687487.html，總共有 8 張，那么 picHtml = http://dp.pconline.com.cn/photo/3687487_[1-8].html ，注意到 http://dp.pconline.com.cn/photo/3687487.html 與 http://dp.pconline.com.cn/photo/3687487_1.html 是等效的，這會給編程帶來方便。

(3) “查看原圖” 是指向高清圖片地址的頁面 xxx.jsp 的鏈接：它是 picHtml 頁面里的 class="aView aViewHD" 的 a 元素；

(4) 最后，從 xxx.jsp 元素中找出 src 為圖片后綴的 img 元素即可。

那么，我們的總體思路就是：

STEP1：抓取 EntryHtml 的網頁內容 entryContent ;

STEP2：解析 entryContent ，找到class="picLink" 的 a 元素列表 SerialHtmlList ；

STEP3：對於SerialHtmlList 的每一個網頁 SerialHtml_i：

(1) 抓取其第一張圖片的網頁內容，解析出其圖片總數 total ；

(2) 根據圖片總數 total 並生成 total 個圖片鏈接 picHtmlList ；

a. 對於 picHtmlList 的每一個網頁，找到 class="aView aViewHD" 的 a 元素 hdLink ；

b. 抓取 hdLink 對應的網頁內容，找到img元素獲得最終的圖片真實地址 picOrigin ；

c. 下載 picOrigin 。

注意到，一個主題系列有多頁，比如首頁是 EntryHtml ：http://dp.pconline.com.cn/list/all_t145.html ，第二頁是 http://dp.pconline.com.cn/list/all_t145_p2.html ；首頁等效於 http://dp.pconline.com.cn/list/all_t145_p1.html 這會給編程帶來方便。要下載一個主題下多頁的系列圖片，只要在最外層再加一層循環。這就是串行版本的實現流程。

串行版本

思路

主要庫的選用：

(1) requests : 抓取網頁內容；

(2) BeautifulSoup: 遍歷HTML文檔樹，獲取所需要的節點元素；

(3) multiprocessing.dummy : Python 的多進程並發庫，這個是以多進程API的形式實現多線程的功能。

一點技巧：

(1) 使用裝飾器來統一捕獲程序中的異常，並打印錯誤信息方便排查；

(2) 細粒度地拆分邏輯，更易於復用、擴展和優化；

(3) 使用異步函數改善性能，使用 map 函數簡潔表達；

運行環境 Python2.7 , 使用 easy_install 或 pip 安裝 requests , BeautifulSoup 這兩個三方庫。

實現

  1 #!/usr/bin/python
  2 #_*_encoding:utf-8_*_
  3 
  4 import os
  5 import re
  6 import sys
  7 import requests
  8 from bs4 import BeautifulSoup
  9 
 10 saveDir = os.environ['HOME'] + '/joy/pic/pconline/nature'
 11 
 12 def createDir(dirName):
 13     if not os.path.exists(dirName):
 14         os.makedirs(dirName)
 15 
 16 def catchExc(func):
 17     def _deco(*args, **kwargs):
 18         try:
 19             return func(*args, **kwargs)
 20         except Exception as e:
 21             print "error catch exception for %s (%s, %s)." % (func.__name__, str(*args), str(**kwargs))
 22             print e
 23             return None
 24     return _deco
 25 
 26 
 27 @catchExc
 28 def getSoup(url):
 29     '''
 30        get the html content of url and transform into soup object
 31            in order to parse what i want later
 32     '''
 33     result = requests.get(url)
 34     status = result.status_code
 35     if status != 200:
 36         return None
 37     resp = result.text
 38     soup = BeautifulSoup(resp, "lxml")
 39     return soup
 40 
 41 @catchExc
 42 def parseTotal(href):
 43     '''
 44       total number of pics is obtained from a data request , not static html.
 45     '''
 46     photoId = href.rsplit('/',1)[1].split('.')[0]
 47     url = "http://dp.pconline.com.cn/public/photo/include/2016/pic_photo/intf/loadPicAmount.jsp?photoId=%s" % photoId
 48     soup = getSoup("http://dp.pconline.com.cn/public/photo/include/2016/pic_photo/intf/loadPicAmount.jsp?photoId=%s" % photoId)
 49     totalNode = soup.find('p')
 50     total = int(totalNode.text)
 51     return total
 52 
 53 @catchExc
 54 def buildSubUrl(href, ind):
 55     '''
 56     if href is http://dp.pconline.com.cn/photo/3687736.html, total is 10
 57     then suburl is
 58         http://dp.pconline.com.cn/photo/3687736_[1-10].html
 59     which contain the origin href of picture
 60     '''
 61     return href.rsplit('.', 1)[0] + "_" + str(ind) + '.html'
 62 
 63 @catchExc
 64 def download(piclink):
 65     '''
 66        download pic from pic href such as
 67             http://img.pconline.com.cn/images/upload/upc/tx/photoblog/1610/21/c9/28691979_1477032141707.jpg
 68     '''
 69 
 70     picsrc = piclink.attrs['src']
 71     picname = picsrc.rsplit('/',1)[1]
 72     saveFile = saveDir + '/' + picname
 73 
 74     picr = requests.get(piclink.attrs['src'], stream=True)
 75     with open(saveFile, 'wb') as f:
 76         for chunk in picr.iter_content(chunk_size=1024):
 77             if chunk:
 78                 f.write(chunk)
 79                 f.flush()
 80     f.close()
 81 
 82 @catchExc
 83 def downloadForASerial(serialHref):
 84     '''
 85        download a serial of pics
 86     '''
 87 
 88     href = serialHref
 89     subsoup = getSoup(href)
 90     total = parseTotal(href)
 91     print 'href: %s *** total: %s' % (href, total)
 92 
 93     for ind in range(1, total+1):
 94         suburl = buildSubUrl(href, ind)
 95         print "suburl: ", suburl
 96         subsoup = getSoup(suburl)
 97 
 98         hdlink = subsoup.find('a', class_='aView aViewHD')
 99         picurl = hdlink.attrs['ourl']
100 
101         picsoup = getSoup(picurl)
102         piclink = picsoup.find('img', src=re.compile(".jpg"))
103         download(piclink)
104 
105 
106 @catchExc
107 def downloadAllForAPage(entryurl):
108     '''
109        download serial pics in a page
110     '''
111 
112     soup = getSoup(entryurl)
113     if soup is None:
114         return
115     #print soup.prettify()
116     picLinks = soup.find_all('a', class_='picLink')
117     if len(picLinks) == 0:
118         return
119     hrefs = map(lambda link: link.attrs['href'], picLinks)
120     print 'serials in a page: ', len(hrefs)
121 
122     for serialHref in hrefs:
123         downloadForASerial(serialHref)
124 
125 def downloadEntryUrl(serial_num, index):
126     entryUrl = 'http://dp.pconline.com.cn/list/all_t%d_p%d.html' % (serial_num, index)
127     print "entryUrl: ", entryUrl
128     downloadAllForAPage(entryUrl)
129     return 0
130 
131 def downloadAll(serial_num):
132     start = 1
133     end = 2
134     return [downloadEntryUrl(serial_num, index) for index in range(start, end+1)]
135 
136 serial_num = 145
137 
138 if __name__ == '__main__':
139     createDir(saveDir)
140     downloadAll(serial_num)

並發版本

思路

很顯然，串行版本會比較慢，CPU 長時間等待網絡連接和操作。要提高性能，通常是采用如下措施：

(1) 將任務分組，可以在需要的時候改造成任務並行的計算，也可以在機器性能不佳的情況下控制並發量，保持穩定運行；

(2) 使用多線程將 io 密集型操作隔離開，避免CPU等待；

(3) 單個循環操作改為批量操作，更好地利用並發；

(4) 使用多進程進行 CPU 密集型操作或任務分配，更充分利用多核的力量。

實現

目錄結構：

pystudy
    common
        common.py
        net.py
        multitasks.py
    tools
        dwloadpics_multi.py

common.py

 1 import os
 2 
 3 def createDir(dirName):
 4     if not os.path.exists(dirName):
 5         os.makedirs(dirName)
 6 
 7 def catchExc(func):
 8     def _deco(*args, **kwargs):
 9         try:
10             return func(*args, **kwargs)
11         except Exception as e:
12             print "error catch exception for %s (%s, %s): %s" % (func.__name__, str(*args), str(**kwargs), e)
13             return None
14     return _deco

net.py

 1 import requests
 2 from bs4 import BeautifulSoup
 3 from common import catchExc
 4 
 5 import time
 6 
 7 delayForHttpReq = 0.5 # 500ms
 8 
 9 @catchExc
10 def getSoup(url):
11     '''
12        get the html content of url and transform into soup object
13            in order to parse what i want later
14     '''
15     time.sleep(delayForHttpReq)
16     result = requests.get(url)
17     status = result.status_code
18     # print 'url: %s , status: %s' % (url, status)
19     if status != 200:
20         return None
21     resp = result.text
22     soup = BeautifulSoup(resp, "lxml")
23     return soup
24 
25 @catchExc
26 def batchGetSoups(pool, urls):
27     '''
28        get the html content of url and transform into soup object
29            in order to parse what i want later
30     '''
31 
32     urlnum = len(urls)
33     if urlnum == 0:
34         return []
35 
36     return pool.map(getSoup, urls)
37 
38 
39 @catchExc
40 def download(piclink, saveDir):
41     '''
42        download pic from pic href such as
43             http://img.pconline.com.cn/images/upload/upc/tx/photoblog/1610/21/c9/28691979_1477032141707.jpg
44     '''
45 
46     picsrc = piclink.attrs['src']
47     picname = picsrc.rsplit('/',1)[1]
48     saveFile = saveDir + '/' + picname
49 
50     picr = requests.get(piclink.attrs['src'], stream=True)
51     with open(saveFile, 'wb') as f:
52         for chunk in picr.iter_content(chunk_size=1024):
53             if chunk:
54                 f.write(chunk)
55                 f.flush()
56     f.close()
57 
58 @catchExc
59 def downloadForSinleParam(paramTuple):
60     download(paramTuple[0], paramTuple[1])

multitasks.py

 1 from multiprocessing import (cpu_count, Pool)
 2 from multiprocessing.dummy import Pool as ThreadPool
 3 
 4 ncpus = cpu_count()
 5 
 6 def divideNParts(total, N):
 7     '''
 8        divide [0, total) into N parts:
 9         return [(0, total/N), (total/N, 2M/N), ((N-1)*total/N, total)]
10     '''
11 
12     each = total / N
13     parts = []
14     for index in range(N):
15         begin = index*each
16         if index == N-1:
17             end = total
18         else:
19             end = begin + each
20         parts.append((begin, end))
21     return parts

dwloadpics_multi.py

  1 #_*_encoding:utf-8_*_
  2 #!/usr/bin/python
  3 
  4 import os
  5 import re
  6 import sys
  7 
  8 from common import createDir, catchExc
  9 from net import getSoup, batchGetSoups, download, downloadForSinleParam
 10 from multitasks import *
 11 
 12 saveDir = os.environ['HOME'] + '/joy/pic/pconline'
 13 dwpicPool = ThreadPool(5)
 14 getUrlPool = ThreadPool(2)
 15 
 16 @catchExc
 17 def parseTotal(href):
 18     '''
 19       total number of pics is obtained from a data request , not static html.
 20     '''
 21     photoId = href.rsplit('/',1)[1].split('.')[0]
 22     url = "http://dp.pconline.com.cn/public/photo/include/2016/pic_photo/intf/loadPicAmount.jsp?photoId=%s" % photoId
 23     soup = getSoup("http://dp.pconline.com.cn/public/photo/include/2016/pic_photo/intf/loadPicAmount.jsp?photoId=%s" % photoId)
 24     totalNode = soup.find('p')
 25     total = int(totalNode.text)
 26     return total
 27 
 28 @catchExc
 29 def buildSubUrl(href, ind):
 30     '''
 31     if href is http://dp.pconline.com.cn/photo/3687736.html, total is 10
 32     then suburl is
 33         http://dp.pconline.com.cn/photo/3687736_[1-10].html
 34     which contain the origin href of picture
 35     '''
 36     return href.rsplit('.', 1)[0] + "_" + str(ind) + '.html'
 37 
 38 def getOriginPicLink(subsoup):
 39     hdlink = subsoup.find('a', class_='aView aViewHD')
 40     return hdlink.attrs['ourl']
 41 
 42 def findPicLink(picsoup):
 43     return picsoup.find('img', src=re.compile(".jpg"))
 44 
 45 def downloadForASerial(serialHref):
 46     '''
 47        download a serial of pics
 48     '''
 49 
 50     href = serialHref
 51     total = getUrlPool.map(parseTotal, [href])[0]
 52     print 'href: %s *** total: %s' % (href, total)
 53 
 54     suburls = [buildSubUrl(href, ind) for ind in range(1, total+1)]
 55     subsoups = batchGetSoups(getUrlPool, suburls)
 56 
 57     picUrls = map(getOriginPicLink, subsoups)
 58     picSoups = batchGetSoups(getUrlPool,picUrls)
 59     piclinks = map(findPicLink, picSoups)
 60     downloadParams = map(lambda picLink: (picLink, saveDir), piclinks)
 61     dwpicPool.map_async(downloadForSinleParam, downloadParams)
 62 
 63 def downloadAllForAPage(entryurl):
 64     '''
 65        download serial pics in a page
 66     '''
 67 
 68     print 'entryurl: ', entryurl
 69     soups = batchGetSoups(getUrlPool,[entryurl])
 70     if len(soups) == 0:
 71         return
 72 
 73     soup = soups[0]
 74     #print soup.prettify()
 75     picLinks = soup.find_all('a', class_='picLink')
 76     if len(picLinks) == 0:
 77         return
 78     hrefs = map(lambda link: link.attrs['href'], picLinks)
 79     map(downloadForASerial, hrefs)
 80 
 81 def downloadAll(serial_num, start, end, taskPool=None):
 82     entryUrl = 'http://dp.pconline.com.cn/list/all_t%d_p%d.html'
 83     entryUrls = [ (entryUrl % (serial_num, ind)) for ind in range(start, end+1)]
 84     execDownloadTask(entryUrls, taskPool)
 85 
 86 def execDownloadTask(entryUrls, taskPool=None):
 87     if taskPool:
 88         print 'using pool to download ...'
 89         taskPool.map(downloadAllForAPage, entryUrls)
 90     else:
 91         map(downloadAllForAPage, entryUrls)
 92 
 93 if __name__ == '__main__':
 94     createDir(saveDir)
 95     taskPool = Pool(processes=ncpus)
 96 
 97     serial_num = 145
 98     total = 4
 99     nparts = divideNParts(total, 2)
100     for part in nparts:
101         start = part[0]+1
102         end = part[1]
103         downloadAll(serial_num, start, end, taskPool=None)
104     taskPool.close()
105     taskPool.join()

知識點

裝飾器

catchExc 函數實現了一個簡易的異常捕獲器，捕獲程序中遇到的異常並打印詳細信息便於排查。 _deco(*args, **kwargs) 是具有通用簽名的 python 函數，裝飾器返回的是函數引用，而不是具體的值。

動態數據抓取

比如 http://dp.pconline.com.cn/photo/4846936.html 這個子系列頁面下的所有圖片數，是根據動態JS加載的（在Chrome通過抓取工具可以得到）。因此，需要構造相應的請求去相應數據，而不是直接解析靜態頁面。不過這使得工具依賴於具體網站的請求，顯然是不靈活的。

1 function loadPicAmount(){
2         var photoId=4846936;
3         var url="/public/photo/include/2016/pic_photo/intf/loadPicAmount.jsp?pho
4 toId="+photoId;
5         $.get(url,function(data){
6                 var picAmount=data;
7                 $("#picAmount").append(picAmount);
8         });
9     }

Soup使用

soup確實是利用jQuery語法獲取網頁元素的利器啊！也說明，借用已經有的慣用法來開拓新的領域，更容易為用戶所接受。

(1) 獲取id元素： find(id="")

(2) 獲取class元素：hdlink = subsoup.find('a', class_='aView aViewHD')

(3) 獲取html標簽元素：picsoup.find('img', src=re.compile(".jpg")) ; totalNode = soup.find('p')

(4) 獲取所有元素： soup.find_all('a', class_='picLink')

(5) 獲取指定元素的文本： totalNode.text

(6) 獲取指定元素的屬性： hdlink.attrs['ourl']

批量處理

在並發批量版本中，大量使用了 map(func, list) , lambda 表達式及列表推導，使得批量處理的含義更加簡潔清晰；

此外，這些 map 都可以在適當的時候替換成並發的處理。

模塊化

注意到並發版本拆分成了多個python文件，將通用的函數分離出來進行歸類，便於后續可以復用。

這里需要設置PYTHONPATH搜索路徑，將自己的公共文件放到這個路徑下：

export PYTHONPATH=$PYTHONPATH:~/Workspace/python/pystudy/pystudy/common

遇到的問題

多線程問題

遇到的一個問題是，發現獲取圖片總數以及網頁數據時不穩定，有時能獲取有時不能獲取，經過打印 http 請求后，發現開始正常，接下來會間隔性地批量出現 503 服務不可用。估計是服務器做了保護措施。為了能夠穩定地獲取到網站數據，降低了請求頻率，發送請求前延遲 500ms 。見 net.py getSoup 方法的 time.sleep(0.5) 。畢竟咱們不是為了惡意攻擊服務器，只是希望能夠自動化便利地獲取網站圖片。

進程map調用問題

 1 from multiprocessing import Pool
 2 
 3 taskPool = Pool(2)
 4 
 5 def g(x):
 6     return x+1
 7 
 8 
 9 def h():
10     return taskPool.map(g, [1,2,3,4])
11 
12 
13 if __name__ == '__main__':
14 
15     print h()
16     taskPool.close()
17     taskPool.join()

報如下錯誤：

AttributeError: 'module' object has no attribute 'g'

解決方案是：必須將 taskPool 的定義挪到 if __name__ == '__main__': 包含的作用域內。

1 if __name__ == '__main__':
2 
3     taskPool = Pool(2)
4     print h()
5 
6     taskPool.close()
7     taskPool.join()

原因見 https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers （16.6.1.5. Using a pool of workers）。

Functionality within this package requires that the __main__ module be importable by the children.

emm... 其實沒讀懂是什么意思。

https://stackoverflow.com/questions/20222534/python-multiprocessing-on-windows-if-name-main 這里也有參考。大意是說，不能在模塊導入時去創建進程。

PS：在網上找了N久，最后發現在一段自己不經意忽略的地方找到。說明要多讀官方文檔，少走捷徑答案。

未完待續

在 http://www.cnblogs.com/lovesqcc/p/8830526.html 一文中，我們實現了批量下載圖片的工具的一個更加通用的版本。

本文原創，轉載請注明出處，謝謝！ :)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 實用小工具下載鏈接更改Python下載源的小工具【Python 開發】第三篇：python 實用小工具使用 Python 實現實用小工具 java 幾個實用的小工具 linux中實用的小工具lrzsz Typora實用小工具（AHK） Python小工具pipreqs [原創]-python實用小工具之camera tuning小助手西瓜視頻下載小工具