python2.7下同步華為雲照片的爬蟲程序實現

本文轉載自查看原文 2016-05-19 10:28 8165 下載文件/ 爬蟲/ python/ 華為雲/ 多線程/ urllib2

1、背景

隨着華為手機的銷量加大，華為雲的捆綁服務使用量也越來越廣泛，華為雲支持自動同步照片、通訊錄、記事本等，用着確實也挺方便的，雲服務帶來方便的同時，也帶來了數據管理風險。
華為目前只提供一個www.hicloud.com網站來管理數據，不提供windows平台的同步工具，數據管理和同步非常不方便。

2、功能描述

進過幾天的摸索，目前的代碼實現以下功能：
1、自動調用登錄網址，並顯示驗證碼，等待手動輸入驗證碼；
2、驗證碼或者密碼出錯，自動重新調用登錄網址，最多3次出錯機會；
3、自動進入相冊文件夾，按照相冊列表獲取相片、視頻的真實地址；
4、方案1：把文件真實地址保存到文本文件中，然后手動調用迅雷等工具進行批量下載；
方案2：建立本地文件夾，單線程的逐個將服務器上的相片、視頻等文件自動同步到本地。
方案3：優化方案2，采取多線程的方式獲取文件。

3、代碼說明

A、登錄過程

訪問http://www.hicloud.com，系統會自動執行多步跳轉
1、先直接在頁面中refresh跳轉到http://www.hicloud.com/others/login.action
2、再直接redirect到https://hwid1.vmall.com/casserver/logout?service=https://www.hicloud.com:443/logout
3、再redirect到https://hwid1.vmall.com/casserver/remoteLogin?service=https://www.hicloud.com:443/others/login.action&loginChannel=1000002&reqClientType=1&loginUrl=https://hwid1.vmall.com/oauth2/account/login?reqClientType=1&lang=zh-cn&adUrl=https://www.hicloud.com:443/others/show_advert.action
4、再redirect到https://hwid1.vmall.com/oauth2/account/login?reqClientType=1&validated=true&service=https://www.hicloud.com:443/others/login.action&loginChannel=1000002&reqClientType=1&adUrl=https://www.hicloud.com:443/others/show_advert.action&lang=zh-cn
這個鏈接會刷新出來登錄界面，本程序直接使用鏈接4進行登陸。
（啃爹吧，搞這么多跳轉，大概華為管理員以為這樣就可以防爬蟲？嗯，一開始在firefox里抓報文，跳轉給報文跟蹤增加了很多難度，后來祭出Fiddler4，搞定！！！）。
5、在鏈接4中包含一個刷新驗證碼的request:
https://hwid1.vmall.com/casserver/randomcode?randomCodeType=emui4_login&_t=1462786575782
其中參數t是系統本地時間
6、接下來調用https://hwid1.vmall.com/casserver/remoteLogin進行post提交
7、登錄成功后會再次執行3次redirect，分別是:
https://www.hicloud.com:443/others/login.action?lang=zh-cn&ticket=1ST-157502-OV1212126aV9BcM9Sh2Dpe-cas
https://www.hicloud.com:443/others/login.action?lang=zh-cn
https://www.hicloud.com:443/home
若是登錄失敗（下面是驗證碼錯誤時的跳轉鏈接），會redirect到鏈接4，因此本文直接使用鏈接4進行登錄。
https://hwid1.vmall.com/oauth2/account/login?validated=true&errorMessage=random_code_error|user_pwd_continue_error&service=https%3A%2F%2Fwww.hicloud.com%3A443%2Fothers%2Flogin.action%3Flang%3Dzh-cn&loginChannel=1000002&reqClientType=1&adUrl=https%3A%2F%2Fwww.hicloud.com%3A443%2Fothers%2Fshow_advert.action%3Flang%3Dzh-cn&lang=zh-cn&viewT

B、函數說明

1、hw.enableCookies()
主要是設置全局的urllib2的一些屬性，譬如打開調試開關，打開cookie管理，注意全局二字，這是urllib2的特性；

2、hw.getLoginPage()
主要實現訪問前文的鏈接4，並獲取應答報文，注意應答報文在后面將進行處理。
可以得到密碼校驗submit時需要的一些參數。

3、hw.getRadomCode()
調用服務器端驗證碼算法生成驗證碼圖片，並調用系統shell顯示圖片。
顯示圖片后，阻塞進程，等待用戶手動輸入驗證碼（曾經想過調用ocr包進行字符識別，不過發現網上幾個公開的包，在識別華為驗證碼時都基本不好用，遂放棄）。

4、hw.genLoginData(content)
基於2、3的返回，拼裝驗證密碼submit的post字符串

5、hw.checkUserPwd(postdata)
正式開始調用驗證密碼的鏈接進行密碼校驗；
從校驗成功的應答報文中使用正則表達式獲取CSRFToken，這個值很關鍵，后續在很多地方用到；

6、hw.getAlbumPage()
直接訪問華為雲的照片主頁https://www.hicloud.com:443/album
其實正常情況下，登錄成功后，用戶需要點擊好幾個動作才能打開照片主頁，后台相當於有多次交互。寫爬蟲的話，就略過這些無關緊要的訪問了。

7、hw.getAlbumList()
相冊主頁有兩種展示方式：一種按時間分組，一種按相冊名分組，我們采取后一種方式。
所以先獲取相冊列表，注意這個交互，服務器端返回的是json應答報文。

8、hw.getFileList(page,'albumList','albumId')
依據步驟7返回的json報文內容，循環獲取各相冊里相冊文件的地址；
這個交互返回還是json報文，需要說明是這個json報文還是gzip壓縮的，而且發現Fiddler4竟然支持自動解壓。
（在測試的時候，通過Fiddler代理收到的應答報文已經被自動解壓了，正式部署運行時發現報錯……不過在寫本文時，又發現Fiddler是有開關來控制是否自動對gzip報文解壓，Fiddler很強大，挖個坑后面再寫Fiddler怎么用）

9、hw.getFileList(page,'ownShareList','shareId')
這個跟步驟8是一樣的功能，主要是華為雲里頭比較搞，針對微信單獨設置了一個相冊目錄，其json節點是ownShareList，步驟8中是albumList。

8,9兩個函數中在下載文件時有三種方案，需要選擇那個方案對應打開對應代碼注釋行：
#方案1：保存下載地址到文本文件中，但不下載文件
#icurrentnum += self.saveFileList2Txt(each[childkey],page,icurrentnum)
#方案2：單線程下載文件到本地
#icurrentnum += self.downFileList(each[childkey],page)
#方案3：多線程下載文件到本地
#unicode碼格式
#print each[childkey].encode('gbk')
icurrentnum += self.downFileListMultiThread(each[childkey],page)

程序說明至此結束，具體大家看代碼吧，都不算復雜。
另外得說明異常拋出這塊，我並沒有去充分考慮和完善，但可以確定代碼肯定是好用的。
以本人舉例，使用華為半年，在服務器上總共存了2536個文件，一共9.24G數據。在2016-5-14日晚，通過家里的20M聯通寬帶全部同步到本地，具體耗時有點忘了，不過程序運行並沒有異常退出，不得不表揚python的穩定性。
不過不保證華為官方看到這個之后，不去調整他的后台邏輯，但是思路基本問題不大。
目前來看在防爬蟲這塊，淘寶是做的相對較好了，主要是邏輯變化比較快，其次是復雜。

4、總結

a、學習python以及爬蟲時間都不長，斷斷續續加起來不到1個月的樣子，借鑒了很多網絡資料，有艱辛也有收獲。
b、python確實很強大，入門難度不高，網絡資料非常豐富，官方在官方類的管理上，做得相當不錯，利用pip安裝挺簡單也挺方便。
c、python的官方類都有是有源碼（目錄在c:\python27\lib下，c:\python是我的python安裝目錄），遇到把握不准的問題，其實看源碼是最好的辦法，網上的資料也有很多繆誤。
不需要完全看懂，一是學習本身需要過程，二是源碼太長，類太多。可以以點帶面，慢慢提高，而且看源碼還可以學習源碼中的一些寫法。
d、另外，不得不吐槽python的字符編碼處理這塊，坑太多了。
曾經在encode，decode這塊困擾了近一個禮拜，到目前算是基本理解、會用吧。

5、源碼

synchuaweiphoto.py

  1 # -*- coding=utf-8 -*-
  2 __author__='zhongtang'
  3 
  4 
  5 import urllib
  6 import urllib2
  7 import cookielib
  8 import time,datetime
  9 from PIL import Image
 10 from lxml import etree
 11 from ordereddict import OrderedDict
 12 import re
 13 import json
 14 import htmltool
 15 import os
 16 import threading
 17 import gzip
 18 import StringIO
 19 import requests
 20 
 21 class HuaWei:
 22     #華為雲服務登錄
 23     '''
 24     訪問http://www.hicloud.com 執行多步跳轉
 25     1、先直接在頁面中refresh跳轉到http://www.hicloud.com/others/login.action
 26     2、再直接redirect到https://hwid1.vmall.com/casserver/logout?service=https://www.hicloud.com:443/logout
 27     3、再redirect到https://hwid1.vmall.com/casserver/remoteLogin?service=https://www.hicloud.com:443/others/login.action&loginChannel=1000002&reqClientType=1&loginUrl=https://hwid1.vmall.com/oauth2/account/login?reqClientType=1&lang=zh-cn&adUrl=https://www.hicloud.com:443/others/show_advert.action
 28     4、再redirect到https://hwid1.vmall.com/oauth2/account/login?reqClientType=1&validated=true&service=https://www.hicloud.com:443/others/login.action&loginChannel=1000002&reqClientType=1&adUrl=https://www.hicloud.com:443/others/show_advert.action&lang=zh-cn
 29     這個鏈接會刷新出來登錄界面，本程序直接使用鏈接4進行登陸。
 30     5、在鏈接4中包含一個刷新驗證碼的request: https://hwid1.vmall.com/casserver/randomcode?randomCodeType=emui4_login&_t=1462786575782
 31     6、接下來調用https://hwid1.vmall.com/casserver/remoteLogin進行post提交
 32     7、登錄成功后會再次執行3次redirect，分別是:
 33     https://www.hicloud.com:443/others/login.action?lang=zh-cn&ticket=1ST-157502-OVRaMo6aV232229Sh2Dpe-cas
 34     https://www.hicloud.com:443/others/login.action?lang=zh-cn
 35     https://www.hicloud.com:443/home
 36     若是登錄失敗，會redirect到鏈接4，因此本文直接使用鏈接4進行登錄。
 37     https://hwid1.vmall.com/oauth2/account/login?validated=true&errorMessage=random_code_error|user_pwd_continue_error&service=https%3A%2F%2Fwww.hicloud.com%3A443%2Fothers%2Flogin.action%3Flang%3Dzh-cn&loginChannel=1000002&reqClientType=1&adUrl=https%3A%2F%2Fwww.hicloud.com%3A443%2Fothers%2Fshow_advert.action%3Flang%3Dzh-cn&lang=zh-cn&viewT
 38     '''
 39 
 40     def __init__(self):
 41         self.username='username@yeah.net' #用戶名
 42         self.passwd='userpassword' #用戶密碼
 43         self.authcode='' #驗證碼
 44         self.baseUrl='https://hwid1.vmall.com'
 45         self.loginUrl=self.baseUrl+'/oauth2/account/login?reqClientType=1&validated=true&service=https://www.hicloud.com:443/others/login.action&loginChannel=1000002&reqClientType=1&adUrl=https://www.hicloud.com:443/others/show_advert.action&lang=zh-cn'
 46         #self.loginUrl='https://www.hicloud.com'
 47         self.randomUrl=self.baseUrl+'/casserver/randomcode'
 48         self.checkpwdUrl=self.baseUrl+'/casserver/remoteLogin'
 49         self.successUrl='https://www.hicloud.com:443/album'
 50         self.getalbumsUrl= 'https://www.hicloud.com/album/getCloudAlbums.action'
 51         self.getalbumfileUrl = 'https://www.hicloud.com/album/getCloudFiles.action'
 52         self.loginHeaders = {
 53             'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0',
 54             'Connection' : 'keep-alive'
 55         }
 56         self.CSRFToken=''
 57         self.OnceMaxFile=100 #單次最大獲取文件數量
 58         self.FileList={} #照片列表
 59         self.ht=htmltool.htmltool()
 60         self.curPath= self.ht.getPyFileDir()
 61         self.FileNum=0
 62         
 63     #設置urllib2 cookie
 64     def enableCookies(self):
 65         #建立一個cookies 容器
 66         self.cookies = cookielib.CookieJar()
 67         #將一個cookies容器和一個HTTP的cookie的處理器綁定
 68         cookieHandler = urllib2.HTTPCookieProcessor(self.cookies)
 69         #創建一個opener,設置一個handler用於處理http的url打開
 70         #self.opener = urllib2.build_opener(self.handler)
 71         httpHandler=urllib2.HTTPHandler(debuglevel=1)
 72         httpsHandler=urllib2.HTTPSHandler(debuglevel=1)
 73         self.opener = urllib2.build_opener(cookieHandler,httpHandler,httpsHandler)
 74         #安裝opener，此后調用urlopen()時會使用安裝過的opener對象
 75         urllib2.install_opener(self.opener)
 76         
 77     #獲取當前時間
 78     def getJstime(self):
 79        itime= int(time.time() * 1000)
 80        return str(itime)
 81 
 82     #獲取驗證碼
 83     def getRadomCode(self,repeat=2):
 84         '''
 85         -- js 
 86         function chgRandomCode(ImgObj, randomCodeImgSrc) {
 87         ImgObj.src = randomCodeImgSrc+"?randomCodeType=emui4_login&_t=" + new Date().getTime();
 88         };
 89         -- http 
 90         GET /casserver/randomcode?randomCodeType=emui4_login&_t=1462786575782 HTTP/1.1
 91         '''
 92         data =''
 93         ostime=self.getJstime()
 94         filename=self.curPath+'\\'+ostime+'.png'
 95         url= self.randomUrl+"?randomCodeType=emui4_login&_t="+ostime
 96         #print url
 97         try:
 98             request = urllib2.Request(url,headers=self.loginHeaders)
 99             response = urllib2.urlopen(request)
100             data = response.read()
101         except :
102             time.sleep(5)
103             print u'保存驗證碼圖片[%s]出錯，嘗試:\n[%s]' %(url,2-repeat)
104             if repeat>0:
105                  return self.getRadomCode(repeat-1)              
106         if len(data)<= 0 : return 
107         f = open(filename, 'wb')
108         f.write(data)
109         #print u"保存圖片:",fileName
110         f.close()
111         im = Image.open(filename)
112         im.show()
113         self.authcode=''
114         self.authcode = raw_input(u'請輸入4位驗證碼:')
115         #刪除驗證碼文件
116         os.remove(filename)
117         return
118     
119     def genLoginData(self,content):
120         '''
121         1<input type="hidden" id="form_submit" name="submit" value="true">
122         2<input type="hidden" id="form_loginUrl" name="loginUrl" value="https://hwid1.vmall.com/oauth2/account/login" />
123         3<input type="hidden" id="form_service" name="service" value="https://www.hicloud.com:443/others/login.action?lang=zh-cn" />
124         4<input type="hidden" id="form_loginChannel" name="loginChannel" value="1000002" />
125         5<input type="hidden" id="form_reqClientType" name="reqClientType" value="1" />
126         6<input type="hidden" id="form_deviceID" name="deviceID" value="" />
127         7<input type="hidden" id="form_adUrl" name="adUrl" value="https://www.hicloud.com:443/others/show_advert.action?lang=zh-cn" />
128         8<input type="hidden" id="form_lang" name="lang" value="zh-cn" />
129         9<input type="hidden" id="form_inviterUserID" name="inviterUserID" value="" /> 
130         10<input type="hidden" id="form_inviter" name="inviter" value="" /> 
131         11<input type="hidden" id="form_viewType" name="viewType" value="0" /> 
132         12<input type="hidden" id="form_quickAuth" name="quickAuth" value="" /> 
133         <input type="hidden" id="form_loginUrlForBind"  value="https://hwid1.vmall.com/oauth2/portal/thirdAccountBindByPhoneForPCWeb.jsp?themeName=cloudTheme" />
134         '''
135         tree = etree.HTML(content)
136         form= tree.xpath('//div[@class="login-box"]')[0]
137         #print len(form)
138         params=OrderedDict()
139         params['submit']=form.xpath('//*[@name="submit"]/@value')[0] #1
140         params['loginUrl']= form.xpath('//*[@name="loginUrl"]/@value')[0] 
141         params['service'] = form.xpath('//*[@name="service"]/@value')[0] 
142         params['loginChannel']= form.xpath('//*[@name="loginChannel"]/@value')[0] 
143         params['reqClientType'] = form.xpath('//*[@name="reqClientType"]/@value')[0] 
144         params['deviceID']= form.xpath('//*[@name="deviceID"]/@value')[0]#6
145         params['adUrl']= form.xpath('//*[@name="adUrl"]/@value')[0]
146         params['lang'] = form.xpath('//*[@name="lang"]/@value')[0]
147         params['inviterUserID']= form.xpath('//*[@name="inviterUserID"]/@value')[0]
148         params['inviter'] = form.xpath('//*[@name="inviter"]/@value')[0]
149         params['viewType']= form.xpath('//*[@name="viewType"]/@value')[0]#11
150         params['quickAuth'] = form.xpath('//*[@name="quickAuth"]/@value')[0]
151         params['userAccount']= self.username
152         params['password'] = self.passwd
153         params['authcode'] = self.authcode
154         params=urllib.urlencode(params)
155         return params
156            
157     def getLoginPage(self):
158         request = urllib2.Request(self.loginUrl,headers=self.loginHeaders)
159         response = urllib2.urlopen(request)
160         page =''
161         page= response.read()
162         redUrl=response.geturl()
163         return page.decode('utf-8')
164 
165         
166     def checkUserPwd(self,postdata):
167         '''
168         <input type="hidden" value="" id="userHeadPic">
169         <input type="hidden" value="1" id="activeUserState"/>
170         <input type="hidden" value='[{"deviceType":0,"deviceID":"1231231231212312312312","terminalType":"huawei mt7-tl00","deviceAliasName":"HUAWEI MT7-TL00"}]' id="deviceList" />
171         <input type="hidden" value='www.hicloud.com' id="server" />
172         <input type="hidden" value='1' id="biFlag" />
173         <input type="hidden" value='https://dc.hicloud.com' id="biUrl" />
174         <script>
175                 var CSRFToken = "9b64dcad38d269147f2c27dc12171e60aade2a22316de213";
176                 var accountType = "1";
177                 var accountTypeLh = "4";
178         </script>
179         '''
180         self.CSRFToken=''
181         pattern = re.compile('CSRFToken = "(.*?)"',re.S)
182         #保存CSRFToken
183         content = re.search(pattern,page)
184         if content :
185             self.CSRFToken = content.group(1)
186             return '1'
187         else:
188             return '0'
189 
190     #打開相冊頁，獲取CSRFToken字符，這個是關鍵字，在后續報文都將用到。
191     def getAlbumPage(self):
192         request=urllib2.Request(self.successUrl,headers=self.loginHeaders)
193         response = urllib2.urlopen(request)
194         rheader = response.info()
195         page= response.read()
196         redUrl=response.geturl()
197         return self.getCSRFToken(page.decode('utf-8'))
198 
199 
200 
201     """
202     Description    : 將網頁圖片保存本地
203     @param imgUrl  : 待保存圖片URL
204     @param imgName : 待保存圖片名稱
205     @return 無
206     """
207     def saveImage( self,imgUrl,imgName ="default.jpg" ):
208         #使用requests的get方法直接下載文件，注意因為url是https，所以加了verify=False
209         response = requests.get(imgUrl, stream=True,verify=False)
210         image = response.content
211         filename= imgName
212         print("保存文件"+filename+"\n")
213         try:
214             with open(filename ,"wb") as jpg:
215                 jpg.write( image)     
216                 return
217         except IOError:
218             print("IO Error\n")
219             return
220         finally:
221             jpg.close        
222 
223     """
224     Description    : 開啟多線程執行下載任務,注意沒有限制線程數
225     @param filelist:待下載圖片URL列表
226     @return 無
227     """
228     def downFileMultiThread( self,urllist,namelist ):
229         task_threads=[]  #存儲線程
230         count=1
231         i = 0
232         for i in range(0,len(urllist)):
233             fileurl = urllist[i]
234             filename= namelist[i]
235             t = threading.Thread(target=self.saveImage,args=(fileurl,filename))
236             count = count+1
237             task_threads.append(t)
238         for task in task_threads:
239             task.start()
240         for task in task_threads:
241             task.join()
242 
243     #多線程下載相冊照片到目錄 ,不同相冊保存到不同的目錄
244     def downFileListMultiThread(self,dirname,hjsondata):
245         if len(hjsondata)<= 0 : return 0
246         hjson2 = {}
247         hjson2 = json.loads(hjsondata)
248         #新建目錄，並切換到目錄
249         self.ht.mkdir(dirname)
250         i = 0
251         urllist=[]
252         namelist=[]
253         if hjson2.has_key("fileList"):
254             for each in hjson2["fileList"]:
255                 urllist.append(hjson2["fileList"][i]["fileUrl"].encode('gbk'))
256                 namelist.append(hjson2["fileList"][i]["fileName"].encode('gbk'))
257                 self.FileNum += 1
258                 i += 1
259                 #每25個文件開始並發下載，並清空數組，或者最后一組
260                 if i%25==0 or i == len(hjson2["fileList"]):                    
261                     self.downFileMultiThread(urllist,namelist)
262                     urllist=[]
263                     namelist=[]
264         return i
265 
266     #下載相冊照片到目錄 ,不同相冊保存到不同的目錄
267     def downFileList(self,dirname,hjsondata):
268         if len(hjsondata)<= 0 : return
269         hjson2 = {}
270         hjson2 = json.loads(hjsondata)
271         #新建目錄，並切換到目錄
272         self.ht.mkdir(dirname)
273         i = 0             
274         if hjson2.has_key("fileList"):
275             for each in hjson2["fileList"]:
276                 self.saveImage(hjson2["fileList"][i]["fileUrl"].encode('gbk'),hjson2["fileList"][i]["fileName"].encode('gbk'))
277                 #每5個文件休息2秒
278                 self.FileNum += 1
279                 if i%5 ==0 : time.sleep(2)
280                 i += 1
281         return i
282     
283 
284     #保存相冊照片地址到文件 ,不同相冊保存到不同的文件
285     def saveFileList2Txt(self,filename,hjsondata,flag):
286         if len(hjsondata)<= 0 : return
287         hjson2 = {}
288         hjson2 = json.loads(hjsondata)
289         lfilename = filename+u".txt"
290         if flag == 0 : #新建文件
291             print u'創建相冊文件'+lfilename+"\n"
292             #新建文件，代表新的相冊重新開始計數
293             self.FileNum = 0
294             f = open(lfilename, 'wb')
295         else: #追加文件
296             f = open(lfilename, 'a')
297         i = 0             
298         if hjson2.has_key("fileList"):
299             for each in hjson2["fileList"]:
300                 f.write(hjson2["fileList"][i]["fileUrl"].encode('gbk')+"\n")
301                 #每一千行分頁
302                 self.FileNum += 1
303                 if self.FileNum%1000 ==0 :f.write('\n\n\n\n\n\n--------------------page %s ------------------\n\n\n\n\n\n' %(int(self.FileNum/1000)))
304                 i += 1
305         f.close()
306         return i
307     
308     #循環讀取相冊文件
309     def getFileList(self,hjsondata,parentkey,childkey):
310         #step 3 getCoverFiles.action,循環取相冊文件列表，單次最多取100條記錄。
311         #每次count都是最大數量49，不管實際數量是否夠，每次currentnum遞增，直到返回空列表。
312         #最后一次返回 空列表
313         #{"albumSortFlag":true,"code":0,"info":"success!","fileList":[]}
314         #第一次取文件時，例如文件總數量只有2個，count也是放最大值49。
315         #albumIds[]=default-album-102-221216000029851117&ownerId=220012300029851117&height=300&width=300&count=49&currentNum=0&thumbType=imgcropa&fileType=0        
316         #[{u'photoNum': 2518, u'albumName': u'default-album-1', u'iversion': -1, u'albumId': u'default-album-1', u'flversion': -1, u'createTime': 1448065264550L, u'size': 0},
317         #{u'photoNum': 100, u'albumName': u'default-album-2', u'iversion': -1, u'albumId': u'default-album-2', u'flversion': -1, u'createTime': 1453090781646L, u'size': 0}]
318         hsjon={}
319         hjson = json.loads(hjsondata.decode('utf-8'))
320         paraAlbum=OrderedDict()
321         if hjson.has_key(parentkey):
322             for each in hjson[parentkey]:
323                 paraAlbum={}
324                 paraAlbum['albumIds[]'] = each[childkey]
325                 paraAlbum['ownerId'] = hjson['ownerId']
326                 paraAlbum['height'] = '300'
327                 paraAlbum['width'] = '300'
328                 paraAlbum['count'] = self.OnceMaxFile
329                 paraAlbum['thumbType'] = 'imgcropa'
330                 paraAlbum['fileType'] = '0'            
331                 itotal= each['photoNum']
332                 icurrentnum=0       
333                 while icurrentnum<itotal:                
334                     paraAlbum['currentNum'] = icurrentnum
335                     paraAlbumstr = urllib.urlencode(paraAlbum)
336                     request=urllib2.Request(self.getalbumfileUrl,headers=self.loginHeaders,data=paraAlbumstr)
337                     response = urllib2.urlopen(request)
338                     rheader = response.info()
339                     page = response.read()
340                     #調用gzip進行解壓
341                     if rheader.get('Content-Encoding')=='gzip':
342                         data = StringIO.StringIO(page)
343                         gz = gzip.GzipFile(fileobj=data)
344                         page = gz.read()
345                         gz.close()
346                     page= page.decode('utf-8')
347                     #print page.decode('utf-8')
348                     #方案1：保存下載地址到文本文件中，但不下載文件
349                     #icurrentnum += self.saveFileList2Txt(each[childkey],page,icurrentnum)
350                     #方案2：單線程下載文件到本地
351                     #icurrentnum += self.downFileList(each[childkey],page)
352                     #方案3：多線程下載文件到本地
353                     #unicode碼格式
354                     #print each[childkey].encode('gbk')
355                     icurrentnum += self.downFileListMultiThread(each[childkey],page)
356         return 
357 
358     #step 1 getCloudAlbums,取相冊列表
359     def getAlbumList(self):
360         self.loginHeaders={
361         'Host': 'www.hicloud.com',
362         'Connection': 'keep-alive',
363         'Accept': 'application/json, text/javascript, */*; q=0.01',
364         'Origin': 'https://www.hicloud.com',
365         'X-Requested-With': 'XMLHttpRequest',
366         'CSRFToken': self.CSRFToken,
367         'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36',
368         'DNT': '1',
369         'Referer': 'https://www.hicloud.com/album',
370         'Accept-Encoding': 'gzip,deflate',
371         'Accept-Language': 'zh-CN',
372         'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'
373         }        
374         request=urllib2.Request(self.getalbumsUrl,headers=self.loginHeaders)
375         response = urllib2.urlopen(request)
376         page=''
377         page= response.read()
378         '''#返回報文
379         {"ownerId":"220012300029851117","code":0,
380         "albumList":[{"albumId":"default-album-1","albumName":"default-album-1","createTime":1448065264550,"photoNum":2521,"flversion":-1,"iversion":-1,"size":0},
381                      {"albumId":"default-album-2","albumName":"default-album-2","createTime":1453090781646,"photoNum":101,"flversion":-1,"iversion":-1,"size":0}],
382         "ownShareList":[{"ownerId":"220012300029851117","resource":"album","shareId":"default-album-102-220123000029851117","shareName":"微信","photoNum":2,"flversion":-1,"iversion":-1,"createTime":1448070407055,"source":"HUAWEI MT7-TL00","size":0,"ownerAcc":"jdstkxx@yeah.net","receiverList":[]}],
383         "recShareList":[]}'
384         '''
385         if len(page)<=0 :
386             print u'取相冊列表出錯，無返回報文!!!\n\n%s\n\n',page.decode('utf-8')
387         return page
388 
389 #主程序開始
390 hw=HuaWei()
391 hw.enableCookies()
392 count =0 
393 while (count <3):
394     count += 1
395     content= hw.getLoginPage()
396     if content == '' :
397         print '獲取登錄信息出錯，立即退出！！！\n\n[%s]\n\n' %(content)
398         break
399     #獲取驗證碼
400     hw.getRadomCode()
401     #生成checkuserpwd提交時需要的POST data
402     postdata=hw.genLoginData(content)
403     #print postdata
404     reUrl = hw.checkUserPwd(postdata)
405     if reUrl.find("user_pwd_error") <> -1 :
406         print u'用戶名或用戶密碼錯誤，立即退出！！！\n\n[%s]\n\n' %(reUrl)
407         break
408     elif reUrl.find("random_code_error") <> -1 :
409         print u'驗證碼錯誤，重試！！！\n\n[%s]\n\n' %(reUrl)
410         continue
411     else:
412         print '恭喜恭喜，登錄華為雲成功！！！\n\n'
413         iRet = hw.getAlbumPage()        
414         if iRet == 0 :
415             print '打開相冊頁失敗，未獲取到CSRFToken！！！\n\n'
416             break 
417         print '打開相冊主頁成功，獲取到CSRFToken！！！\n\n'
418         page = hw.getAlbumList()
419         if page=='' :
420             print '獲取到相冊列表失敗！！！\n\n'
421             break
422         #保存相冊列表
423         hw.getFileList(page,'albumList','albumId')
424         #保存公共相冊列表
425         hw.getFileList(page,'ownShareList','shareId')
426         print '運行結束，可以用迅雷打開相冊文件進行批量下載到本地！！！\n\n'
427         break

htmltool.py

 1 # -*- coding:utf-8 -*-
 2 __author__ = 'zhongtang'
 3 
 4 import re
 5 import HTMLParser
 6 import cgi
 7 import sys
 8 import os
 9 
10 #處理頁面標簽類
11 class htmltool:
12     #去除img標簽,1-7位空格,&nbsp;
13     removeImg = re.compile('<img.*?>| {1,7}|&nbsp;')
14     #刪除超鏈接標簽
15     removeAddr = re.compile('<a.*?>|</a>')
16     #把換行的標簽換為\n
17     replaceLine = re.compile('<tr>|<div>|</div>|</p>')
18     #將表格制表<td>替換為\t
19     replaceTD= re.compile('<td>')
20     #將換行符或雙換行符替換為\n
21     replaceBR = re.compile('<br><br>|<br>')
22     #將其余標簽剔除
23     removeExtraTag = re.compile('<.*?>')
24     #將多行空行刪除
25     removeNoneLine = re.compile('\n+')
26     
27     #html 轉換成txt
28     #譬如 '&lt;abc&gt;' --> '<abc>'
29     def html2txt(self,html):
30         html_parser = HTMLParser.HTMLParser()
31         txt = html_parser.unescape(html)
32         return txt.strip()
33     
34     #html 轉換成txt
35     #譬如 '<abc>' --> '&lt;abc&gt;' 
36     def txt2html(self,txt):
37         html = cgi.escape(txt) 
38         return html.strip()
39     
40     def replace(self,x):
41         x = re.sub(self.removeImg,"",x)
42         x = re.sub(self.removeAddr,"",x)
43         x = re.sub(self.replaceLine,"\n",x)
44         x = re.sub(self.replaceTD,"\t",x)
45         x = re.sub(self.replaceBR,"\n",x)
46         x = re.sub(self.removeExtraTag,"",x)
47         x = re.sub(self.removeNoneLine,"\n",x)
48         #strip()將前后多余內容刪除
49         return x.strip()    
50 
51     #獲取腳本文件的當前路徑，返回utf-8格式
52     def getPyFileDir(self):
53         #獲取腳本路徑
54         path = sys.path[0]
55         #判斷為腳本文件還是py2exe編譯后的文件，如果是腳本文件，則返回的是腳本的目錄，如果是py2exe編譯后的文件，則返回的是編譯后的文件路徑
56         if os.path.isdir(path):
57             return path.decode('utf-8')
58         elif os.path.isfile(path):
59             return os.path.dirname(path).decode('utf-8')
60 
61     #創建新目錄
62     def mkdir(self,path):
63         path = path.strip()
64         pathDir = self.getPyFileDir()
65         #print path
66         #print pathDir
67         #unicode格式
68         path = u'%s\\%s' %(pathDir,path) 
69         # 判斷路徑是否存在
70         # 存在     True
71         # 不存在   False
72         isExists=os.path.exists(path)
73         # 判斷結果
74         if not isExists:
75             # 如果不存在則創建目錄
76             #print u'新建[%s]的文件夾\n' %(path)
77             # 創建目錄操作函數
78             os.makedirs(path)
79         #else:
80            # 如果目錄存在則不創建，並提示目錄已存在
81            #print u'文件夾[%s]已存在\n'  %(path)
82         os.chdir(path)
83         return  path

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 華為雲照片的爬蟲程序更新(python3.6) Linux下python2.7安裝pip Windows下編譯Python2.7源碼 Linux下python2.7安裝pip Linux下安裝Python2.7 Ubuntu系統下安裝python2.7 Linux下安裝Python2.7 Linux下python2.7安裝pip Linux下python2.7安裝pip 爬蟲(二)：urllib庫文件的基礎和進階（python2.7）