由於2018知乎改版,增加了幾個登錄所需要的post_data,讓我這個初出茅廬的小白頭疼了幾天,經過一番search(github和各種大佬的博客),最終成功的模擬登錄的2018新版知乎。
方法如下:
1.谷歌瀏覽器,打開知乎登錄頁面,F12打開調試,F5刷新,選中Network,輸入賬號,錯誤的密碼(正確的密碼登錄成功直接跳到主頁了就無法分析登錄的請求了),觀察登錄的過程中提交了哪些請求
主要就是上圖中4個請求
(1)第一個是一個GET請求,查看他的response是一個json格式的數據,show_captcha為true表示需要驗證碼,false為不需要
(2)第二個是一個PUT請求,這次請求得到了驗證碼的base64碼,解碼后可得到一張帶有字母的圖片驗證碼
(3)第三個是一個POST請求,表示服務器的驗證結果是否正確,如果為true,表示驗證成功,可以進行最后一個登陸請求了
(4)最后一個為登陸請求(POST),就是登陸請求了,新版知乎的登錄post數據變成了Request Payload,比以前復雜了許多
發送請求的時候帶上這些數據就可以啦。其中signature和timestamp較為復雜,signature是通過js加密過的(通過ctrl+alt+F搜索signature),可用python模擬js加密過程獲得signature的值,timestamp為13位的時間戳。client_id和request header中的authorization后面部分是一樣的。
2.下面附上整體代碼:
1 # -*- coding: utf-8 -*- 2 __author__ = 'Mark' 3 __date__ = '2018/4/15 10:18' 4 5 import hmac 6 import json 7 import scrapy 8 import time 9 import base64 10 from hashlib import sha1 11 12 13 class ZhihuLoginSpider(scrapy.Spider): 14 name = 'zhihu03' 15 allowed_domains = ['www.zhihu.com'] 16 start_urls = ['http://www.zhihu.com/'] 17 agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36' 18 # agent = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36' 19 headers = { 20 'Connection': 'keep-alive', 21 'Host': 'www.zhihu.com', 22 'Referer': 'https://www.zhihu.com/signup?next=%2F', 23 'User-Agent': agent, 24 'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20' 25 } 26 grant_type = 'password' 27 client_id = 'c3cef7c66a1843f8b3a9e6a1e3160e20' 28 source = 'com.zhihu.web' 29 timestamp = str(int(time.time() * 1000)) 30 timestamp2 = str(time.time() * 1000) 31 print(timestamp2) 32 33 def get_signature(self, grant_type, client_id, source, timestamp): 34 """處理簽名""" 35 hm = hmac.new(b'd1b964811afb40118a12068ff74a12f4', None, sha1) 36 hm.update(str.encode(grant_type)) 37 hm.update(str.encode(client_id)) 38 hm.update(str.encode(source)) 39 hm.update(str.encode(timestamp)) 40 return str(hm.hexdigest()) 41 42 def parse(self, response): 43 print(response.body.decode("utf-8")) 44 45 def start_requests(self): 46 yield scrapy.Request('https://www.zhihu.com/api/v3/oauth/captcha?lang=en', 47 headers=self.headers, callback=self.is_need_capture) 48 49 def is_need_capture(self, response): 50 print(response.text) 51 need_cap = json.loads(response.body)['show_captcha'] 52 print(need_cap) 53 54 if need_cap: 55 print('需要驗證碼') 56 yield scrapy.Request( 57 url='https://www.zhihu.com/api/v3/oauth/captcha?lang=en', 58 headers=self.headers, 59 callback=self.capture, 60 method='PUT' 61 ) 62 else: 63 print('不需要驗證碼') 64 post_url = 'https://www.zhihu.com/api/v3/oauth/sign_in' 65 post_data = { 66 "client_id": self.client_id, 67 "username": "***********", # 輸入知乎用戶名 68 "password": "***********", # 輸入知乎密碼 69 "grant_type": self.grant_type, 70 "source": self.source, 71 "timestamp": self.timestamp, 72 "signature": self.get_signature(self.grant_type, self.client_id, self.source, self.timestamp), # 獲取簽名 73 "lang": "en", 74 "ref_source": "homepage", 75 "captcha": '', 76 "utm_source": "baidu" 77 } 78 yield scrapy.FormRequest( 79 url=post_url, 80 formdata=post_data, 81 headers=self.headers, 82 callback=self.check_login 83 ) 84 # yield scrapy.Request('https://www.zhihu.com/captcha.gif?r=%d&type=login' % (time.time() * 1000), 85 # headers=self.headers, callback=self.capture, meta={"resp": response}) 86 # yield scrapy.Request('https://www.zhihu.com/api/v3/oauth/captcha?lang=en', 87 # headers=self.headers, callback=self.capture, meta={"resp": response},dont_filter=True) 88 89 def capture(self, response): 90 # print(response.body) 91 try: 92 img = json.loads(response.body)['img_base64'] 93 except ValueError: 94 print('獲取img_base64的值失敗!') 95 else: 96 img = img.encode('utf8') 97 img_data = base64.b64decode(img) 98 99 with open('zhihu03.gif', 'wb') as f: 100 f.write(img_data) 101 f.close() 102 captcha = input('請輸入驗證碼:') 103 post_data = { 104 'input_text': captcha 105 } 106 yield scrapy.FormRequest( 107 url='https://www.zhihu.com/api/v3/oauth/captcha?lang=en', 108 formdata=post_data, 109 callback=self.captcha_login, 110 headers=self.headers 111 ) 112 113 def captcha_login(self, response): 114 try: 115 cap_result = json.loads(response.body)['success'] 116 print(cap_result) 117 except ValueError: 118 print('關於驗證碼的POST請求響應失敗!') 119 else: 120 if cap_result: 121 print('驗證成功!') 122 post_url = 'https://www.zhihu.com/api/v3/oauth/sign_in' 123 post_data = { 124 "client_id": self.client_id, 125 "username": "***********", # 輸入知乎用戶名 126 "password": "***********", # 輸入知乎密碼 127 "grant_type": self.grant_type, 128 "source": self.source, 129 "timestamp": self.timestamp, 130 "signature": self.get_signature(self.grant_type, self.client_id, self.source, self.timestamp), # 獲取簽名 131 "lang": "en", 132 "ref_source": "homepage", 133 "captcha": '', 134 "utm_source": "" 135 } 136 headers = self.headers 137 headers.update({ 138 'Origin': 'https://www.zhihu.com', 139 'Pragma': 'no - cache', 140 'Cache-Control': 'no - cache' 141 }) 142 yield scrapy.FormRequest( 143 url=post_url, 144 formdata=post_data, 145 headers=headers, 146 callback=self.check_login 147 ) 148 149 def check_login(self, response): 150 # 驗證是否登錄成功 151 text_json = json.loads(response.text) 152 print(text_json) 153 yield scrapy.Request('https://www.zhihu.com/inbox', headers=self.headers)
3.經過多番掙扎,此次模擬登錄新版知乎終於成功,給自己上了一課,還有好多好多需要學習,一起加油吧!
注:
最后調試過程中出現的問題:
(1)驗證碼票據問題:setting.py文件,設置
COOKIES_ENABLED = True
(2)輸入驗證碼最后檢查登錄是否成功瀏覽器時出現500錯誤,本以為是user-agent的問題,各種設置后發現還是沒用,最后經調試發現是自己的timestamp值提交錯了
應該先取整然后再轉成字符串,寫代碼一定要仔細仔細仔細啊!!!我是真的皮。。。
參考鏈接:
https://zhuanlan.zhihu.com/p/34073256
http://www.bubuko.com/infodetail-2485207.html
https://github.com/zkqiang/Zhihu-Login/blob/master/zhihu_login.py
https://github.com/superdicdi/zhihu_login/blob/master/ZhiHu/spiders/zhihu_login.py