鋪墊
目標網站:http://www.gsxt.gov.cn/index.html
網站數據包分析:charles抓包
從結果,追根溯源
先看http://www.gsxt.gov.cn/corp-query-search-1.html這個包
從上圖中可以看到,這個頁面顯示的內容是靜態的資源,所以我們必需要獲取這個頁面
上圖中我們可以看到,他需要的參數有:
tab:ent_tab province: geetest_challenge:10faf845f3f031f4aa0c314d5b593477 geetest_validate:84cec0edcd71ef8e63faafaf251c840a geetest_seccode:84cec0edcd71ef8e63faafaf251c840a|jordan token:40390420 searchword:搜索關鍵字
如果去搜索js生成的話,你會發現如下(360瀏覽器出現了點問題,接下來我用谷歌來調試):
上圖中找到了這三個參數的生成的地方,是不是有點激動,只要解析那個生成的方法是不是就能搞定了?沒那么簡單,繼續往下看
如上圖所示,我點到了生成的函數那,。。。。。。。。。。
換條路:我們看看其他兩個包
第三個包:
第三個包的響應里面有:validate
把這個值拿出來,與第一個包抓的參數geetest_validate的值對比一下:
第一個包參數:
geetest_validate:84cec0edcd71ef8e63faafaf251c840a geetest_seccode:84cec0edcd71ef8e63faafaf251c840a|jordan
第三個包參數:84cec0edcd71ef8e63faafaf251c840a
結論:一毛一樣
這里出現了一個問題就是:
SearchItemCaptcha?t=1593853193470 這個包獲取的 challenge的值與獲取corp-query-search-1.html這個包 攜帶的參數geetest_challenge的值是不同的
且要想拿到validate的值必須先搞定geetest_challenge這個參數。
先不管其他的了,先訪問拿到gt再說,后面再研究這個geetest_challenge參數
正文
目標:拿到下面的響應
cookie反爬蟲
上面說到了,我們要獲取這個地址:http://www.gsxt.gov.cn/SearchItemCaptcha?t=1593853193470的響應數據,從而拿到gt參數
我們先模擬發請求:
import requests import time import re import execjs class Business_Information(object): keyword = '騰訊' headers = { 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cache-Control': 'no-cache', 'Host': 'www.gsxt.gov.cn', 'Pragma': 'no-cache', 'Proxy-Connection': 'keep-alive', 'Referer': 'http://www.gsxt.gov.cn/corp-query-search-1.html', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'X-Requested-With': 'XMLHttpRequest', } sess = requests.session() sess.get('http://www.gsxt.gov.cn/index.html',headers=headers) def get_challenge(self): url = 'http://www.gsxt.gov.cn/SearchItemCaptcha' params = { "t": round(time.time() * 1000) } # 獲取生成cookie的js代碼 cookie_html = self.sess.get(url, params=params, headers=self.headers).text print(cookie_html) def main(self): self.get_challenge() bf = Business_Information() bf.main()
看結果:
返回了這么一串東西,很明顯不是我們需要的數據,那這個是個什么東西呢?經過兩個小時的研究,發現這個代碼是用來生成js代碼的,只有調用了這個生成的js代碼才能拿到生成cookie的js代碼,然后調用生成cookie的js代碼才能拿到真正的cookie
流程:調用接口,獲得一堆js代碼——》正則匹配需要的js代碼——》調用前面的js代碼,生成用來生成真正cookie的js代碼——》調用生成的js代碼——》獲得真正的cookie
分析解析:
調用接口拿到的js代碼(其實就是個html中嵌入來js代碼):
<script > var x = "@21@var@@Jl@location@new@@while@0xEDB88320@@D@a@for@@@@0xFF@match@document@catch@20@@window@@search@parseInt@@@1593854493@@@0@rOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6@@@@__jsl_clearance@2@04@10@innerHTML@onreadystatechange@@Jul@@@@substr@replace@false@4@try@g@8@DOMContentLoaded@@return@33@@@@reverse@@@charAt@addEventListener@if@createElement@002@href@Sat@@@36@GMT@charCodeAt@mDoFw@Ei@@length@@@else@@firstChild@toLowerCase@captcha@1@setTimeout@toString@split@@3@@String@https@challenge@function@@e@@Path@cookie@eval@@I@chars@@attachEvent@1500@Array@@@f@RegExp@@join@P@pathname@Expires@JgSe0upZ@@div@d@@@fromCharCode".replace(/@*$/, "").split("@"), y = "3 f=4f(){46('6.38=6.5f+6.15.28(/[\\?|&]44-4e/,\\'\\')',56);k.4k='1h=19.37|1c|'+(4f(){3 4i=[4f(f){2g f},4f(4i){2g 4i},4f(f){2g 50('4c.62('+f+')')},4f(f){e(3 4i=1c;4i<f.3i;4i++){f[4i]=16(f[4i]).47(3c)};2g f.5d('')}],f=[[(1i+[])+(-~~~''+[]+[])],'52',[(1i+[])+(-~~~''+[]+[])],[+[-~{}, ~~![]]+[]+[[]][1c]][1c].33(~~![]),[(-~~~''+[]+[])+(1i+[]),(-~~~''+[]+[])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'5%',(1i+[]),[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'3f',[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+(-~[1i]+2a+[[]][1c])],(2a+[]+[])+({}+[]+[]).33(2a+2a),'3g',(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'5e',(2a+[]+[]),[(-~[1i]+[])+(-~~~''+[]+[])],[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(([-~-~{}]+(+[])>>-~-~{})+[]+[[]][1c]),[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'%',(-~[1i]+[]),'c'];e(3 49=1c;49<f.3i;49++){f[49]=4i[[4a,45,4a,1c,4a,45,1c,1i,1c,45,1i,1c,45,1c,45,1c,4a,1i,1c,1i,45,1c,45][49]](f[49])};2g f.5d('')})()+';5g=39, 1j-23-11 1k:2:2h 3d;4j=/;'};35((4f(){2b{2g !!13.34;}10(4h){2g 29;}})()){k.34('2e',f,29)}40{k.55('21',f)}", f = function (x, y) { var a = 0, b = 0, c = 0; x = x.split(""); y = y || 99; while ((a = x.shift()) && (b = a.charCodeAt(0) - 77.5)) c = (Math.abs(b) < 13 ? (b + 48.5) : parseInt(a, 36)) + y * c; return c }, z = f(y.match(/\w/g).sort(function (x, y) { return f(x) - f(y) }).pop()); while (z++) try { eval(y.replace(/\b\w+\b/g, function (y) { return x[f(y, z) - 1] || ("_" + y) })); break } catch (_) { } </script>
接下來做修改
function pre_cookie() { # 用個函數包起來 var x = "@21@var@@Jl@location@new@@while@0xEDB88320@@D@a@for@@@@0xFF@match@document@catch@20@@window@@search@parseInt@@@1593854493@@@0@rOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6@@@@__jsl_clearance@2@04@10@innerHTML@onreadystatechange@@Jul@@@@substr@replace@false@4@try@g@8@DOMContentLoaded@@return@33@@@@reverse@@@charAt@addEventListener@if@createElement@002@href@Sat@@@36@GMT@charCodeAt@mDoFw@Ei@@length@@@else@@firstChild@toLowerCase@captcha@1@setTimeout@toString@split@@3@@String@https@challenge@function@@e@@Path@cookie@eval@@I@chars@@attachEvent@1500@Array@@@f@RegExp@@join@P@pathname@Expires@JgSe0upZ@@div@d@@@fromCharCode".replace(/@*$/, "").split("@"), y = "3 f=4f(){46('6.38=6.5f+6.15.28(/[\\?|&]44-4e/,\\'\\')',56);k.4k='1h=19.37|1c|'+(4f(){3 4i=[4f(f){2g f},4f(4i){2g 4i},4f(f){2g 50('4c.62('+f+')')},4f(f){e(3 4i=1c;4i<f.3i;4i++){f[4i]=16(f[4i]).47(3c)};2g f.5d('')}],f=[[(1i+[])+(-~~~''+[]+[])],'52',[(1i+[])+(-~~~''+[]+[])],[+[-~{}, ~~![]]+[]+[[]][1c]][1c].33(~~![]),[(-~~~''+[]+[])+(1i+[]),(-~~~''+[]+[])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'5%',(1i+[]),[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'3f',[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+(-~[1i]+2a+[[]][1c])],(2a+[]+[])+({}+[]+[]).33(2a+2a),'3g',(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'5e',(2a+[]+[]),[(-~[1i]+[])+(-~~~''+[]+[])],[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(([-~-~{}]+(+[])>>-~-~{})+[]+[[]][1c]),[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'%',(-~[1i]+[]),'c'];e(3 49=1c;49<f.3i;49++){f[49]=4i[[4a,45,4a,1c,4a,45,1c,1i,1c,45,1i,1c,45,1c,45,1c,4a,1i,1c,1i,45,1c,45][49]](f[49])};2g f.5d('')})()+';5g=39, 1j-23-11 1k:2:2h 3d;4j=/;'};35((4f(){2b{2g !!13.34;}10(4h){2g 29;}})()){k.34('2e',f,29)}40{k.55('21',f)}", f = function (x, y) { var a = 0, b = 0, c = 0; x = x.split(""); y = y || 99; while ((a = x.shift()) && (b = a.charCodeAt(0) - 77.5)) c = (Math.abs(b) < 13 ? (b + 48.5) : parseInt(a, 36)) + y * c; return c }, z = f(y.match(/\w/g).sort(function (x, y) { return f(x) - f(y) }).pop()); while (z++) try { var result; # 定義一個變量來存儲值 result = (y.replace(/\b\w+\b/g, function (y) { # 給變量賦一下值 return x[f(y, z) - 1] || ("_" + y) })); break } catch (_) { } return result # 返回這個變量 }
接下來用execjs模塊調用一下:
結果:其中的黑體字就是我們需要用來生成真正cookie的js,注意:當你多次執行的時候它返回的js可能是錯誤的js(頻率不高,有興趣的可以嘗試一下所以下一步哪里需要做一下判斷)
var _f = function () { setTimeout('location.href=location.pathname+location.search.replace(/[\?|&]captcha-challenge/,\'\')', 1500); document.cookie = '__jsl_clearance=1593854493.002|0|' + (function () { var _4i = [function (_f) { return _f }, function (_4i) { return _4i }, function (_f) { return eval('String.fromCharCode(' + _f + ')') }, function (_f) { for (var _4i = 0; _4i < _f.length; _4i++) { _f[_4i] = parseInt(_f[_4i]).toString(36) } ; return _f.join('') }], _f = [[(2 + []) + (-~~~'' + [] + [])], 'I', [(2 + []) + (-~~~'' + [] + [])], [+[-~{}, ~~![]] + [] + [[]][0]][0].charAt(~~![]), [(-~~~'' + [] + []) + (2 + []), (-~~~'' + [] + []) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], 'Jl%', (2 + []), [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'mDoFw', [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + (-~[2] + 4 + [[]][0])], (4 + [] + []) + ({} + [] + []).charAt(4 + 4), 'Ei', (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'P', (4 + [] + []), [(-~[2] + []) + (-~~~'' + [] + [])], [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (([-~-~{}] + (+[]) >> -~-~{}) + [] + [[]][0]), [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], '%', (-~[2] + []), 'D']; for (var _49 = 0; _49 < _f.length; _49++) { _f[_49] = _4i[[3, 1, 3, 0, 3, 1, 0, 2, 0, 1, 2, 0, 1, 0, 1, 0, 3, 2, 0, 2, 1, 0, 1][_49]](_f[_49]) } ; return _f.join('') })() + ';Expires=Sat, 04-Jul-20 10:21:33 GMT;Path=/;' }; if ((function () { try { return !!window.addEventListener; } catch (e) { return false; } })()) { document.addEventListener('DOMContentLoaded', _f, false) } else { document.attachEvent('onreadystatechange', _f) }
拿出來進行改寫:
function generate_cookie_js() { # 用函數包起來 cookie = '__jsl_clearance=1593854493.002|0|' + (function () { var _4i = [function (_f) { return _f }, function (_4i) { return _4i }, function (_f) { return eval('String.fromCharCode(' + _f + ')') }, function (_f) { for (var _4i = 0; _4i < _f.length; _4i++) { _f[_4i] = parseInt(_f[_4i]).toString(36) } ; return _f.join('') }], _f = [[(2 + []) + (-~~~'' + [] + [])], 'I', [(2 + []) + (-~~~'' + [] + [])], [+[-~{}, ~~![]] + [] + [[]][0]][0].charAt(~~![]), [(-~~~'' + [] + []) + (2 + []), (-~~~'' + [] + []) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], 'Jl%', (2 + []), [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'mDoFw', [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + (-~[2] + 4 + [[]][0])], (4 + [] + []) + ({} + [] + []).charAt(4 + 4), 'Ei', (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'P', (4 + [] + []), [(-~[2] + []) + (-~~~'' + [] + [])], [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (([-~-~{}] + (+[]) >> -~-~{}) + [] + [[]][0]), [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], '%', (-~[2] + []), 'D']; for (var _49 = 0; _49 < _f.length; _49++) { _f[_49] = _4i[[3, 1, 3, 0, 3, 1, 0, 2, 0, 1, 2, 0, 1, 0, 1, 0, 3, 2, 0, 2, 1, 0, 1][_49]](_f[_49]) } ; return _f.join('') })() + ';Expires=Sat, 04-Jul-20 10:21:33 GMT;Path=/;' return cookie # 讓他返回cookie };
用execjs模塊調用一下:
拿到了結果,這個反爬蟲攜帶的cookie參數
貼一下cookie反爬的源代碼:
import requests import time import re import execjs class Business_Information(object): keyword = '騰訊' headers = { 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cache-Control': 'no-cache', 'Host': 'www.gsxt.gov.cn', 'Pragma': 'no-cache', 'Proxy-Connection': 'keep-alive', 'Referer': 'http://www.gsxt.gov.cn/corp-query-search-1.html', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'X-Requested-With': 'XMLHttpRequest', } sess = requests.session() sess.get('http://www.gsxt.gov.cn/index.html',headers=headers) def get_challenge(self): url = 'http://www.gsxt.gov.cn/SearchItemCaptcha' params = { "t": round(time.time() * 1000) } # 獲取生成cookie的js代碼 cookie_html = self.sess.get(url, params=params, headers=self.headers).text # 從返回的html源碼中匹配到js代碼部分 cookie_js = re.findall("<script>(.*?)</script>",cookie_html, re.S)[0] # 拼接生成要調用的js代碼 edit_js ="function pre_cookie(){" + cookie_js.replace('try{eval','try{var result; result=')+"return result}" # 第一次調用js,獲得用來生成cookie的真正的js代碼 first_js = execjs.compile(edit_js) # 調用js生成第二次需要的js代碼(動態變化的) generate_cookie_js_all = first_js.call("pre_cookie") # 匹配真正生成cookie的js代碼 # print(generate_cookie_js_all) if "href(){setTimeout" in generate_cookie_js_all: raise Exception('您獲取的這段js代碼太傻比,請重新獲取!') generate_cookie_js = re.findall('document\.(cookie=.*?if)',generate_cookie_js_all)[0] generate_cookie_js = "window = {};var get_cookie = function () {"+generate_cookie_js.replace("};if",";return cookie};") # 第二次調用js,生成真正的cookie second_js = execjs.compile(generate_cookie_js) # 獲取真正的cookie cookie = second_js.call('get_cookie') print(cookie) cookie = cookie.split("__jsl_clearance=",)[-1] self.sess.cookies.set("__jsl_clearance",cookie) json_data = self.sess.get(url, params=params, headers=self.headers).json() print(json_data) def main(self): self.get_challenge() bf = Business_Information() bf.main()
執行結果:
注:這個網站的反爬已更新