cookie反爬


鋪墊

目標網站:http://www.gsxt.gov.cn/index.html

網站數據包分析:charles抓包

 從結果,追根溯源

先看http://www.gsxt.gov.cn/corp-query-search-1.html這個包

 從上圖中可以看到,這個頁面顯示的內容是靜態的資源,所以我們必需要獲取這個頁面

 上圖中我們可以看到,他需要的參數有:

tab:ent_tab
province:
geetest_challenge:10faf845f3f031f4aa0c314d5b593477
geetest_validate:84cec0edcd71ef8e63faafaf251c840a
geetest_seccode:84cec0edcd71ef8e63faafaf251c840a|jordan
token:40390420
searchword:搜索關鍵字

如果去搜索js生成的話,你會發現如下(360瀏覽器出現了點問題,接下來我用谷歌來調試):

 

 上圖中找到了這三個參數的生成的地方,是不是有點激動,只要解析那個生成的方法是不是就能搞定了?沒那么簡單,繼續往下看

 如上圖所示,我點到了生成的函數那,。。。。。。。。。。

換條路:我們看看其他兩個包

 第三個包:

 第三個包的響應里面有:validate

把這個值拿出來,與第一個包抓的參數geetest_validate的值對比一下:

第一個包參數:
geetest_validate:84cec0edcd71ef8e63faafaf251c840a geetest_seccode:84cec0edcd71ef8e63faafaf251c840a
|jordan
第三個包參數:84cec0edcd71ef8e63faafaf251c840a

結論:一毛一樣

這里出現了一個問題就是:

SearchItemCaptcha?t=1593853193470 這個包獲取的 challenge的值與獲取corp-query-search-1.html這個包 攜帶的參數geetest_challenge的值是不同的

且要想拿到validate的值必須先搞定geetest_challenge這個參數。

先不管其他的了,先訪問拿到gt再說,后面再研究這個geetest_challenge參數

正文

目標:拿到下面的響應

cookie反爬蟲 

上面說到了,我們要獲取這個地址:http://www.gsxt.gov.cn/SearchItemCaptcha?t=1593853193470的響應數據,從而拿到gt參數

我們先模擬發請求:

import requests
import time
import re
import execjs

class Business_Information(object):
    keyword = '騰訊'
    headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'no-cache',
        'Host': 'www.gsxt.gov.cn',
        'Pragma': 'no-cache',
        'Proxy-Connection': 'keep-alive',
        'Referer': 'http://www.gsxt.gov.cn/corp-query-search-1.html',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
        'X-Requested-With': 'XMLHttpRequest',
           }
    sess = requests.session()
    sess.get('http://www.gsxt.gov.cn/index.html',headers=headers)
    def get_challenge(self):
        url = 'http://www.gsxt.gov.cn/SearchItemCaptcha'
        params = {
            "t": round(time.time() * 1000)
        }
        # 獲取生成cookie的js代碼
        cookie_html = self.sess.get(url, params=params, headers=self.headers).text
        print(cookie_html)

    def main(self):
        self.get_challenge()


bf = Business_Information()
bf.main()

看結果:

 返回了這么一串東西,很明顯不是我們需要的數據,那這個是個什么東西呢?經過兩個小時的研究,發現這個代碼是用來生成js代碼的,只有調用了這個生成的js代碼才能拿到生成cookie的js代碼,然后調用生成cookie的js代碼才能拿到真正的cookie

流程:調用接口,獲得一堆js代碼——》正則匹配需要的js代碼——》調用前面的js代碼,生成用來生成真正cookie的js代碼——》調用生成的js代碼——》獲得真正的cookie

分析解析:

調用接口拿到的js代碼(其實就是個html中嵌入來js代碼):

<script >
var x = "@21@var@@Jl@location@new@@while@0xEDB88320@@D@a@for@@@@0xFF@match@document@catch@20@@window@@search@parseInt@@@1593854493@@@0@rOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6@@@@__jsl_clearance@2@04@10@innerHTML@onreadystatechange@@Jul@@@@substr@replace@false@4@try@g@8@DOMContentLoaded@@return@33@@@@reverse@@@charAt@addEventListener@if@createElement@002@href@Sat@@@36@GMT@charCodeAt@mDoFw@Ei@@length@@@else@@firstChild@toLowerCase@captcha@1@setTimeout@toString@split@@3@@String@https@challenge@function@@e@@Path@cookie@eval@@I@chars@@attachEvent@1500@Array@@@f@RegExp@@join@P@pathname@Expires@JgSe0upZ@@div@d@@@fromCharCode".replace(/@*$/, "").split("@"),
    y = "3 f=4f(){46('6.38=6.5f+6.15.28(/[\\?|&]44-4e/,\\'\\')',56);k.4k='1h=19.37|1c|'+(4f(){3 4i=[4f(f){2g f},4f(4i){2g 4i},4f(f){2g 50('4c.62('+f+')')},4f(f){e(3 4i=1c;4i<f.3i;4i++){f[4i]=16(f[4i]).47(3c)};2g f.5d('')}],f=[[(1i+[])+(-~~~''+[]+[])],'52',[(1i+[])+(-~~~''+[]+[])],[+[-~{}, ~~![]]+[]+[[]][1c]][1c].33(~~![]),[(-~~~''+[]+[])+(1i+[]),(-~~~''+[]+[])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'5%',(1i+[]),[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'3f',[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+(-~[1i]+2a+[[]][1c])],(2a+[]+[])+({}+[]+[]).33(2a+2a),'3g',(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'5e',(2a+[]+[]),[(-~[1i]+[])+(-~~~''+[]+[])],[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(([-~-~{}]+(+[])>>-~-~{})+[]+[[]][1c]),[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'%',(-~[1i]+[]),'c'];e(3 49=1c;49<f.3i;49++){f[49]=4i[[4a,45,4a,1c,4a,45,1c,1i,1c,45,1i,1c,45,1c,45,1c,4a,1i,1c,1i,45,1c,45][49]](f[49])};2g f.5d('')})()+';5g=39, 1j-23-11 1k:2:2h 3d;4j=/;'};35((4f(){2b{2g !!13.34;}10(4h){2g 29;}})()){k.34('2e',f,29)}40{k.55('21',f)}",
    f = function (x, y) {
        var a = 0, b = 0, c = 0;
        x = x.split("");
        y = y || 99;
        while ((a = x.shift()) && (b = a.charCodeAt(0) - 77.5)) c = (Math.abs(b) < 13 ? (b + 48.5) : parseInt(a, 36)) + y * c;
        return c
    }, z = f(y.match(/\w/g).sort(function (x, y) {
        return f(x) - f(y)
    }).pop());
while (z++) try {
    eval(y.replace(/\b\w+\b/g, function (y) {
        return x[f(y, z) - 1] || ("_" + y)
    }));
    break
} catch (_) {
}
</script>        

接下來做修改

function pre_cookie() { # 用個函數包起來
var x = "@21@var@@Jl@location@new@@while@0xEDB88320@@D@a@for@@@@0xFF@match@document@catch@20@@window@@search@parseInt@@@1593854493@@@0@rOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6@@@@__jsl_clearance@2@04@10@innerHTML@onreadystatechange@@Jul@@@@substr@replace@false@4@try@g@8@DOMContentLoaded@@return@33@@@@reverse@@@charAt@addEventListener@if@createElement@002@href@Sat@@@36@GMT@charCodeAt@mDoFw@Ei@@length@@@else@@firstChild@toLowerCase@captcha@1@setTimeout@toString@split@@3@@String@https@challenge@function@@e@@Path@cookie@eval@@I@chars@@attachEvent@1500@Array@@@f@RegExp@@join@P@pathname@Expires@JgSe0upZ@@div@d@@@fromCharCode".replace(/@*$/, "").split("@"),
    y = "3 f=4f(){46('6.38=6.5f+6.15.28(/[\\?|&]44-4e/,\\'\\')',56);k.4k='1h=19.37|1c|'+(4f(){3 4i=[4f(f){2g f},4f(4i){2g 4i},4f(f){2g 50('4c.62('+f+')')},4f(f){e(3 4i=1c;4i<f.3i;4i++){f[4i]=16(f[4i]).47(3c)};2g f.5d('')}],f=[[(1i+[])+(-~~~''+[]+[])],'52',[(1i+[])+(-~~~''+[]+[])],[+[-~{}, ~~![]]+[]+[[]][1c]][1c].33(~~![]),[(-~~~''+[]+[])+(1i+[]),(-~~~''+[]+[])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'5%',(1i+[]),[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'3f',[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+(-~[1i]+2a+[[]][1c])],(2a+[]+[])+({}+[]+[]).33(2a+2a),'3g',(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'5e',(2a+[]+[]),[(-~[1i]+[])+(-~~~''+[]+[])],[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(([-~-~{}]+(+[])>>-~-~{})+[]+[[]][1c]),[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'%',(-~[1i]+[]),'c'];e(3 49=1c;49<f.3i;49++){f[49]=4i[[4a,45,4a,1c,4a,45,1c,1i,1c,45,1i,1c,45,1c,45,1c,4a,1i,1c,1i,45,1c,45][49]](f[49])};2g f.5d('')})()+';5g=39, 1j-23-11 1k:2:2h 3d;4j=/;'};35((4f(){2b{2g !!13.34;}10(4h){2g 29;}})()){k.34('2e',f,29)}40{k.55('21',f)}",
    f = function (x, y) {
        var a = 0, b = 0, c = 0;
        x = x.split("");
        y = y || 99;
        while ((a = x.shift()) && (b = a.charCodeAt(0) - 77.5)) c = (Math.abs(b) < 13 ? (b + 48.5) : parseInt(a, 36)) + y * c;
        return c
    }, z = f(y.match(/\w/g).sort(function (x, y) {
        return f(x) - f(y)
    }).pop());
while (z++) try {
    var result;  # 定義一個變量來存儲值
    result = (y.replace(/\b\w+\b/g, function (y) { # 給變量賦一下值
        return x[f(y, z) - 1] || ("_" + y)
    }));
    break
} catch (_) {
}
return result  # 返回這個變量
}

接下來用execjs模塊調用一下:

 結果:其中的黑體字就是我們需要用來生成真正cookie的js,注意:當你多次執行的時候它返回的js可能是錯誤的js(頻率不高,有興趣的可以嘗試一下所以下一步哪里需要做一下判斷)

var _f = function () {
    setTimeout('location.href=location.pathname+location.search.replace(/[\?|&]captcha-challenge/,\'\')', 1500);
    document.cookie = '__jsl_clearance=1593854493.002|0|' + (function () { var _4i = [function (_f) { return _f }, function (_4i) { return _4i }, function (_f) { return eval('String.fromCharCode(' + _f + ')') }, function (_f) { for (var _4i = 0; _4i < _f.length; _4i++) { _f[_4i] = parseInt(_f[_4i]).toString(36) } ; return _f.join('') }], _f = [[(2 + []) + (-~~~'' + [] + [])], 'I', [(2 + []) + (-~~~'' + [] + [])], [+[-~{}, ~~![]] + [] + [[]][0]][0].charAt(~~![]), [(-~~~'' + [] + []) + (2 + []), (-~~~'' + [] + []) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], 'Jl%', (2 + []), [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'mDoFw', [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + (-~[2] + 4 + [[]][0])], (4 + [] + []) + ({} + [] + []).charAt(4 + 4), 'Ei', (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'P', (4 + [] + []), [(-~[2] + []) + (-~~~'' + [] + [])], [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (([-~-~{}] + (+[]) >> -~-~{}) + [] + [[]][0]), [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], '%', (-~[2] + []), 'D']; for (var _49 = 0; _49 < _f.length; _49++) { _f[_49] = _4i[[3, 1, 3, 0, 3, 1, 0, 2, 0, 1, 2, 0, 1, 0, 1, 0, 3, 2, 0, 2, 1, 0, 1][_49]](_f[_49]) } ; return _f.join('') })() + ';Expires=Sat, 04-Jul-20 10:21:33 GMT;Path=/;' }; if ((function () {
    try {
        return !!window.addEventListener;
    } catch (e) {
        return false;
    }
})()) {
    document.addEventListener('DOMContentLoaded', _f, false)
} else {
    document.attachEvent('onreadystatechange', _f)
}

拿出來進行改寫:

function generate_cookie_js() { # 用函數包起來
    cookie = '__jsl_clearance=1593854493.002|0|' + (function () {
        var _4i = [function (_f) {
                return _f
            }, function (_4i) {
                return _4i
            }, function (_f) {
                return eval('String.fromCharCode(' + _f + ')')
            }, function (_f) {
                for (var _4i = 0; _4i < _f.length; _4i++) {
                    _f[_4i] = parseInt(_f[_4i]).toString(36)
                }
                ;
                return _f.join('')
            }],
            _f = [[(2 + []) + (-~~~'' + [] + [])], 'I', [(2 + []) + (-~~~'' + [] + [])], [+[-~{}, ~~![]] + [] + [[]][0]][0].charAt(~~![]), [(-~~~'' + [] + []) + (2 + []), (-~~~'' + [] + []) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], 'Jl%', (2 + []), [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'mDoFw', [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + (-~[2] + 4 + [[]][0])], (4 + [] + []) + ({} + [] + []).charAt(4 + 4), 'Ei', (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'P', (4 + [] + []), [(-~[2] + []) + (-~~~'' + [] + [])], [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (([-~-~{}] + (+[]) >> -~-~{}) + [] + [[]][0]), [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], '%', (-~[2] + []), 'D'];
        for (var _49 = 0; _49 < _f.length; _49++) {
            _f[_49] = _4i[[3, 1, 3, 0, 3, 1, 0, 2, 0, 1, 2, 0, 1, 0, 1, 0, 3, 2, 0, 2, 1, 0, 1][_49]](_f[_49])
        }
        ;
        return _f.join('')
    })() + ';Expires=Sat, 04-Jul-20 10:21:33 GMT;Path=/;'
return cookie # 讓他返回cookie
};

用execjs模塊調用一下:

 拿到了結果,這個反爬蟲攜帶的cookie參數

貼一下cookie反爬的源代碼:

import requests
import time
import re
import execjs

class Business_Information(object):
    keyword = '騰訊'
    headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'no-cache',
        'Host': 'www.gsxt.gov.cn',
        'Pragma': 'no-cache',
        'Proxy-Connection': 'keep-alive',
        'Referer': 'http://www.gsxt.gov.cn/corp-query-search-1.html',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
        'X-Requested-With': 'XMLHttpRequest',
           }
    sess = requests.session()
    sess.get('http://www.gsxt.gov.cn/index.html',headers=headers)
    def get_challenge(self):
        url = 'http://www.gsxt.gov.cn/SearchItemCaptcha'
        params = {
            "t": round(time.time() * 1000)
        }
        # 獲取生成cookie的js代碼
        cookie_html = self.sess.get(url, params=params, headers=self.headers).text
        # 從返回的html源碼中匹配到js代碼部分
        cookie_js = re.findall("<script>(.*?)</script>",cookie_html, re.S)[0]
        # 拼接生成要調用的js代碼
        edit_js ="function pre_cookie(){" + cookie_js.replace('try{eval','try{var result; result=')+"return result}"
        # 第一次調用js,獲得用來生成cookie的真正的js代碼
        first_js = execjs.compile(edit_js)
        # 調用js生成第二次需要的js代碼(動態變化的)
        generate_cookie_js_all = first_js.call("pre_cookie")
        # 匹配真正生成cookie的js代碼
        # print(generate_cookie_js_all)
        if "href(){setTimeout" in generate_cookie_js_all:
            raise Exception('您獲取的這段js代碼太傻比,請重新獲取!')
        generate_cookie_js = re.findall('document\.(cookie=.*?if)',generate_cookie_js_all)[0]
        generate_cookie_js = "window = {};var get_cookie = function () {"+generate_cookie_js.replace("};if",";return cookie};")
        # 第二次調用js,生成真正的cookie
        second_js = execjs.compile(generate_cookie_js)
        # 獲取真正的cookie
        cookie = second_js.call('get_cookie')
        print(cookie)
        cookie = cookie.split("__jsl_clearance=",)[-1]
        self.sess.cookies.set("__jsl_clearance",cookie)
        json_data = self.sess.get(url, params=params, headers=self.headers).json()
        print(json_data)


    def main(self):
        self.get_challenge()


bf = Business_Information()
bf.main()

執行結果:

注:這個網站的反爬已更新


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM