爬蟲521錯誤(又是一次和可愛的前端vs的故事)


起因:

  今天突然想重構一下代理池,並且想擴充一下代理,所以就想着爬點代理IP,然后就有了下面的故事

 

一上來先進行了一頓操作:

def get_xxdaili(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36',
        "Host": 'www.66ip.cn',
        "Referer": 'http://www.66ip.cn/index.html',
        "Upgrade-Insecure-Requests": '1',
    }
    res = requests.get(url=url, headers=headers)

然后看都沒看狀態碼直接xpath取:過了一會黑人問號??????,喵喵喵,為啥是空,點開源代碼,啥都有,哦,可能是xpath寫的有問題,又進行了微調,還是取不到,突然感覺這個網站好騷,怎么就取不到呢.有重新分析了一次源代碼與Network, 然后看了眼返回狀態碼,521,進過分析以后得出了問題的原因:

  發生 521 錯誤是因為源服務器拒絕來自 Cloudflare 的連接。更具體地說,Cloudflare 嘗試通過端口 80 或 443 連接到您的源服務器,但卻收到連接被拒絕的錯誤。

 

我發現cookie的參數很有問題,所以估計是cookie的問題(之前沒遇到521,所以一開始也不清楚哪里的問題),網上整理了一下資料,原來是進行了cookie加密(js),所以接下來思路是很清晰了,就是分析js,然后拿到加密后的數據.所以我直接拿到相應信息

<script>var x="27@@attachEvent@var@substr@May@function@@0@href@while@@Array@rOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6@3@@f@fromCharCode@GMT@@return@@pathname@try@hvw@BxG@innerHTML@@Y@@@else@e@onreadystatechange@@a@toLowerCase@20@toString@@@@@new@@593@@@@charAt@1@reverse@Expires@@headless@D@@eval@gYw@1558952840@firstChild@@charCodeAt@8@parseInt@@@for@@window@if@join@false@11@String@@@search@@h@captcha@@0xFF@RegExp@DOMContentLoaded@https@B@replace@@JgSe0upZ@length@split@div@@addEventListener@__jsl_clearance@@1500@d@@@@challenge@@@catch@Mon@@36@@chars@0xEDB88320@match@createElement@@Path@setTimeout@document@@mX@@g@@19@cookie@@@location".replace(/@*$/,"").split("@"),y="4 49=7(){4l('58.a=58.n+58.36.3g(/[\\?|&]39-47/,\\'\\')',42);4m.55='40=2c.1m|9|'+(7(){4 49=[[(+!~~[])]+[-~{}-~{}],[f+f],[(+!~~[])]+((-~-~[])*[-~-~[]]+[]+[[]][9]),[(+!~~[])]+[-~{}+(-~-~[]<<-~!{})],[(+!~~[])]+[~~{}],[(+!~~[])]+[-~[]+(-~!{}+[(-~~~{}<<-~~~{})])/[(-~~~{}<<-~~~{})]],[(+!~~[])]+((-~-~[]^(+!~~[]))+[[]][9]),[~~{}],((-~-~[]^(+!~~[]))+[[]][9]),[-~[]+(-~!{}+[(-~~~{}<<-~~~{})])/[(-~~~{}<<-~~~{})]],[(+!~~[])]+[(+!~~[])],[-~{}-~{}],[(+!~~[])]+[f+f],[(-~-~[]^(+!~~[]))+(-~-~[]^(+!~~[]))+(-~-~[]^(+!~~[]))],[(+!~~[])],((-~-~[])*[-~-~[]]+[]+[[]][9]),[-~{}+(-~-~[]<<-~!{})],[(+!~~[])]+[(-~{}<<(-~-~[]^(+!~~[])))],[(-~{}<<(-~-~[]^(+!~~[])))]],c=d(49.3j);2k(4 45=9;45<49.3j;45++){c[49[45]]=['%','15','3f',[-~{}+(-~-~[]<<-~!{})]+[{}+[[]][9]][9].22(2g),'50',((-~-~[]^(+!~~[]))+[[]][9]),[-~{}-~{}],'2b',[-~{}-~{}],(+[~~[], ~~[]]+[]).22((+[]))+[~~{}]+([-~{}-~{}]/(+![])+[]+[[]][9]).22(-~-~[]+(-~-~[])*[-~-~[]])+[!{}+[]+[[]][9]][9].22(-~{}-~{}),(2m.27+[]).22(-~((-~-~[]^(+!~~[])))-~((-~-~[]^(+!~~[])))),'%','11%',((f)/(+[])+[]).22(~~{})+({}+[]+[]).22([(+!~~[])]+[~~{}]),[-~[]+(-~!{}+[(-~~~{}<<-~~~{})])/[(-~~~{}<<-~~~{})]]+[(+!~~[])],'12',[!-{}+[]+[]][9].22((-~~~{}<<-~~~{}))+[-~{}+(-~-~[]<<-~!{})],'28','38'][45]};l c.30('')})()+';25=4b, 1-6-54 32:1:1e j;4k=/;'};2n((7(){10{l !!2m.3n;}4a(19){l 31;}})()){4m.3n('3d',49,31)}18{4m.3('1a',49)}",f=function(x,y){var a=0,b=0,c=0;x=x.split("");y=y||99;while((a=x.shift())&&(b=a.charCodeAt(0)-77.5))c=(Math.abs(b)<13?(b+48.5):parseInt(a,36))+y*c;return c},z=f(y.match(/\w/g).sort(function(x,y){return f(x)-f(y)}).pop());while(z++)try{eval(y.replace(/\b\w+\b/g, function(y){return x[f(y,z)-1]||("_"+y)}));break}catch(_){}</script>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              

經過js優化:

<script>var x = "@@DOMContentLoaded@P@length@location@new@a@S@RegExp@19@document@0@@pathname@@@@replace@0xEDB88320@@8@reverse@@36@try@GMT@Expires@@@cookie@@27@String@vwpBT@while@chars@parseInt@BpU@charCodeAt@@@return@f@addEventListener@Path@@@@@captcha@div@@href@@D@@function@0xFF@if@substr@@1@@attachEvent@JgSe0upZ@join@firstChild@@w@@@@Array@search@@@Mon@else@BqA@13@@@fromCharCode@onreadystatechange@@@@rOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6@@charAt@09@e@May@@@1500@match@window@25@https@var@__jsl_clearance@@1558945513@@toLowerCase@@3@false@@053@d@g@WpLU@for@5@toString@@eval@catch@innerHTML@createElement@split@2@@setTimeout@challenge".replace(/@*$/, "").split("@"),
    y = "402 241=213(){1002('11.204=11.30+11.300.34(/[\\?|&]201-1003/,\\'\\')',342);22.111='403=410.422|23|'+(213(){402 241=[((+!+{})-~(+!+{})-~[-~(+[])+(-~[]<<-~(+!+{}))]+[]+[]),(-~[]+[[]][23])+([-~[]-~[]]*((-~{}+[-~(+!+{})]>>-~(+!+{})))+[]+[]),((-~(+[])|1000)+[]),[((-~![]<<-~![])<<(-~![]<<-~![]))],(-~[]+[[]][23])+((-~(+[])|1000)+[]),(-~[]+[[]][23]),(-~(+!+{})+[[]][23]),(-~[]+[[]][23])+(-~[]+[[]][23]),(-~[]+[[]][23])+((-~[]<<-~(+!+{}))+[[]][23]),[432],[414+(-~![]<<-~![])+(-~![]<<-~![])],(-~[]+[[]][23])+[432],(-~[]+[[]][23])+[~~{}],(-~[]+[[]][23])+(-~(+!+{})+[[]][23]),((-~[]<<-~(+!+{}))+[[]][23]),[~~{}],([-~[]-~[]]*((-~{}+[-~(+!+{})]>>-~(+!+{})))+[]+[])],31=244(241.10);431(402 340=23;340<241.10;340++){31[241[340]]=[[{}+[[]][23]][23].331(([-~(+!+{})]+~~[]>>-~(+!+{}))),'211',((+!+{})-~(+!+{})-~[-~(+[])+(-~[]<<-~(+!+{}))]+[]+[]),'424',[!''+[[]][23]][23].331(-~[]-~[])+({}+[]+[[]][23]).331((1000^-~(+[]))),(-~[]+[[]][23]),'14',([-~[]-~[]]*((-~{}+[-~(+!+{})]>>-~(+!+{})))+[]+[]),'310%',[!/!/+[]][23].331((1000^-~(+[]))),[414+(-~![]<<-~![])+(-~![]<<-~![])],((-~(+[])|1000)+[]),'4','430','120','124','240'][340]};133 31.232('')})()+';103=303, 113-334-21 332:400:311 102;141=/;'};220((213(){101{133 !!344.140;}441(333){133 420;}})()){22.140('3',241,420)}304{22.230('320',241)}",
    f = function (x, y) {
        var a = 0, b = 0, c = 0;
        x = x.split("");
        y = y || 99;
        while ((a = x.shift()) && (b = a.charCodeAt(0) - 77.5)) c = (Math.abs(b) < 13 ? (b + 48.5) : parseInt(a, 36)) + y * c;
        return c
    }, z = f(y.match(/\w/g).sort(function (x, y) {
        return f(x) - f(y)
    }).pop());
while (z++) try {
    eval(y.replace(/\b\w+\b/g, function (y) {
        return x[f(y, z) - 1] || ("_" + y)
    }));
    break
} catch (_) {
}</script>

經過參考資料,和自己的研究,發現關鍵地方

於是 我把 eval 替換成 console.log

經過整理得到(上圖與下面js代碼聲明的不一樣,但是基本上一樣,):

var _3a = function () {
    setTimeout('location.href=location.pathname+location.search.replace(/[\?|&]captcha-challenge/,\'\')', 1500);
    document.cookie = '__jsl_clearance=1558947273.79|0|' + (function () {
        var _3a = [((-~(+[]) | 2) + []), ((-~[] << -~(+!+{})) + [[]][0]), (-~[] + [[]][0]) + (-~[] + [[]][0]), (-~[] + [[]][0]) + [~~{}], (-~(+!+{}) + [[]][0]) + (-~[] + [[]][0]), (-~[] + [[]][0]) + ((-~(+[]) | 2) + []), (-~(+!+{}) + [[]][0]) + [~~{}], (-~[] + [[]][0]) + ([-~[] - ~[]] * ((-~{} + [-~(+!+{})] >> -~(+!+{}))) + [] + []), (-~[] + [[]][0]) + [3 + (-~![] << -~![]) + (-~![] << -~![])], (-~[] + [[]][0]) + [5], [5], ((+!+{}) - ~(+!+{}) - ~[-~(+[]) + (-~[] << -~(+!+{}))] + [] + []), ([-~[] - ~[]] * ((-~{} + [-~(+!+{})] >> -~(+!+{}))) + [] + []), [3 + (-~![] << -~![]) + (-~![] << -~![])], (-~[] + [[]][0]) + [((-~![] << -~![]) << (-~![] << -~![]))], [~~{}], [((-~![] << -~![]) << (-~![] << -~![]))], (-~[] + [[]][0]) + ((-~[] << -~(+!+{})) + [[]][0]), (-~[] + [[]][0]) + (-~(+!+{}) + [[]][0]), (-~(+!+{}) + [[]][0]), (-~[] + [[]][0]), (-~[] + [[]][0]) + ((+!+{}) - ~(+!+{}) - ~[-~(+[]) + (-~[] << -~(+!+{}))] + [] + [])],
            _4h = Array(_3a.length);
        for (var _28 = 0; _28 < _3a.length; _28++) {
            _4h[_3a[_28]] = ['YM%', (-~(+!+{}) + [[]][0]), 'xG', [5] + [{} + [] + []][0].charAt(-~[] - ~[]) + ([-~[] - ~[]] * ((-~{} + [-~(+!+{})] >> -~(+!+{}))) + [] + []), 'D', '%', ((-~(+[]) | 2) + []), [window['callP' + 'hantom'] + [] + [[]][0]][0].charAt((-~![] << -~![])) + (!![[]][1] + [] + []).charAt((+!+{})), 'T', 'B', 'B', 'BP', ([-~[] - ~[]] * ((-~{} + [-~(+!+{})] >> -~(+!+{}))) + [] + []) + [5], '%', (!![[]][1] + [] + []).charAt((+!+{})), [!+{} + []][0].charAt(~~'') + [(+!+{}) / ~~'' + [] + []][0].charAt(([-~(+!+{})] + ~~[] >> -~(+!+{}))) + ((-~[] << -~(+!+{})) + [[]][0]) + [!/!/ + [[]][0]][0].charAt(-~[] - ~[]) + [{} + [] + []][0].charAt(-~[] - ~[]), (-~(+!+{}) + [[]][0]), (-~(+!+{}) + [[]][0]), (!+{} + []).charAt(-~![]), ({} + [] + [[]][0]).charAt((2 ^ -~(+[]))) + [(+!+{}) / ~~'' + [] + []][0].charAt((-~![] << -~![])), 'K', 'k%'][_28]
        }
        ;
        return _4h.join('')
    })() + ';Expires=Mon, 27-May-19 09:54:33 GMT;Path=/;'
};
if ((function () {
    try {
        return !!window.addEventListener;
    } catch (e) {
        return false;
    }
})()) {
    document.addEventListener('DOMContentLoaded', _3a, false)
} else {
    document.attachEvent('onreadystatechange', _3a)
}

從上面可以看出網站在得到cookie之后又進行了一次加密.所以我們在把上面的代碼 document.cookie 中的數據得到就是 我們想要的cookie了

__jsl_clearance=1558954345.795|0|V%2Bp1UYNNA%2Fc4wboCF4SQoA%2Fy9j0%3D;Expires=Mon, 27-May-19 11:52:25 GMT;Path=/;  這就是我們要得到的數據,在加上第一次我們需要的cookie ,然后將它們進行拼接就是我們要的cookie了,

 

想要在python下運行js,有很多包,這里我們使用 js2py 與 execjs (這兩個都可以)

pip install Js2Py
pip install PyExecJS

 兩個代碼基本類似,而且由於時間關系,很多地方沒有優化,只是實現的功能,希望大家見諒(后期優化)

js2py 實現 

 

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2019/5/27 15:19
# @Author  : yhl
# @Software: PyCharm

import re
import time
import js2py
import random
import requests
from lxml import etree

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36',
    "Host": 'www.66ip.cn',
    # "Referer": 'http://www.66ip.cn/index.html',
    "Upgrade-Insecure-Requests": '1',
}

def get_521_content(url):
    req = requests.get(url=url, headers=headers)
    cookies = req.cookies
    cookies = '; '.join(['='.join(item) for item in cookies.items()])
    txt_521 = req.text
    txt_521 = ''.join(re.findall('<script>(.*?)</script>', txt_521))
    return (txt_521, cookies, req)


def fixed_fun(function,url):
    print(function)
    js = function.replace("<script>", "").replace("</script>", "").replace("{eval(", "{var my_data_1 = (")
    # print(js)
    # 使用js2py的js交互功能獲得剛才賦值的data1對象
    context = js2py.EvalJs()
    context.execute(js)
    js_temp = context.my_data_1
    print(js_temp)
    index1 = js_temp.find("document.")
    index2 = js_temp.find("};if((")
    js_temp = js_temp[index1:index2].replace("document.cookie", "my_data_2")
    new_js_temp = re.sub(r'document.create.*?firstChild.href', '"{}"'.format(url), js_temp)
    # print(new_js_temp)
    # print(type(new_js_temp))
    context.execute(new_js_temp)
    data = context.my_data_2
    # print(data)
    __jsl_clearance = str(data).split(';')[0]
    return __jsl_clearance


def get_66daili(url):
    txt_521, cookies, req = get_521_content(url)
    print(req.status_code)
    if req.status_code == 521:
        __jsl_clearance = fixed_fun(txt_521,url)
        headers['Cookie'] = __jsl_clearance + ';' + cookies
        res1 = requests.get(url=url, headers=headers)
    else:
        res1 = req
    res1.encoding = 'gb2312'
    html = etree.HTML(res1.text)
    tr_list = html.xpath('//table//tr')
    for num, tr in enumerate(tr_list, 1):
        proxy_ip_dict = {}
        if num != 1:
            proxy_ip_dict['proxy_ip'] = ''.join(tr.xpath('.//td[1]/text()'))
            proxy_ip_dict['proxy_port'] = ''.join(tr.xpath('.//td[2]/text()'))
            proxy_ip_dict['proxy_local'] = ''.join(tr.xpath('.//td[3]/text()'))
            proxy_ip_dict['proxy_anonymous'] = ''.join(tr.xpath('.//td[4]/text()'))
            print(proxy_ip_dict)  #proxy_type 網頁沒有,自己添加+代理檢測


def main():
    for i in range(1, 2000):
        get_66daili('http://www.66ip.cn/%s.html' % (i))


if __name__ == '__main__':
    for i in range(2, 2000):
        get_66daili('http://www.66ip.cn/%s.html' % (i))
        time.sleep(random.uniform(1, 2))

 

 

execjs 實現(容易出bug,但是還是可以出來的,親測)

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2019/5/27 17:18
# @Author  : yhl
# @Software: PyCharm

import re
import execjs
import js2py
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'
}


def get_521_content():
    req = requests.get('http://www.66ip.cn/1.html', headers=headers)
    cookies = req.cookies
    cookies = '; '.join(['='.join(item) for item in cookies.items()])
    txt_521 = req.text
    txt_521 = ''.join(re.findall('<script>(.*?)</script>', txt_521))
    return (txt_521, cookies)


def fixed_fun(function):
    print(function)
    func_return = function.replace('eval', 'return')
    resHtml = "function getClearance(){" + func_return + "};"
    ctx = execjs.compile(resHtml)
    temp1 = ctx.call('getClearance')
    print(temp1)
    s = 'var a' + temp1.split('document.cookie')[1].split("Path=/;'")[0] + "Path=/;';return a;"
    s = re.sub(r'document.create.*?firstChild.href', '"{}"'.format('http://www.66ip.cn/1.html'), s) 
    print('s--->',s)
    resHtml = "function getnewClearance(){" + s + "};"
    ctx = execjs.compile(resHtml)
    jsl_clearance = ctx.call('getnewClearance')
    __jsl_clearance = str(jsl_clearance).split(';')[0]
    print(jsl_clearance)

    return __jsl_clearance

if __name__ == '__main__':
    func = get_521_content()
    content = func[0]
    cookie_id = func[1]
    cookie_id1 = fixed_fun(content)
    headers['Cookie'] = cookie_id + ';' + cookie_id1
    res1 = requests.get(url='http://www.66ip.cn/1.html', headers=headers)
    res1.encoding = 'gb2312'
    print(res1.text)

 基於execjs實現結果

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM