前言
最近遇到了一個比較好玩的反爬--攜程eleven參數的生成。
說好玩的原因是請求一個接口后,會返回js代碼,只要稍微調試下,便可以在瀏覽器上得到eleven參數了。
但如果想要在node或者無頭瀏覽器之類的東西完成的話,只會報錯。
(需要代碼的大佬可以跳到最后(node環境+油猴+py, 通過websocket給油猴和py代碼通信))
爬取目標
說一下我們要爬的數據吧。(如下內容) https://hotels.ctrip.com/hotel/beijing1#ctm_ref=hod_hp_sb_lst
這些數據來自於此接口
是一個post請求,在請求體中便是著名的eleven參數了。
如果eleven參數錯誤的話,是不會返回上面的數據的。
Eleven參數
前面說過了,Eleven參數來自於一段js代碼(這段js代碼是請求一個url后直接返回的)
下面便是那個所要請求的url
請求此url后返回的js代碼
1. 測試返回的js代碼
我們新開一個標簽頁,然后將返回的js代碼復制到控制台執行。會發現報了一個錯
應該是缺少了其它js代碼造成的
// 這樣可以產生一個與上圖相似的錯誤 Function.prototype.toString.call(1);
我們還可以將返回的js代碼復制到攜程頁面的控制台運行下,發現並沒有報錯,但我們的頁面被重定向到了登陸頁面
因此猜測返回的js代碼的只能被執行一次(畢竟就是在攜程的頁面執行的。)
2. 下斷點(看看返回的js代碼是怎么運行的)
怎么下斷點?搜索url中的關鍵字? 下xhr斷點。
當時我用的方式是搜索url中的關鍵字, oceanball。那天是直接可以搜索到的。但今天沒有了。
下xhr斷點是不可能斷下來的。因為他用到是jsonp來請求的。
那用啥方法呢?
看這個請求的發起者。
鼠標移到下面紅色箭頭指的位置便可以看到這個請求的發起者
點擊第二個發起者(js @cQuery_110421.js:formatted:823那行)
為什么是第二個,因為第一個不行,可以自行嘗試
如圖 下一個斷點(823行,也是第二個發起者代碼執行行數)
我們其實也可以在 "歡迎度排序" 和 "好評優先" 之間來回切換,不然總是刷新的話,效率不高
如果頁面刷新了,或者如上切換了選項的話。頁面會在我們之前下的斷點停住。
注意下右邊的 call stack(調用棧)
我們如下圖點擊一下 ,來到上一層的執行環境。
往上翻一下,就會發現這部分便是那個url從發起請求到處理返回結果的所有細節
如下圖所示
其實只要在返回的js代碼里加上如下內容
// 這里的o是請求url中callback參數。
window[o] = function (e){ console.log(e()); // 這樣便可以輸出結果了。 } // 下面是請求url后返回的js代碼
這樣代碼便可以在瀏覽器中輸出結果了
好了,重點部分來了。
3. 如何批量生成eleven參數
我不能說我手動復制到瀏覽器中運行,然后復制下結果吧。
在node環境中運行難度比較大,他會嚴格檢測執行環境。
也別想斷點調試。見過一個函數被調用18多萬次嗎?
是不是想問我是怎么知道這些的?
我是通過Object.defineProperty 劫持了 navigator.userAgent。
當這個js代碼想要獲取 navigator.userAgent 時,代碼便會在此就會停住
Object.defineProperty(navigator, "userAgent", { get(){ debugger; return navigator.userAgent; } })
然后慢慢堆棧時發現某一個函數貌似便是用於檢測環境的函數啥的。然后那個函數被調用了18萬多次。
檢測的內容非常多,不光是node環境,還有無頭瀏覽器啥的,你聽過的沒聽過的都有。
處理方法
已經有大佬在node環境中模擬了這個瀏覽器環境了,像我這種菜雞,估計是難做到了。
我的想法很簡單,還是通過瀏覽器執行那些js代碼,但是需要自動執行。
vscode的自動保存便刷新頁面的插件給了我啟發,他是通過websocket進行通信,服務器會將最新的html傳給客戶端,客戶端可以做一定的處理
1. 首先需要使用nodejs搭建一個websocket的環境,可以使用 nodejs-websocket 模塊搭建
代碼如下
需要安裝下node環境(node官網下載,像裝軟件一樣安裝即可。)

var ws = require("nodejs-websocket"); console.log("開始建立連接...") var cached = { } var server = ws.createServer(function(conn){ conn.on("text", function (msg) { if (!msg) return; // conn.sendText(str) // console.log(str); if (msg.length > 1000){ console.log("msg 這是js代碼") }else{ console.log("msg", msg); } var key = conn.key; if ((msg === "Browser") || (msg === "Python")){ // browser或者python第一次連接 cached[msg] = key; console.log("cached",cached); return; } console.log(cached, key); if (Object.values(cached).includes(key)){ console.log(server.connections.forEach(conn=>conn.key)); var targetConn = server.connections.filter(function(conn){ return conn.key !== key; }) console.log(targetConn.key); console.log("將要發送的js代碼"); targetConn.forEach(conn=>{ conn.send(msg); }) } // broadcast(server, str); }) conn.on("close", function (code, reason) { console.log("關閉連接") }); conn.on("error", function (code, reason) { console.log("異常關閉") }); }).listen(8014) console.log("WebSocket建立完畢") // var server = http.createServer(function(request, response){ // response.end("ok"); // }).listen(8000);
2. 其次, 需要安裝瀏覽器插件 油猴(英文名 tampermonkey),需要FQ。
點擊應用后就會有 谷歌應用商店(需要FQ),然后搜索 油猴便可以了。
關於油猴的代碼

// ==UserScript== // @name 攜程websocket // @namespace http://tampermonkey.net/ // @version 0.1 // @description try to take over the world! // @author You // @match https://hotels.ctrip.com/hotel/beijing1 // @grant none // ==/UserScript== (function() { var mess = document.getElementById("mess"); if(window.WebSocket){ ws = new WebSocket('ws://127.0.0.1:8014/'); ws.onopen = function(e){ // console.log("連接服務器成功"); ws.send("Browser"); } ws.onclose = function(e){ console.log("服務器關閉"); } ws.onerror = function(){ console.log("連接出錯"); } ws.onmessage = function(e){ var data = e.data; var execJS = document.getElementById("execJS"); if (execJS){ document.body.removeChild(execJS); } execJS = document.createElement("script"); execJS.id = "execJS"; execJS.innerHTML = data; document.body.appendChild(execJS); } } // Your code here... })();
說明一下,為什么需要油猴?
使用油猴,使得js代碼的運行環境直接就是攜程的網頁,而不是單獨打開的頁面。
(注意,攜程的服務器每天驗證的嚴格程度都不太一樣。)
那天測試的時候,我是直接寫了一個html文件的,然后本地打開。就可以直接用了。
如果沒有裝油猴,可以先試試我下面提供的html文件。如果驗證沒有通過,就需要使用油猴環境

<!doctype html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Document</title> <style> #mess{text-align: center} </style> </head> <body> <script id="execJS"></script> <script> var mess = document.getElementById("mess"); var execJS = document.getElementById("execJS"); if(window.WebSocket){ var ws = new WebSocket('ws://127.0.0.1:8010/'); ws.onopen = function(e){ // console.log("連接服務器成功"); ws.send("Browser"); } ws.onclose = function(e){ console.log("服務器關閉"); } ws.onerror = function(){ console.log("連接出錯"); } ws.onmessage = function(e){ var data = e.data; var execJS = document.getElementById("execJS"); if (execJS){ document.body.removeChild(execJS); } execJS = document.createElement("script"); execJS.id = "execJS"; execJS.innerHTML = data; document.body.appendChild(execJS); } } </script> </body> </html>
返回的eleven參數是正確的,請求也成功了。但是今天測試時失敗了,然后我對比了一下在攜程的控制台下和我本地路徑下的html的控制台的結果
# 21e3255d0f89cdf5c3a347d61e7dafbcf15db34f7afe97cda2b5a7ec578652ee_1965113742
# 21e3255d0f89cdf5c3a347d61e7cafbcf15db34f7afe97cda2b5a7ec578652ee_1965113417
如果不仔細看的話,還看不出來。最后的三位是不一樣的,應該是對location的檢測。
油猴的作用是在攜程的網站打開時注入我們的js代碼,然后接下來要運行的代碼環境便是攜程的了。這樣產生的eleven參數便是正確的。
3. python代碼的編寫
python的作用其實是連接websocket服務,發送我們需要運行的js代碼,node會幫我們將js代碼傳給前端頁面(油猴插件)。
當js代碼在攜程的環境里運行完畢后,它會將eleven參數通過websocket傳給node,node會把結果返回給我們。這樣我們的py代碼就能獲取到eleven參數了。

import requests import time import datetime import execjs import os from ws4py.client.threadedclient import WebSocketClient class CG_Client(WebSocketClient): def opened(self): print("連接成功") # req = open("../a.js").read() self.send("Python") def closed(self, code, reason=None): print("Closed down:", code, reason) def received_message(self, resp): print("resp", resp) currentDate = time.strftime("%Y-%m-%d") today = datetime.datetime.now() # 今天,如 "2020-05-11" last_time = today + datetime.timedelta(hours=-24) tomorrow = last_time.strftime("%Y-%m-%d") # 明天,如 '2020-05-10' data = { "__VIEWSTATEGENERATOR": "DB1FBB6D", "cityName": "%E5%8C%97%E4%BA%AC", "StartTime": today, "DepTime": tomorrow, "RoomGuestCount": "1,1,0", "txtkeyword": "", "Resource": "", "Room": "", "Paymentterm": "", "BRev": "", "Minstate": "", "PromoteType": "", "PromoteDate": "", "operationtype": "NEWHOTELORDER", "PromoteStartDate": "", "PromoteEndDate": "", "OrderID": "", "RoomNum": "", "IsOnlyAirHotel": "F", "cityId": "1", "cityPY": "beijing", "cityCode": "010", "cityLat": "39.9105329229", "cityLng": "116.413784021", "positionArea": "", "positionId": "", "hotelposition": "", "keyword": "", "hotelId": "", "htlPageView": "0", "hotelType": "F", "hasPKGHotel": "F", "requestTravelMoney": "F", "isusergiftcard": "F", "useFG": "F", "HotelEquipment": "", "priceRange": "-2", "hotelBrandId": "", "promotion": "F", "prepay": "F", "IsCanReserve": "F", "k1": "", "k2": "", "CorpPayType": "", "viewType": "", "checkIn": today, "checkOut": tomorrow, "DealSale": "", "ulogin": "", "hidTestLat": "0%7C0", "AllHotelIds": "", "psid": "", "isfromlist": "T", "ubt_price_key": "htl_search_noresult_promotion", "showwindow": "", "defaultcoupon": "", "isHuaZhu": "False", "hotelPriceLow": "", "unBookHotelTraceCode": "", "showTipFlg": "", "traceAdContextId": "", "allianceid": "0", "sid": "0", "pyramidHotels": "", "hotelIds": "", "markType": "0", "zone": "", "location": "", "type": "", "brand": "", "group": "", "feature": "", "equip": "", "bed": "", "breakfast": "", "other": "", "star": "", "sl": "", "s": "", "l": "", "price": "", "a": "0", "keywordLat": "", "keywordLon": "", "contrast": "0", "PaymentType": "", "CtripService": "", "promotionf": "", "allpoint": "", "page_id_forlog": "102002", "contyped": "0", "productcode": "", "eleven": resp.data, "orderby": "3", "ordertype": "0", "page": "1", } headers = { "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36", "referer": "https://hotels.ctrip.com/hotel/shanghai2", "cookie": 請在此處寫入你的cookie,因為攜程會檢測cookie的ip字段(經過混淆加密) } url = "https://hotels.ctrip.com/Domestic/Tool/AjaxHotelList.aspx" a = requests.post(url, data=data, headers=headers) print(a.text) # resp = json.loads(str(resp)) # data = resp['data'] # if type(data) is dict: # ask = data['asks'][0] # print('Ask:', ask) # bid = data['bids'][0] # print('Bid:', bid) def getTime(): return str(time.time()).replace(".", "")[0:13] def getCallbackParam(): f = open("./callback.js") context = execjs.compile(f.read()) return context.call("getCallback") def getContent(): t = getTime() callback = getCallbackParam() print(callback) url = "https://hotels.ctrip.com/domestic/cas/oceanball?callback=%s&_=%s" % ( callback, t, ) headers = { "user-agent": "Mozilla/5.0 (darwin) AppleWebKit/537.36 (KHTML, like Gecko) jsdom/16.2.2", "referer": "https://hotels.ctrip.com/hotel/shanghai2", } r = requests.get(url, headers=headers) code = ( """ window["%s"] = function (e) { var f = e(); console.log(f); ws.send(f); };; """ % callback + r.text ) print(code) ws.send(code) # getContent() ws = None try: ws = CG_Client("ws://127.0.0.1:8014/") ws.connect() getContent() # 如果想要多次請求,可在此處再寫一個 ws.run_forever() except KeyboardInterrupt: ws.close()
python代碼需要依賴一個callback.js文件,內容如下
// callback.js function e(e) { var t = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"], a = "CAS", o = 0 for (; o < e; o++) { var i = Math.ceil(51 * Math.random()); a += t[i] } return a } function getCallback() { return e(15); }
4. 代碼的啟動順序
1. 啟動node websocket服務 (node app.js)
2. 刷新攜程網頁,F12后查看是否連接上了node websocket服務
3. 啟動python代碼
5. 注意事項
如果有端口占用錯誤(如果是mac,這個現象很正常,可以npm i nodemon, 然后nodemon app.js 啟動。這樣我們只要保存app.js,就會重啟)
如果python代碼運行后一直收不到結果,可以先看看node有沒有報錯,然后刷新下攜程的頁面
6. 關於運行速度
基本就是瀏覽器運行js腳本的速度,(瀏覽器引擎的解釋速度可能比node快很多,畢竟瀏覽器專門做這個的)。
只有中間websocket通信的時耗,並且websocket是復用的,不是用一次就連接一次。
7. 關於canvas指紋
如果大量采集的話,會是一樣的canvas指紋。可以選擇hook canvas相關的api。
8. 關於爬取評論的py代碼
import requests import time import datetime import execjs import os from ws4py.client.threadedclient import WebSocketClient callback = "" class CG_Client(WebSocketClient): def opened(self): print("連接成功") # req = open("../a.js").read() self.send("Python") def closed(self, code, reason=None): print("Closed down:", code, reason) def received_message(self, resp): global callback print("resp", resp.data) headers = { "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.35", "referer": "https://hotels.ctrip.com/hotel/shanghai2", "cookie": 請在此處輸入你的cookie, } eleven = resp.data params = { "MasterHotelID": "608516", "hotel": "608516", "NewOpenCount": "0", "AutoExpiredCount": "0", "RecordCount": "1659", "OpenDate": "", "card": "-1", "property": "-1", "userType": "-1", "productcode": "", "keyword": "", "roomName": "", "orderBy": "2", "currentPage": "2", "viewVersion": "c", "contyped": "0", "eleven": "", "callback": callback, "_": str(time.time()).replace(".", "")[0:13], } comment_url = ( "https://hotels.ctrip.com/Domestic/tool/AjaxHotelCommentList.aspx?" ) r = requests.get(comment_url, params=params, headers=headers) print(r.url) print(r.text) # a = requests.post(url, data=data, headers=headers) # print(a.text) # resp = json.loads(str(resp)) # data = resp['data'] # if type(data) is dict: # ask = data['asks'][0] # print('Ask:', ask) # bid = data['bids'][0] # print('Bid:', bid) def getTime(): return str(time.time()).replace(".", "")[0:13] def getCallbackParam(): # f = open("./callback.js") callbackCode = """ function e(e) { var t = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"], a = "CAS", o = 0 for (; o < e; o++) { var i = Math.ceil(51 * Math.random()); a += t[i] } return a } function getCallback() { return e(15); } """ context = execjs.compile(callbackCode) return context.call("getCallback") def getContent(): global callback t = getTime() callback = getCallbackParam() print(callback) url = "https://hotels.ctrip.com/domestic/cas/oceanball?callback=%s&_=%s" % ( callback, t, ) headers = { "user-agent": "Mozilla/5.0 (darwin) AppleWebKit/537.36 (KHTML, like Gecko) jsdom/16.2.2", "referer": "https://hotels.ctrip.com/hotel/shanghai2", "cookie": 請在此處輸入你的cookie, } r = requests.get(url, headers=headers) code = ( """ window["%s"] = function (e) { var f = e(); console.log(f); ws.send(f); };; """ % callback + r.text ) # print(code) ws.send(code) # open("a.js", "w").write(code) # # os.system("node a.js") # getContent() ws = None try: ws = CG_Client("ws://127.0.0.1:8014/") ws.connect() getContent() ws.run_forever() except KeyboardInterrupt: ws.close()
View Code
運行成功的截圖