crawlergo動態爬蟲去除Spidername使用

本文轉載自查看原文 2020-11-28 14:22 572 安全工具/ 安全開發

本來是想用AWVS的爬蟲來聯動Xray的，但是需要主機安裝AWVS，再進行規則聯動，只是使用其中的目標爬蟲功能感覺就太重了，在github上面找到了由360 0Kee-Team團隊從360天相中分離出來的動態爬蟲模塊crawlergo，嘗試進行自定義代碼聯動

基礎使用

下載最新的releases版本，到其目錄下使用：

在PowerShell里面運行

./crawlergo -c "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -t 10 http://testphp.vulnweb.com/

但是很明顯可以看到在爬蟲的請求頭里面存在：

Spider-Name:crawlergo字段

crawlergo團隊也說明了這個問題：

也有issue提到了這一點

所以我們先解決關鍵字被WAF攔截的問題，使用自定義請求頭進行crawlergo頁面爬取。

使用fake_useragent偽造請求頭：

from fake_useragent import UserAgent
ua = UserAgent()


def GetHeaders():
    headers = {'User-Agent': ua.random}
    return headers

在爬取的時候指定請求頭為隨機生成的，即：

"--custom-headers",json.dumps(GetHeaders())

然后根據crawlergo團隊給出的系統調用部分代碼進行修改

原代碼如下（我已將谷歌瀏覽器路徑改為自己本地的了）：

#!/usr/bin/python3
# coding: utf-8

import simplejson
import subprocess


def main():
    target = "http://testphp.vulnweb.com/"
    cmd = ["./crawlergo", "-c", "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe", "-o", "json", target]
    rsp = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    output, error = rsp.communicate()
	#  "--[Mission Complete]--"  是任務結束的分隔字符串
    result = simplejson.loads(output.decode().split("--[Mission Complete]--")[1])
    req_list = result["req_list"]
    print(req_list[0])


if __name__ == '__main__':
    main()

該代碼默認打印當前域名請求

運行結果如圖：

將關鍵部分代碼：

cmd = ["./crawlergo", "-c", "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe", "-o", "json", target]

根據項目參數：

--custom-headers Headers 自定義HTTP頭，使用傳入json序列化之后的數據，這個是全局定義，將被用於所有請求

修改為：

cmd = ["./crawlergo", "-c", "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe", "--custom-headers",json.dumps(GetHeaders()),"-t","10","-o", "json", target]

GetHeaders()函數上面已經給出，運行結果為：

可以看到Spider-Name:crawlergo字段已經沒有了。

對於返回結果的處理

當設置輸出模式為 json時，返回的結果反序列化之后包含四個部分：

all_req_list：本次爬取任務過程中發現的所有請求，包含其他域名的任何資源類型。
req_list：本次爬取任務的同域名結果，經過偽靜態去重，不包含靜態資源鏈接。理論上是 all_req_list 的子集
all_domain_list：發現的所有域名列表。
sub_domain_list：發現的任務目標的子域名列表。

我們想要獲取的是任務的同域名結果，所以輸出：

result = simplejson.loads(output.decode().split("--[Mission Complete]--")[1])
    # print(result)
    req_list = result["req_list"]
    for url in req_list:
        print(url['url'])

可以看到去重不算太完美

最后為了方便配置可以寫一個config.py，用來放置chorme的路徑，增加掃描系統的通用性，將結果存儲到txt或者隊列里面去。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 360crawlergo動態爬蟲+xray被動掃描 360crawlergo動態爬蟲+xray被動掃描【Python爬蟲】：使用動態IP代理進行反反爬蟲 Python-爬蟲-動態渲染頁面抓取-（Selenium）的使用 python爬蟲之動態渲染頁面抓取-（Selenium）的使用 Python 爬蟲使用動態切換ip防止封殺 Python爬蟲使用selenium處理動態網頁爬蟲動態渲染頁面爬取之Splash的介紹和使用爬蟲抓取動態內容在python使用selenium獲取動態網頁信息並用BeautifulSoup進行解析--動態網頁爬蟲