1. 前言
crawlab 是基於Golang的分布式爬蟲管理平台,但是沒有實現批量添加爬蟲的功能。
作為黨國的優秀青年,怎么可以容忍這件事情呢,那就實現一個腳本去批量添加爬蟲吧。
2. 主要解決的問題是
需要抓取的網站有幾百個,爬蟲代碼編寫完畢,但是需要手動的去添加爬蟲代碼,一個一個的去添加,累死人了,
因此想辦法去編寫一個腳本去實現爬蟲的功能。
3. 難點
文件上傳使用的是 multipart/form-data; boundary=----WebKitFormBoundaryxHvDD0BUGzUw2qhS
瞬間懵逼了,multipart/form-data; 是什么玩意呢?boundary=----WebKitFormBoundaryxHvDD0BUGzUw2qhS 又是干啥的啊?然后就去瘋狂的搜索資料
根據http/1.1 rfc 2616的協議規定,我們的請求方式只有OPTIONS、GET、HEAD、POST、PUT、DELETE、TRACE等, 那為為何我們還會有multipart/form-data請求之說呢?這就要從頭來說了。 http協議規定以ASCII碼傳輸,建立在tcp,ip協議智商的引用規范,規范內容把http請求分成3個部分,狀態行,請求頭,請求體。 所有的方法,實現都是圍繞如何使用和組織這三部分來完成了,萬變不離其宗,http的知識大家可以問度娘。 既然上面請求方式里面沒有multipart/form-data那這個請求又是怎么回事呢, 其實是一回事,multipart/form-data也是在post基礎上演變而來的,具體如下: 1.multipart/form-data的基礎方式是post,也就是說通過post組合方式來實現的。 2.multipart/form-data於post方法的不同之處在於請求頭和請求體。 3.multipart/form-data的請求頭必須包含一個特殊的頭信息:Content-Type, 其值也必須為multipart/form-data,同時還需要規定一個內容分割用於分割請求提中多個post的內容, 如文件內容和文本內容是需要分隔開來的,不然接收方就無法解析和還原這個文件了, 具體的頭信息如下: Content-Type: multipart/form-data; boundary=${bound}
下一步直接上代碼吧:
代碼僅僅是演示作用,如果有需要請評論后獲得指導
xxxxxxxxxxxxxxxxxxxxxxx 請根據自己的代碼去修改
file_name = "xxxxxxxxxxxxxxxxxxxxxxx" f = zipfile.ZipFile( file_name + '.zip', 'w', zipfile.ZIP_DEFLATED) for i in ["contest.py", file_name + ".py"]: file = i.split('/')[-1] f.write(i, file) f.close() url = 'xxxxxxxxxxxxxxxxxxxxxxx' params = { 'name': file_name, 'display_name': file_name, 'col': "undefined", 'cmd': "", } print(json.dumps(params)) with open('xxxxxxxxxxxxxxxxxxxxxxx.zip', 'rb') as f_: m = MultipartEncoder( fields={ "params":json.dumps(params), 'file': (file_name + '.zip', f_,'application/x-zip-compressed'), }, ) headers = { 'Content-Type': m.content_type, "Authorization": "xxxxxxxxxxxxxxxxxxxxxxx", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36", "Accept": "*/*", "Origin": "xxxxxxxxxxxxxxxxxx", "Referer": "xxxxxxxxxxxxxxxxxxxxxxxx", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8", "Cookie": "xxxxxxxxxxxxxxxxxxxxxxx" } response = requests.post(url,data=m,verify=False,headers=headers) content = json.loads(response.content)
上面的代碼僅僅實現的文件的上傳功能,但是沒有 項目 和執行命令,在爬蟲詳情頁有一個保存的功能,同樣可以用爬蟲技術實現這個修改保存的功能
代碼如下:
content = json.loads(response.content) _id = content['data']['_id'] file_id = content['data']['file_id'] print(content) false = False data = { "_id": _id, "name": file_name, "display_name": file_name, "type": "customized", "file_id":file_id , "col": "", "site": "", "envs": [], "remark": "", "src": "/app/spiders/" + file_name, "project_id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "is_public": false, "cmd": "python3 " + file_name + ".py", "is_scrapy": false, "spider_names": [], "template": "", "is_git": false, "git_url": "", "git_branch": "", "git_has_credential": false, "git_username": "", "git_password": "", "git_auto_sync": false, "git_sync_frequency": "", "git_sync_error": "", "is_long_task": false, "is_dedup": false, "dedup_field": "", "dedup_method": "", "is_web_hook": false, "web_hook_url": "", "last_run_ts": "0001-01-01T00:00:00Z", "last_status": "", "config": { "name": "", "display_name": "", "col": "", "remark": "", "Type": "", "engine": "", "start_url": "", "start_stage": "", "stages": [], "settings": {}, "cmd": "" }, "latest_tasks": [], "username": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "project_name": "", "user_id": "61246ad5a3abed001dfccd82", "create_ts": "2021-09-02T02:22:40.9Z", "update_ts": "2021-09-02T02:22:40.905Z" } url = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" headers1 = { "Host":"crawler.uibe.info", "Connection":"keep-alive", "Content-Length":"1045", "Accept":"application/json, text/plain, */*", "Authorization":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "sec-ch-ua-mobile":"?0", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36", "Content-Type":"application/json;charset=UTF-8", "Origin":"xxxxxxxxxxxxxxxxxxxxxxxxx", "Sec-Fetch-Site":"same-origin", "Sec-Fetch-Mode":"cors", "Sec-Fetch-Dest":"empty", "Referer":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "Accept-Encoding":"gzip, deflate, br", "Accept-Language":"zh-CN,zh;q=0.9,en;q=0.8", "Cookie":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", } res = requests.post(url=url,data=json.dumps(data),headers=headers1).text print(res)