最近寫了批量get網站title的腳本,腳本很慢不實用,bug很多,慎用!主要是總結踩得坑,就當模塊練習了。
源碼如下:
import requests
import argparse
import re
class bcolors:
HEADER = '\033[95m'
OKBLUE = '\033[94m'
OKGREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
def parser_args():
parser = argparse.ArgumentParser()
parser.add_argument("-f","--file",help="指定domain文件")
# parser.add_argument("-f","--file",help="指定domain文件",action="store_true") 不可控
return parser.parse_args()
def httpheaders(url):
proxies = {
'http': 'http://127.0.0.1:8080'
}
headers = {
'Connection': 'close',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch, br',
'Accept-Language': 'zh-CN,zh;q=0.8',
}
requests.packages.urllib3.disable_warnings()
res = requests.get(url, proxies=proxies, headers=headers, timeout=10, verify=False)
res.encoding = res.apparent_encoding
head = res.headers
# print('[+]url:'+url,' '+'Content-Type:'+head['Content-Type'])
title = re.findall("<title>(.*)</title>", res.text, re.IGNORECASE)[0].strip()
print(bcolors.OKGREEN+'[+]url:'+url,' '+'title:'+title+' length:'+head['Content-Length']+bcolors.ENDC)
def fileopen(filename):
with open(filename,'r') as obj:
for adomain in obj.readlines():
adomain = adomain.rstrip('\n')
try:
httpheaders(adomain)
except Exception as e:
print(bcolors.WARNING +'[+]'+adomain+" Connect refuse"+bcolors.ENDC)
if __name__ == "__main__":
try:
abc = vars(parser_args())
a = abc['file']
fileopen(a)
except FileNotFoundError as e:
print('目錄下無該文件'+a)
本次踩到的坑:
1.使用Python3 requests發送HTTPS請求,已經關閉認證(verify=False)情況下,控制台會輸出以下錯誤:
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
命令行看到此消息,強迫症會很難受。
解決辦法:在執行code前加入 requests.packages.urllib3.disable_warnings()
2.關於argparse模塊的用法:
def parser_args(): parser = argparse.ArgumentParser() #創建ArgumentParser()對象 parser.add_argument("-f","--file",help="指定domain文件") #主要看第二個參數,--file將file傳作key值,例如:-f aaa,那么file的value為'aaa' # parser.add_argument("-f","--file",help="指定domain文件",action="store_true") 不可控 return parser.parse_args()
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
分割線,此時已是3月10日凌晨2:56
已經被bug折磨的死去活來,書到用時方恨少,當初我要是好好學習就不會這樣了T T
上代碼!
這次加了線程,腳本運行速度可以像火箭一樣快了
# -*- coding:utf-8 -*-
import threading,requests,re
import logging
logging.captureWarnings(True)
from concurrent.futures import ThreadPoolExecutor
import argparse
import time
# lock = threading.Lock()
def parser_args():
parser = argparse.ArgumentParser()
parser.add_argument("-f","--file",help="指定domain文件")
return parser.parse_args()
def getTitle(url):
f = open("resultssq.txt", "a", encoding='utf-8')
headers = {
'Connection': 'close',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
'Accept': 'textml,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch, br',
'Accept-Language': 'zh-CN,zh;q=0.8',
}
try:
res = requests.get(url, headers=headers, timeout=100, verify=False)
res.encoding = res.apparent_encoding
a = res.status_code
title = re.findall("<title>(.*)</title>", res.text, re.IGNORECASE)[0].strip()
resurlhttp = url.rsplit('//')
if 'https://'+resurlhttp[1] == res.url.rstrip('/'):
f.write('[+]'+res.url + "\t----\t" +title+'\n')
return print('[+]'+res.url + "\t----\t" + title)
if url == res.url.rstrip('/'):
f.write('[+]'+res.url + "\t----\t" +title+'\n')
return print('[+]'+res.url + "\t----\t" + title)
else:
f.write('[redirect]'+url+'\t => \t'+res.url + "\t----\t" +title+'\n')
print('[redirect]'+url+'\t => \t'+res.url + "\t----\t" + title)
except:
# time.sleep(10)
print(res.status_code,url)
f = f.close()
a = vars(parser_args())
file = a['file']
try:
with ThreadPoolExecutor(max_workers=100) as executor:
for i in open(file, errors="ignore").readlines():
executor.submit(getTitle, i.strip().strip('\\'))
except:
print('-f 指定domain文件')
追加跳轉功能:如果網站是https協議,但是在只請求主機名的情況下,url上沒有web協議,在res.url邏輯判斷后可變為https:
當腳本運行完畢后,去查看result文件,發現少六七十個域名,排查錯誤發現是except語句之后並沒有.write('xxx'),虛驚一場。嘗試在except后執行寫入語句,,,並不可以。
復制粘貼命令行打印結果,發現仍然少三十左右域名,我艹!
按照我寫的邏輯應該都會輸出的,突然想到,網站不存活的話不行,編寫以下代碼找出沒被輸出的url
f = open('wwwww.txt','r',encoding='utf-8') f1 = open('taobao.txt','r',encoding='utf-8') lists = [] for ii in f.readlines(): ii = ii.rstrip('\n') lists.append(ii) for i in f1.readlines(): i = i.rstrip('\n') if i not in lists: print(i)
艹,這個tmd也有坑,直接和f.read()判斷,返回結果都是True
發現這幾個孫子沒輸出,這幾個url有的hostip存活,有的判斷不出hostip

將這幾個未輸出的url扔到腳本debug捕獲到異常HTTPConnectionPool
腳本運行效果

