前阵子网上看到有人写爬取妹子图的派森代码,于是乎我也想写一个教程,很多教程都是调用的第三方模块,今天就使用原生库来爬,并且扩展实现了图片鉴定,图片去重等操作,经过了爬站验证,稳如老狗,我已经爬了几万张了,只要你硬盘够大。
妹子图网站被扒倒闭了,下面的代码只能参考了。
前端,被一个 img标签包起来 <img src="https://mtl.gzhuibei.com/images/img/10431/5.jpg" alt=
直接正则匹配
先来生成页面链接,代码如下
# 传入参数,对页面进行拼接并返回列表
def SplicingPage(page,start,end):
url = []
for each in range(start,end):
temporary = page.format(each)
url.append(temporary)
return url
接着使用内置库爬行
# 通过内置库,获取到页面的URL源代码
def GetPageURL(page):
head = GetUserAgent(page)
req = request.Request(url=page,headers=head,method="GET")
respon = request.urlopen(req,timeout=3)
if respon.status == 200:
html = respon.read().decode("utf-8")
return html
最后正则匹配爬取,完事了。代码自己研究一下就明白了,太简单了,
page_list = SplicingPage(str(args.url),2,100)
for item in page_list:
respon = GetPageURL(str(item))
subject = re.findall('<img src="([^"]+\.jpg)"',respon,re.S)
for each in subject:
img_name = each.split("/")[-1]
img_type = each.split("/")[-1].split(".")[1]
save_name = str(random.randint(1111111,99999999)) + "." + img_type
print("[+] 原始名称: {} 保存为: {} 路径: {}".format(img_name,save_name,each))
urllib.request.urlretrieve(each,save_name,None)
也可以通过外部库提取。
from lxml import etree
html = etree.HTML(response.content.decode())
src_list = html.xpath('//ul[@id="pins"]/li/a/img/@data-original')
alt_list = html.xpath('//ul[@id="pins"]/li/a/img/@alt')
一些请求头信息,用于绕过反爬虫策略
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
"UCWEB7.0.2.37/28/999",
"NOKIA5700/ UCWEB7.0.2.37/28/999",
"Openwave/ UCWEB7.0.2.37/28/999",
"Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
# iPhone 6:
"Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25"
运行结果,就是这样,同学们,都把裤子给我穿上!好好学习!
接着我们来扩展一个知识点,如何使用Python实现自动鉴别图片,鉴别黄色图片的思路是,讲图片中的每一个位读入内存然后将皮肤颜色填充为白色,将衣服填充为黑色,计算出整个人物的像素大小,然后计算身体颜色与衣服的比例,如果超出预定义的范围则认为是黄图,这是基本的原理,实现起来需要各种算法的支持,Python有一个库可以实现 pip install Pillow porndetective
鉴别代码如下。
>>> from porndetective import PornDetective
>>> test=PornDetective("c://1.jpg")
>>> test.parse()
c://1.jpg JPEG 1600×2400: result=True message='Porn Pic!!'
<porndetective.PornDetective object at 0x0000021ACBA0EFD0>
>>>
>>> test=PornDetective("c://2.jpg")
>>> test.parse()
c://2.jpg JPEG 1620×2430: result=False message='Total skin percentage lower than 15 (12.51)'
<porndetective.PornDetective object at 0x0000021ACBA5F5E0>
>>> test.result
False
鉴定结果如下,识别率不是很高,其实第一张并不算严格意义上的黄图,你可以使用爬虫爬取所有妹子图,然后通过调用这个库对其进行检测,如果是则保留,不是的直接删除,只保留优质资源。
他这个库使用的算法有一些问题,如果照这样来分析,那肚皮舞之类的都会被鉴别为黄图,而且一般都会使用机器学习识别率更高,这种硬编码的方式一般的还可以,如果更加深入的鉴别根本做不到,是不是黄图,不能只从暴露皮肤方面判断,还要综合考量,姿势,暴露尺度,衣服类型,等各方面,不过这也够用,如果想要在海量图片中筛选出比较优质的资源,你可以这样来写。
from PIL import Image
import os
from porndetective import PornDetective
if __name__ == "__main__":
img_dic = os.listdir("./meizitu/")
for each in img_dic:
img = Image.open("./meizitu/{}".format(each))
width = img.size[0] # 宽度
height = img.size[1] # 高度
img = img.resize((int(width*0.3), int(height*0.3)), Image.ANTIALIAS)
img.save("image.jpg")
test = PornDetective("./image.jpg")
test.parse()
if test.result == True:
print("{} 图片大赞,自动为你保留.".format(each))
else:
print("----> {} 图片正常,自动清除,节约空间,存着真的是浪费资源老铁".format(each))
os.remove("./meizitu/"+str(each))
妹子图去重,代码如下,这个代码我写了好一阵子,一开始没思路,后来才想到的,其原理是利用CRC32算法,计算图片hash值,比对hash值,并将目录与hash关联,最后定位到目录,只删除多余的图片,保留其中的一张,这里给出思路代码。
import zlib,os
def Find_Repeat_File(file_path,file_type):
Catalogue = os.listdir(file_path)
CatalogueDict = {} # 查询字典,方便后期查询键值对对应参数
for each in Catalogue:
path = (file_path + each)
if os.path.splitext(path)[1] == file_type:
with open(path,"rb") as fp:
crc32 = zlib.crc32(fp.read())
# print("[*] 文件名: {} CRC32校验: {}".format(path,str(crc32)))
CatalogueDict[each] = str(crc32)
CatalogueList = []
for value in CatalogueDict.values():
# 该过程实现提取字典中的crc32特征组合成列表 CatalogueList
CatalogueList.append(value)
CountDict = {}
for each in CatalogueList:
# 该过程用于存储文件特征与特征重复次数,放入 CountDict
CountDict[each] = CatalogueList.count(each)
RepeatFileFeatures = []
for key,value in CountDict.items():
# 循环查找字典中的数据,如果value大于1就存入 RepeatFileFeatures
if value > 1:
print("[-] 文件特征: {} 重复次数: {}".format(key,value))
RepeatFileFeatures.append(key)
for key,value in CatalogueDict.items():
if value == "1926471896":
print("[*] 重复文件所在目录: {}".format(file_path + key))
if __name__ == "__main__":
Find_Repeat_File("D://python/",".jpg")
来来来,小老弟,我们去探讨一下技术,学好技术,每天都开荤
蜘蛛爬虫最终代码:
import os,re,random,urllib,argparse
from urllib import request,parse
# 随机获取一个请求体
def GetUserAgent(url):
UsrHead = ["Windows; U; Windows NT 6.1; en-us","Windows NT 5.1; x86_64","Ubuntu U; NT 18.04; x86_64",
"Windows NT 10.0; WOW64","X11; Ubuntu i686;","X11; Centos x86_64;","compatible; MSIE 9.0; Windows NT 8.1;",
"X11; Linux i686","Macintosh; U; Intel Mac OS X 10_6_8; en-us","compatible; MSIE 7.0; Windows Server 6.1",
"Macintosh; Intel Mac OS X 10.6.8; U; en","compatible; MSIE 7.0; Windows NT 5.1","iPad; CPU OS 4_3_3;"]
UsrFox = ["Chrome/60.0.3100.0","Auburn Browser","Safari/522.13","Chrome/80.0.1211.0","Firefox/74.0",
"Gecko/20100101 Firefox/4.0.1","Presto/2.8.131 Version/11.11","Mobile/8J2 Safari/6533.18.5",
"Version/4.0 Safari/534.13","wOSBrowser/233.70 Baidu Browser/534.6 TouchPad/1.0","BrowserNG/7.1.18124",
"rident/4.0; SE 2.X MetaSr 1.0;","360SE/80.1","wOSBrowser/233.70","UCWEB7.0.2.37/28/999","Opera/UCWEB7.0.2.37"]
UsrAgent = "Mozilla/5.0 (" + str(random.sample(UsrHead,1)[0]) + ") AppleWebKit/" + str(random.randint(100,1000)) \
+ ".36 (KHTML, like Gecko) " + str(random.sample(UsrFox,1)[0])
UsrRefer = str(url + "/" + "".join(random.sample("abcdef23457sdadw",10)))
UserAgent = {"User-Agent": UsrAgent,"Referer":UsrRefer}
return UserAgent
# 通过内置库,获取到页面的URL源代码
def GetPageURL(page):
head = GetUserAgent(page)
req = request.Request(url=page,headers=head,method="GET")
respon = request.urlopen(req,timeout=3)
if respon.status == 200:
html = respon.read().decode("utf-8") # 或是gbk根据页面属性而定
return html
# 传入参数,对页面进行拼接并返回列表
def SplicingPage(page,start,end):
url = []
for each in range(start,end):
temporary = page.format(each)
url.append(temporary)
return url
if __name__ == "__main__":
urls = "https://www.meitulu.com/item/{}_{}.html".format(str(random.randint(1000,20000)),"{}")
page_list = SplicingPage(urls,2,100)
for item in page_list:
try:
respon = GetPageURL(str(item))
subject = re.findall('<img src="([^"]+\.jpg)"',respon,re.S)
for each in subject:
img_name = each.split("/")[-1]
img_type = each.split("/")[-1].split(".")[1]
save_name = str(random.randint(11111111,999999999)) + "." + img_type
print("[+] 原始名称: {} 保存为: {} 路径: {}".format(img_name,save_name,each))
#urllib.request.urlretrieve(each,save_name,None) # 无请求体的下载图片方式
head = GetUserAgent(str(urls)) # 随机弹出请求头
ret = urllib.request.Request(each,headers=head) # each = 访问图片路径
respons = urllib.request.urlopen(ret,timeout=10) # 打开图片路径
with open(save_name,"wb") as fp:
fp.write(respons.read())
except Exception:
# 删除当前目录下小于100kb的图片
for each in os.listdir():
if each.split(".")[1] == "jpg":
if int(os.stat(each).st_size / 1024) < 100:
print("[-] 自动清除 {} 小于100kb文件.".format(each))
os.remove(each)
exit(1)
最后的效果,高并发下载(代码分工明确:有负责清理重复的,有负责删除小于150kb的,有负责爬行的,包工头非你莫属)今晚通宵
上方代码还有许多需要优化的地方,例如我们是随机爬取,现在我们只想爬取其中的一部分妹子图,所以我们需要改进一下,首先来获取到需要的链接,找首先找所有A标签,提取出页面A标题。
from bs4 import BeautifulSoup
import requests
if __name__ == "__main__":
get_url = []
urls = requests.get("https://www.meitulu.com/t/youhuo/")
soup = BeautifulSoup(urls.text,"html.parser")
soup_ret = soup.select('div[class="boxs"] ul[class="img"] a')
for each in soup_ret:
if str(each["href"]).endswith("html"):
get_url.append(each["href"])
for item in get_url:
for each in range(2,30):
url = item.replace(".html","_{}.html".format(each))
with open("url.log","a+") as fp:
fp.write(url + "\n")
接着直接循环爬取就好,这里并没有多线程,爬行会有点慢的
from bs4 import BeautifulSoup
import requests,random
def GetUserAgent(url):
UsrHead = ["Windows; U; Windows NT 6.1; en-us","Windows NT 5.1; x86_64","Ubuntu U; NT 18.04; x86_64",
"Windows NT 10.0; WOW64","X11; Ubuntu i686;","X11; Centos x86_64;","compatible; MSIE 9.0; Windows NT 8.1;",
"X11; Linux i686","Macintosh; U; Intel Mac OS X 10_6_8; en-us","compatible; MSIE 7.0; Windows Server 6.1",
"Macintosh; Intel Mac OS X 10.6.8; U; en","compatible; MSIE 7.0; Windows NT 5.1","iPad; CPU OS 4_3_3;"]
UsrFox = ["Chrome/60.0.3100.0","Auburn Browser","Safari/522.13","Chrome/80.0.1211.0","Firefox/74.0",
"Gecko/20100101 Firefox/4.0.1","Presto/2.8.131 Version/11.11","Mobile/8J2 Safari/6533.18.5",
"Version/4.0 Safari/534.13","wOSBrowser/233.70 Baidu Browser/534.6 TouchPad/1.0","BrowserNG/7.1.18124",
"rident/4.0; SE 2.X MetaSr 1.0;","360SE/80.1","wOSBrowser/233.70","UCWEB7.0.2.37/28/999","Opera/UCWEB7.0.2.37"]
UsrAgent = "Mozilla/5.0 (" + str(random.sample(UsrHead,1)[0]) + ") AppleWebKit/" + str(random.randint(100,1000)) \
+ ".36 (KHTML, like Gecko) " + str(random.sample(UsrFox,1)[0])
UsrRefer = str(url + "/" + "".join(random.sample("abcdef23457sdadw",10)))
UserAgent = {"User-Agent": UsrAgent,"Referer":UsrRefer}
return UserAgent
url = []
with open("url.log","r") as fp:
files = fp.readlines()
for i in files:
url.append(i.replace("\n",""))
for i in range(0,9999):
aget = GetUserAgent(url[i])
try:
ret = requests.get(url[i],timeout=10,headers=aget)
if ret.status_code == 200:
soup = BeautifulSoup(ret.text,"html.parser")
soup_ret = soup.select('div[class="content"] img')
for x in soup_ret:
try:
down = x["src"]
save_name = str(random.randint(11111111,999999999)) + ".jpg"
print("xiazai -> {}".format(save_name))
img_download = requests.get(url=down, headers=aget, stream=True)
with open(save_name,"wb") as fp:
for chunk in img_download.iter_content(chunk_size=1024):
fp.write(chunk)
except Exception:
pass
except Exception:
pass
另外两个网站的爬虫程序公开: wuso
import os,urllib,random,argparse,sys
from urllib import request,parse
from bs4 import BeautifulSoup
def GetUserAgent(url):
UsrHead = ["Windows; U; Windows NT 6.1; en-us","Windows NT 5.1; x86_64","Ubuntu U; NT 18.04; x86_64",
"Windows NT 10.0; WOW64","X11; Ubuntu i686;","X11; Centos x86_64;","compatible; MSIE 9.0; Windows NT 8.1;",
"X11; Linux i686","Macintosh; U; Intel Mac OS X 10_6_8; en-us","compatible; MSIE 7.0; Windows Server 6.1",
"Macintosh; Intel Mac OS X 10.6.8; U; en","compatible; MSIE 7.0; Windows NT 5.1","iPad; CPU OS 4_3_3;"]
UsrFox = ["Chrome/60.0.3100.0","Auburn Browser","Safari/522.13","Chrome/80.0.1211.0","Firefox/74.0",
"Gecko/20100101 Firefox/4.0.1","Presto/2.8.131 Version/11.11","Mobile/8J2 Safari/6533.18.5",
"Version/4.0 Safari/534.13","wOSBrowser/233.70 Baidu Browser/534.6 TouchPad/1.0","BrowserNG/7.1.18124",
"rident/4.0; SE 2.X MetaSr 1.0;","360SE/80.1","wOSBrowser/233.70","UCWEB7.0.2.37/28/999","Opera/UCWEB7.0.2.37"]
UsrAgent = "Mozilla/5.0 (" + str(random.sample(UsrHead,1)[0]) + ") AppleWebKit/" + str(random.randint(100,1000)) \
+ ".36 (KHTML, like Gecko) " + str(random.sample(UsrFox,1)[0])
UsrRefer = url + str("/" + "".join(random.sample("abcdefghi123457sdadw",10)))
UserAgent = {"User-Agent": UsrAgent,"Referer":UsrRefer}
return UserAgent
def GetPageURL(page):
head = GetUserAgent(page)
req = request.Request(url=page,headers=head,method="GET")
respon = request.urlopen(req,timeout=30)
if respon.status == 200:
html = respon.read().decode("utf-8")
return html
if __name__ == "__main__":
runt = []
waibu = GetPageURL("https://xxx.me/forum.php?mod=forumdisplay&fid=48&typeid=114&filter=typeid&typeid=114")
soup1 = BeautifulSoup(waibu,"html.parser")
ret1 = soup1.select("div[id='threadlist'] ul[id='waterfall'] a")
for x in ret1:
runt.append(x.attrs["href"])
for ss in runt:
print("[+] 爬行: {}".format(ss))
try:
resp = []
respon = GetPageURL(str(ss))
soup = BeautifulSoup(respon,"html.parser")
ret = soup.select("div[class='pct'] div[class='pcb'] td[class='t_f'] img")
try:
for i in ret:
url = "https://xxx.me/" + str(i.attrs["file"])
print(url)
resp.append(url)
except Exception:
pass
for each in resp:
try:
img_name = each.split("/")[-1]
print("down: {}".format(img_name))
head=GetUserAgent("https://wuso.me")
ret = urllib.request.Request(each,headers=head)
respons = urllib.request.urlopen(ret,timeout=60)
with open(img_name,"wb") as fp:
fp.write(respons.read())
fp.close()
except Exception:
pass
except Exception:
pass
2.0
import os,urllib,random,argparse,sys
from urllib import request,parse
from bs4 import BeautifulSoup
def GetUserAgent(url):
UsrHead = ["Windows; U; Windows NT 6.1; en-us","Windows NT 5.1; x86_64","Ubuntu U; NT 18.04; x86_64",
"Windows NT 10.0; WOW64","X11; Ubuntu i686;","X11; Centos x86_64;","compatible; MSIE 9.0; Windows NT 8.1;",
"X11; Linux i686","Macintosh; U; Intel Mac OS X 10_6_8; en-us","compatible; MSIE 7.0; Windows Server 6.1",
"Macintosh; Intel Mac OS X 10.6.8; U; en","compatible; MSIE 7.0; Windows NT 5.1","iPad; CPU OS 4_3_3;"]
UsrFox = ["Chrome/60.0.3100.0","Auburn Browser","Safari/522.13","Chrome/80.0.1211.0","Firefox/74.0",
"Gecko/20100101 Firefox/4.0.1","Presto/2.8.131 Version/11.11","Mobile/8J2 Safari/6533.18.5",
"Version/4.0 Safari/534.13","wOSBrowser/233.70 Baidu Browser/534.6 TouchPad/1.0","BrowserNG/7.1.18124",
"rident/4.0; SE 2.X MetaSr 1.0;","360SE/80.1","wOSBrowser/233.70","UCWEB7.0.2.37/28/999","Opera/UCWEB7.0.2.37"]
UsrAgent = "Mozilla/5.0 (" + str(random.sample(UsrHead,1)[0]) + ") AppleWebKit/" + str(random.randint(100,1000)) \
+ ".36 (KHTML, like Gecko) " + str(random.sample(UsrFox,1)[0])
UsrRefer = url + str("/" + "".join(random.sample("abcdefghi123457sdadw",10)))
UserAgent = {"User-Agent": UsrAgent,"Referer":UsrRefer}
return UserAgent
def GetPageURL(page):
head = GetUserAgent(page)
req = request.Request(url=page,headers=head,method="GET")
respon = request.urlopen(req,timeout=30)
if respon.status == 200:
html = respon.read().decode("utf-8")
return html
# 获取到当前页面中所有连接
def getpage():
# https://.me/forum.php?mod=forumdisplay&fid=48&filter=typeid&typeid=17
waibu = GetPageURL("https://.me/forum.php?mod=forumdisplay&fid=48&filter=typeid&typeid=17")
soup1 = BeautifulSoup(waibu,"html.parser")
ret1 = soup1.select("div[id='threadlist'] ul[id='waterfall'] a")
for x in ret1:
print(x.attrs["href"])
# 获取到页面中的图片路径
def get_page_image(url):
respon = GetPageURL(str(url))
soup = BeautifulSoup(respon,"html.parser")
ret = soup.select("div[class='pcb'] div[class='pattl'] div[class='mbn savephotop'] img")
resp = []
try:
for i in ret:
url = "https://.me/" + str(i.attrs["file"])
print(url)
resp.append(url)
except Exception:
pass
return resp
# 下载
if __name__ == "__main__":
# https://.me/forum.php?mod=viewthread&tid=747730&extra=page%3D1%26filter%3Dtypeid%26typeid%3D17
# python main.py ""
args = sys.argv
user = str(args[1])
resp = get_page_image(user)
for each in resp:
try:
img_name = each.split("/")[-1]
head=GetUserAgent("https://.me")
ret = urllib.request.Request(each,headers=head)
respons = urllib.request.urlopen(ret,timeout=10)
with open(img_name,"wb") as fp:
fp.write(respons.read())
fp.close()
print("down: {}".format(img_name))
except Exception:
pass
第二个爬虫程序: 这个开一个多线程,用另外一个程序开多进程,爬取速度非常快,CPU 100%利用率
import os,sys
import subprocess
# 每行一个人物名称。
fp = open("lis.log","r")
aaa = fp.readlines()
for i in aaa:
nam = i.replace("\n","")
cmd = "python thread.py " + nam
os.popen(cmd)
多线程代码。
import requests,random
from bs4 import BeautifulSoup
import os,re,random,urllib,argparse
from urllib import request,parse
import threading,sys
def GetUserAgent(url):
head = ["Windows; U; Windows NT 6.1; en-us","Windows NT 6.3; x86_64","Windows U; NT 6.2; x86_64",
"Windows NT 6.1; WOW64","X11; Linux i686;","X11; Linux x86_64;","compatible; MSIE 9.0; Windows NT 6.1;",
"X11; Linux i686","Macintosh; U; Intel Mac OS X 10_6_8; en-us","compatible; MSIE 7.0; Windows NT 6.0",
"Macintosh; Intel Mac OS X 10.6.8; U; en","compatible; MSIE 7.0; Windows NT 5.1","iPad; CPU OS 4_3_3;",]
fox = ["Chrome/60.0.3100.0","Chrome/59.0.2100.0","Safari/522.13","Chrome/80.0.1211.0","Firefox/74.0",
"Gecko/20100101 Firefox/4.0.1","Presto/2.8.131 Version/11.11","Mobile/8J2 Safari/6533.18.5",
"Version/4.0 Safari/534.13","wOSBrowser/233.70 Safari/534.6 TouchPad/1.0","BrowserNG/7.1.18124"]
agent = "Mozilla/5.0 (" + str(random.sample(head,1)[0]) + ") AppleWebKit/" + str(random.randint(100,1000)) \
+ ".36 (KHTML, like Gecko) " + str(random.sample(fox,1)[0])
refer = url
UserAgent = {"User-Agent": agent,"Referer":refer}
return UserAgent
def run(user):
head = GetUserAgent("aHR0cHM6Ly93d3cuYW1ldGFydC5jb20v")
ret = requests.get("aHR0cHM6Ly93d3cuYW1ldGFydC5jb20vbW9kZWxzL3t9Lw==".format(user),headers=head,timeout=3)
scan_url = []
if ret.status_code == 200:
soup = BeautifulSoup(ret.text,"html.parser")
a = soup.select("div[class='thumbs'] a")
for each in a:
url = "aHR0cHM6Ly93d3cuYW1ldGFydC5jb20v" + str(each["href"])
scan_url.append(url)
rando = random.choice(scan_url)
print("随机编号: {}".format(rando))
try:
ret = requests.get(url=str(rando),headers=head,timeout=10)
if ret.status_code == 200:
soup = BeautifulSoup(ret.text,"html.parser")
img = soup.select("div[class='container'] div div a")
try:
for each in img:
head = GetUserAgent(str(each["href"]))
down = requests.get(url=str(each["href"]),headers=head)
img_name = str(random.randint(100000000,9999999999)) + ".jpg"
print("[+] 图片解析: {} 保存为: {}".format(each["href"],img_name))
with open(img_name,"wb") as fp:
fp.write(down.content)
except Exception:
pass
except Exception:
exit(1)
if __name__ == "__main__":
args = sys.argv
user = str(args[1])
try:
os.mkdir(user)
os.chdir("D://python/test/" + user)
for item in range(100):
t = threading.Thread(target=run,args=(user,))
t.start()
except FileExistsError:
exit(0)
开20个进程,每个进程里面驮着100个线程,并发访问每秒,1500次请求,因为有去重程序在不断地扫描,所有图片无重复,并保留质量最高的图片,突然发现,妹子图多了之后,妹子都不好看了 ,哈哈哈
经过爬站,之后我们得到了几万张妹子图,但是如果我们想看其中的一个妹子的写真,肿么办? 接下来登场的是AI人脸识别军团,通过简单地机器学习,识别特定人脸,来筛选我们想要看的妹子图。
import cv2
import numpy as np
def Display_Face(img_path):
img = cv2.imread(img_path) # 读取图片
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # 将图片转化成灰度
face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml") # 加载级联分类器模型
face_cascade.load("haarcascade_frontalface_default.xml")
faces = face_cascade.detectMultiScale(gray, 1.3, 5)
for (x, y, w, h) in faces:
# 在原图上画出包围框(蓝色框,宽度3)
img = cv2.rectangle(img, (x, y), (x + w, y + h), (255, 0, 0), 3)
cv2.namedWindow("img",0);
cv2.resizeWindow("img", 300, 400);
cv2.imshow('img', img)
cv2.waitKey()
def Return_Face(img_path):
img = cv2.imread(img_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
faces = face_cascade.detectMultiScale(gray, scaleFactor=1.2, minNeighbors=5)
if (len(faces) == 0):
return None,None
(x, y, w, h) = faces[0]
return gray[y:y + w, x:x + h], faces[0]
ret = Return_Face("./meizi/172909315.jpg")
print(ret)
Display_Face("./meizi/172909315.jpg")
import cv2,os
import numpy as np
def Return_Face(img_path):
img = cv2.imread(img_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
faces = face_cascade.detectMultiScale(gray, scaleFactor=1.2, minNeighbors=5)
if (len(faces) == 0):
return None,None
(x, y, w, h) = faces[0]
return gray[y:y + w, x:x + h], faces[0]
#载入图像 读取ORL人脸数据库,准备训练数据
def LoadImages(data):
images=[]
names=[]
labels=[]
label=0
#遍历所有文件夹
for subdir in os.listdir(data):
subpath=os.path.join(data,subdir)
#print('path',subpath)
#判断文件夹是否存在
if os.path.isdir(subpath):
#在每一个文件夹中存放着一个人的许多照片
names.append(subdir)
#遍历文件夹中的图片文件
for filename in os.listdir(subpath):
imgpath=os.path.join(subpath,filename)
img=cv2.imread(imgpath,cv2.IMREAD_COLOR)
gray_img=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
#cv2.imshow('1',img)
#cv2.waitKey(0)
images.append(gray_img)
labels.append(label)
label+=1
images=np.asarray(images)
#names=np.asarray(names)
labels=np.asarray(labels)
return images,labels,names
images,labels,names = LoadImages("./")
face_recognizer = cv2.face.LBPHFaceRecognizer_create()
# 创建LBPH识别器并开始训练
face_recognizer.train(images, labels)
收集的其他爬虫: 网络收集的其他爬虫写法,可参考。
1
# -*- coding: UTF-8 -*-
import sys,requests
from bs4 import BeautifulSoup
sys.path.append("/Python")
import conf.mysql_db as mysqldb
image_count = 1
#获取套图下的每套图片信息
def get_photo_info(url,layout_tablename):
global PhotoNames
html = get_html(url)
# html = fread('ttb.html')
soup = BeautifulSoup(html, "lxml")
db = mysqldb.Database()
icount = 1
for ul in soup.find_all(class_ = 'ul960c'):
for li in ul:
if (str(li).strip()):
PhotoName = li.span.string
PhotoUrl = li.img['src']
imageUrl = 'http://www.quantuwang.co'+li.a['href']
print('第'+str(icount)+'套图:'+PhotoName+' '+PhotoUrl+' '+imageUrl)
sql = "insert into "+layout_tablename+"(picname,girlname,picpath,flodername) values('%s','%s'," \
"'%s','%s')" % (imageUrl,PhotoName,PhotoUrl,PhotoName)
db.execute(sql)
icount = icount + 1
db.close()
return True
#查找套图内的每张图片信息并保存
def get_images(image_tablename,pic_nums,pic_title,url,layout_count):
global image_count
db = mysqldb.Database()
try:
for i in range(1, int(pic_nums)):
pic_url = url[:-5] + str(i) + '.jpg'
sql = "insert into "+image_tablename+"(id,imageid,flodername,imagepath) " \
"values (" + str(i) + ","+str(image_count)+",'" + pic_title + "','" + pic_url + "')"
db.execute(sql)
print('第'+str(layout_count)+'套写真'+str(image_count)+',第'+str(i)+'张图片:'+pic_title+' url:'+pic_url)
image_count = image_count + 1
except Exception as e:
print('Error',e)
db.close()
#获取首页图片信息中的每页链接
def get_image_pages(url):
html = get_html(url)
soup = BeautifulSoup(html, "lxml")
# print(html)
image_pages = []
image_pages.append(url)
try:
for ul in soup.find_all(class_='c_page'):
for li in ul.find_all('a'):
image_pages.append('http://www.quantuwang.co/'+li.get('href'))
except Exception as e:
print('Error',e)
return len(image_pages)
#获取网页信息,得到的html就是网页的源代码,传url,返回html
def get_html(url):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
# 'Accept - Encoding': 'gzip, deflate',
# 'Accept-Language': 'zh-CN,zh;q=0.9',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36',
}
resp = requests.get(url,headers=headers)
resp.encoding='utf-8'
html = resp.text
# fwrite(html)
return html
#处理url
def handle_url(i,url):
if i == 1:
return url
else:
url = url[:-5] + "_" + format(i) + ".html"
return url
def main():
global image_count
# image_count = 1391
url = 'http://www.quantuwang.co/t/f4543e3a7d545391.html'
layoyt_name = '糯美子Mini'
layout_tablename = 'pc_dic_'+'nuomeizi'
image_tablename = 'po_'+'nuomeizi'
#复制表结构
db = mysqldb.Database()
try:
sql = "create table if not exists "+layout_tablename+"(LIKE pc_dic_toxic)"
db.execute(sql)
print('创建表:'+layout_tablename)
sql = "create table if not exists " + image_tablename + "(LIKE po_toxic)"
db.execute(sql)
print('创建表:'+image_tablename)
except Exception as e:
print('Error',e)
db.close()
#第一步:搜索页面信息截取
get_photo_info(url,layout_tablename)
#第二步:找出图集中的每张图片,插入数据库
layout_count = 1
db = mysqldb.Database()
sql = 'select * from '+layout_tablename+' where ID>0'
results = db.fetch_all(sql)
for row in results:
# 找出套图信息:图片数量
imgage_nums = get_image_pages(row['picname']) + 1
get_images(image_tablename,imgage_nums,row['flodername'],row['picpath'],layout_count)
layout_count = layout_count + 1
db.close()
#更新总表
db = mysqldb.Database()
try:
sql = "select max(imageid) as maxcount from "+image_tablename
results = db.fetch_one(sql)
sql = "insert into pc_dic_lanvshen(BeautyName,MinID,MaxID,TableName,IndexName,IndexType) values ('%s',%d,%d,'%s'," \
"'%s',%d)" % (layoyt_name,1,int(results['maxcount']),image_tablename,layout_tablename,1)
db.execute(sql)
print('数据已更新到总表:'+layout_tablename+' '+image_tablename)
except Exception as e:
print('Error',e)
db.close()
if __name__ == '__main__':
main()
2
#!/usr/local/Cellar/python/3.7.3/bin
# -*- coding: UTF-8 -*-
# https://www.meitulu.com
import sys,requests,time,random,re
from bs4 import BeautifulSoup
sys.path.append("/Python")
import conf.mysql_db as mysqldb
album_count = 1
image_count = 1
#获取套图下的每套图片信息
def get_photo_info(url,layout_tablename):
global album_count
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept - Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}
req = requests.get(url, headers=headers)
req.encoding = 'utf-8'
# print(req.text)
soup = BeautifulSoup(req.text, "lxml")
db = mysqldb.Database()
for ul in soup.find_all(class_ = 'img'):
for li in ul:
if (str(li).strip()):
AlbumName = li.img['alt']
AlbumNums = re.findall(r"\d+\.?\d*", li.p.string)[0]
AlbumUrl = li.a['href']
PhotoUrl = li.img['src']
print('第'+str(album_count)+'套图:'+AlbumName+' '+AlbumUrl+' '+PhotoUrl)
sql = "insert into "+layout_tablename+"(picname,girlname,picpath,imageid,flodername) values('%s','%s','%s','%s')" % (AlbumUrl,AlbumName,PhotoUrl,AlbumNums,AlbumName)
db.execute(sql)
album_count = album_count + 1
db.close()
return True
#保存每张图片信息
def get_images(image_tablename,image_nums,flodername,image_url,albumID):
global image_count
db = mysqldb.Database()
for i in range(1, int(image_nums)+1):
image_path = image_url[:-6] + '/' + str(i) + '.jpg'
sql = "insert into " + image_tablename + "(imageid,flodername,imagepath,id) values('%s','%s','%s','%s')" % (image_count, flodername, image_path, i)
db.execute(sql)
print('第'+str(albumID)+'套写真'+str(image_count)+',第'+str(i)+'张图片:'+flodername+' url:'+image_path)
image_count = image_count + 1
db.close()
#判断网页是否存在
def get_html_status(url):
req = requests.get(url).status_code
if(req == 200):
return True
else:
return False
def main():
global album_count
global image_count
# image_count = 1391
url = 'https://www.meitulu.com/t/dingziku/'
album_name = '丁字裤美女'
album_tablename = 'pc_dic_'+'dingziku'
image_tablename = 'po_'+'dingziku'
#复制表结构
db = mysqldb.Database()
try:
sql = "create table if not exists "+album_tablename+"(LIKE pc_dic_toxic)"
db.execute(sql)
print('创建表:'+album_tablename)
sql = "create table if not exists " + image_tablename + "(LIKE po_toxic)"
db.execute(sql)
print('创建表:'+image_tablename)
except Exception as e:
print('Error',e)
db.close()
#第一步:搜索页面信息截取
get_photo_info(url,album_tablename)
for i in range(2,100):
urls = url +str(i)+'.html'
# urls = url +str(i)+'.html'
if(get_html_status(urls)):
get_photo_info(urls,album_tablename)
time.sleep(random.randint(1, 3))
else:
break
#第二步:找出图集中的每张图片,插入数据库
db = mysqldb.Database()
sql = 'select * from '+album_tablename+' where ID>0'
results = db.fetch_all(sql)
for row in results:
get_images(image_tablename,row['imageid'],row['flodername'],row['picpath'],row['ID'])
db.close()
#更新总表
db = mysqldb.Database()
try:
sql = "select max(imageid) as maxcount from "+image_tablename
results = db.fetch_one(sql)
sql = "insert into pc_dic_lanvshen(BeautyName,MinID,MaxID,TableName,IndexName,IndexType) values ('%s',%d,%d,'%s'," \
"'%s',%d)" % (album_name,1,int(results['maxcount']),image_tablename,album_tablename,1)
db.execute(sql)
print('数据已更新到总表:'+album_tablename+' '+image_tablename)
except Exception as e:
print('Error',e)
db.close()
if __name__ == '__main__':
main()
3
#!/usr/local/Cellar/python/3.7.3/bin
# -*- coding: UTF-8 -*-
# https://www.lanvshen.com
import sys,requests,re,time,random
from bs4 import BeautifulSoup
sys.path.append("/Python")
import conf.mysql_db as mysqldb
layout_count = 1
image_count = 1
#查找每套图集信息
def get_layout(url,layout_tablename):
global layout_count
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept - Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}
req = requests.get(url, headers=headers)
req.encoding = 'utf-8'
# print(req.text)
soup = BeautifulSoup(req.text, "lxml")
db = mysqldb.Database()
try:
for ul1 in soup.find_all(class_='hezi'):
for ul2 in ul1:
if(str(ul2).strip()):
for li in ul2:
if (str(li).strip()):
layout_url = li.a['href']
cover_url = li.img['src']
layout_nums = re.findall('(\d+)', li.span.string)[0]
layout_name = li.find_all("p", class_="biaoti")[0].a.string
print('第'+str(layout_count)+'套写真:'+layout_name+" url:"+layout_url)
# print('写真集:'+layout_name+' 图片数:'+str(layout_nums)+' 链接:'+cover_url)
sql = "insert into "+layout_tablename+"(ID,picname,girlname,picpath,imageid,flodername) values (" +\
str(layout_count)+ ",'" + layout_url + "','" + layout_name + "','"+cover_url+"',"+str(layout_nums)+",'" + layout_name + "')"
db.execute(sql)
layout_count=layout_count+1
except Exception as e:
print('Error',e)
db.close()
#查找套图内的每张图片信息
def get_images(image_tablename,pic_nums,pic_title,url):
global image_count
global layout_count
url_num = re.findall('(\d+)', url)[0]
db = mysqldb.Database()
for i in range(1, int(pic_nums)):
pic_url = 'https://img.hywly.com/a/1/' + url_num + '/' + str(i) + '.jpg'
sql = "insert into "+image_tablename+"(id,imageid,flodername,imagepath) " \
"values (" + str(i) + ","+str(image_count)+",'" + pic_title + "','" + pic_url + "')"
db.execute(sql)
print('第'+str(layout_count)+'套写真,第'+str(i)+'张图片:'+pic_title+' url:'+pic_url)
image_count = image_count + 1
db.close()
#判断网页是否存在
def get_html_status(url):
req = requests.get(url).status_code
if(req == 200):
return True
else:
return False
def main():
global layout_count
url='https://www.lanvshen.com/s/16/'
layoyt_name = '蕾丝美女'
layout_tablename = 'pc_dic_'+'leisi'
image_tablename = 'po_'+'leisi'
#复制表结构
db = mysqldb.Database()
try:
sql = "create table if not exists "+layout_tablename+"(LIKE pc_dic_toxic)"
db.execute(sql)
print('创建表:'+layout_tablename)
sql = "create table if not exists " + image_tablename + "(LIKE po_toxic)"
db.execute(sql)
print('创建表:'+image_tablename)
except Exception as e:
print('Error',e)
db.close()
#找出写真集的每套图集信息,插入数据库
get_layout(url,layout_tablename)
for i in range(1,100):
urls = url + 'index_'+str(i)+'.html'
# urls = url +str(i)+'.html'
if(get_html_status(urls)):
get_layout(urls,layout_tablename)
time.sleep(random.randint(1, 3))
else:
break
#找出图集中的每张图片,插入数据库
layout_count = 1
db = mysqldb.Database()
sql = 'select * from '+layout_tablename+' order by ID'
results = db.fetch_all(sql)
for row in results:
get_images(image_tablename,row['imageid'],row['flodername'],row['picname'])
layout_count = layout_count + 1
db.close()
#更新总表
db = mysqldb.Database()
try:
sql = "select max(imageid) as maxcount from "+image_tablename
results = db.fetch_one(sql)
sql = "insert into pc_dic_lanvshen(BeautyName,MinID,MaxID,TableName,IndexName,IndexType) values ('%s',%d,%d,'%s'," \
"'%s',%d)" % (layoyt_name,1,int(results['maxcount']),image_tablename,layout_tablename,1)
db.execute(sql)
print('数据已更新到总表:'+layout_tablename+' '+image_tablename)
except Exception as e:
print('Error',e)
db.close()
if __name__ == '__main__':
main()
咳咳,快!派森扶我起来,我还能学挖掘机技术,未完待续。。。