請提前搭好梯子,如果沒有梯子的話直接403。
1.所用到的包
requests: 和服務器建立連接,請求和接收數據(當然也可以用其他的包,socket之類的,不過requests是最簡單好用的)
BeautifulSoup:解析從服務器接收到的數據
urllib: 將網頁圖片下載到本地
import requests from bs4 import BeautifulSoup import urllib
2.獲取指定頁面的html內容並解析
我這里選取"blowjob"作為關鍵字
key_word='blowjob'
url = 'https://www.pornhub.com/video/search?search='+key_word html=requests.get(url) soup=BeautifulSoup(html.content,'html.parser')
3.從html中篩到全部image並進行遍歷
使用find_all函數,將所有img區塊中包含屬性'width':"150"的存儲到jpg_data列表中,並對jpg_data列表進行遍歷
jpg_data=soup.find_all('img',attrs={'width':"150" }) for cur in jpg_data: pic_src=cur['src']
4.進一步篩選,並找到圖片地址進行下載操作
cur['src']為當前圖片地址,cur['alt']為當前圖片標題,urllib.requests.urlretrieve操作將圖片保存到當地,默認地址為本py文件所在目錄,如有需要也可自定義保存目錄。
for cur in jpg_data:
pic_src=cur['src'] if(".jpg" in pic_src): filename=cur['alt']+'.jpg' with open(filename,'wb') as f: f.write(bytes(pic_src,encoding='utf-8')) print(filename) f.close()
完整代碼:
import requests from bs4 import BeautifulSoup headers = {'User-Agent': 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'} url = 'https://www.pornhub.com/video/search?search=blowjob' html=requests.get(url,headers=headers) soup=BeautifulSoup(html.content,'html.parser') jpg_data=soup.find_all('img',attrs={'width':"150" }) for cur in jpg_data: pic_src=cur['src'] if(".jpg" in pic_src): filename=cur['alt']+'.jpg' with open(filename,'wb') as f: f.write(bytes(pic_src,encoding='utf-8')) print(filename) f.close()
以上所作示例僅爬取了keyword關鍵詞搜索下第一頁的圖片內容,如需要爬取多頁,
可在url后加'&page=xx'並進行遍歷
for i in range(0,10): url = 'https://www.pornhub.com/video/search?search=blowjob'+'&page='+str(i)
程序運行結果: