這是一個分析IP代理網站,通過代理網站提供的ip去訪問CSDN博客,達到以不同ip訪同一博客的目的,以娛樂為主,大家可以去玩一下。
首先,准備工作,設置User-Agent:
#1.headers headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
然后百度一個IP代理網站,我選用的是https://www.kuaidaili.com/free,解析網頁,提取其中的ip、端口、類型,並以list保存:
#1.獲取IP地址 html=requests.get('https://www.kuaidaili.com/free').content.decode('utf8') tree = etree.HTML(html) ip = tree.xpath("//td[@data-title='IP']/text()") port=tree.xpath("//td[@data-title='PORT']/text()") model=tree.xpath("//td[@data-title='類型']/text()")
接着分析個人博客下的各篇文章的url地址,以list保存
#2.獲取CSDN文章url地址 ChildrenUrl[] url='https://blog.csdn.net/weixin_43576564' response=requests.get(url,headers=headers) Home=response.content.decode('utf8') Home=etree.HTML(Home) urls=Home.xpath("//div[@class='article-item-box csdn-tracking-statistics']/h4/a/@href") ChildrenUrl=[]
然后通過代理ip去訪問個人博客的各篇文章,通過for循環,一個ip將所有文章訪問一遍,通過解析"我的博客"網頁,獲取總瀏覽量,實時監控瀏覽量是否發生變化,設置任務數,實時顯示任務進度,通過random.randint()設置sleep時間,使得spider更加安全。全代碼如下:
import os
import time
import random
import requests
from lxml import etree
#准備部分
#1.headers
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
#1.獲取IP地址
html=requests.get('https://www.kuaidaili.com/free').content.decode('utf8')
tree = etree.HTML(html)
ip = tree.xpath("//td[@data-title='IP']/text()")
port=tree.xpath("//td[@data-title='PORT']/text()")
model=tree.xpath("//td[@data-title='類型']/text()")
#2.獲取CSDN文章url地址 ChildrenUrl[]
url='https://blog.csdn.net/weixin_43576564'
response=requests.get(url,headers=headers)
Home=response.content.decode('utf8')
Home=etree.HTML(Home)
urls=Home.xpath("//div[@class='article-item-box csdn-tracking-statistics']/h4/a/@href")
ChildrenUrl=[]
for i in range(1,len(urls)):
ChildrenUrl.append(urls[i])
oldtime=time.gmtime()
browses=int(input("輸入需要訪問次數:"))
browse=0
#3.循環偽裝ip並爬取文章
for i in range(1,len(model)):
#設計代理ip
proxies={model[i]:'{}{}'.format(ip[i],port[i])}
for Curl in ChildrenUrl:
try:
browse += 1
print("進度:{}/{}".format(browse,browses),end="\t")
#遍歷文章
response=requests.get(Curl,headers=headers,proxies=proxies)
#獲取訪問人數
look=etree.HTML(response.content)
Nuwmunber=look.xpath("//div[@class='grade-box clearfix']/dl[2]/dd/text()")
count=Nuwmunber[0].strip()
print("總瀏覽量:{}".format(count),end="\t")
'''
重新實現
#每個IP進行一次查詢
if Curl==ChildrenUrl[5]:
ipUrl='http://www.ip138.com/'
response=requests.get(ipUrl,proxies=proxies)
iphtml=response.content
ipHtmlTree=etree.HTML(iphtml)
ipaddress=ipHtmlTree.xpath("//p[@class='result']/text()")
print(ip[i],ipaddress)
'''
i = random.randint(5, 30)
print("間隔{}秒".format(i),end="\t")
time.sleep(i)
print("當前瀏覽文章地址:{}".format(Curl))
if browse == browses:
print("已完成爬取任務,共消耗{}秒".format(int(time.perf_counter())))
os._exit(0)
except:
print('error')
os._exit(0)
#打印當前代理ip
print(proxies)
實際運行效果圖:
