1.爬蟲相關概述
爬蟲概念:
通過編寫程序模擬瀏覽器上網,然后讓其去互聯網上爬取/抓取數據的過程
模擬:瀏覽器就是一款純天然的原始的爬蟲工具
爬蟲分類:
通用爬蟲:爬取一整張頁面中的數據. 抓取系統(爬蟲程序)
聚焦爬蟲:爬取頁面中局部的數據.一定是建立在通用爬蟲的基礎之上
增量式爬蟲:用來監測網站數據更新的情況.以便爬取到網站最新更新出來的數據
風險分析
合理的的使用
爬蟲風險的體現:
爬蟲干擾了被訪問網站的正常運營;
爬蟲抓取了受到法律保護的特定類型的數據或信息。
避免風險:
嚴格遵守網站設置的robots協議;
在規避反爬蟲措施的同時,需要優化自己的代碼,避免干擾被訪問網站的正常運行;
在使用、傳播抓取到的信息時,應審查所抓取的內容,如發現屬於用戶的個人信息、隱私或者他人的商業秘密的,應及時停止並刪除。
反爬機制
反反爬策略
robots.txt協議:文本協議,在文本中指定了可爬和不可爬的數據說明.
常用的頭信息
User-Agent:請求載體的身份標識
Connection:close
content-type
如何鑒定頁面中是否有動態加載的數據?
局部搜索 全局搜索
對一個陌生網站進行爬取前的第一步做什么?
確定你要爬取的數據是否為動態加載的!!!
2.requests模塊的基本使用
requests模塊
概念:一個機遇網絡請求的模塊.作用就是用來模擬瀏覽器發起請求
編碼流程:
指定url
進行請求的發送
獲取響應數據(爬取到的數據)
持久化存儲
import requests
url = 'https://www.sogou.com'
#返回值是一個響應對象
response = requests.get(url=url)
#text返回的是字符串形式的響應數據
data = (response.text)
with open('./sogou.html',"w",encoding='utf-8') as f:
f.write(data)
基於搜狗編寫一個簡易的網頁采集器
解決亂碼問題
解決UA檢測問題
import requests
wd = input('輸入key:')
url = 'https://www.sogou.com/web'
# 存儲的就是動態的請求參數
params = {
'query': wd
}
#params參數表示的是對請求url參數的封裝
#headers 解決反爬機制,實現UA偽裝
headers = {
'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params,headers=headers)
#手動修改響應數據的編碼,解決中文亂碼
response.encoding = 'utf-8'
data = (response.text)
filename = wd + '.html'
with open(filename, "w", encoding='utf-8') as f:
f.write(data)
print(wd, "下載成功")
1.爬取豆瓣電影的詳細數據
分析
當滾輪滑動到底部的時候,發起ajax的請求,且請求到了一組電影數據
動態加載的數據:就是通過另一個額外的請求請求到的數據
ajax生成動態加載的數據
js生成動態加載的數據
import requests
limit = input("排行榜前多少的數據:::")
url = 'https://movie.douban.com/j/chart/top_list'
params = {
"type": "5",
"interval_id": "100:90",
"action": "",
"start": "0",
"limit": limit
}
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params, headers=headers)
#json返回的是序列化好的對象
data_list = (response.json())
with open('douban.txt', "w", encoding='utf-8') as f:
for i in data_list:
name = i['title']
score = i['score']
f.write(name+""+score+""+"\n")
print("成功")
2.爬取肯德基地理位置信息
import requests
url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
params = {
"cname": "",
"pid": "",
"keyword": "青島",
"pageIndex": "1",
"pageSize": "10"
}
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.post(url=url, params=params, headers=headers)
# json返回的是序列化好的對象
data_list = (response.json())
with open('kedeji.txt', "w", encoding='utf-8') as f:
for i in data_list["Table1"]:
name = i['storeName']
addres = i['addressDetail']
f.write(name + "," + addres + "\n")
print("成功")
3.爬取葯品管理局數據
import requests
url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
with open('化妝品,txt', "w", encoding="utf-8") as f:
for i in range(1, 5):
params = {
"on": "true",
"page": str(i),
"pageSize": "12",
"productName": "",
"conditionType": "1",
"applyname": "",
"applysn": ""
}
response = requests.post(url=url, params=params, headers=headers)
data_dic = (response.json())
for i in data_dic["list"]:
id = i['ID']
post_url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById"
post_data = {
"id": id
}
response2 = requests.post(url=post_url, params=post_data, headers=headers)
data_dic2 = (response2.json())
title = data_dic2["epsName"]
name = data_dic2['legalPerson']
f.write(title + ":" + name + "\n")
3.數據解析
解析:根據指定的規則對數據進行提取
作用:實現聚焦爬蟲
聚焦爬蟲的編碼流程:
指定url
發起請求
獲取響應數據
數據解析
持久化存儲
數據解析的方式:
正則
bs4
xpath
pyquery(拓展)
數據解析的通用原理是什么?
數據解析需要作用在頁面源碼中(一組html標簽組成的)
html的核心作用是什么?
展示數據
html是如何展示數據的呢?
html所要展示的數據一定是被放置在html標簽之中,或者是在屬性中
通用原理:
1.標簽定位
2.取文本or取屬性
1.正則解析
1.爬取糗事百科糗圖數據
爬取單張
import requests
url = "https://pic.qiushibaike.com/system/pictures/12330/123306162/medium/GRF7AMF9GKDTIZL6.jpg"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, headers=headers)
# content返回的是byte類型的數據
img_data = (response.content)
with open('./123.jpg', "wb") as f:
f.write(img_data)
print("成功")
爬取單頁
<div class="thumb">
<a href="/article/123319109" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12331/123319109/medium/MOX0YDFJX7CM1NWK.jpg" alt="糗事#123319109" class="illustration" width="100%" height="auto">
</a>
</div>
import re
import os
import requests
dir_name = "./img"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
url = "https://www.qiushibaike.com/imgrank/"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
img_text = requests.get(url, headers).text
ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
img_list = re.findall(ex, img_text, re.S)
for src in img_list:
src = "https:" + src
img_name = src.split('/')[-1]
img_path = dir_name + "/" + img_name
response = requests.get(src, headers).content
# 對圖片地址發請求獲取圖片數據
with open(img_path, "wb") as f:
f.write(response)
print("成功")
爬取多頁
import re
import os
import requests
dir_name = "./img"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
for i in range(1,5):
url = f"https://www.qiushibaike.com/imgrank/page/{i}/"
print(f"正在爬取第{i}頁的圖片")
img_text = requests.get(url, headers=headers).text
ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
img_list = re.findall(ex, img_text, re.S)
for src in img_list:
src = "https:" + src
img_name = src.split('/')[-1]
img_path = dir_name + "/" + img_name
response = requests.get(src, headers).content
# 對圖片地址發請求獲取圖片數據
with open(img_path, "wb") as f:
f.write(response)
print("成功")
2.bs4解析
環境安裝
pip install bs4
bs4的解析原理
實例化一個BeautifulSoup的對象為soup,並且將即將被解析的頁面源碼數據加載到該對象中,
調用BeautifulSoup對象中的相關屬性和方法進行標簽定位和數據提取
如何實例化BeautifulSoup對象呢?
BeautifulSoup(fp,'lxml'):專門用作於解析本地存儲的html文檔中的數據
BeautifulSoup(page_text,'lxml'):專門用作於將互聯網上請求到的頁面源碼數據進行解析
標簽定位
soup.tagName:定位到第一個TagName標簽,返回的是第一個
屬性定位
soup.find('div',class_='s'),返回值是class=s的div標簽
find_all:和find用法一致,但是返回值是列表
選擇器定位
select('選擇器'),返回值為列表
標簽,類,id,層級(>一個層級,空格 多個層級)
提取數據
取文本
tag.string:標簽中直系的文本內容
tag.text:標簽中所有的文本內容
取屬性
soup.find("a",id_='tt')['href']
1.爬取三國演義小說內容
http://www.shicimingju.com/book/sanguoyanyi.html
爬取章節名稱+章節內容
1.在首頁中解析章節名稱&每一個章節詳情頁的url
from bs4 import BeautifulSoup
import requests
url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
page_text = requests.get(url, headers=headers).text
soup = BeautifulSoup(page_text, 'lxml')
a_list = soup.select(".book-mulu a")
with open('./sanguo.txt', 'w', encoding='utf-8') as f:
for a in a_list:
new_url = "http://www.shicimingju.com" + a["href"]
mulu = a.text
print(mulu)
##對章節詳情頁的url發起請求,解析詳情頁中的章節內容
new_page_text = requests.get(new_url, headers).text
new_soup = BeautifulSoup(new_page_text, 'lxml')
neirong = new_soup.find('div', class_='chapter_content').text
f.write(mulu+":"+neirong+"\n")
3.xpath解析
環境安裝
pip install lxml
xpath的解析原理
實例化一個etree類型xpath的解析原理的對象,且將頁面源碼數據加載到該對象中
需要調用該對象的xpath方法結合着不同形式的xpath表達式進行標簽定位和數據提取
etree對象的實例化
tree = etree.parse(fileNane)
tree = etree.HTML(page_text)
xpath方法返回的永遠是一個列表
標簽定位
tree.xpath("")
在xpath表達式中最最側的/表示的含義是說,當前定位的標簽必須從根節點開始進行定位
xpath表達式中最左側的//表示可以從任意位置進行標簽定位
xpath表達式中非最左側的//表示的是多個層級的意思
xpath表達式中非最左側的/表示的是一個層級的意思
屬性定位://div[@class='ddd']
索引定位://div[@class='ddd']/li[3] #索引從1開始
索引定位://div[@class='ddd']//li[2] #索引從1開始
提取數據
取文本:
tree.xpath("//p[1]/text()"):取直系的文本內容
tree.xpath("//div[@class='ddd']/li[2]//text()"):取所有的文本內容
取屬性:
tree.xpath('//a[@id="feng"]/@href')
1.爬取boss的招聘信息
from lxml import etree
import requests
import time
url = 'https://www.zhipin.com/job_detail/?query=python&city=101120200&industry=&position='
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
'cookie':'__zp__pub__=; lastCity=101120200; __c=1594792470; __g=-; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1594713563,1594713587,1594792470; __l=l=%2Fwww.zhipin.com%2Fqingdao%2F&r=&friend_source=0&friend_source=0; __a=26925852.1594713563.1594713586.1594792470.52.3.39.52; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1594801318; __zp_stoken__=c508aZxdfUB9hb0Q8ORppIXd7JTdDTF96U3EdCDgIHEscYxUsVnoqdH9VBxY5GUtkJi5wfxggRDtsR0dAT2pEDDRRfWsWLg8WUmFyWQECQlYFSV4SCUQqUB8yfRwAUTAyZBc1ABdbRRhyXUY%3D'
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="main"]/div/div[2]/ul/li')
for li in li_list:
#需要將li表示的局部頁面源碼數據中的相關數據進行提取
#如果xpath表達式被作用在了循環中, 表達式要以. / 或者. // 開頭
detail_url = 'https://www.zhipin.com' + li.xpath('.//span[@class="job-name"]/a/@href')[0]
job_title = li.xpath('.//span[@class="job-name"]/a/text()')[0]
company = li.xpath('.//div[@class="info-company"]/div/h3/a/text()')[0]
# # 對詳情頁的url發請求解析出崗位職責
detail_page_text = requests.get(detail_url, headers=headers).text
tree = etree.HTML(detail_page_text)
job_desc = tree.xpath('//div[@class="text"]/text()')
#列表轉字符傳
job_desc = ''.join(job_desc)
print(job_title,company,job_desc)
time.sleep(5)
2.爬取糗事百科
爬取作者,和文章。注意作者有匿名和實名之分
from lxml import etree
import requests
url = "https://www.qiushibaike.com/text/page/4/"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@class="col1 old-style-col1"]/div')
print(div_list)
for div in div_list:
#用戶名分為匿名用戶和注冊用戶
author = div.xpath('.//div[@class="author clearfix"]//h2/text() | .//div[@class="author clearfix"]/span[2]/h2/text()')[0]
content = div.xpath('.//div[@class="content"]/span//text()')
content = ''.join(content)
print(author, content)
3.爬取網站圖片
from lxml import etree
import requests
import os
dir_name = "./img2"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
for i in range(1, 6):
if i == 1:
url = "http://pic.netbian.com/4kmeinv/"
else:
url = f"http://pic.netbian.com/4kmeinv/index_{i}.html"
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
for li in li_list:
img_src = "http://pic.netbian.com/" + li.xpath('./a/img/@src')[0]
img_name = li.xpath('./a/b/text()')[0]
#解決中文亂碼
img_name = img_name.encode('iso-8859-1').decode('gbk')
response = requests.get(img_src).content
img_path = dir_name + "/" + f"{img_name}.jpg"
with open(img_path, "wb") as f:
f.write(response)
print(f"第{i}頁成功")
4.IP代理
代理服務器
實現請求轉發,從而可以實現更換請求的ip地址
代理的匿名度
透明:服務器知道你使用了代理並且知道你的真實ip
匿名:服務器知道你使用了代理,但是不知道你的真實ip
高匿:服務器不知道你使用了代理,更不知道你的真實ip
代理的類型
http:該類型的代理只可以轉發http協議的請求
https:只可以轉發https協議的請求
免費代理ip的網站
快代理
西祠代理
goubanjia
代理精靈(推薦):http://http.zhiliandaili.cn/
在爬蟲中遇到ip被禁掉如何處理?
使用代理
構建一個代理池
撥號服務器
import requests
import random
from lxml import etree
# 列表形式的代理池
all_ips = []
proxy_url = "http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=5&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
proxy_page_text = requests.get(url=proxy_url, headers=headers).text
tree = etree.HTML(proxy_page_text)
proxy_list = tree.xpath('//body//text()')
for ip in proxy_list:
dic = {'https': ip}
all_ips.append(dic)
# 爬取快代理中的免費代理ip
free_proxies = []
for i in range(1, 3):
url = f"http://www.kuaidaili.com/free/inha/{i}/"
page_text = requests.get(url, headers=headers,proxies=random.choice(all_ips)).text
tree = etree.HTML(page_text)
# xpath表達式中不可以出現tbody
tr_list = tree.xpath('//*[@id="list"]/table/tbody/tr')
for tr in tr_list:
ip = tr.xpath("./td/text()")[0]
port = tr.xpath("./td[2]/text()")[0]
dic = {
"ip":ip,
"port":port
}
print(dic)
free_proxies.append(dic)
print(f"第{i}頁")
print(len(free_proxies))
5.處理cookie
視頻解析接口
https://www.wocao.xyz/index.php?url=
https://2wk.com/vip.php?url=
https://api.47ks.com/webcloud/?v-
視頻解析網址
牛巴巴 http://mv.688ing.com/
愛片網 https://ap2345.com/vip/
全民解析 http://www.qmaile.com/
回歸正點
為什么要處理cookie?
保存客戶端的相關狀態
在請求中攜帶cookie,在爬蟲中如果遇到了cookie的反爬如何處理?
#手動處理
在抓包工具中捕獲cookie,將其封裝在headers中
#自動處理
使用session機制
使用場景:動態變化的cookie
session對象:該對象和requests模塊用法幾乎一致.如果在請求的過程中產生了cookie,如果該請求使用session發起的,則cookie會被自動存儲到session中
爬去雪球網的數據
import requests
s = requests.Session()
main_url = "https://xueqiu.com" # 先對url發請求獲取cookie
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
params = {
"size": "8",
'_type': "10",
"type": "10"
}
s.get(main_url, headers=headers)
url = 'https://stock.xueqiu.com/v5/stock/hot_stock/list.json?size=8&_type=10&type=10'
page_text = s.get(url, headers=headers).json()
print(page_text)
6.驗證碼識別
相關的線上打碼平台識別
- 打碼兔
- 雲打碼
- 超級鷹:http://www.chaojiying.com/about.html
1.注冊,登錄(用戶中心的身份認證)
2.登錄后
創建一個軟件:軟件ID->生成一個軟件id
下載示例代碼:開發文檔->python->下載
平台實例代碼的演示
import requests
from hashlib import md5
class Chaojiying_Client(object):
def __init__(self, username, password, soft_id):
self.username = username
password = password.encode('utf8')
self.password = md5(password).hexdigest()
self.soft_id = soft_id
self.base_params = {
'user': self.username,
'pass2': self.password,
'softid': self.soft_id,
}
self.headers = {
'Connection': 'Keep-Alive',
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
}
def PostPic(self, im, codetype):
params = {
'codetype': codetype,
}
params.update(self.base_params)
files = {'userfile': ('ccc.jpg', im)}
r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
headers=self.headers)
return r.json()
def ReportError(self, im_id):
params = {
'id': im_id,
}
params.update(self.base_params)
r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
return r.json()
chaojiying = Chaojiying_Client('超級鷹用戶名', '超級鷹用戶名的密碼', '96001')
im = open('a.jpg', 'rb').read()
print(chaojiying.PostPic(im, 1902)['pic_str'])
將古詩網中的驗證碼進行識別
zbb.py
import requests
from hashlib import md5
class Chaojiying_Client(object):
def __init__(self, username, password, soft_id):
self.username = username
password = password.encode('utf8')
self.password = md5(password).hexdigest()
self.soft_id = soft_id
self.base_params = {
'user': self.username,
'pass2': self.password,
'softid': self.soft_id,
}
self.headers = {
'Connection': 'Keep-Alive',
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
}
def PostPic(self, im, codetype):
params = {
'codetype': codetype,
}
params.update(self.base_params)
files = {'userfile': ('ccc.jpg', im)}
r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
headers=self.headers)
return r.json()
def ReportError(self, im_id):
params = {
'id': im_id,
}
params.update(self.base_params)
r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
return r.json()
def www(path,type):
chaojiying = Chaojiying_Client('5423', '521521', '906630')
im = open(path, 'rb').read()
return chaojiying.PostPic(im, type)['pic_str']
requests.py
import requests
from lxml import etree
from zbb import www
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = requests.get(img_url,headers=headers).content
with open('./111.jpg','wb') as f:
f.write(img_data)
img_text = www('./111.jpg',1004)
print(img_text)
7.模擬登陸
為什么在爬蟲中需要實現模擬登錄?
有的數據是必須經過登錄后才可以顯示出來的
古詩網
涉及到的反扒機制
1.驗證碼
2.動態請求參數:每次請求對應的請求參數都是動態變化
動態捕獲:通常情況下,動態的請求參數都會被隱藏在前台頁面的源碼中
3.cookie存在驗證碼圖片之中
坑壁玩意
import requests
from lxml import etree
from zbb import www
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
# 獲取cookie
s = requests.Session()
# s_url = "https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx"
# s.get(s_url, headers=headers)
# 獲取驗證碼
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = s.get(img_url, headers=headers).content
with open('./111.jpg', 'wb') as f:
f.write(img_data)
img_text = www('./111.jpg', 1004)
print(img_text)
# 動態捕獲動態的請求參數
__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]
# 點擊登錄按鈕后發起請求的url:通過抓包工具捕獲
login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'
data = {
"__VIEWSTATE": __VIEWSTATE,
"__VIEWSTATEGENERATOR": __VIEWSTATEGENERATOR, # 變化的
"from": "http://so.gushiwen.cn/user/collect.aspx",
"email": "542154983@qq.com",
"pwd": "zxy521",
"code": img_text,
"denglu": "登錄"
}
main_page_text = s.post(login_url, headers=headers, data=data).text
with open('main.html', 'w', encoding='utf-8') as fp:
fp.write(main_page_text)
8.基於線程池的異步爬取
基於線程池的異步爬取 趣味百科前十頁內容
import requests
from multiprocessing.dummy import Pool
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
#將url獲取,加入列表之中
urls = []
for i in range(1, 11):
urls.append(f'https://www.qiushibaike.com/8hr/page/{i}/')
#創建一個request請求
def get_request(url):
# 必須只能有一個參數
return requests.get(url, headers=headers).text
#實例化線程10個
pool = Pool(10)
response_text_list = pool.map(get_request,urls)
print(response_text_list)
9.單線程+多任務異步協程
1.簡介
協程:對象
#可以把協程當做是一個特殊的函數.如果一個函數的定義被async關鍵字所修飾.該特殊的函數被調用后函數內部的程序語句不會被立即執行,而是會返回一個協程對象.
任務對象(task)
#所謂的任務對象就是對協程對象的進一步封裝.在任務對象中可以實現顯示協程對象的運行狀況.
#任務對象最終是需要被注冊到事件循環對象中.
綁定回調
#回調函數是綁定給任務對象,只有當任務對象對應的特殊函數被執行完畢后,回調函數才會被執行
事件循環對象
#無限循環的對象.也可以把其當成是某一種容器.該容器中需要放置多個任務對象(就是一組待執行的代碼塊).
異步的體現
#當事件循環開啟后,該對象會安裝順序執行每一個任務對象,
#當一個任務對象發生了阻塞事件循環是不會等待,而是直接執行下一個任務對象
await:掛起的操作.交出cpu的使用權
單任務
from time import sleep
import asyncio
# 回調函數:
# 默認參數:任務對象
def callback(task):
print('i am callback!!1')
print(task.result()) # result返回的就是任務對象對應的那個特殊函數的返回值
async def get_request(url):
print('正在請求:', url)
sleep(2)
print('請求結束:', url)
return 'hello bobo'
# 創建一個協程對象
c = get_request('www.1.com')
# 封裝一個任務對象
task = asyncio.ensure_future(c)
# 給任務對象綁定回調函數
task.add_done_callback(callback)
# 創建一個事件循環對象
loop = asyncio.get_event_loop()
loop.run_until_complete(task) # 將任務對象注冊到事件循環對象中並且開啟了事件循環
2.多任務的異步協程
import asyncio
from time import sleep
import time
start = time.time()
urls = [
'http://localhost:5000/a',
'http://localhost:5000/b',
'http://localhost:5000/c'
]
#在待執行的代碼塊中不可以出現不支持異步模塊的代碼
#在該函數內部如果有阻塞操作必須使用await關鍵字進行修飾
async def get_request(url):
print('正在請求:',url)
# sleep(2)
await asyncio.sleep(2)
print('請求結束:',url)
return 'hello bobo'
tasks = [] #放置所有的任務對象
for url in urls:
c = get_request(url)
task = asyncio.ensure_future(c)
tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
print(time.time()-start)
注意事項:
1.將多個任務對象存儲到一個列表中,然后將該列表注冊到事件循環中.在注冊的過程中,該列表需要被wait方法進行處理.
2.在任務對象對應的特殊函數內部的實現中,不可以出現不支持異步模塊的代碼,否則就會中斷整個的異步效果.並且,在該函數內部每一組阻塞的操作都必須使用await關鍵字進行修飾.
3.requests模塊對應的代碼不可以出現在特殊函數內部,因為requests是一個不支持異步的模塊
3.aiohttp
支持異步操作的網絡請求的模塊
- 環境安裝:pip install aiohttp
import asyncio
import requests
import time
import aiohttp
from lxml import etree
urls = [
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
]
# 無法實現異步的效果:是因為requests模塊是一個不支持異步的模塊
async def req(url):
async with aiohttp.ClientSession() as s:
async with await s.get(url) as response:
# response.read():byte
page_text = await response.text()
return page_text
# 細節:在每一個with前面加上async,在每一步的阻塞操作前加上await
def parse(task):
page_text = task.result()
tree = etree.HTML(page_text)
name = tree.xpath('//p/text()')[0]
print(name)
if __name__ == '__main__':
start = time.time()
tasks = []
for url in urls:
c = req(url)
task = asyncio.ensure_future(c)
task.add_done_callback(parse)
tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
print(time.time() - start)
10.selenium
概念
基於瀏覽器自動化的一個模塊.
環境的安裝:
下載selenium模塊
selenium和爬蟲之間的關聯是什么?
便捷的獲取頁面中動態加載的數據
requests模塊進行數據爬取:可見非可得
selenium:可見即可得
實現模擬登錄
基本操作:
谷歌瀏覽器驅動程序下地址:
http://chromedriver.storage.googleapis.com/index.html
selenium驅動程序和谷歌版本的映射關系表:
https://blog.csdn.net/huilan_same/article/details/51896672
動作鏈
一系列的行為動作
無頭瀏覽器
無可視化界面的瀏覽器
phantosJS
1.京東基本操作示例
from selenium import webdriver
from time import sleep
#1.實例化一個瀏覽器對象
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
#2.模擬用戶發起請求
url = 'https://www.jd.com'
bro.get(url)
#3.標簽定位
search_input = bro.find_element_by_id('key')
#4.對指定標簽進行數據交互
search_input.send_keys('華為')
#5.系列的行為動作
btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)
#6.執行js代碼
jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
bro.execute_script(jsCode)
sleep(3)
#7.關閉
bro.quit()
2.爬取葯品總局信息
from selenium import webdriver
from lxml import etree
from time import sleep
page_text_list = []
# 實例化一個瀏覽器對象
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
url = 'http://125.35.6.84:81/xk/'
bro.get(url)
# 必須等待頁面加載完畢
sleep(2)
# page_source就是瀏覽器打開頁面的源碼數據
page_text = bro.page_source
page_text_list.append(page_text)
#必須要與窗口對應,窗口必須要顯示點擊按鈕才可
jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
bro.execute_script(jsCode)
#打開后兩頁的
for i in range(2):
bro.find_element_by_id('pageIto_next').click()
sleep(2)
page_text = bro.page_source
page_text_list.append(page_text)
for p in page_text_list:
tree = etree.HTML(p)
li_list = tree.xpath('//*[@id="gzlist"]/li')
for li in li_list:
name = li.xpath('./dl/@title')[0]
print(name)
sleep(2)
bro.quit()
3.動作鏈
from lxml import etree
from time import sleep
from selenium import webdriver
from selenium.webdriver import ActionChains
# 實例化一個瀏覽器對象
page_text_list = []
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
bro.get(url)
# 如果定位的標簽是存在於iframe對應的子頁面中的話,在進行標簽定位前一定要執行一個switch_to的操作
bro.switch_to.frame('iframeResult')
div_tag = bro.find_element_by_id('draggable')
# 1.實例化動作鏈對象
action = ActionChains(bro)
action.click_and_hold(div_tag)
for i in range(5):
#perform讓動作鏈立即執行
action.move_by_offset(17, 0).perform()
sleep(0.5)
#釋放
action.release()
sleep(3)
bro.quit()
4.處理反爬selenium
像淘寶很多網站都禁止selenium爬取
正常在瀏覽器輸入window.Navigator.webdriver返回的是undefined
用代碼打開瀏覽器返回的是true
from selenium import webdriver
from selenium.webdriver import ChromeOptions
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
#實例化一個瀏覽器對象
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe',options=option)
bro.get('https://www.taobao.com/')
5.模擬12306登錄
from selenium import webdriver
from selenium.webdriver import ActionChains
from PIL import Image # 用作於圖片的裁剪 pillow
from zbb import www
from time import sleep
bro =webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
bro.get('https://kyfw.12306.cn/otn/resources/login.html')
sleep(5)
zhdl = bro.find_element_by_xpath('/html/body/div[2]/div[2]/ul/li[2]/a')
zhdl.click()
sleep(1)
username = bro.find_element_by_id('J-userName')
username.send_keys('181873')
pwd = bro.find_element_by_id('J-password')
pwd.send_keys('zx1')
# 驗證碼圖片進行捕獲(裁剪)
bro.save_screenshot('main.png')
# 定位到了驗證碼圖片對應的標簽
code_img_ele = bro.find_element_by_xpath('//*[@id="J-loginImg"]')
location = code_img_ele.location # 驗證碼圖片基於當前整張頁面的左下角坐標
size = code_img_ele.size # 驗證碼圖片的長和寬
# 裁剪的矩形區域(左下角和右上角兩點的坐標)
rangle = (
int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))
i = Image.open('main.png')
frame = i.crop(rangle)
frame.save('code.png')
# # 使用打碼平台進行驗證碼的識別
result = www('./code.png',9004)
# x1,y1|x2,y2|x3,y3 ==> [[x1,y1],[x2,y2],[x3,y3]]
all_list = [] # [[x1,y1],[x2,y2],[x3,y3]] 每一個列表元素表示一個點的坐標,坐標對應值的0,0點是驗證碼圖片左下角
if '|' in result:
list_1 = result.split('|')
count_1 = len(list_1)
for i in range(count_1):
xy_list = []
x = int(list_1[i].split(',')[0])
y = int(list_1[i].split(',')[1])
xy_list.append(x)
xy_list.append(y)
all_list.append(xy_list)
else:
x = int(result.split(',')[0])
y = int(result.split(',')[1])
xy_list = []
xy_list.append(x)
xy_list.append(y)
all_list.append(xy_list)
print(all_list)
action = ActionChains(bro)
for l in all_list:
x = l[0]
y = l[1]
action.move_to_element_with_offset(code_img_ele, x, y).click().perform()
sleep(2)
btn = bro.find_element_by_xpath('//*[@id="J-login"]')
btn.click()
action.release()
sleep(3)
bro.quit()