爬蟲基礎篇

本文轉載自查看原文 2020-07-31 10:35 839 python爬蟲

1.爬蟲相關概述

爬蟲概念:

通過編寫程序模擬瀏覽器上網,然后讓其去互聯網上爬取/抓取數據的過程
模擬:瀏覽器就是一款純天然的原始的爬蟲工具

爬蟲分類:

通用爬蟲:爬取一整張頁面中的數據. 抓取系統(爬蟲程序)
聚焦爬蟲:爬取頁面中局部的數據.一定是建立在通用爬蟲的基礎之上
增量式爬蟲:用來監測網站數據更新的情況.以便爬取到網站最新更新出來的數據

風險分析

合理的的使用
爬蟲風險的體現:
爬蟲干擾了被訪問網站的正常運營；
爬蟲抓取了受到法律保護的特定類型的數據或信息。
避免風險:
嚴格遵守網站設置的robots協議；
在規避反爬蟲措施的同時，需要優化自己的代碼，避免干擾被訪問網站的正常運行；
在使用、傳播抓取到的信息時，應審查所抓取的內容，如發現屬於用戶的個人信息、隱私或者他人的商業秘密的，應及時停止並刪除。

反爬機制

反反爬策略 
robots.txt協議:文本協議,在文本中指定了可爬和不可爬的數據說明.

常用的頭信息

User-Agent:請求載體的身份標識
Connection:close
content-type

如何鑒定頁面中是否有動態加載的數據?

局部搜索全局搜索

對一個陌生網站進行爬取前的第一步做什么?
確定你要爬取的數據是否為動態加載的!!!

2.requests模塊的基本使用

requests模塊
概念:一個機遇網絡請求的模塊.作用就是用來模擬瀏覽器發起請求
編碼流程:
指定url
進行請求的發送
獲取響應數據(爬取到的數據)
持久化存儲

import requests
url = 'https://www.sogou.com'
#返回值是一個響應對象
response = requests.get(url=url)
#text返回的是字符串形式的響應數據
data = (response.text)
with open('./sogou.html',"w",encoding='utf-8') as f:
    f.write(data)

基於搜狗編寫一個簡易的網頁采集器

解決亂碼問題

解決UA檢測問題

import requests

wd = input('輸入key：')
url = 'https://www.sogou.com/web'
# 存儲的就是動態的請求參數
params = {
    'query': wd
}
#params參數表示的是對請求url參數的封裝
#headers 解決反爬機制，實現UA偽裝
headers = {
    'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params,headers=headers)
#手動修改響應數據的編碼，解決中文亂碼
response.encoding = 'utf-8'

data = (response.text)
filename = wd + '.html'
with open(filename, "w", encoding='utf-8') as f:
    f.write(data)
print(wd, "下載成功")

1.爬取豆瓣電影的詳細數據

分析

當滾輪滑動到底部的時候，發起ajax的請求，且請求到了一組電影數據
動態加載的數據:就是通過另一個額外的請求請求到的數據
ajax生成動態加載的數據
js生成動態加載的數據

import requests
limit = input("排行榜前多少的數據:::")
url = 'https://movie.douban.com/j/chart/top_list'
params = {
    "type": "5",
    "interval_id": "100:90",
    "action": "",
    "start": "0",
    "limit": limit
}

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params, headers=headers)
#json返回的是序列化好的對象
data_list = (response.json())

with open('douban.txt', "w", encoding='utf-8') as f:
    for i in data_list:
        name = i['title']
        score = i['score']
        f.write(name+""+score+""+"\n")
print("成功")

2.爬取肯德基地理位置信息

import requests

url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
params = {
    "cname": "",
    "pid": "",
    "keyword": "青島",
    "pageIndex": "1",
    "pageSize": "10"
}

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.post(url=url, params=params, headers=headers)
# json返回的是序列化好的對象
data_list = (response.json())
with open('kedeji.txt', "w", encoding='utf-8') as f:
    for i in data_list["Table1"]:
        name = i['storeName']
        addres = i['addressDetail']
        f.write(name + "," + addres  + "\n")
print("成功")

3.爬取葯品管理局數據

import requests

url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList"
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
with open('化妝品，txt', "w", encoding="utf-8") as f:
    for i in range(1, 5):
        params = {
            "on": "true",
            "page": str(i),
            "pageSize": "12",
            "productName": "",
            "conditionType": "1",
            "applyname": "",
            "applysn": ""
        }

        response = requests.post(url=url, params=params, headers=headers)
        data_dic = (response.json())

        for i in data_dic["list"]:
            id = i['ID']
            post_url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById"
            post_data = {
                "id": id
            }
            response2 = requests.post(url=post_url, params=post_data, headers=headers)
            data_dic2 = (response2.json())
            title = data_dic2["epsName"]
            name = data_dic2['legalPerson']

            f.write(title + ":" + name + "\n")

3.數據解析

解析:根據指定的規則對數據進行提取

作用:實現聚焦爬蟲

聚焦爬蟲的編碼流程:

指定url
發起請求
獲取響應數據
數據解析
持久化存儲

數據解析的方式:

正則
bs4
xpath
pyquery(拓展)

數據解析的通用原理是什么?

數據解析需要作用在頁面源碼中(一組html標簽組成的)

html的核心作用是什么?

展示數據

html是如何展示數據的呢?

html所要展示的數據一定是被放置在html標簽之中,或者是在屬性中

通用原理:

1.標簽定位
2.取文本or取屬性

1.正則解析

1.爬取糗事百科糗圖數據

爬取單張

import requests

url = "https://pic.qiushibaike.com/system/pictures/12330/123306162/medium/GRF7AMF9GKDTIZL6.jpg"

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, headers=headers)
# content返回的是byte類型的數據
img_data = (response.content)
with open('./123.jpg', "wb") as f:
        f.write(img_data)
print("成功")

爬取單頁

<div class="thumb">

<a href="/article/123319109" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12331/123319109/medium/MOX0YDFJX7CM1NWK.jpg" alt="糗事#123319109" class="illustration" width="100%" height="auto">
</a>
</div>

import re
import os
import requests

dir_name = "./img"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
url = "https://www.qiushibaike.com/imgrank/"

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
img_text = requests.get(url, headers).text
ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
img_list = re.findall(ex, img_text, re.S)
for src in img_list:
    src = "https:" + src
    img_name = src.split('/')[-1]
    img_path = dir_name + "/" + img_name
    response = requests.get(src, headers).content
    # 對圖片地址發請求獲取圖片數據
    with open(img_path, "wb") as f:
        f.write(response)
print("成功")

爬取多頁

import re
import os
import requests

dir_name = "./img"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
for i in range(1,5):
    url = f"https://www.qiushibaike.com/imgrank/page/{i}/"
    print(f"正在爬取第{i}頁的圖片")
    img_text = requests.get(url, headers=headers).text
    ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
    img_list = re.findall(ex, img_text, re.S)
    for src in img_list:
        src = "https:" + src
        img_name = src.split('/')[-1]
        img_path = dir_name + "/" + img_name
        response = requests.get(src, headers).content
        # 對圖片地址發請求獲取圖片數據
        with open(img_path, "wb") as f:
            f.write(response)
print("成功")

2.bs4解析

環境安裝

pip install bs4

bs4的解析原理

實例化一個BeautifulSoup的對象為soup,並且將即將被解析的頁面源碼數據加載到該對象中,
調用BeautifulSoup對象中的相關屬性和方法進行標簽定位和數據提取

如何實例化BeautifulSoup對象呢?

BeautifulSoup(fp,'lxml'):專門用作於解析本地存儲的html文檔中的數據
BeautifulSoup(page_text,'lxml'):專門用作於將互聯網上請求到的頁面源碼數據進行解析

標簽定位

soup.tagName:定位到第一個TagName標簽,返回的是第一個

屬性定位

soup.find('div',class_='s'),返回值是class=s的div標簽
find_all:和find用法一致,但是返回值是列表

選擇器定位

select('選擇器'),返回值為列表
	標簽,類,id,層級(>一個層級,空格 多個層級)

提取數據

取文本

tag.string:標簽中直系的文本內容
tag.text:標簽中所有的文本內容

取屬性

soup.find("a",id_='tt')['href']

1.爬取三國演義小說內容

http://www.shicimingju.com/book/sanguoyanyi.html

爬取章節名稱+章節內容

1.在首頁中解析章節名稱&每一個章節詳情頁的url

from bs4 import BeautifulSoup
import requests

url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
page_text = requests.get(url, headers=headers).text
soup = BeautifulSoup(page_text, 'lxml')
a_list = soup.select(".book-mulu a")
with open('./sanguo.txt', 'w', encoding='utf-8') as f:
    for a in a_list:
        new_url = "http://www.shicimingju.com" + a["href"]
        mulu = a.text
        print(mulu)
        ##對章節詳情頁的url發起請求,解析詳情頁中的章節內容
        new_page_text = requests.get(new_url, headers).text
        new_soup = BeautifulSoup(new_page_text, 'lxml')
        neirong = new_soup.find('div', class_='chapter_content').text
        f.write(mulu+":"+neirong+"\n")

3.xpath解析

環境安裝

pip install lxml

xpath的解析原理

實例化一個etree類型xpath的解析原理的對象,且將頁面源碼數據加載到該對象中
需要調用該對象的xpath方法結合着不同形式的xpath表達式進行標簽定位和數據提取

etree對象的實例化

tree = etree.parse(fileNane)
tree = etree.HTML(page_text)
xpath方法返回的永遠是一個列表

標簽定位

tree.xpath("")
在xpath表達式中最最側的/表示的含義是說,當前定位的標簽必須從根節點開始進行定位
xpath表達式中最左側的//表示可以從任意位置進行標簽定位
xpath表達式中非最左側的//表示的是多個層級的意思
xpath表達式中非最左側的/表示的是一個層級的意思

屬性定位://div[@class='ddd']

索引定位://div[@class='ddd']/li[3] #索引從1開始
索引定位://div[@class='ddd']//li[2] #索引從1開始

提取數據

取文本:
tree.xpath("//p[1]/text()"):取直系的文本內容
tree.xpath("//div[@class='ddd']/li[2]//text()"):取所有的文本內容
取屬性:
tree.xpath('//a[@id="feng"]/@href')

1.爬取boss的招聘信息

from lxml import etree
import requests
import time


url = 'https://www.zhipin.com/job_detail/?query=python&city=101120200&industry=&position='
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    'cookie':'__zp__pub__=; lastCity=101120200; __c=1594792470; __g=-; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1594713563,1594713587,1594792470; __l=l=%2Fwww.zhipin.com%2Fqingdao%2F&r=&friend_source=0&friend_source=0; __a=26925852.1594713563.1594713586.1594792470.52.3.39.52; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1594801318; __zp_stoken__=c508aZxdfUB9hb0Q8ORppIXd7JTdDTF96U3EdCDgIHEscYxUsVnoqdH9VBxY5GUtkJi5wfxggRDtsR0dAT2pEDDRRfWsWLg8WUmFyWQECQlYFSV4SCUQqUB8yfRwAUTAyZBc1ABdbRRhyXUY%3D'
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="main"]/div/div[2]/ul/li')
for li in li_list:
    #需要將li表示的局部頁面源碼數據中的相關數據進行提取
    #如果xpath表達式被作用在了循環中, 表達式要以. / 或者. // 開頭
    detail_url = 'https://www.zhipin.com' + li.xpath('.//span[@class="job-name"]/a/@href')[0]
    job_title = li.xpath('.//span[@class="job-name"]/a/text()')[0]
    company = li.xpath('.//div[@class="info-company"]/div/h3/a/text()')[0]
    # # 對詳情頁的url發請求解析出崗位職責
    detail_page_text = requests.get(detail_url, headers=headers).text
    tree = etree.HTML(detail_page_text)
    job_desc = tree.xpath('//div[@class="text"]/text()')
    #列表轉字符傳
    job_desc = ''.join(job_desc)
    print(job_title,company,job_desc)
    time.sleep(5)

2.爬取糗事百科

爬取作者，和文章。注意作者有匿名和實名之分

from lxml import etree
import requests


url = "https://www.qiushibaike.com/text/page/4/"
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@class="col1 old-style-col1"]/div')
print(div_list)

for div in div_list:
#用戶名分為匿名用戶和注冊用戶
    author = div.xpath('.//div[@class="author clearfix"]//h2/text() | .//div[@class="author clearfix"]/span[2]/h2/text()')[0]
    content = div.xpath('.//div[@class="content"]/span//text()')
    content = ''.join(content)
    print(author, content)

3.爬取網站圖片

from lxml import etree
import requests
import os
dir_name = "./img2"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
for i in range(1, 6):
    if i == 1:
        url = "http://pic.netbian.com/4kmeinv/"
    else:
        url = f"http://pic.netbian.com/4kmeinv/index_{i}.html"

    page_text = requests.get(url, headers=headers).text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
    for li in li_list:
        img_src = "http://pic.netbian.com/" + li.xpath('./a/img/@src')[0]
        img_name = li.xpath('./a/b/text()')[0]
        #解決中文亂碼
        img_name = img_name.encode('iso-8859-1').decode('gbk')
        response = requests.get(img_src).content
        img_path = dir_name + "/" + f"{img_name}.jpg"
        with open(img_path, "wb") as f:
            f.write(response)
    print(f"第{i}頁成功")

4.IP代理

代理服務器

實現請求轉發,從而可以實現更換請求的ip地址

代理的匿名度

透明:服務器知道你使用了代理並且知道你的真實ip
匿名:服務器知道你使用了代理,但是不知道你的真實ip
高匿:服務器不知道你使用了代理,更不知道你的真實ip

代理的類型

http:該類型的代理只可以轉發http協議的請求

https:只可以轉發https協議的請求

免費代理ip的網站

快代理
西祠代理
goubanjia
代理精靈(推薦):http://http.zhiliandaili.cn/

在爬蟲中遇到ip被禁掉如何處理?

使用代理
構建一個代理池
撥號服務器

import requests
import random
from lxml import etree

# 列表形式的代理池
all_ips = []
proxy_url = "http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=5&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15"
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
proxy_page_text = requests.get(url=proxy_url, headers=headers).text
tree = etree.HTML(proxy_page_text)
proxy_list = tree.xpath('//body//text()')
for ip in proxy_list:
    dic = {'https': ip}
    all_ips.append(dic)
# 爬取快代理中的免費代理ip
free_proxies = []
for i in range(1, 3):
    url = f"http://www.kuaidaili.com/free/inha/{i}/"
    page_text = requests.get(url, headers=headers,proxies=random.choice(all_ips)).text
    tree = etree.HTML(page_text)
    # xpath表達式中不可以出現tbody
    tr_list = tree.xpath('//*[@id="list"]/table/tbody/tr')
    for tr in tr_list:
        ip = tr.xpath("./td/text()")[0]
        port = tr.xpath("./td[2]/text()")[0]
        dic = {
            "ip":ip,
            "port":port
        }
        print(dic)
        free_proxies.append(dic)
    print(f"第{i}頁")
print(len(free_proxies))

5.處理cookie

視頻解析接口

https://www.wocao.xyz/index.php?url=
https://2wk.com/vip.php?url=
https://api.47ks.com/webcloud/?v-

視頻解析網址

牛巴巴     http://mv.688ing.com/
愛片網     https://ap2345.com/vip/
全民解析   http://www.qmaile.com/

回歸正點

為什么要處理cookie？

保存客戶端的相關狀態

在請求中攜帶cookie,在爬蟲中如果遇到了cookie的反爬如何處理?

#手動處理
在抓包工具中捕獲cookie,將其封裝在headers中

#自動處理
使用session機制
使用場景:動態變化的cookie
session對象:該對象和requests模塊用法幾乎一致.如果在請求的過程中產生了cookie,如果該請求使用session發起的,則cookie會被自動存儲到session中

爬去雪球網的數據

import requests

s = requests.Session()
main_url = "https://xueqiu.com"  # 先對url發請求獲取cookie
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
params = {
    "size": "8",
    '_type': "10",
    "type": "10"
}
s.get(main_url, headers=headers)
url = 'https://stock.xueqiu.com/v5/stock/hot_stock/list.json?size=8&_type=10&type=10'

page_text = s.get(url, headers=headers).json()
print(page_text)

6.驗證碼識別

相關的線上打碼平台識別

打碼兔
雲打碼
超級鷹：http://www.chaojiying.com/about.html

1.注冊,登錄(用戶中心的身份認證)

2.登錄后

創建一個軟件:軟件ID->生成一個軟件id

下載示例代碼:開發文檔->python->下載

平台實例代碼的演示

import requests
from hashlib import md5


class Chaojiying_Client(object):
    def __init__(self, username, password, soft_id):
        self.username = username
        password = password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                          headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


chaojiying = Chaojiying_Client('超級鷹用戶名', '超級鷹用戶名的密碼', '96001')
im = open('a.jpg', 'rb').read()
print(chaojiying.PostPic(im, 1902)['pic_str'])

將古詩網中的驗證碼進行識別

zbb.py

import requests
from hashlib import md5


class Chaojiying_Client(object):
    def __init__(self, username, password, soft_id):
        self.username = username
        password = password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                          headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


def www(path,type):
    chaojiying = Chaojiying_Client('5423', '521521', '906630')
    im = open(path, 'rb').read()
    return chaojiying.PostPic(im, type)['pic_str']

requests.py

import requests
from lxml import etree
from zbb import www

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = requests.get(img_url,headers=headers).content
with open('./111.jpg','wb') as f:
    f.write(img_data)
img_text = www('./111.jpg',1004)
print(img_text)

7.模擬登陸

為什么在爬蟲中需要實現模擬登錄?

有的數據是必須經過登錄后才可以顯示出來的

古詩網

涉及到的反扒機制

1.驗證碼
2.動態請求參數:每次請求對應的請求參數都是動態變化
	動態捕獲:通常情況下,動態的請求參數都會被隱藏在前台頁面的源碼中
3.cookie存在驗證碼圖片之中 
 坑壁玩意

import requests
from lxml import etree
from zbb import www

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
# 獲取cookie
s = requests.Session()
# s_url = "https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx"
# s.get(s_url, headers=headers)

# 獲取驗證碼
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = s.get(img_url, headers=headers).content
with open('./111.jpg', 'wb') as f:
    f.write(img_data)
img_text = www('./111.jpg', 1004)
print(img_text)

# 動態捕獲動態的請求參數
__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]

# 點擊登錄按鈕后發起請求的url:通過抓包工具捕獲
login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'
data = {
    "__VIEWSTATE": __VIEWSTATE,
    "__VIEWSTATEGENERATOR": __VIEWSTATEGENERATOR,  # 變化的
    "from": "http://so.gushiwen.cn/user/collect.aspx",
    "email": "542154983@qq.com",
    "pwd": "zxy521",
    "code": img_text,
    "denglu": "登錄"
}
main_page_text = s.post(login_url, headers=headers, data=data).text
with open('main.html', 'w', encoding='utf-8') as fp:
    fp.write(main_page_text)

8.基於線程池的異步爬取

基於線程池的異步爬取趣味百科前十頁內容

import requests
from multiprocessing.dummy import Pool

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
#將url獲取，加入列表之中
urls = []
for i in range(1, 11):
    urls.append(f'https://www.qiushibaike.com/8hr/page/{i}/')

#創建一個request請求
def get_request(url):
    # 必須只能有一個參數
    return requests.get(url, headers=headers).text
#實例化線程10個
pool = Pool(10)
response_text_list = pool.map(get_request,urls)
print(response_text_list)

9.單線程+多任務異步協程

1.簡介

協程:對象

#可以把協程當做是一個特殊的函數.如果一個函數的定義被async關鍵字所修飾.該特殊的函數被調用后函數內部的程序語句不會被立即執行,而是會返回一個協程對象.

任務對象(task)

#所謂的任務對象就是對協程對象的進一步封裝.在任務對象中可以實現顯示協程對象的運行狀況.
#任務對象最終是需要被注冊到事件循環對象中.

綁定回調

#回調函數是綁定給任務對象,只有當任務對象對應的特殊函數被執行完畢后,回調函數才會被執行

事件循環對象

#無限循環的對象.也可以把其當成是某一種容器.該容器中需要放置多個任務對象(就是一組待執行的代碼塊).

異步的體現

#當事件循環開啟后,該對象會安裝順序執行每一個任務對象,
    #當一個任務對象發生了阻塞事件循環是不會等待,而是直接執行下一個任務對象

await:掛起的操作.交出cpu的使用權

單任務

from time import sleep
import asyncio


# 回調函數:
# 默認參數:任務對象
def callback(task):
    print('i am callback!!1')
    print(task.result())  # result返回的就是任務對象對應的那個特殊函數的返回值


async def get_request(url):
    print('正在請求:', url)
    sleep(2)
    print('請求結束:', url)
    return 'hello bobo'


# 創建一個協程對象
c = get_request('www.1.com')
# 封裝一個任務對象
task = asyncio.ensure_future(c)

# 給任務對象綁定回調函數
task.add_done_callback(callback)

# 創建一個事件循環對象
loop = asyncio.get_event_loop()
loop.run_until_complete(task)  # 將任務對象注冊到事件循環對象中並且開啟了事件循環

2.多任務的異步協程

import asyncio
from time import sleep
import time
start = time.time()
urls = [
    'http://localhost:5000/a',
    'http://localhost:5000/b',
    'http://localhost:5000/c'
]
#在待執行的代碼塊中不可以出現不支持異步模塊的代碼
#在該函數內部如果有阻塞操作必須使用await關鍵字進行修飾
async def get_request(url):
    print('正在請求:',url)
    # sleep(2)
    await asyncio.sleep(2)
    print('請求結束:',url)
    return 'hello bobo'

tasks = [] #放置所有的任務對象
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time()-start)

注意事項:

1.將多個任務對象存儲到一個列表中,然后將該列表注冊到事件循環中.在注冊的過程中,該列表需要被wait方法進行處理.
2.在任務對象對應的特殊函數內部的實現中,不可以出現不支持異步模塊的代碼,否則就會中斷整個的異步效果.並且,在該函數內部每一組阻塞的操作都必須使用await關鍵字進行修飾.
3.requests模塊對應的代碼不可以出現在特殊函數內部,因為requests是一個不支持異步的模塊

3.aiohttp

支持異步操作的網絡請求的模塊

- 環境安裝:pip install aiohttp

import asyncio
import requests
import time
import aiohttp
from lxml import etree

urls = [
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
]


# 無法實現異步的效果:是因為requests模塊是一個不支持異步的模塊
async def req(url):
    async with aiohttp.ClientSession() as s:
        async with await s.get(url) as response:
            # response.read():byte
            page_text = await response.text()
            return page_text

    # 細節:在每一個with前面加上async,在每一步的阻塞操作前加上await


def parse(task):
    page_text = task.result()
    tree = etree.HTML(page_text)
    name = tree.xpath('//p/text()')[0]
    print(name)


if __name__ == '__main__':
    start = time.time()
    tasks = []
    for url in urls:
        c = req(url)
        task = asyncio.ensure_future(c)
        task.add_done_callback(parse)
        tasks.append(task)

    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))

    print(time.time() - start)

10.selenium

概念

基於瀏覽器自動化的一個模塊.

環境的安裝:

下載selenium模塊

selenium和爬蟲之間的關聯是什么?

便捷的獲取頁面中動態加載的數據
     requests模塊進行數據爬取:可見非可得
     selenium:可見即可得
實現模擬登錄

基本操作:

谷歌瀏覽器驅動程序下地址:
http://chromedriver.storage.googleapis.com/index.html

selenium驅動程序和谷歌版本的映射關系表:
https://blog.csdn.net/huilan_same/article/details/51896672

動作鏈

一系列的行為動作

無頭瀏覽器

無可視化界面的瀏覽器
phantosJS

1.京東基本操作示例

from selenium import webdriver
from time import sleep
#1.實例化一個瀏覽器對象
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
#2.模擬用戶發起請求
url = 'https://www.jd.com'
bro.get(url) 
#3.標簽定位
search_input =  bro.find_element_by_id('key')
#4.對指定標簽進行數據交互
search_input.send_keys('華為')
#5.系列的行為動作
btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)
#6.執行js代碼
jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
bro.execute_script(jsCode)
sleep(3)
#7.關閉
bro.quit()

2.爬取葯品總局信息

from selenium import webdriver
from lxml import etree
from time import sleep

page_text_list = []
# 實例化一個瀏覽器對象
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
url = 'http://125.35.6.84:81/xk/'
bro.get(url)
# 必須等待頁面加載完畢
sleep(2)
# page_source就是瀏覽器打開頁面的源碼數據

page_text = bro.page_source
page_text_list.append(page_text)
#必須要與窗口對應，窗口必須要顯示點擊按鈕才可
jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
bro.execute_script(jsCode)
#打開后兩頁的
for i in range(2):
    bro.find_element_by_id('pageIto_next').click()
    sleep(2)

    page_text = bro.page_source
    page_text_list.append(page_text)

for p in page_text_list:
    tree = etree.HTML(p)
    li_list = tree.xpath('//*[@id="gzlist"]/li')
    for li in li_list:
        name = li.xpath('./dl/@title')[0]
        print(name)
sleep(2)
bro.quit()

3.動作鏈

from lxml import etree
from time import sleep
from selenium import webdriver
from selenium.webdriver import ActionChains

# 實例化一個瀏覽器對象
page_text_list = []
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
bro.get(url)
# 如果定位的標簽是存在於iframe對應的子頁面中的話,在進行標簽定位前一定要執行一個switch_to的操作
bro.switch_to.frame('iframeResult')
div_tag = bro.find_element_by_id('draggable')

# 1.實例化動作鏈對象
action = ActionChains(bro)
action.click_and_hold(div_tag)

for i in range(5):
    #perform讓動作鏈立即執行
    action.move_by_offset(17, 0).perform()
    sleep(0.5)
#釋放
action.release()

sleep(3)

bro.quit()

4.處理反爬selenium

像淘寶很多網站都禁止selenium爬取

正常在瀏覽器輸入window.Navigator.webdriver返回的是undefined

用代碼打開瀏覽器返回的是true

from selenium import webdriver
from selenium.webdriver import ChromeOptions
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])

#實例化一個瀏覽器對象
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe',options=option)
bro.get('https://www.taobao.com/')

5.模擬12306登錄

from selenium import webdriver
from selenium.webdriver import ActionChains
from PIL import Image  # 用作於圖片的裁剪 pillow
from zbb import www
from time import sleep

bro =webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
bro.get('https://kyfw.12306.cn/otn/resources/login.html')
sleep(5)
zhdl = bro.find_element_by_xpath('/html/body/div[2]/div[2]/ul/li[2]/a')
zhdl.click()
sleep(1)

username = bro.find_element_by_id('J-userName')
username.send_keys('181873')
pwd = bro.find_element_by_id('J-password')
pwd.send_keys('zx1')
# 驗證碼圖片進行捕獲(裁剪)
bro.save_screenshot('main.png')
# 定位到了驗證碼圖片對應的標簽
code_img_ele = bro.find_element_by_xpath('//*[@id="J-loginImg"]')
location = code_img_ele.location  # 驗證碼圖片基於當前整張頁面的左下角坐標
size = code_img_ele.size  # 驗證碼圖片的長和寬
# 裁剪的矩形區域(左下角和右上角兩點的坐標)
rangle = (
int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))

i = Image.open('main.png')
frame = i.crop(rangle)
frame.save('code.png')

# # 使用打碼平台進行驗證碼的識別
result = www('./code.png',9004)
  # x1,y1|x2,y2|x3,y3  ==> [[x1,y1],[x2,y2],[x3,y3]]
all_list = []  # [[x1,y1],[x2,y2],[x3,y3]] 每一個列表元素表示一個點的坐標,坐標對應值的0,0點是驗證碼圖片左下角
if '|' in result:
    list_1 = result.split('|')
    count_1 = len(list_1)
    for i in range(count_1):
        xy_list = []
        x = int(list_1[i].split(',')[0])
        y = int(list_1[i].split(',')[1])
        xy_list.append(x)
        xy_list.append(y)
        all_list.append(xy_list)
else:
    x = int(result.split(',')[0])
    y = int(result.split(',')[1])
    xy_list = []
    xy_list.append(x)
    xy_list.append(y)
    all_list.append(xy_list)
print(all_list)
action = ActionChains(bro)
for l in all_list:
    x = l[0]
    y = l[1]
    action.move_to_element_with_offset(code_img_ele, x, y).click().perform()
    sleep(2)

btn = bro.find_element_by_xpath('//*[@id="J-login"]')
btn.click()


action.release()
sleep(3)
bro.quit()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python 基礎教程 —— 網絡爬蟲入門篇 Python爬蟲技術--基礎篇--輸入與輸出語句爬蟲與Python：（二）Python基礎篇——擴展：判斷閏年 [爬蟲]Python爬蟲基礎爬蟲初級篇 django高級之爬蟲基礎 Python爬蟲基礎入門爬蟲基礎庫網絡爬蟲基礎一 Python爬蟲基礎之UrlError