數據爬取

本文轉載自查看原文 2019-10-11 21:17 309

協程

- 特殊的函數：
    - 如果一個函數的定義被async修飾后，則該函數就是一個特殊的函數。
- 協程：
    - 對象。特殊函數被調用后，函數內部的實現語句不會被立即執行，然后該函數
        調用會返回一個協程對象。
    - 結論：協程對象==特殊的函數調用

- 任務對象
    - 起始就是對協程對象的進一步封裝。
    - 結論：任務對象==高級的協程對象==特殊的函數調用
    - 綁定回調：
        - 回調函數什么時候被執行？
            - 任務對象執行結束后執行回調函數
        - task.add_done_callback(func)
            - func必須要有一個參數，該參數表示的是該回調函數對應的任務對象
            - 回調函數的參數.result():任務對象對應特殊函數內部的返回值
- 事件循環對象
    - 作用：將其內部注冊的任務對象進行異步執行。

- 編碼流程:
    - 定義特殊函數
    - 創建協程對象
    - 封裝任務對象
    - 創建事件循環對象
    - 將任務對象注冊到事件循環中且開啟事件循環對象

- 注意：在特殊函數內部的實現語句中不可以出現不支持異步的模塊對應的代碼，否則
    就是終止多任務異步協程的異步效果

單協程

import asyncio
from time import sleep

#函數的定義
async def get_request(url):
    print('正在請求:',url)
    sleep(1)
    print('請求結束:',url)

#函數調用:返回的就是一個協程對象
c = get_request('www.1.com')
#創建一個任務對象：基於協程對象創建
task = asyncio.ensure_future(c)

#創建一個事件循環對象
loop = asyncio.get_event_loop()
#將任務對象注冊到事件循環對象中並且開啟事件循環
loop.run_until_complete(task)

多任務異步協程

import asyncio
from time import sleep
import time
#函數的定義
async def get_request(url):
    print('正在請求:',url)
    await asyncio.sleep(1)
    print('請求結束:',url)


#創建3個協程對象
urls = [
    '1.com','2.com','3.com'
]

start = time.time()
#任務列表：存儲的是多個任務對象
tasks = []
for url in urls:
    c = get_request(url)
    # 創建一個任務對象：基於協程對象創建
    task = asyncio.ensure_future(c)
    tasks.append(task)

#創建一個事件循環對象
loop = asyncio.get_event_loop()
#將任務對象注冊到事件循環對象中並且開啟事件循環
loop.run_until_complete(asyncio.wait(tasks))  # 放入的如果是列表要加asyncio.wait()

print('總耗時：',time.time()-start)

View Code

給任務對象綁定回調

func必須要有一個參數，該參數表示的是該回調函數對應的任務對象
回調函數的參數.result():任務對象對應特殊函數內部的返回值

task.add_done_callback(parse)

import asyncio
from time import sleep
import time
#函數的定義
async def get_request(url):
    print('正在請求:',url)
    await asyncio.sleep(3)
    print('請求結束:',url)
    return 'bobo'

def parse(task):
    print('i am task callback()!!!=----',task.result())


#創建3個協程對象
urls = [
    '1.com','2.com','3.com'
]

start = time.time()
#任務列表：存儲的是多個任務對象
tasks = []
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)
    #綁定回調
    task.add_done_callback(parse)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print('總耗時：',time.time()-start)

View Code

多任務的異步爬蟲（重點）

- 注意重點：requests模塊不支持異步，在多任務的異步協程中不可以使用requests

- aiohttp
    - 概念：支持異步的網絡請求模塊
    - 編碼流程:
        - 寫基本架構：
                with aiohttp.ClientSession() as s:
                    with s.get(url) as response:
                        page_text = response.text()
                        return  page_text
        - 補充細節：
            - 添加async關鍵字
                - 每一個with前加上async
            - 添加await關鍵字
                - 加載每一步的阻塞操作前加上await
                    - 請求
                    - 獲取響應數據

import asyncio
import requests
import time
import aiohttp
from lxml import etree

#特殊函數：發起請求獲取頁面源碼數據
# async def get_request(url):
#     #requests是一個不支持異步的模塊
#     page_text = requests.get(url).text
#     return page_text


async def get_request(url):
    async with aiohttp.ClientSession() as s:
        #get/post：proxy = 'http://ip:port'
        #url，headers，data/prames跟requests一直
        async with await s.get(url) as response:
            page_text = await response.text()#text()字符串形式的響應數據。read()二進制的響應數據
            return  page_text

def parse(task):
    page_text = task.result()
    tree = etree.HTML(page_text)
    print(tree.xpath('/html/body/div[2]/div[3]/div[3]/h3/a/@href'))

urls = [
    'https://www.huya.com/?ptag=gouzai&rso=huya_h5_395',
]

start = time.time()

tasks = [] #任務列表
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)
    #綁定回調：用作於數據解析
    task.add_done_callback(parse)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print('總耗時：',time.time()-start)

selenium

- 概念：基於瀏覽器自動化的一個模塊。
- Appium是基於手機的自動化的模塊。
- selenium和爬蟲之間的關聯
    - 便捷的爬取到動態加載的數據
        - 可見即可得
    - 便捷的實
    現模擬登陸
- 基本使用：
    - 環境安裝
        - pip install selenium
        - 下載瀏覽器的驅動程序
            - http://chromedriver.storage.googleapis.com/index.html
            - 瀏覽器版本和驅動程序的映射關系：https://blog.csdn.net/huilan_same/article/details/51896672

selenium的基本使用

from selenium import webdriver # webdriver外部瀏覽器的驅動
import time
#實例化某一款瀏覽器對象
bro = webdriver.Chrome(executable_path='chromedriver.exe')

#基於瀏覽器發起請求
bro.get('https://www.jd.com/')

#商品搜索
#標簽定位
search_input = bro.find_element_by_id('key')
#往定位到的標簽中錄入數據
search_input.send_keys('襪子')
#點擊搜索按鈕
btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
time.sleep(2)
btn.click()        #點擊
time.sleep(2)
#滾輪滑動（js注入）
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')



time.sleep(6)

# bro.quit() # 關閉瀏覽器

View Code

捕獲動態家加載的數據

from selenium import webdriver
import time
from lxml import etree
#實例化某一款瀏覽器對象
bro = webdriver.Chrome(executable_path='chromedriver.exe')

bro.get('https://www.fjggfw.gov.cn/Website/JYXXNew.aspx')
time.sleep(1)

#page_source：當前頁面所有的頁面源碼數據
page_text = bro.page_source

#存儲前3頁對應的頁面源碼數據
all_page_text = [page_text]

for i in range(3):
    next_page_btn = bro.find_element_by_xpath('//*[@id="kkpager"]/div[1]/span[1]/a[7]')
    next_page_btn.click()
    time.sleep(1)
    all_page_text.append(bro.page_source)

for page_text in all_page_text:
    tree = etree.HTML(page_text)
    title = tree.xpath('//*[@id="list"]/div[1]/div/h4/a/text()')[0]
    print(title)

View Code

動作連

from selenium import webdriver
from selenium.webdriver import ActionChains #動作連的類
from time import sleep
bro = webdriver.Chrome(executable_path='chromedriver.exe')
bro.get('https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')
sleep(1)


bro.switch_to.frame('iframeResult') #frame的參數為iframe標簽的id屬性值
div_tag = bro.find_element_by_id('draggable')

#基於動作連實現滑動操作
action = ActionChains(bro)
#點擊且長按
action.click_and_hold(div_tag)

for i in range(5):
    #perform()表示讓動作連立即執行
    action.move_by_offset(20,20).perform()
    sleep(0.5)


sleep(3)
bro.quit()

View Code

谷歌無頭瀏覽器

bro.save_screenshot('./123.jpg')   # 截取的屏幕截圖會保存在當前的目錄下

- 無頭瀏覽器
    - phantomjs
    - 谷歌無頭瀏覽器（推薦）

from selenium import webdriver
from time import sleep


from selenium.webdriver.chrome.options import Options
# 創建一個參數對象，用來控制chrome以無界面模式打開
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')


bro = webdriver.Chrome(executable_path='chromedriver.exe',chrome_options=chrome_options)
bro.get('https://www.taobao.com/')
bro.save_screenshot('./123.jpg')   # 截取的屏幕截圖會保存在當前的目錄下

print(bro.page_source)

View Code

規避檢測

- 如何規避selenium被監測到的風險
    - 網站可以根據：window.navigator.webdriver的返回值鑒定是否使用了selenium
        - undefind：正常
        - true：selenium

from selenium import webdriver
from time import sleep

# 規避檢測
from selenium.webdriver import ChromeOptions

option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])


# 后面是你的瀏覽器驅動位置，記得前面加r'','r'是防止字符轉義的
bro = webdriver.Chrome(r'chromedriver.exe',options=option)

bro.get('https://www.taobao.com/')

View Code

12306模擬登陸

rom ChaoJiYing import Chaojiying_Client

from selenium import webdriver
from selenium.webdriver import ActionChains
from time import sleep
#下載pil或者是Pillow 圖像處理
from PIL import Image


def transformCode(imgPath,imgType):
    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')
    im = open(imgPath, 'rb').read()
    return chaojiying.PostPic(im,imgType)['pic_str']



bro = webdriver.Chrome(executable_path='chromedriver.exe')

bro.get('https://kyfw.12306.cn/otn/login/init')
sleep(2)

bro.save_screenshot('main.png')

#在main.jpg中截取下驗證碼圖片
img_tag = bro.find_element_by_xpath('//*[@id="loginForm"]/div/ul[2]/li[4]/div/div/div[3]/img')

location = img_tag.location #左下角坐標
size = img_tag.size #img標簽對應圖片的長寬（尺寸）
#裁剪范圍
rangle = (int(location['x']),int(location['y']),int(location['x']+size['width']),int(location['y']+size['height']))

i = Image.open('./main.png')
frame = i.crop(rangle)
frame.save('code.png')

result = transformCode('./code.png',9004)

#260,140|260,139  ==> [[260,140],[260,139]]
all_list = []#[[260,140],[260,139]]
if '|' in result:
    list_1 = result.split('|')
    count_1 = len(list_1)
    for i in range(count_1):
        xy_list = []
        x = int(list_1[i].split(',')[0])
        y = int(list_1[i].split(',')[1])
        xy_list.append(x)
        xy_list.append(y)
        all_list.append(xy_list)
else:
    x = int(result.split(',')[0])
    y = int(result.split(',')[1])
    xy_list = []
    xy_list.append(x)
    xy_list.append(y)
    all_list.append(xy_list)



for xy in all_list:
    x = xy[0]
    y = xy[1]
    ActionChains(bro).move_to_element_with_offset(img_tag,x,y).click().perform()
    sleep(1)

View Code

小結：

- 多任務的異步協程
    - 特殊的函數
        - 函數調用的時候函數內部的程序語句不會被立即執行
        - 函數調用后會返回一個協程對象
    - 協程對象：等同於特殊的函數
    - 任務對象：高級的協程對象==協程==特殊的函數
        - 綁定回調：task.add_done_callback(parse)
            - 回調函數一定是在任務對象執行結束后被執行
            - 主要是用來實現數據解析的
    - 事件循環對象：
        - 需要將多個任務注冊/放置在該對象中
        - 啟動事件循環，該對象就可以異步的去執行其內部注冊的所有的任務對象
    - 掛起：將任務列表注冊到事件循環中的時候需要使用asyncio.wait(tasks)
    - 注意事項：在特殊函數內部的程序語句中不可以出現不支持異步的模塊代碼
        - requests就不可以在該模式下被使用
    - aiohttp：支持異步的網絡請求模塊
        - 基本架構：with實現
        - 補充細節

import aiohttp
import asyncio

async def get_request(url):
async with aiohttp.ClientSession() as s:
async with await s.get(url) as response:
page_text = await response.text()#read()
return page_text

- selenium
　　- 概念
　　- 和爬蟲之間的關聯是什么？
　　　　- 便捷的捕獲動態加載的數據（可見即可得）
　　　　- 實現模擬登陸
　　- 基本使用：
　　　　- 實例化一個瀏覽器對象（驅動程序）
　　　　- get（）
　　　　- find系列的函數：用作於標簽定位
　　　　- send_keys():進行向標簽中錄入數據
　　　　- click
　　　　- excute_script（js）：js注入
　　　　- page_source：返回的是頁面源碼數據
　　　　- switch_to.frame（）
　　　　- save_scrennshot()
　　- 動作連：ActionChains
　　　　- 無頭瀏覽器：
　　　　- phantomjs
　　- 谷歌無頭
　　- 規避監測

空氣質量爬取

- 分析
    - 1.在頁面中更換查找條件可以讓抓包工具捕獲到我們想要的數據包
    - 2.apistudyapi.php該數據包就是我們最終定位到的爬取數據對應的數據包
        - 該數據包中可以提取到url和請求參數（可能是一組密文，然后該是動態變化）
        - 響應數據的是經過加密的密文
    - 3.當修改了查詢條件后且點擊了查詢按鈕后發起了一個ajax請求，該請求就可以請求到apistudyapi.php數據包
        - 想要捕獲的數據是可以通過點擊搜索按鈕生成的
    - 4.通過火狐瀏覽器的開發者工具可以找到搜索按鈕綁定的點擊事件對應的事件函數（getData()）
    - 5.分析getData():在該函數實現內部沒有找到ajax請求對應的操作
        - type這個變量可以為HOUR
        - getAQIData();getWeatherData();
    - 6.分析：getAQIData();getWeatherData();
        - 發現這兩個函數的實現除了method變量的賦值不同剩下的都一致
            - method = （GETDETAIL 或者 GETCITYWEATHER）
            - 在這兩個函數的實現中也沒有發現ajax請求對應的代碼，但是發現了一個叫做getServerData的函數調用，則分析ajax請求對應的代碼肯定是存在於getServerData這個函數的實現中
            - getServerData（method, param，匿名函數，0.5）
                - method = （GETDETAIL 或者 GETCITYWEATHER）
                - param是一個字典，內部有四組（city，type，startTime，endTime）鍵值對
     - 7.分析getServerData函數的實現：
         - 最終通過抓包工具的全局搜索定位到了該函數的實現，但是實現的js代碼被加密了，該種形式的加密被稱為js混淆。
         - 如何破解js混淆？
             - http://www.bm8.com.cn/jsConfusion/進行js反混淆
         - 在該函數的實現中終於找到了ajax請求對應的代碼：
             - ajax請求的url
             - ajax請求方式
             - 請求參數的來源：getParam(method, param)
             - 對加密的響應數據解密：decodeData(密文)
     - 8.基於python模擬執行js代碼
         - PyExecJS模塊可以讓python模擬執行js代碼
         - 環境安裝：
             - pip install PyExecJS
             - 在本機安裝nodejs的開發環境

View Code

獲取ajax請求的動態變化且加密的請求參數

#獲取ajax請求的動態變化且加密的請求參數（d：xxx）
import execjs
node = execjs.get()
 
# Params
method = 'GETCITYWEATHER'
city = '北京'
type = 'HOUR'
start_time = '2018-01-25 00:00:00'
end_time = '2018-01-25 23:00:00'
 
# Compile javascript
file = 'test.js'
ctx = node.compile(open(file,encoding='utf-8').read())
 
# Get params
js = 'getPostParamCode("{0}", "{1}", "{2}", "{3}", "{4}")'.format(method, city, type, start_time, end_time)
params = ctx.eval(js)
print(params)

View Code

攜帶捕獲到請求參數進行請求

#攜帶捕獲到請求參數進行請求
import execjs
import requests

node = execjs.get()
 
# Params
method = 'GETCITYWEATHER'
city = '北京'
type = 'HOUR'
start_time = '2018-01-25 00:00:00'
end_time = '2018-01-25 23:00:00'
 
# Compile javascript
file = 'test.js'
ctx = node.compile(open(file,encoding='utf-8').read())
 
# Get params
js = 'getPostParamCode("{0}", "{1}", "{2}", "{3}", "{4}")'.format(method, city, type, start_time, end_time)
params = ctx.eval(js)

#發起post請求
url = 'https://www.aqistudy.cn/apinew/aqistudyapi.php'
response_text = requests.post(url, data={'d': params}).text
print(response_text)

View Code

對捕獲到的加密的響應數據進行解密

#對捕獲到的加密的響應數據進行解密
import execjs
import requests

node = execjs.get()
 
# Params
method = 'GETDETAIL'
city = '北京'
type = 'HOUR'
start_time = '2018-01-25 00:00:00'
end_time = '2018-01-25 23:00:00'
 
# Compile javascript
file = 'test.js'
ctx = node.compile(open(file,encoding='utf-8').read())
 
# Get params
js = 'getPostParamCode("{0}", "{1}", "{2}", "{3}", "{4}")'.format(method, city, type, start_time, end_time)
params = ctx.eval(js)

#發起post請求
url = 'https://www.aqistudy.cn/apinew/aqistudyapi.php'
response_text = requests.post(url, data={'d': params}).text

#對加密的響應數據進行解密
js = 'decodeData("{0}")'.format(response_text)
decrypted_data = ctx.eval(js)
print(decrypted_data)

View Code

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬取疫情數據去哪兒網數據爬取爬取騰訊疫情數據基於 PHP 的數據爬取（QueryList） python爬取疫情數據爬取數據入門指南 python 爬取動態數據爬取最新疫情數據 python爬取疫情數據 js加密數據爬取