一、引言

Selenium環境的相關配置比較繁瑣，此外，有的網站會對selenium和webdriver進行識別和反爬，因此在這里介紹一下它的替代產品Pyppeteer。

Pyppeteer 就是依賴於 Chromium 這個瀏覽器來運行的。如果第一次運行的時候，Chromium 瀏覽器沒有安裝，那么程序會幫我們自動安裝和配置，就免去了繁瑣的環境配置等工作。另外 Pyppeteer 是基於 Python 的新特性 async 實現的，所以它的一些執行也支持異步操作，效率相對於 Selenium也有所提高。

二、安裝

pip3 install pyppeteer

三、快速入門

# coding:utf-8
import asyncio
from pyppeteer import launch


async def main():
    # 創建瀏覽器對象
    browser = await launch(headless=False,args=['--disable-infobars'])

    # 打開新的標簽頁
    page = await browser.newPage()

    #設置視圖大小
    await page.setViewport({'width':1366,'height':768})

    #設置UserAgent
    await page.setUserAgent('Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36')

    # 訪問頁面
    response = await page.goto('https://www.baidu.com')

    #獲取status、headers、url
    print(response.status)
    print(response.headers)
    print(response.url)

    #獲取當前頁標題
    print(await page.title())

    #獲取當前頁內容
    print(await page.content()) #文本類型
    # print(await response.text())

    #cookie操作
    print(await page.cookies()) #獲取cookie,[{'name':xx,'value':xxx...},...]
    # page.deleteCookie() 刪除cookie
    # page.setCookie() 設置cookie

    #定位元素
    #1、只定位一個元素（css選擇器）
    # element = await page.querySelector('#s-top-left > a')
    #2、css選擇器
    elements = await page.querySelectorAll('#s-top-left > a:nth-child(2n)')
    #3、xpath
    # elements = await page.xpath('//div[@id="s-top-left"]/a')
    for element in elements:
        print(await (await element.getProperty('textContent')).jsonValue()) #獲取文本內容
        print(await (await element.getProperty('href')).jsonValue())#獲取href屬性

    #模擬輸入和點擊
    await page.type('#kw','中國',{'delay':1000}) #模擬輸入，輸入時間:1000 ms
    await asyncio.sleep(2)
    await page.click('#su') #模擬點擊，也可以先定位元素，然后await element.click()
    await asyncio.sleep(2)

    #執行js，滾動頁面到底部
    await page.evaluate('window.scrollTo(0,document.body.scrollHeight);')

    #截圖
    await page.screenshot({'path':'baidu.png'})

    await asyncio.sleep(5)
    await browser.close() #關閉瀏覽器

asyncio.get_event_loop().run_until_complete(main())

四、主要操作

打開瀏覽器
- 調用 launch 方法即可，相關參數介紹：
  - ignoreHTTPSErrors (bool): 是否要忽略 HTTPS 的錯誤，默認是 False。
  - headless (bool): 是否啟用無界面模式，默認為 True。如果 devtools 這個參數是 True 的話，那么該參數就會被設置為 False。
  - executablePath (str): 可執行文件的路徑，如果指定之后就不需要使用默認的 Chromium 了，可以指定為已有的 Chrome 或 Chromium。
  - args (List[str]): 在執行過程中可以傳入的額外參數。
  - slowMo (int|float): 設置這個參數可以延遲pyppeteer的操作，單位是毫秒.
  - userDataDir (str): 即用戶數據文件夾，即可以保留一些個性化配置和操作記錄。
  - devtools (bool): 是否為每一個頁面自動開啟調試工具，默認是 False。如果為 True，那么headless參數會被強制設置為 False。

關閉提示條：”Chrome 正受到自動測試軟件的控制”

browser = await launch(headless=False, args=['--disable-infobars'])

設置視圖大小

width, height = 1366, 768
await page.setViewport({'width': width, 'height': height})

設置UserAgent
```
await page.setUserAgent('xxx')
```

執行JS腳本：調用page.evaluate（）方法

await page.evaluate('window.scrollTo(0,document.body.scrollHeight);')
#滾動頁面到底部

規避webdriver檢測

import asyncio
from pyppeteer import launch


async def main():
    browser = await launch(headless=False, args=['--disable-infobars'])
    page = await browser.newPage()
    await page.goto('https://login.taobao.com/member/login.jhtml?redirectURL=https://www.taobao.com/')
    await page.evaluate(
        '''() =>{ Object.defineProperties(navigator,{ webdriver:{ get: () => false } }) }''')
    await asyncio.sleep(100)

asyncio.get_event_loop().run_until_complete(main())

模擬輸入和點擊

await page.type(selector, text, {"delay":100}) #模擬輸入，輸入每個字符的間隔時間100 ms
await asyncio.sleep(2)
await page.click(selector) #模擬點擊
await asyncio.sleep(2)

鼠標操作

await page.hover(selector) #鼠標移動到某個元素上
await page.mouse.down() #按下鼠標
await page.mouse.move(2000, 0, {'delay': random.randint(1000, 2000)}) #移動鼠標
await page.mouse.up() #松開鼠標

定位元素、獲取元素文本內容和屬性值

page.querySelector（selector）#只匹配第一個元素

element = await page.querySelector('#s-top-left > a')

print(await (await element.getProperty('textContent')).jsonValue()) #獲取文本內容
print(await (await element.getProperty('href')).jsonValue())#獲取href屬性

page.querySelectorAll（selector）#css選擇器

elements = await page.querySelectorAll('#s-top-left > a:nth-child(2n)')
for element in elements:
    print(await (await element.getProperty('textContent')).jsonValue()) #獲取文本內容
    print(await (await element.getProperty('href')).jsonValue())#獲取href屬性

page.xpath（expression）#xpath

elements = await page.xpath('//div[@id="s-top-left"]/a')
for element in elements:
    print(await (await element.getProperty('textContent')).jsonValue()) #獲取文本內容
    print(await (await element.getProperty('href')).jsonValue())#獲取href屬性

page.jeval（selector，pageFunction）#定位元素，並調用js函數去執行

print(await page.Jeval('#s-top-left > a:first-child','node => node.textContent') ) #獲取文本內容
print(await page.Jeval('#s-top-left > a:first-child','node => node.href') ) #獲取href屬性

針對frame操作

page.frames獲取頁面中的所有frames列表，對於每一個frame操作，和page操作一致

page.mainFrame獲取當前頁面的主frame

frame_list = page.frames #獲取所有frame

#獲取當前頁面的標題，兩者效果一樣
print(await frame_list[0].title())
print(await page.mainFrame.title())

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 pyppeteer的使用 pyppeteer使用筆記 pyppeteer Python網絡爬蟲(pyppeteer基本使用) Scrapy對接Pyppeteer | GerapyPyppeteer對象 | Scrapy python 模塊BeautifulSoup使用 python _thread模塊使用 Burpsuite各個模塊詳細使用 Python之ipaddress模塊的使用 python timeit 模塊的使用

pyppeteer模塊的使用