phantomjs+selenium實現爬取動態網址

本文轉載自查看原文 2016-10-18 00:37 2307 爬蟲/ Phantom/ 動態頁面/ Selenium/ Python

之前使用 selenium + firefox驅動瀏覽器來實現爬取動態網址，但是firefox經常更新，更新后時常會導致webdriver啟動不來，所以改用phantomjs+selenium來改善一下。
使用phantomjs和使用瀏覽器區別並不大。

一，首先還是需要下載Phantomjs

Phantomjs對各個主流的平台都支持，下載頁面。選擇好存放的目錄，例如D:\phantomjs。
phantomjs的可執行文件就在bin目錄下，可以將D:\phantomjs\bin目錄加入環境變量中。如果不加入環境變量，那么selenium在驅動phantomjs時就需要指定路徑。

二，在Selenium中驅動Phantomjs

from selenium import webdriver
from selenium.common.exceptions import TimeoutException

##可以對phantomjs配置
#cap = webdriver.DesiredCapabilities.PHANTOMJS    #獲取webdriver對Phantomjs的默認配置
#cap["phantomjs.page.settings.resourceTimeout"] = 5000    #資源加載超時時長
#cap["phantomjs.page.settings.loadImages"] = False    #是否加載圖片
#driver = webdriver.PhantomJS(desired_capabilities=cap)

#未將phantomjs加入環境變量,需要指定phantomjs的路徑
#driver = webdriver.PhantomJS(executable_path="D:\phantomjs\bin\phantomjs.exe")
driver = webdriver.PhantomJS()
driver.set_page_load_timeout(5)    #設置頁面超時時長
#driver.set_script_timeout(5)    #設置頁面JS超時時長，這兩者超時后會報TimeoutException錯

##當超時后停止頁面的加載
##有些頁面在加載出你想要的數據后，還是會一直加載一些其他資源
tru:
    driver.get("www.tvmao.com")
exception TimeoutException:
    driver.execute_script("window.stop()")

##獲取網頁源代碼后，就可以將其保存起來進而進行數據解析了
page_source = driver.page_source()

############
#
#數據解析部分
#
############

phantomjs可配置的選項，可以看官方文檔說明

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 基於selenium+phantomJS的動態網站全站爬取爬蟲（三）通過Selenium + Headless Chrome爬取動態網頁 Python 爬蟲-selenium動態網頁爬取 scrapy結合selenium爬取淘寶等動態網站 python+selenium+PhantomJS爬取網頁動態加載內容 python+selenium+PhantomJS爬取網頁動態加載內容 Python爬蟲爬取動態網頁動態網頁爬取方法動態網頁爬取流程總結 python動態網頁的爬取