python爬蟲筆記：phantomjs+selenium采集內容

本文轉載自查看原文 2017-02-26 19:30 3038

對於一般的網站而言，利用python的beautifulsoup都可以爬取，但面對一些需要執行頁面上的JavaScript才能爬取的網站，就可以采用phantomjs+selenium的方法爬取數據。我在學習時，也遇到了這類問題，因此聊以記之。

我用的案例網站是中國天氣網（http://www.weather.com.cn/weather40d/101020100.shtml）。

我想爬取的是上海的40天天氣里的每一天的最高氣溫數據。因此，首先我使用一般的方法爬取：

from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen('http://www.weather.com.cn/weather40d/101020100.shtml')
html_parse = BeautifulSoup(html)
temp = html_parse.findAll("span",{"class":"max"})
print(temp)

但是卻發現print(temp)輸出的只是標簽：[, ...... ]

因此我判斷數據必須要在javascript執行后才能獲取，於是，我采用了phantomjs+selenium的方式獲取這一類數據，代碼如下：

from bs4 import BeautifulSoup
from selenium import webdriver
import time

driver = webdriver.PhantomJS(executable_path='F:\\python\\phantomjs-2.1.1-windows\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
driver.get("http://www.weather.com.cn/weather40d/101020100.shtml")
time.sleep(3)
pageSource = driver.page_source
html_parse = BeautifulSoup(pageSource)
temp = html_parse.findAll("span",{"class":"max"})
print(temp)

這段代碼創建了一個新的selenium WebDriver，首先用WebDriver加載頁面，因此我們給它3秒鍾時間（time.sleep(3)），之后，由於我個人比較喜歡用beautifulsoup，而WebDriver的page_source函數可以返回頁面的源代碼字符串，因此我用了第8,9行代碼來回歸到用我們所熟悉的Beautifulsoup來解析頁面內容。這個程序的最后運行結果是：[9, 9...... 12, 12, , , , , , , ],數據基本上就可以被獲取了。

雖然這個例子比較簡單，但是所謂萬變不離其宗，其基本思想便是這些了，更高深的技術就需要我們繼續學習了。

若文中有錯誤不妥之處，歡迎指出，共同學習，一起進步。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲之selenium、phantomJs phantomjs+selenium實現爬取動態網址 python爬蟲之圖片懶加載、selenium和phantomJS Selenium + PhantomJS + python 簡單實現爬蟲的功能 Python爬蟲(二十一)_Selenium與PhantomJS Python爬蟲使用Selenium+PhantomJS抓取Ajax和動態HTML內容爬蟲——Selenium與PhantomJS python爬蟲積累（一）--------selenium+python+PhantomJS的使用 [Python爬蟲] 在Windows下安裝PIP+Phantomjs+Selenium (八) Python網絡爬蟲之圖片懶加載技術、selenium和PhantomJS