小白學 Python 爬蟲（39）： JavaScript 渲染服務 scrapy-splash 入門

本文轉載自查看原文 2020-01-14 08:47 348

人生苦短，我用 Python

前文傳送門：

小白學 Python 爬蟲（1）：開篇

小白學 Python 爬蟲（2）：前置准備（一）基本類庫的安裝

小白學 Python 爬蟲（3）：前置准備（二）Linux基礎入門

小白學 Python 爬蟲（4）：前置准備（三）Docker基礎入門

小白學 Python 爬蟲（5）：前置准備（四）數據庫基礎

小白學 Python 爬蟲（6）：前置准備（五）爬蟲框架的安裝

小白學 Python 爬蟲（7）：HTTP 基礎

小白學 Python 爬蟲（8）：網頁基礎

小白學 Python 爬蟲（9）：爬蟲基礎

小白學 Python 爬蟲（10）：Session 和 Cookies

小白學 Python 爬蟲（11）：urllib 基礎使用（一）

小白學 Python 爬蟲（12）：urllib 基礎使用（二）

小白學 Python 爬蟲（13）：urllib 基礎使用（三）

小白學 Python 爬蟲（14）：urllib 基礎使用（四）

小白學 Python 爬蟲（15）：urllib 基礎使用（五）

小白學 Python 爬蟲（16）：urllib 實戰之爬取妹子圖

小白學 Python 爬蟲（17）：Requests 基礎使用

小白學 Python 爬蟲（18）：Requests 進階操作

小白學 Python 爬蟲（19）：Xpath 基操

小白學 Python 爬蟲（20）：Xpath 進階

小白學 Python 爬蟲（21）：解析庫 Beautiful Soup（上）

小白學 Python 爬蟲（22）：解析庫 Beautiful Soup（下）

小白學 Python 爬蟲（23）：解析庫 pyquery 入門

小白學 Python 爬蟲（24）：2019 豆瓣電影排行

小白學 Python 爬蟲（25）：爬取股票信息

小白學 Python 爬蟲（26）：為啥買不起上海二手房你都買不起

小白學 Python 爬蟲（27）：自動化測試框架 Selenium 從入門到放棄（上）

小白學 Python 爬蟲（28）：自動化測試框架 Selenium 從入門到放棄（下）

小白學 Python 爬蟲（29）：Selenium 獲取某大型電商網站商品信息

小白學 Python 爬蟲（30）：代理基礎

小白學 Python 爬蟲（31）：自己構建一個簡單的代理池

小白學 Python 爬蟲（32）：異步請求庫 AIOHTTP 基礎入門

小白學 Python 爬蟲（33）：爬蟲框架 Scrapy 入門基礎（一）

小白學 Python 爬蟲（34）：爬蟲框架 Scrapy 入門基礎（二）

小白學 Python 爬蟲（35）：爬蟲框架 Scrapy 入門基礎（三） Selector 選擇器

小白學 Python 爬蟲（36）：爬蟲框架 Scrapy 入門基礎（四） Downloader Middleware

小白學 Python 爬蟲（37）：爬蟲框架 Scrapy 入門基礎（五） Spider Middleware

小白學 Python 爬蟲（38）：爬蟲框架 Scrapy 入門基礎（六） Item Pipeline

引言

Splash 是一種 JavaScript 渲染服務，是一個帶有 HTTP API 的輕量級瀏覽器，同時它對接了 Python3 中的 Twisted 和 QT 庫。

通過它，我們同樣可以實現動態渲染頁面的抓取。

Github：https://github.com/scrapy-plugins/scrapy-splash

Splash 官方文檔：http://splash.readthedocs.io

功能說明：

並行處理多個網頁；
獲取 HTML 結果和/或獲取屏幕截圖；
關閉圖片或使用 Adblock Plus 規則來加快渲染速度；
在頁面上下文中執行自定義 JavaScript；
編寫 Lua 瀏覽腳本 ;
在 Splash-Jupyter Notebook 中開發 Splash Lua 腳本。
以 HAR 格式獲取詳細的渲染信息。

安裝

安裝 Splash 主要有兩個部分，一個是 Splash 服務的安裝，具體是通過Docker，安裝之后，會啟動一個 Splash 服務。另外一個是 Scrapy-Splash 的 Python 庫的安裝，安裝之后即可在 Scrapy 中使用 Splash 服務。

在 Docker 中安裝 Splash 服務，命令如下：

docker run -p 8050:8050 scrapinghub/splash

理論上看到如下內容，就證明安裝成功了。

2020-01-10 13:09:41+0000 [-] Log opened.
2020-01-10 13:09:41.824978 [-] Xvfb is started: ['Xvfb', ':1196586140', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash'
2020-01-10 13:09:42.153188 [-] Splash version: 3.4
2020-01-10 13:09:42.376664 [-] Qt 5.13.1, PyQt 5.13.1, WebKit 602.1, Chromium 73.0.3683.105, sip 4.19.19, Twisted 19.7.0, Lua 5.2
2020-01-10 13:09:42.376820 [-] Python 3.6.8 (default, Oct  7 2019, 12:59:55) [GCC 8.3.0]
2020-01-10 13:09:42.376898 [-] Open files limit: 1048576
2020-01-10 13:09:42.376965 [-] Can't bump open files limit
2020-01-10 13:09:42.394903 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2020-01-10 13:09:42.395050 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2020-01-10 13:09:42.594670 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2020-01-10 13:09:42.594909 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Webkit: enabled, Chromium: enabled
2020-01-10 13:09:42.595245 [-] Site starting on 8050
2020-01-10 13:09:42.595341 [-] Starting factory <twisted.web.server.Site object at 0x7f26e5414fd0>
2020-01-10 13:09:42.595541 [-] Server listening on http://0.0.0.0:8050

這時我們打開瀏覽器直接訪問 http://localhost:8050 ，就能看到 Splash 的主頁：

接下來安裝 Scrapy-Splash 的 Python 庫，這個就比較簡單了，一個命令搞定：

pip install scrapy-splash

試用

打開 Splash 的主頁，可以看到輸入框中默認訪問的是 http://google.com ，我們這里換成度娘的首頁看下：

可以看到，網頁的返回結果呈現了渲染截圖、 HAR 加載統計數據、網頁的源代碼。

通過 HAR 的結果可以看到， Splash 執行了整個網頁的渲染過程，包括 CSS 、 JavaScript 的加載等過程，呈現的頁面和我們在瀏覽器中得到的結果完全一致。

點擊上方的 Script 按鈕，可以看到一段腳本，如下：

function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(0.5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

這里其實是一段 Lua 腳本， Splash 的整個渲染都是由這個 Lua 腳本進行控制的。

雖然我們並不清楚 Lua 腳本的語法，但是看了這個代碼，也應該能大致猜測出來首先是使用 go() 訪問了 url ，然后使用 wait() 等待了 0.5 秒。最后返回了頁面的 html 源碼，png 的截圖和 har 的一些數據。

Splash Lua API

我們來簡單的了解一下 Splash Lua 的一些內置的 API ，更多的內容可以訪問文檔獲得，小編這里主要介紹一下有關頁面操作的 API 。

Splash Lua API 文檔地址：https://splash.readthedocs.io/en/stable/scripting-overview.html

splash:go - 向瀏覽器加載URL；
splash:set_content - 將指定的內容（通常是HTML）加載到瀏覽器；
splash:lock_navigation and splash:unlock_navigation - 鎖定/解鎖導航；
splash:on_navigation_locked 允許檢查鎖定導航后丟棄的請求；
splash:set_user_agent 允許更改用於請求的User-Agent標頭；
splash:set_custom_headers 允許設置默認的HTTP標頭Splash使用。
splash:on_request 允許過濾或替換對相關資源的請求；它還允許根據請求設置HTTP或SOCKS5代理服務器；
splash:on_response_headers 允許根據請求的標頭（例如，基於Content-Type）過濾掉請求；
splash:init_cookies, splash:add_cookie, splash:get_cookies, splash:clear_cookies and splash:delete_cookies 管理cookie。

延遲

splash:wait 允許等待指定的時間；
splash:call_later 計划將來的任務；
splash:wait_for_resume 允許等待直到某個JS事件發生；
splash:with_timeout 允許限制在代碼塊中花費的時間。

從頁面中提取信息

splash:html 在由瀏覽器呈現后返回頁面HTML內容；
splash:url 返回瀏覽器中加載的當前URL；
splash:evaljs and splash:jsfunc 允許使用JavaScript從頁面提取數據；
splash:select and splash:select_all 允許在頁面中運行CSS選擇器；它們返回Element對象，該對象具有許多對抓取和進一步處理有用的方法
element:text 返回DOM元素的文本內容；
element:bounds 返回元素的邊界框；
element:styles 返回元素的計算樣式；
element:form_values 返回<form>元素的值。

截圖

splash:png, splash:jpeg - 拍攝PNG或JPEG屏幕截圖;
splash:set_viewport_full - 更改視口大小（在 splash:png 或 splash:jpeg 之前調用）以獲取整個頁面的屏幕截圖；
splash:set_viewport_size - 更改視口的大小；
element:png and element:jpeg - 截取單個DOM元素的屏幕截圖。

與頁面互動

splash:runjs, splash:evaljs and splash:jsfunc 允許在頁面上下文中運行任意JavaScript；
splash:autoload 允許在每個頁面渲染開始時預加載JavaScript庫或執行一些JavaScript代碼；
splash:mouse_click, splash:mouse_hover, splash:mouse_press, splash:mouse_release 允許將鼠標事件發送到頁面上的特定坐標；
element:mouse_click and element:mouse_hover 允許將鼠標事件發送到特定的DOM元素；
splash:send_keys and splash:send_text 允許將鍵盤事件發送到頁面；
element:send_keys and element:send_text 允許將鍵盤事件發送到特定的DOM元素；
可以<form>使用element:form_values獲取初始值，在Lua代碼中對其進行更改，使用element:fill用更新后的值填充表單，並使用element:submit提交它；
splash.scroll_position 允許滾動頁面；

HTTP請求

splash:http_get - 發送HTTP GET請求並獲得響應，而無需將頁面加載到瀏覽器；
splash:http_post - 發送HTTP POST請求並獲得響應，而無需將頁面加載到瀏覽器；

檢查網絡流量

splash:har 以 HAR 格式返回所有請求和響應
splash:history 返回有關重定向和加載到瀏覽器主窗口的頁面的信息；
splash:on_request 允許捕獲網頁和腳本發出的請求；
splash:on_response_headers 允許在標頭到達時檢查（或丟棄）響應；
splash:on_response 允許檢查收到的原始響應（包括相關資源的內容）；
splash.response_body_enabled 在 splash:har 和 splash:on_response 中啟用完整的響應主體；

示例

上面講了這么這么多 API ，我們寫一個簡單的小例子吧：

function main(splash, args)
  splash:set_viewport_size(400, 700)
  assert(splash:go(args.url))
  assert(splash:wait(0.5))
  return {
    url = splash:url(),
    jpeg = splash:jpeg(),
    har = splash:har(),
    cookies = splash:get_cookies()
  }
end

這里小編設置了當前瀏覽器頁面的大小，返回了當前訪問的 url ，並且將返回的圖片格式改成了 jpeg ，同時返回了當前的 cookies 。

結果太長了，小編這里就不截圖了，各位同學可以自己動手嘗試下，屬實很簡單，並不難。

Splash HTTP API

上面我們介紹了如何在 Splash 主頁上通過 Lua 腳本進行一些操作，但這並不是我們想要的，我們想要通過我們自己 Python 程序來結合 Splash 對頁面進行抓取。

Splash 給我們提供了一些 HTTP API 接口，我們只需要請求這些接口並傳遞相應的參數即可，下面我們簡單的介紹一下這些接口的使用。

更多內容可以查閱文檔：https://splash.readthedocs.io/en/stable/api.html

render.html

此接口用於獲取JavaScript渲染的頁面的HTML代碼，接口地址就是Splash的運行地址加此接口名稱，例如 http://localhost:8050/render.html 。

比如我們使用某東做測試：

import requests

url = 'http://localhost:8050/render.html?url=https://www.jd.com'
response = requests.get(url)
print(response.text)

render.html 其實還支持很多參數，具體內容可以查閱文檔獲得。

render.png

此接口可以獲取網頁截圖，其參數比 render.html 多了幾個，通過width和height來控制寬高，它返回的是PNG格式的圖片二進制數據，示例如下：

import requests

url = 'http://localhost:8050/render.png?url=https://www.jd.com&width=1000&height=700'
response = requests.get(url)
with open('jd.png', 'wb') as f:
    f.write(response.content)

這里我們可以看到當前目錄下多了一張名為 jd.png 的圖片，如下：

render.har

此接口用於獲取頁面加載的HAR數據，示例如下：

import requests

url = 'http://localhost:8050/render.har?url=https://www.jd.com'
response = requests.get(url)
print(response.text)

結果太長了，小編就不貼了，結果是一個JSON格式的數據，其中包含頁面加載過程中的 HAR 數據。

render.json

此接口包含了前面接口的所有功能，返回結果是JSON格式，示例如下：

url = 'http://localhost:8050/render.json?url=https://httpbin.org'
response = requests.get(url)
print(response.text)

結果如下：

{"url": "https://httpbin.org/get", "requestedUrl": "https://httpbin.org/get", "geometry": [0, 0, 1024, 768], "title": ""}

我們可以通過傳入不同參數控制其返回結果。比如，傳入 html=1 ，返回結果即會增加源代碼數據；傳入 png=1 ，返回結果即會增加頁面PNG截圖數據；傳入 har=1 ，則會獲得頁面 HAR 數據。示例代碼如下：

url = 'http://localhost:8050/render.json?url=https://httpbin.org/get&html=1&har=1'
response = requests.get(url)
print(response.text)

execute

此接口才最為強大的接口，我們前面用的那些 Lua 腳本，就是通過這個接口來與 Splash 進行對接的，我們將剛才上面的示例稍微改動下，代碼如下：

import requests
from urllib.parse import quote

lua = '''
function main(splash, args)
  splash:go("https://www.geekdigging.com/")
    return {
      url = splash:url(),
      jpeg = splash:jpeg(),
      har = splash:har(),
      cookies = splash:get_cookies()
    }
end
'''

url = 'http://localhost:8050/execute?lua_source=' + quote(lua)
response = requests.get(url)
print(response.text)

結果同樣有點長，小編就不貼了。

本篇內容就到這里了，感謝觀看。