目前絕大多數的網站的頁面都是冬天頁面,動態頁面中的部分內容是瀏覽器運行頁面中的JavaScript 腳本動態生成的,爬取相對比較困難
先來看一個很簡單的動態頁面的例子,在瀏覽器中打開 http://quotes.toscrape.com/js,顯示如下:
頁面總有十條名人名言,每一條都包含在<div class = "quote">元素中,現在我們在 Scrapy shell中嘗試爬取頁面中的名人名言:
$ scrapy shell http://quotes.toscrape.com/js/ ... >>> response.css(''div.quote) []
從結果可以看出,爬取失敗了,在頁面中沒有找到任何包含名人名言的 <div class = 'quote'>元素。這些 <div class = 'qoute'>就是動態內容,從服務器下載的頁面中並不包含他們(多以我們爬去失敗),瀏覽器執行了頁面中的一段 JavaScript 代碼后,他們才被生成出來。
圖中的 JavaScript 代碼如下:
var data = [ { "tags": [ "change", "deep-thoughts", "thinking", "world" ], "author": { "name": "Albert Einstein", "goodreads_link": "/author/show/9810.Albert_Einstein", "slug": "Albert-Einstein" }, "text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d" }, { "tags": [ "abilities", "choices" ], "author": { "name": "J.K. Rowling", "goodreads_link": "/author/show/1077326.J_K_Rowling", "slug": "J-K-Rowling" }, "text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d" }, { "tags": [ "inspirational", "life", "live", "miracle", "miracles" ], "author": { "name": "Albert Einstein", "goodreads_link": "/author/show/9810.Albert_Einstein", "slug": "Albert-Einstein" }, "text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d" }, { "tags": [ "aliteracy", "books", "classic", "humor" ], "author": { "name": "Jane Austen", "goodreads_link": "/author/show/1265.Jane_Austen", "slug": "Jane-Austen" }, "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d" }, { "tags": [ "be-yourself", "inspirational" ], "author": { "name": "Marilyn Monroe", "goodreads_link": "/author/show/82952.Marilyn_Monroe", "slug": "Marilyn-Monroe" }, "text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d" }, { "tags": [ "adulthood", "success", "value" ], "author": { "name": "Albert Einstein", "goodreads_link": "/author/show/9810.Albert_Einstein", "slug": "Albert-Einstein" }, "text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d" }, { "tags": [ "life", "love" ], "author": { "name": "Andr\u00e9 Gide", "goodreads_link": "/author/show/7617.Andr_Gide", "slug": "Andre-Gide" }, "text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d" }, { "tags": [ "edison", "failure", "inspirational", "paraphrased" ], "author": { "name": "Thomas A. Edison", "goodreads_link": "/author/show/3091287.Thomas_A_Edison", "slug": "Thomas-A-Edison" }, "text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d" }, { "tags": [ "misattributed-eleanor-roosevelt" ], "author": { "name": "Eleanor Roosevelt", "goodreads_link": "/author/show/44566.Eleanor_Roosevelt", "slug": "Eleanor-Roosevelt" }, "text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d" }, { "tags": [ "humor", "obvious", "simile" ], "author": { "name": "Steve Martin", "goodreads_link": "/author/show/7103.Steve_Martin", "slug": "Steve-Martin" }, "text": "\u201cA day without sunshine is like, you know, night.\u201d" } ]; for (var i in data) { var d = data[i]; var tags = $.map(d['tags'], function(t) { return "<a class='tag'>" + t + "</a>"; }).join(" "); document.write("<div class='quote'><span class='text'>" + d['text'] + "</span><span>by <small class='author'>" + d['author']['name'] + "</small></span><div class='tags'>Tags: " + tags + "</div></div>"); }
閱讀代碼可以了解頁面中動態生成的細節,所有名人名言信息被保存在數組 data 中,最后的 for 循環迭代 data 中的每項信息,使用 document。write 生成每條名人名言對應的 <div class = ‘quote’>元素。
上面是動態頁面中最簡單的一個例子,數據被應編碼到 JavaScript 代碼中, 實際中更常見的是JavaScript 通過 HTTP 請求跟網站動態交互獲取數據(AJAX),然后使用數據更新 HTMML 頁面。爬取此類動態網頁需要先執行頁面使用 JavaScript 渲染引擎頁面,咋進行爬取。