基於Node.js的爬蟲工具 – Node Crawler

本文轉載自查看原文 2016-04-29 09:31 3954

Node Crawler的目標是成為最好的node.js爬蟲工具，目前已經停止維護。

我們來抓取光合新知博客tech欄目中的文章信息。
訪問http://dev.guanghe.tv/category/tech/，右鍵查看頁面源代碼，可以看到文章信息等內容，如下所示：

<li>

<a class="post-link" href="/2015/12/Getting-Started-With-React-And-JSX.html">React和JSX入門指導</a>

</li>

<li>

<a class="post-link" href="/2015/12/ReactJS-For-Stupid-People.html">React 懶人教程</a>

</li>

</ul>

因為每篇文章都是一個<li>標簽，所以我們從頁面代碼的所有<li>中獲取文章的發布時間、鏈接和標題。

爬蟲代碼：

var Crawler = require('crawler');

var crawler = new Crawler({

maxConnections: 10,

callback: function(err, result, $) {

$('li').each(function(index, li) {

console.log(index + ' :');

console.log('time:' + $(li).children(0).text());

console.log('url:' + result.uri + $(li).children(1).attr('href'));

console.log('title:' + $(li).children(1).text());

});

}

});

crawler.queue('http://dev.guanghe.tv/category/tech/');

npm install安裝crawler模塊，node app.js運行程序。
你將會獲得如下內容（僅展示部分內容）：

0 :

time:Dec 31, 2015

url:http://dev.guanghe.tv/category/tech//2015/12/Getting-Started-With-React-And-JSX.html

title:React和JSX入門指導

1 :

time:Dec 30, 2015

url:http://dev.guanghe.tv/category/tech//2015/12/ReactJS-For-Stupid-People.html

title:React 懶人教程

2 :

time:Dec 24, 2015

url:http://dev.guanghe.tv/category/tech//2015/12/iOSCustomProblem.html

title:iOS開發常見問題

3 :

time:Dec 17, 2015

url:http://dev.guanghe.tv/category/tech//2015/12/iOSXcodeDebug.html

title:Xcode Debug技巧

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 node.js 爬蟲動態代理ip node.js 基於cheerio的爬蟲工具，需要登錄權限的爬蟲工具，直接導出到Excel文件中 Node.js 網頁爬蟲再進階，cheerio助力推薦近期15個 Node.js 開發工具 Node.js 命令行工具的編寫 Vim配置Node.js開發工具 Node.js 常用工具util包 Node.js—簡介 Node.js的卸載 Node.js入門