參考了DotNetSpider示例,
感覺DotNetSpider
太重了,它是一個比較完整的爬蟲框架。
對比了以下各種無頭瀏覽器,最終采用PuppeteerSharp
+AngleSharp
寫一個爬蟲示例。
和上面的博文一樣,都是用汽車之家的https://store.mall.autohome.com.cn/83106681.html這個頁面做數據采集示例。
本文中使用PuppeteerSharp
獲取最終頁面(即加載JavaScript之后的頁面),使用AngleSharp
進行Html documents解析處理。
Headless Browsers
A list of (almost) all headless web browsers in existence
A web browser without a graphical user interface, controlled programmatically. Used for automation, testing, and other purposes.
Browser engines
These browser engines fully render web pages or run JavaScript in a virtual DOM
Name | About | Supported Languages | License |
---|---|---|---|
Chromium Embedded Framework | CEF is a open source project based on the Google Chromium project. | JavaScript | BSD |
Erik | Headless browser on top of Kanna and WebKit. | Swift | MIT |
jBrowserDriver | A Selenium-compatible headless browser which is written in pure Java. WebKit-based. Works with any of the Selenium Server bindings. | Java | Apache License v2.0 |
PhantomJS | [Unmaintained] PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. | JavaScript, Python, Ruby, Java, C#, Haskell, Objective-C, Perl, PHP, R(via Selenium) | BSD 3-Clause |
Splash | Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python using Twisted and QT. | Any | BSD 3-Clause |
Multi drivers
These libraries can control multiple browser engines (typically using Selenium)
Name | About | Supported Languages | License |
---|---|---|---|
CasperJS | CasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko). | JavaScript | MIT |
Geb | Geb is a Groovy interface to WebDriver. | Groovy | Apache |
Selenium | Selenium is a suite of tools to automate web browsers across many platforms. | JavaScript, Python, Ruby, Java, C#, Haskell, Objective-C, Perl, PHP, R | Apache |
Splinter | Splinter is an open source tool for testing web applications using Python. It lets you automate browser actions, such as visiting URLs and interacting with their items. | Python | - |
SST | SST (selenium-simple-test) is a web test framework that uses Python to generate functional browser-based tests. | Python | - |
Watir | The most elegant way to use Selenium WebDriver with ruby. | Ruby | MIT |
PhantomJS drivers
These libraries control PhantomJS
Name | About | Supported Languages | License |
---|---|---|---|
Ghostbuster | Automated browser testing via phantom.js, with all of the pain taken out! That means you get a real browser, with a real DOM, and can do real testing! | JavaScript | Not specified |
jedi-crawler | Lightsabing Node/PhantomJS crawler; scrape dynamic content : without the hassle | JavaScript | Not specified |
Lotte | Lotte is a headless, automated testing framework built on top of PhantomJS and inspired by Ghostbuster. | JavaScript | MIT |
phantompy | Phantompy is a headless WebKit engine with powerful pythonic api build on top of Qt5 Webkit | Python | LGPL-2.1 |
X-RAY | Supports strings, arrays, arrays of objects, nested object structures, selector API, pagination, crawler, concurrency, throttles, delays, timeouts, and pluggable drivers (PhantomJS, HTTP) | JavaScript | MIT |
Horseman | Promise based Node.js module for PhantomJS. Features chainable API, understandable control-flow, support for multiple tabs, and built-in jQuery. | JavaScript | MIT |
Chromium drivers
These libraries control Chromium
Name | About | Supported Languages | License |
---|---|---|---|
Awesomium | Chromium-based headless browser engine | C++, | Free/Commercial |
Headless Chromium | Chromium feature activated with the --headlesss flag, currently availible in the nightly build of Chromium, not yet released |
C++ | Opensource |
Puppeteer | Headless Chrome Node API from the Chrome DevTools team | JavaScript | Apache |
PuppeteerSharp | PuppeteerSharp is a port of the official Headless Chrome Node.JS Puppeteer API | MIT | |
chrome-remote-interface | Chrome Debugging Protocol interface for Node.js | JavaScript | MIT |
Chromy | Features chainable API, mobile emulation, fundamental API such as javascript evaluation. | JavaScript | MIT |
chromedp | A faster, simpler way to drive browsers (Chrome, Edge, Safari, Android, etc) without external dependencies (ie, Selenium, PhantomJS, etc) using the Chrome Debugging Protocol. | Go | MIT |
Chromeless | Chrome automation made simple. Runs locally or headless on AWS Lambda. | JavaScript | MIT |
Webkit drivers
These drivers control an in-process instance of Webkit
Name | About | Supported Languages | License |
---|---|---|---|
Browserjet | Runs a custom build of webkit, controlled by node.js interface. | JavaScript | Not specified |
ghost.py | ghost.py is a webkit web client written in python. | Python | MIT |
headless_browser | Headless browser based on WebKit written in C++. | C++ | Not Specified |
Jabba-Webkit | Jabba's headless webkit browser for scraping AJAX-powered webpages. | Python | Not specified |
Jasmine-Headless-Webkit | jasmine-headless-webkit uses the QtWebKit widget to run your specs without needing to render a pixel. | Python, JavaScript, Ruby | Free |
Python-Webkit | Python-Webkit is a python extension to Webkit to add full, complete access to Webkit's DOM | Python | GNU |
Spynner | Programmatic web browsing module with AJAX support for Python | Python | Not specified |
Webloop | Scriptable, headless WebKit with a Go API. | Go | BSD 3-Clause |
wkhtmltopdf wkhtmltox wkhtmltoimage | Command line tool rendering HTML into PDF and other image formats. | shell, C | LGPLv3 |
WKZombie | Functional headless browser (with JSON support) for iOS using WebKit and hpple/libxml2. | Swift | MIT |
Other drivers
These libraries control lesser known browsers or OS-provided web libraries
Name | About | Supported Languages | License |
---|---|---|---|
Nightmare | Nightmare is a high-level browser automation library built as an easier alternative to PhantomJS. It runs on the Electron engine. | JavaScript | MIT |
grope | A RubyCocoa interface to the macOS WebKit Framework | RubyCocoa | MIT |
SlimerJS | SlimerJS is similar to PhantomJs, except that it runs Gecko, the browser engine of Mozilla Firefox, instead of Webkit (And it is not yet truly headless). | JavaScript | Mozilla 2.0 |
SpecterJS | A scriptable headless Internet Explorer port of PhantomJS. | JavaScript | MIT |
trifleJS | A headless Internet Explorer browser using the WebBrowser Class with a Javascript API running on the V8 engine. | JavaScript | MIT |
Fake Browser Engine
These libraries are typically naive or HTML-only browsers
Name | About | Supported Languages | License |
---|---|---|---|
AngleSharp | Http Parsing Library | MIT | |
Guillotine | A headless browser, written in C# | LGPL-3.0 | |
benv | Stub a browser environment in node.js and headlessly test your client-side code. | JavaScript | MIT |
browser.rb | Headless Ruby browser on top of Nokogiri and TheRubyRacer | Ruby | Not specified |
BrowserKit | BrowserKit simulates the behavior of a web browser. | PHP | MIT |
DamonJS | Bot navigating urls and doing tasks. | JavaScript | Apache |
Headless | Headless browser support for fast web acceptance testing in | MIT | |
HeadlessBrowser | A very miniature headless browser, for testing the DOM on Node.js | JavaScript | Not specified |
HtmlUnit | HtmlUnit is a "GUI-Less browser for Java programs". | Java | Apache |
Jaunt | Java Web Scraping & Automation API | Java | Not specified |
JSDom | A JavaScript implementation of the WHATWG DOM and HTML standards, for use with Node.js. | JavaScript | MIT |
MechanicalSoup | A Python library for automating interaction with websites. | Python | MIT |
mechanize | Stateful programmatic web browsing. | Python | BSD 3-Clause, ZPL 2.1 |
node-as-browser | Create a browser-like environment within Node.js | JavaScript | MIT |
RoboBrowser | A simple, Pythonic library for browsing the web without a standalone web browser. | Python | BSD 3-Clause |
SimpleBrowser | A flexible and intuitive web browser engine designed for automation tasks. Built on the 4 framework. | BSD 3-Clause | |
stanislaw | Naive, mechanize-like HTML parser/form driver. | Python | Not specified |
twill | Twill is a simple language that interacts with basic HTML pages (no JavaScript support). | Python | MIT |
WeasyPrint | WeasyPrint is a visual rendering engine for HTML and CSS that can export to PDF. It aims to support web standards for printing. | Python | BSD 3-Clause |
WWW::Mechanize | Headless browser for Perl with many plugins and extensions, notably Test::WWW:Mechanize for testing | Perl | Perl 5 |
X-RAY | Supports strings, arrays, arrays of objects, nested object structures, selector API, pagination, crawler, concurrency, throttles, delays, timeouts, and pluggable drivers (PhantomJS, HTTP) | JavaScript | MIT |
Xidel (Internet Tools) | An XQuery-based cli web scraper for static X/HTML pages and JSON-APIs. | FreePascal, XQuery | GPL-2 |
Zombie.js | Zombie.js is a lightweight framework for testing client-side JavaScript code in a simulated environment. No browser required. | JavaScript | MIT |
Runs in a browser
Name | About | Supported Languages | License |
---|---|---|---|
DalekJS | [unmaintained and recommend TestCafé] Automated cross browser testing with JavaScript. | JavaScript | MIT |
TestCafé | Automated browser testing for the modern web development stack. | JavaScript | MIT |
Sahi | Sahi is a cross-browser automation/testing tool with the facility to record and playback scripts. | JavaScript, Java, Ruby, PHP | Apache / Commercial |
WatiN | Web Application Testing In | Apache 2.0 |
Misc tools
Name | About | Supported Languages | License |
---|---|---|---|
browser-launcher | Detect and launch browser versions, headlessly or otherwise | JavaScript | MIT |
其實如果沒有JavaScripts
加載數據需求,單獨用AngleSharp
就可以搞定了。
但涉及到JavaScripts
加載數據需求的,就需要上真正的無頭瀏覽器組件才能搞定了。AngleSharp
現在只支持簡單的JavaScripts
代碼執行,稍微復雜點的,都不行,聽說以后要完整支持JavaScripts
,敬請期待吧!
Code
/* * This is a Puppeteer+AngleSharp crawler console app samples */ using System; using System.Collections.Generic; using System.Threading.Tasks; using AngleSharp; using AngleSharp.Dom; using AngleSharp.Html.Parser; using Newtonsoft.Json; using PuppeteerSharp; namespace CrawlerSamples { internal class Program { private const string Url = "https://store.mall.autohome.com.cn/83106681.html"; private const int ChromiumRevision = BrowserFetcher.DefaultRevision; private static async Task Main(string[] args) { //Download chromium browser revision package await new BrowserFetcher().DownloadAsync(ChromiumRevision); //Test AngleSharp await TestAngleSharp(); Console.ReadKey(); } private static async Task TestAngleSharp() { /* * Used AngleSharp loading of HTML document * TODO: Used WithJavaScript function need install AngleSharp.Scripting.Javascript nuget package * Note: that JavaScripts support is an experimental and does not support complex JavaScripts code. */ //IConfiguration config = Configuration.Default.WithDefaultLoader().WithCss().WithCookies().WithJavaScript(); //IBrowsingContext context = BrowsingContext.New(config); //IDocument document = await context.OpenAsync(url); //Used PuppeteerSharp loading of HTML document var htmlString = await TestPuppeteerSharp(); /* * Parsing of HTML document string */ var context = BrowsingContext.New(Configuration.Default); var parser = context.GetService<IHtmlParser>(); var document = parser.ParseDocument(htmlString); //Selector carbox element list var carboxList = document.QuerySelectorAll("div.shop-content div.content div.list li.carbox"); var carModelList = new List<CarModel>(); foreach (var carbox in carboxList) { //Parsing and converting to the car model object. var model = CreateModelWithAngleSharp(carbox); carModelList.Add(model); //Printing to console windows var jsonString = JsonConvert.SerializeObject(model); Console.WriteLine(jsonString); Console.WriteLine(); } Console.WriteLine("Total count:" + carModelList.Count); } private static async Task<string> TestPuppeteerSharp() { //Enabled headless option var launchOptions = new LaunchOptions { Headless = true }; //Starting headless browser var browser = await Puppeteer.LaunchAsync(launchOptions); //New tab page var page = await browser.NewPageAsync(); //Request URL to get the page await page.GoToAsync(Url); //Get and return the HTML content of the page var htmlString = await page.GetContentAsync(); #region Dispose resources //Close tab page await page.CloseAsync(); //Close headless browser, all pages will be closed here. await browser.CloseAsync(); #endregion return htmlString; } private static CarModel CreateModelWithAngleSharp(IParentNode node) { var model = new CarModel { Title = node.QuerySelector("a div.carbox-title").TextContent, ImageUrl = node.QuerySelector("a div.carbox-carimg img").GetAttribute("src"), ProductUrl = node.QuerySelector("a").GetAttribute("href"), Tip = node.QuerySelector("a div.carbox-tip").TextContent, OrdersNumber = node.QuerySelector("a div.carbox-number span").TextContent }; return model; } } }
Result
Note
注意,第一次運行,這一句代碼:
await new BrowserFetcher().DownloadAsync(ChromiumRevision);
會從網絡上下載瀏覽器便捷式安裝包download-Win64-536395.zip
到你本地,里面解壓后是一個Chromium瀏覽器。這里需要等待一些時間。
Source
https://github.com/VAllens/CrawlerSamples