[DotnetSpider 系列目錄]
上一篇介紹的基本的使用方式,自由度很高,但是編寫的代碼相對就多了。而我所在的行業其實大部分都是定題爬蟲, 只需要采集指定的頁面並結構化數據。為了提高開發效率, 我實現了利用實體配置的方式來實現爬蟲
創建 Console 項目
利用NUGET添加包
DotnetSpider2.Extension
定義配置式數據對象
- 數據對象必須繼承 SpiderEntity
- EntityTableAttribute中可以定義數據名稱、表名及表名后綴、索引、主鍵或者需要更新的字段
- EntitySelector 定義從頁面數據中抽取數據對象的規則
- TargetUrlsSelector定義符合規則(正則)的目標鏈接, 用於加入到隊列中
定義一個原始的數據對象類
public class Product : SpiderEntity { }
使用Chrome打開京東商品頁 http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main
- 使用快捷鍵F12打開開發者工具
- 選中一個商品,並觀察Html結構

可以看到每個商品都在class為gl-i-wrap j-sku-item的DIV下面,因此添加EntitySelector到數據對象Product的類名上面。( XPath的寫法不是唯一的,不熟悉的可以去W3CSCHOLL學習一下, 框架也支持使用Css甚至正則來選擇出正確的Html片段)。
[EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] public class Product : SpiderEntity { }
-
添加數據庫及索引信息
[EntityTable("test", "sku", EntityTable.Monday, Indexs = new[] { "Category" }, Uniques = new[] { "Category,Sku", "Sku" })] [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] public class Product : SpiderEntity { }
-
假設你需要采集SKU信息,觀察HTML結構,計算出相對的XPath, 為什么是相對XPath?因為EntitySelector已經把HTML截成片段了,內部的Html元素查詢都是相對於EntitySelector查詢出來的元素。最后再加上數據庫中列的信息
[EntityTable("test", "sku", EntityTable.Monday, Indexs = new[] { "Category" }, Uniques = new[] { "Category,Sku", "Sku" })] [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] public class Product : SpiderEntity { [PropertyDefine(Expression = "./@data-sku")] public string Sku { get; set; } }
-
爬蟲內部,鏈接是通過Request對象來存儲信息的,構造Request對象時可以添加額外的屬性值,這時候允許數據對象從Request的額外屬性值中查詢數據
[EntityTable("test", "sku", EntityTable.Monday, Indexs = new[] { "Category" }, Uniques = new[] { "Category,Sku", "Sku" })] [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] public class Product : SpiderEntity { [PropertyDefine(Expression = "./@data-sku")] public string Sku { get; set; } [PropertyDefine(Expression = "name", Type = SelectorType.Enviroment)] public string Category { get; set; } }
配置爬蟲(繼承EntitySpider)
public class JdSkuSampleSpider : EntitySpider { public JdSkuSampleSpider() : base("JdSkuSample", new Site { //HttpProxyPool = new HttpProxyPool(new KuaidailiProxySupplier("快代理API")) }) { } protected override void MyInit(params string[] arguments) { Identity = Identity ?? "JD SKU SAMPLE"; ThreadNum = 1; // dowload html by http client Downloader = new HttpClientDownloader(); // storage data to mysql, default is mysql entity pipeline, so you can comment this line. Don't miss sslmode. AddPipeline(new MySqlEntityPipeline("Database='mysql';Data Source=localhost;User ID=root;Password=;Port=3306;SslMode=None;")); AddStartUrl("http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main", new Dictionary<string, object> { { "name", "手機" }, { "cat3", "655" } }); AddEntityType<Product>(); } }
-
其中AddStartUrl第二個參數Dictionary<string, object>就是用於Enviroment查詢的數據
-
TargetUrlsSelector,可以配置數據鏈接的合法性驗證,以及目標URL的獲取。如下表示目標URL的獲取區域是由XPATH選擇,並且要符合正則表達式 &page=[0-9]+&
[EntityTable("test", "jd_sku", EntityTable.Monday, Indexs = new[] { "Category" }, Uniques = new[] { "Category,Sku", "Sku" })] [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] [TargetUrlsSelector(XPaths = new[] { "//span[@class=\"p-num\"]" }, Patterns = new[] { @"&page=[0-9]+&" })] public class Product : SpiderEntity { [PropertyDefine(Expression = "./@data-sku")] public string Sku { get; set; } [PropertyDefine(Expression = "name", Type = SelectorType.Enviroment)] public string Category { get; set; } }

-
添加一個MySql的數據管道,只需要配置好連接字符串即可
context.AddPipeline(new MySqlEntityPipeline("Database='test';Data Source=localhost;User ID=root;Password=1qazZAQ!;Port=3306"));
完整代碼
public class JdSkuSampleSpider : EntitySpider { public JdSkuSampleSpider() : base("JdSkuSample", new Site { //HttpProxyPool = new HttpProxyPool(new KuaidailiProxySupplier("快代理API")) }) { } protected override void MyInit(params string[] arguments) { Identity = Identity ?? "JD SKU SAMPLE"; ThreadNum = 1; // dowload html by http client Downloader = new HttpClientDownloader(); // storage data to mysql, default is mysql entity pipeline, so you can comment this line. Don't miss sslmode. AddPipeline(new MySqlEntityPipeline("Database='mysql';Data Source=localhost;User ID=root;Password=;Port=3306;SslMode=None;")); AddStartUrl("http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main", new Dictionary<string, object> { { "name", "手機" }, { "cat3", "655" } }); AddEntityType<Product>(); } } [EntityTable("test", "jd_sku", EntityTable.Monday, Indexs = new[] { "Category" }, Uniques = new[] { "Category,Sku", "Sku" })] [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] [TargetUrlsSelector(XPaths = new[] { "//span[@class=\"p-num\"]" }, Patterns = new[] { @"&page=[0-9]+&" })] public class Product : SpiderEntity { [PropertyDefine(Expression = "./@data-sku", Length = 100)] public string Sku { get; set; } [PropertyDefine(Expression = "name", Type = SelectorType.Enviroment, Length = 100)] public string Category { get; set; } [PropertyDefine(Expression = "cat3", Type = SelectorType.Enviroment)] public int CategoryId { get; set; } [PropertyDefine(Expression = "./div[1]/a/@href")] public string Url { get; set; } [PropertyDefine(Expression = "./div[5]/strong/a")] public long CommentsCount { get; set; } [PropertyDefine(Expression = ".//div[@class='p-shop']/@data-shop_name", Length = 100)] public string ShopName { get; set; } [PropertyDefine(Expression = ".//div[@class='p-name']/a/em", Length = 100)] public string Name { get; set; } [PropertyDefine(Expression = "./@venderid", Length = 100)] public string VenderId { get; set; } [PropertyDefine(Expression = "./@jdzy_shop_id", Length = 100)] public string JdzyShopId { get; set; } [PropertyDefine(Expression = "Monday", Type = SelectorType.Enviroment)] public DateTime RunId { get; set; } }
運行爬蟲
public class Program { public static void Main(string[] args) { JdSkuSampleSpider spider = new JdSkuSampleSpider(); spider.Run(); } }


不到57行代碼完成一個爬蟲,是不是異常的簡單?
代碼地址
https://github.com/zlzforever/DotnetSpider 望各位大佬加星 
參與開發或有疑問
博文寫得比較早, 框架修改有時會來不及更新博文中的代碼, 請查看DotnetSpider.Sample項目中的樣例爬蟲
QQ群: 477731655
