[開源 .NET 跨平台 Crawler 數據采集 爬蟲框架: DotnetSpider] [三] 配置式爬蟲


[DotnetSpider 系列目錄]

上一篇介紹的基本的使用方式,自由度很高,但是編寫的代碼相對就多了。而我所在的行業其實大部分都是定題爬蟲, 只需要采集指定的頁面並結構化數據。為了提高開發效率, 我實現了利用實體配置的方式來實現爬蟲

創建 Console 項目

利用NUGET添加包

DotnetSpider2.Extension

定義配置式數據對象

  • 數據對象必須繼承 ISpiderEntity
  • Schema 定義數據名稱、表名及表名后綴
  • Indexes 定義數據表的主鍵、唯一索引、索引
  • EntitySelector 定義從頁面數據中抽取數據對象的規則

定義一個原始的數據對象類

public class Product : ISpiderEntity
{
}
使用Chrome打開京東商品頁 http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main
  1. 使用快捷鍵F12打開開發者工具
  2. 選中一個商品,並觀察Html結構

          image

我們發現每個商品都在class為gl-i-wrap j-sku-item的DIV下面,因此添加EntitySelector到數據對象Product的類名上面。( XPath的寫法不是唯一的,不熟悉的可以去W3CSCHOLL學習一下, 框架也支持使用Css甚至正則來選擇出正確的Html片段)。 

    [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] public class Product : ISpiderEntity 

  1. 添加數據庫及索引信息

    [Schema("test", "sku", TableSuffix.Today)]
    [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")]
    [Indexes(Index = new[] { "category" }, Unique = new[] { "category,sku", "sku" })]
    public class Product : ISpiderEntity
  2. 假設你需要采集SKU信息,觀察HTML結構,計算出相對的XPath, 為什么是相對XPath?因為EntitySelector已經把HTML截成片段了,內部的Html元素查詢都是相對於EntitySelector查詢出來的元素。最后再加上數據庫中列的信息

    復制代碼
    [Schema("test", "sku", TableSuffix.Today)]
    [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")]
    [Indexes(Index = new[] { "category" }, Unique = new[] { "category,sku", "sku" })]
    public class Product : ISpiderEntity
    {
         [StoredAs("sku", DataType.String, 25)]
         [PropertySelector(Expression = "./@data-sku")]
         public string Sku { get; set; }
     }
    復制代碼
  3. 爬蟲內部,鏈接是通過Request對象來存儲信息的,構造Request對象時可以添加額外的屬性值,這時候允許數據對象從Request的額外屬性值中查詢數據

    [StoredAs("category", DataType.String, 20)]
    [PropertySelector(Expression = "name", Type = SelectorType.Enviroment)]
    public string CategoryName { get; set; }
配置爬蟲(繼承EntitySpiderBuilder)
復制代碼
    protected override EntitySpider GetEntitySpider()
    {
        EntitySpider context = new EntitySpider(new Site
        {
            //HttpProxyPool = new HttpProxyPool(new KuaidailiProxySupplier("快代理API"))
        })
        {
            UserId = "DotnetSpider",
            TaskGroup = "JdSkuSampleSpider"
        };
        context.SetThreadNum(1);
        context.SetIdentity("JD_sku_store_test_" + DateTime.Now.ToString("yyyy_MM_dd_hhmmss"));
        context.AddEntityPipeline(new MySqlEntityPipeline("Database='test';Data Source=localhost;User ID=root;Password=1qazZAQ!;Port=3306"));
        context.AddStartUrl("http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main", new Dictionary<string, object> { { "name", "手機" }, { "cat3", "655" } });
        context.AddEntityType(typeof(Product), new TargetUrlExtractor
        {
            Region = new BaseSelector { Type = SelectorType.XPath, Expression = "//span[@class=\"p-num\"]" },
            Patterns = new List<string> { @"&page=[0-9]+&" }
        });
        return context;
    }
復制代碼
  1. 其中AddStartUrl第二個參數Dictionary<string, object>就是用於Enviroment查詢的數據

  2. 配置Scheduler: 默認是使用內存Queue做Url調度,如果想使用多台機器分布式采集則需要配置為RedisScheduler

    context.SetScheduler(new RedisScheduler
     {
         Host = "",
         Password = "",
         Port = 6379
     });
  3. 在添加數據對象時,可以配置數據鏈接的合法性驗證。用在一個網站采集多種鏈接時映射到不同的數據對象。同時此驗證會抽取當前頁面中符合規則的Url加入到Scheduler中繼續采集。

    context.AddEntityType(typeof(Product), new TargetUrlExtractor
    {
         Region = new BaseSelector { Type = SelectorType.XPath, Expression = "//span[@class=\"p-num\"]" },
        Patterns = new List<string> { @"&page=[0-9]+&" }
    });

          image

  1. 添加一個MySql的數據管道,只需要配置好連接字符串即可

    context.AddEntityPipeline(new MySqlEntityPipeline("Database='test';Data Source=localhost;User ID=root;Password=1qazZAQ!;Port=3306"));
完整代碼
 
        
復制代碼
public class JdSkuSampleSpider : EntitySpiderBuilder
    {
        protected override EntitySpider GetEntitySpider()
        {
            EntitySpider context = new EntitySpider(new Site
            {
                //HttpProxyPool = new HttpProxyPool(new KuaidailiProxySupplier("快代理API"))
            })
            {
                UserId = "DotnetSpider",
                TaskGroup = "JdSkuSampleSpider"
            };
            context.SetThreadNum(1);
            context.SetIdentity("JD_sku_store_test_" + DateTime.Now.ToString("yyyy_MM_dd_hhmmss"));
            context.AddEntityPipeline(new MySqlEntityPipeline("Database='test';Data Source=localhost;User ID=root;Password=1qazZAQ!;Port=3306"));
            context.AddStartUrl("http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main", new Dictionary<string, object> { { "name", "手機" }, { "cat3", "655" } });
            context.AddEntityType(typeof(Product), new TargetUrlExtractor
            {
                Region = new BaseSelector { Type = SelectorType.XPath, Expression = "//span[@class=\"p-num\"]" },
                Patterns = new List<string> { @"&page=[0-9]+&" }
            });
            return context;
        }

        [Schema("test", "sku", TableSuffix.Today)]
        [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")]
        [Indexes(Index = new[] { "category" }, Unique = new[] { "category,sku", "sku" })]
        public class Product : ISpiderEntity
        {
            [StoredAs("sku", DataType.String, 25)]
            [PropertySelector(Expression = "./@data-sku")]
            public string Sku { get; set; }

            [StoredAs("category", DataType.String, 20)]
            [PropertySelector(Expression = "name", Type = SelectorType.Enviroment)]
            public string CategoryName { get; set; }

            [StoredAs("cat3", DataType.String, 20)]
            [PropertySelector(Expression = "cat3", Type = SelectorType.Enviroment)]
            public int CategoryId { get; set; }

            [StoredAs("url", DataType.Text)]
            [PropertySelector(Expression = "./div[1]/a/@href")]
            public string Url { get; set; }

            [StoredAs("commentscount", DataType.String, 32)]
            [PropertySelector(Expression = "./div[5]/strong/a")]
            public long CommentsCount { get; set; }

            [StoredAs("shopname", DataType.String, 100)]
            [PropertySelector(Expression = ".//div[@class='p-shop']/@data-shop_name")]
            public string ShopName { get; set; }

            [StoredAs("name", DataType.String, 50)]
            [PropertySelector(Expression = ".//div[@class='p-name']/a/em")]
            public string Name { get; set; }

            [StoredAs("venderid", DataType.String, 25)]
            [PropertySelector(Expression = "./@venderid")]
            public string VenderId { get; set; }

            [StoredAs("jdzy_shop_id", DataType.String, 25)]
            [PropertySelector(Expression = "./@jdzy_shop_id")]
            public string JdzyShopId { get; set; }

            [StoredAs("run_id", DataType.Date)]
            [PropertySelector(Expression = "Monday", Type = SelectorType.Enviroment)]
            public DateTime RunId { get; set; }

            [PropertySelector(Expression = "Now", Type = SelectorType.Enviroment)]
            [StoredAs("cdate", DataType.Time)]
            public DateTime CDate { get; set; }
        }
    }
復制代碼
 
        
運行爬蟲
復制代碼
public class Program
{
    public static void Main(string[] args)
    {
        JdSkuSampleSpider spiderBuilder = new JdSkuSampleSpider();
        spiderBuilder.Run("rerun");
    }
}
復制代碼

image

不到100行代碼完成一個爬蟲,是不是異常的簡單?


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM