基於.net的爬蟲應用-DotnetSpider

本文轉載自查看原文 2018-05-13 13:17 3302 爬蟲/ 技術開發/ Spider/ Crawler

最近應朋友的邀請，幫忙做了個簡單的爬蟲程序，要求不高，主要是方便對不同網站的爬取進行擴展，獲取到想要的數據信息即可。當然，基於數據的后期分析功能是后話，以后的隨筆我會逐步的介紹。

開源的爬蟲框架比較多，之前我研究過java的nutch,同時它還兼備基於Lucene全文檢索的功能,還有Python爬蟲等等。為什么我會選擇用DotnetSpider呢，我之前有使用.net開發過一套分布式框架，框架的實現機制和DotnetSpider有相似之處，所以上手之后，甚是喜歡。

先看下解決方案的整體分層情況：

InternetSpider：控制台程序，后續可以服務的方式部署在windows環境中

ISee.Shaun.Spiders.Business：爬蟲程序的中心調度層，負責爬蟲的配置，啟動，執行等

ISee.Shaun.Spiders.Common：通用類，包括反射代碼、大眾點評的數據字典、回調委托的定義等

ISee.Shaun.Spiders.Pipeline：BasePipeline的實現層，主要實現了數據保存

ISee.Shaun.Spiders.Processor：BasePageProcessor的實現層，主要實現了通過xpath的數據提取

ISee.Shaun.Spiders.SpiderModel：數據模型層，負責實體定義和EF數據操作

以爬取大眾點評湘菜數據為例，程序的執行過程如下：

InternetSpider讀取配置文件，獲取需要爬取的URL地址，大眾點評數據分頁僅支持50頁，所以，需要獲取更多數據我們需要將檢索條件進行細化，觀察后發現通過重點地區進行爬取，效果尚可，地址為http://www.dianping.com/search/keyword/2/10_湖南菜/{0}p{1}。

圖一：湘菜檢索地址

圖二：分類檢索地址，共11頁

那么行政區地址從哪里來的呢？我們直接使用谷歌瀏覽器，代碼全在里面了

字典直接附上：

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ISee.Shaun.Spiders.Common
{
    public static class DazhongdianpingArea
    {
        private static Dictionary<string, string> areaDic = null;
        public static Dictionary<string, string> GetAreaDic()
        {
            if (areaDic == null)
            {
                areaDic = new Dictionary<string, string>();
                areaDic.Add("r16", "西城區");
                areaDic.Add("r15", "東城區");
                areaDic.Add("r17", "海淀區");
                areaDic.Add("r328", "石景山區");
                areaDic.Add("r14", "朝陽區");
                areaDic.Add("r20", "豐台區");
                areaDic.Add("r9158", "順義區");
                areaDic.Add("r5950", "昌平區");
                areaDic.Add("r5952", "大興區");
                areaDic.Add("r9157", "房山區");
                areaDic.Add("r5951", "通州區");
                areaDic.Add("c4453", "懷柔區");
                areaDic.Add("c435", "延慶區");
                areaDic.Add("c434", "密雲區");
                areaDic.Add("c4454", "門頭溝區");
                areaDic.Add("c4455", "平谷區");
            }
            return areaDic;
        }
    }
}

OK，在看一下配置文件，對應好需要的地址

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <configSections>
    <!-- For more information on Entity Framework configuration, visit http://go.microsoft.com/fwlink/?LinkID=237468 -->
    <section name="entityFramework" type="System.Data.Entity.Internal.ConfigFile.EntityFrameworkSection, EntityFramework, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" requirePermission="false" />
  </configSections>
  <appSettings>
    <!-- 大分類抓取地址，共五十頁 -->
    <add key="WebUrls" value="http://www.dianping.com/search/keyword/2/10_湖南菜/p{0}" />
    <!-- 細化后地址，加入了地區 -->
    <add key="WebAreaUrls" value="http://www.dianping.com/search/keyword/2/10_湖南菜/{0}p{1}" />
  </appSettings>
  <startup>
    <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.6.1" />
  </startup>
  <connectionStrings>
    <!-- 數據庫連接字符串 -->
    <add name="ConnectionStr" connectionString="data source=.;initial catalog=Membership_Spider;integrated security=True;user id=sa;password=123asd!@#;multipleactiveresultsets=True;" providerName="System.Data.SqlClient" />
  </connectionStrings>
  <entityFramework>
    <defaultConnectionFactory type="System.Data.Entity.Infrastructure.LocalDbConnectionFactory, EntityFramework">
      <parameters>
        <parameter value="mssqllocaldb" />
      </parameters>
    </defaultConnectionFactory>
    <providers>
      <provider invariantName="System.Data.SqlClient" type="System.Data.Entity.SqlServer.SqlProviderServices, EntityFramework.SqlServer" />
    </providers>
  </entityFramework>
</configuration>

獲取到頁面地址后，我們需要初始化爬蟲服務，我定義了一個RunSpider，初始化時，傳遞Processor和Pipeline實現類字符串，編碼格式等。直接調用run方法，開始執行。

 1 using ISee.Shaun.Spiders.Business;
 2 using ISee.Shaun.Spiders.Common;
 3 using System;
 4 using System.Collections.Generic;
 5 using System.Configuration;
 6 using System.Linq;
 7 using System.Text;
 8 using System.Threading.Tasks;
 9 
10 namespace InternetSpider
11 {
12     class Program
13     {
14         private static string urlInfo = ConfigurationManager.AppSettings["WebUrls"];
15         private static string urlAreaInfo = ConfigurationManager.AppSettings["WebAreaUrls"];
16         static void Main(string[] args)
17         {
18             Run();
19         }
20 
21         /// <summary>
22         /// Begin spider
23         /// </summary>
24         private static void Run()
25         {
26             //Add other areaInfo
27             Dictionary<string, string> areaDic = DazhongdianpingArea.GetAreaDic();
28             List<string> urls = new List<string>();
29             foreach (var key in areaDic.Keys)
30             {
31                 for (int i = 1; i <= 50; i++)
32                 {
33                     urls.Add(string.Format(urlAreaInfo, key, i));
34                 }
35             }
36             RunSpider runSpiders = new RunSpider("DazhongdianpingProcessor", "DazhongdianpingPipeline", "UTF-8", true);
37             runSpiders.Run(urls);
38 
39             //RunSpider runSpider = new RunSpider("DazhongdianpingProcessor", "DazhongdianpingPipeline", "UTF-8", true);
40             //runSpider.Run(urlInfo, 50);
41         }
42     }
43 }

關於RunSpider,我不在重復說明，請看代碼注釋（RunSpider類的主要功能就是方便新任務的開啟，或者不通域名下站點的調用，或者說我這里的委托中開啟的子頁面調用等；反射的使用，便於在后續擴展時，創建批量任務配置文件，自動執行任務才加入的）：

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using DotnetSpider.Core;
using DotnetSpider.Core.Downloader;
using DotnetSpider.Core.Pipeline;
using DotnetSpider.Core.Processor;
using DotnetSpider.Core.Scheduler;
using ISee.Shaun.Spiders.Common;
using ISee.Shaun.Spiders.Pipeline;
using ISee.Shaun.Spiders.Processor;

namespace ISee.Shaun.Spiders.Business
{
    public class RunSpider
    {
        private const string ASSEMBLY_PROCESSOR_NAME = "ISee.Shaun.Spiders.Processor";
        private const string ASSEMBLY_PIPELINE_NAME = "ISee.Shaun.Spiders.Pipeline";
        private BaseProcessor processor = null;
        private BasePipeline pipeline = null;
        private Site site = null;
        private string encoding = string.Empty;
        private bool removeOutBound = false;

        private int spiderThreadNums = 1;
        public int SpiderThreadNums { get => spiderThreadNums; set => spiderThreadNums = value; }

        /// <summary>
        /// Constructor
        /// </summary>
        /// <param name="processorName"></param>
        /// <param name="pipeLineName"></param>
        public RunSpider(string processorName, string pipeLineName, string encoding, bool removeOutBound)
        {
            //通過反射，獲取當前處理類
            processor = ReflectionInvoke.GetInstance(ASSEMBLY_PROCESSOR_NAME, processorName, null) as BaseProcessor;
            //如果需要回寫信息，使用當前委托，如這里，繼續子頁面的抓取調用
            processor.InvokeFoodUrls = this.InvokeNext;
            pipeline = ReflectionInvoke.GetInstance(ASSEMBLY_PIPELINE_NAME, pipeLineName, null) as BasePipeline;
            this.encoding = encoding;
            this.removeOutBound = removeOutBound;
        }

        /// <summary>
        /// 執行，按照頁號
        /// </summary>
        /// <param name="urlInfo"></param>
        /// <param name="times"></param>
        public void Run(string urlInfo, int times)
        {
            SetSite(encoding, removeOutBound, urlInfo, times);
            Run();
        }

        /// <summary>
        /// 執行，按照地址集合
        /// </summary>
        /// <param name="urlList"></param>
        public void Run(List<string> urlList)
        {
            SetSite(encoding, removeOutBound, urlList);
            Run();
        }

        /// <summary>
        /// Begin spider
        /// </summary>
        private void Run()
        {
            Spider spider = Spider.Create(site, new QueueDuplicateRemovedScheduler(), processor);
            spider.AddPipeline(pipeline);
            spider.Downloader = new HttpClientDownloader();
            spider.ThreadNum = this.spiderThreadNums;
            spider.EmptySleepTime = 3000;
            spider.Deep = 3;
            spider.Run();
        }

        private void InvokeNext(string processorName, string pipeLineName, List<string> foodUrls)
        {
            RunSpider runSpider = new RunSpider(processorName, pipeLineName, this.encoding, true);
            runSpider.Run(foodUrls);
        }

        /// <summary>
        /// 通過可變頁號，設定站點URL
        /// </summary>
        /// <param name="encoding"></param>
        /// <param name="removeOutBound"></param>
        /// <param name="urlInfo"></param>
        /// <param name="times"></param>
        private void SetSite(string encoding, bool removeOutBound, string urlInfo, int times)
        {
            this.site = new Site { EncodingName = encoding, RemoveOutboundLinks = false };
            if (times == 0)
            {
                this.site.AddStartUrl(urlInfo);
            }
            else
            {
                List<string> urls = new List<string>();
                for (int i = 1; i <= times; ++i)
                {
                    urls.Add(string.Format(urlInfo, i));
                }
                this.site.AddStartUrls(urls);
            }
        }

        /// <summary>
        /// 通過URL集合設置站點URL
        /// </summary>
        /// <param name="encoding"></param>
        /// <param name="removeOutBound"></param>
        /// <param name="urlList"></param>
        private void SetSite(string encoding, bool removeOutBound, List<string> urlList)
        {
            this.site = new Site { EncodingName = encoding, RemoveOutboundLinks = false };
            this.site.AddStartUrls(urlList);
        }
    }
}

關於Processor，我后續會擴展出不通的網站實現類，那么涉及到通用屬性等需要進行抽象處理，代碼如下：

using DotnetSpider.Core;
using DotnetSpider.Core.Processor;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using static ISee.Shaun.Spiders.Common.DelegeteDefine;

namespace ISee.Shaun.Spiders.Processor
{
    public class BaseProcessor : BasePageProcessor
    {
        protected List<string> foodUrls = null;
        public CallbackEventHandler InvokeFoodUrls { get; set; }

        protected string SourceWebsite { get; set; }

        public BaseProcessor() { foodUrls = new List<string>(); }

        protected override void Handle(Page page)
        {
            throw new NotImplementedException();
        }

        protected virtual void InvokeCallback(string processorName, string pipeLineName)
        {
            if (InvokeFoodUrls != null && this.foodUrls.Count > 0)
            {
                InvokeFoodUrls(processorName, pipeLineName, this.foodUrls);
            }
        }
    }
}

接下來看具體的實現類（關於xpath不在多加說明，網上資料很多，如果結構不清楚，可以使用谷歌的開發者工具，或者在調試中，拿到html結構，自行分析，本文不再增加次類演示截圖）：

using DotnetSpider.Core;
using DotnetSpider.Core.Processor;
using DotnetSpider.Core.Selector;
using ISee.Shaun.Spiders.Common;
using ISee.Shaun.Spiders.SpiderModel.Model;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using static ISee.Shaun.Spiders.Common.DelegeteDefine;

namespace ISee.Shaun.Spiders.Processor
{
    public class DazhongdianpingProcessor : BaseProcessor
    {
        public DazhongdianpingProcessor() : base()
        {
            //標記當前數據來源
            SourceWebsite = "大眾點評";
        }

        /// <summary>
        /// 重新父類方法，開始執行數據獲取操作
        /// </summary>
        /// <param name="page"></param>
        protected override void Handle(Page page)
        {
            // 利用 Selectable 查詢並構造自己想要的數據對象
            var totalVideoElements = page.Selectable.SelectList(Selectors.XPath(".//div[@class='shop-list J_shop-list shop-all-list']/ul/li")).Nodes();
            if (totalVideoElements == null)
            {
                return;
            }
            //定義需處理數據集合
            List<Restaurant> restaurantList = new List<Restaurant>();
            foreach (var restElement in totalVideoElements)
            {
                var restaurant = new Restaurant() { SourceWebsite = SourceWebsite };
                //下面通過xpath開始獲取餐廳信息
                restaurant.Name = restElement.Select(Selectors.XPath(".//h4")).GetValue();
                var price= restElement.Select(Selectors.XPath(".//div[@class='txt']/div/a[@class='mean-price']/b")).GetValue();
                restaurant.AveragePrice = string.IsNullOrEmpty(price) ? "0" : price.Replace("￥","");
                restaurant.Type = restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='tag-addr']/a/span[@class='tag']")).GetValue();
                restaurant.Star = restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='comment']/span/@title")).GetValue();
                restaurant.ImageUrl = restElement.Select(Selectors.XPath(".//div[@class='pic']/a/img/@src")).GetValue();
                var areaCode = page.Url.Substring(page.Url.LastIndexOf('/')+1);
                if (!string.IsNullOrEmpty(areaCode) && (areaCode.Contains("r")|| areaCode.Contains("c")))
                {
                    Dictionary<string, string> areaDic = DazhongdianpingArea.GetAreaDic();
                    string result= areaCode.Substring(0, areaCode.IndexOf('p'));
                    if (areaDic.ContainsKey(result))
                    {
                        restaurant.Area = areaDic[result];
                    }
                }

                List<ISelectable> infoList = restElement.SelectList(Selectors.XPath("./div[@class='txt']/span[@class='comment-list']/span/b")).Nodes() as List<ISelectable>;
                if (infoList != null && infoList.Count > 0)
                {
                    var result = infoList[0].GetValue();
                    restaurant.Taste = string.IsNullOrEmpty(result) ? string.Empty : result;
                    result = infoList[1].GetValue();
                    restaurant.Environment = string.IsNullOrEmpty(result) ? string.Empty : result;
                    result = infoList[2].GetValue();
                    restaurant.ServiceScore = string.IsNullOrEmpty(result) ? string.Empty : result;
                }

                var recommetList = restElement.SelectList(Selectors.XPath(".//div[@class='txt']/div[@class='recommend']/a")).Nodes();
                restaurant.Recommendation = string.Join(",", recommetList.Select(o => o.GetValue()));
                restaurant.Address = restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='tag-addr']/span")).GetValue();
                restaurant.Position= restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='tag-addr']/a[@data-click-name='shop_tag_region_click']/span[@class='tag']")).GetValue();

                var shopUrl = restElement.Select(Selectors.XPath(".//div[@class='txt']/div/a/@href")).GetValue();
                restaurant.Code = shopUrl.Substring(shopUrl.LastIndexOf('/') + 1);
                restaurantList.Add(restaurant);

                //add next links
                if (!string.IsNullOrEmpty(shopUrl))
                {
                    this.foodUrls.Add(shopUrl);
                }
            }
            // 如果進行二級爬蟲，取消注釋，並且實現對應的兩個類
            //InvokeCallback("DazhongdianpingFoodProcessor", "DazhongdianpingFoodPipeline");
            // Save data object by key. 以自定義KEY存入page對象中供Pipeline調用
            page.AddResultItem("RestaurantList", restaurantList);
        }
    }
}

數據實體的定義：

using System;
using System.Collections.Generic;
using System.ComponentModel.DataAnnotations;
using System.ComponentModel.DataAnnotations.Schema;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ISee.Shaun.Spiders.SpiderModel.Model
{
    public class FoodInfo
    {
        [Key]
        public int Id { get; set; }
        public int RestaurantId { get; set; }
        public string Code { get; set; }
        public string RestaurantCode { get; set; }
        public string Name { get; set; }
        public string Price { get; set; }
        public string FoodImageUrl { get; set; }
        [ForeignKey("RestaurantId")]
        public Restaurant restaurant { get; set; }
    }
}

數據獲取下來之后，爬蟲會自動將任務分配給pipeline來處理收集到的數據信息，直接上代碼：

using DotnetSpider.Core;
using DotnetSpider.Core.Pipeline;
using ISee.Shaun.Spiders.SpiderModel.Model;
using ISee.Shaun.Spiders.SpiderModel;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ISee.Shaun.Spiders.Pipeline
{
    public class DazhongdianpingPipeline : BasePipeline
    {
        /// <summary>
        /// 處理餐廳信息
        /// </summary>
        /// <param name="resultItems"></param>
        /// <param name="spider"></param>
        public override void Process(IEnumerable<ResultItems> resultItems, ISpider spider)
        {
            //便利結果集
            foreach (ResultItems entry in resultItems)
            {
                //定義EF實體
                using (var rEntity = new FoodInfoEntity())
                {
                    List<Restaurant> resList = new List<Restaurant>();
                    foreach (Restaurant result in entry.Results["RestaurantList"])
                    {
                        //通過餐廳名稱和地址作為篩重條件
                        var resultList = rEntity.RestaurantInfo.Where(o => o.Name == result.Name && o.Address == result.Address).ToList();
                        if (resultList.Count == 0)
                        {
                            resList.Add(result);
                        }
                    }
                    if (resList.Count > 0)
                    {
                        rEntity.RestaurantInfo.AddRange(resList);
                        rEntity.SaveChanges();
                    }
                }
            }

        }
    }
}

好了，整體下來，就是這樣簡單，當然我還要強調一下幾個問題：

1.如果需要對大量的頁面進行數據爬取，可增加額外的xml配置文件，來定義抓取的規則或者任務。（不再細說，如有疑問可留言交流）

2.如果要完成比如美團網等網站的擴展，在Processor和Pipeline分別實現對應的類即可

3.關於數據實體，我采用了EF的Code first方式，大家可以隨意擴展自己想要的方式，或者更換數據庫等，請參閱網上大量的關於EF的文章。

今天就到這里了，基本都在上代碼，如何理解各自體會吧，另外，下周開始，停發兩年多的1024伐木累還會繼續更新，只想好好的把這件事做完，願一切安好！

補充，Github地址：https://github.com/sall84993356/Spiders.git

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。