最近應朋友的邀請,幫忙做了個簡單的爬蟲程序,要求不高,主要是方便對不同網站的爬取進行擴展,獲取到想要的數據信息即可。當然,基於數據的后期分析功能是后話,以后的隨筆我會逐步的介紹。
開源的爬蟲框架比較多,之前我研究過java的nutch,同時它還兼備基於Lucene全文檢索的功能,還有Python爬蟲等等。為什么我會選擇用DotnetSpider呢,我之前有使用.net開發過一套分布式框架,框架的實現機制和DotnetSpider有相似之處,所以上手之后,甚是喜歡。
先看下解決方案的整體分層情況:

InternetSpider:控制台程序,后續可以服務的方式部署在windows環境中
ISee.Shaun.Spiders.Business:爬蟲程序的中心調度層,負責爬蟲的配置,啟動,執行等
ISee.Shaun.Spiders.Common:通用類,包括反射代碼、大眾點評的數據字典、回調委托的定義等
ISee.Shaun.Spiders.Pipeline:BasePipeline的實現層,主要實現了數據保存
ISee.Shaun.Spiders.Processor:BasePageProcessor的實現層,主要實現了通過xpath的數據提取
ISee.Shaun.Spiders.SpiderModel:數據模型層,負責實體定義和EF數據操作
以爬取大眾點評湘菜數據為例,程序的執行過程如下:
InternetSpider讀取配置文件,獲取需要爬取的URL地址,大眾點評數據分頁僅支持50頁,所以,需要獲取更多數據我們需要將檢索條件進行細化,觀察后發現通過重點地區進行爬取,效果尚可,地址為http://www.dianping.com/search/keyword/2/10_湖南菜/{0}p{1}。
圖一:湘菜檢索地址

圖二:分類檢索地址,共11頁

那么行政區地址從哪里來的呢?我們直接使用谷歌瀏覽器,代碼全在里面了

字典直接附上:
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; namespace ISee.Shaun.Spiders.Common { public static class DazhongdianpingArea { private static Dictionary<string, string> areaDic = null; public static Dictionary<string, string> GetAreaDic() { if (areaDic == null) { areaDic = new Dictionary<string, string>(); areaDic.Add("r16", "西城區"); areaDic.Add("r15", "東城區"); areaDic.Add("r17", "海淀區"); areaDic.Add("r328", "石景山區"); areaDic.Add("r14", "朝陽區"); areaDic.Add("r20", "豐台區"); areaDic.Add("r9158", "順義區"); areaDic.Add("r5950", "昌平區"); areaDic.Add("r5952", "大興區"); areaDic.Add("r9157", "房山區"); areaDic.Add("r5951", "通州區"); areaDic.Add("c4453", "懷柔區"); areaDic.Add("c435", "延慶區"); areaDic.Add("c434", "密雲區"); areaDic.Add("c4454", "門頭溝區"); areaDic.Add("c4455", "平谷區"); } return areaDic; } } }
OK,在看一下配置文件,對應好需要的地址
<?xml version="1.0" encoding="utf-8"?> <configuration> <configSections> <!-- For more information on Entity Framework configuration, visit http://go.microsoft.com/fwlink/?LinkID=237468 --> <section name="entityFramework" type="System.Data.Entity.Internal.ConfigFile.EntityFrameworkSection, EntityFramework, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" requirePermission="false" /> </configSections> <appSettings> <!-- 大分類抓取地址,共五十頁 --> <add key="WebUrls" value="http://www.dianping.com/search/keyword/2/10_湖南菜/p{0}" /> <!-- 細化后地址,加入了地區 --> <add key="WebAreaUrls" value="http://www.dianping.com/search/keyword/2/10_湖南菜/{0}p{1}" /> </appSettings> <startup> <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.6.1" /> </startup> <connectionStrings> <!-- 數據庫連接字符串 --> <add name="ConnectionStr" connectionString="data source=.;initial catalog=Membership_Spider;integrated security=True;user id=sa;password=123asd!@#;multipleactiveresultsets=True;" providerName="System.Data.SqlClient" /> </connectionStrings> <entityFramework> <defaultConnectionFactory type="System.Data.Entity.Infrastructure.LocalDbConnectionFactory, EntityFramework"> <parameters> <parameter value="mssqllocaldb" /> </parameters> </defaultConnectionFactory> <providers> <provider invariantName="System.Data.SqlClient" type="System.Data.Entity.SqlServer.SqlProviderServices, EntityFramework.SqlServer" /> </providers> </entityFramework> </configuration>
獲取到頁面地址后,我們需要初始化爬蟲服務,我定義了一個RunSpider,初始化時,傳遞Processor和Pipeline實現類字符串,編碼格式等。直接調用run方法,開始執行。
1 using ISee.Shaun.Spiders.Business; 2 using ISee.Shaun.Spiders.Common; 3 using System; 4 using System.Collections.Generic; 5 using System.Configuration; 6 using System.Linq; 7 using System.Text; 8 using System.Threading.Tasks; 9 10 namespace InternetSpider 11 { 12 class Program 13 { 14 private static string urlInfo = ConfigurationManager.AppSettings["WebUrls"]; 15 private static string urlAreaInfo = ConfigurationManager.AppSettings["WebAreaUrls"]; 16 static void Main(string[] args) 17 { 18 Run(); 19 } 20 21 /// <summary> 22 /// Begin spider 23 /// </summary> 24 private static void Run() 25 { 26 //Add other areaInfo 27 Dictionary<string, string> areaDic = DazhongdianpingArea.GetAreaDic(); 28 List<string> urls = new List<string>(); 29 foreach (var key in areaDic.Keys) 30 { 31 for (int i = 1; i <= 50; i++) 32 { 33 urls.Add(string.Format(urlAreaInfo, key, i)); 34 } 35 } 36 RunSpider runSpiders = new RunSpider("DazhongdianpingProcessor", "DazhongdianpingPipeline", "UTF-8", true); 37 runSpiders.Run(urls); 38 39 //RunSpider runSpider = new RunSpider("DazhongdianpingProcessor", "DazhongdianpingPipeline", "UTF-8", true); 40 //runSpider.Run(urlInfo, 50); 41 } 42 } 43 }
關於RunSpider,我不在重復說明,請看代碼注釋(RunSpider類的主要功能就是方便新任務的開啟,或者不通域名下站點的調用,或者說我這里的委托中開啟的子頁面調用等;反射的使用,便於在后續擴展時,創建批量任務配置文件,自動執行任務才加入的):
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using DotnetSpider.Core; using DotnetSpider.Core.Downloader; using DotnetSpider.Core.Pipeline; using DotnetSpider.Core.Processor; using DotnetSpider.Core.Scheduler; using ISee.Shaun.Spiders.Common; using ISee.Shaun.Spiders.Pipeline; using ISee.Shaun.Spiders.Processor; namespace ISee.Shaun.Spiders.Business { public class RunSpider { private const string ASSEMBLY_PROCESSOR_NAME = "ISee.Shaun.Spiders.Processor"; private const string ASSEMBLY_PIPELINE_NAME = "ISee.Shaun.Spiders.Pipeline"; private BaseProcessor processor = null; private BasePipeline pipeline = null; private Site site = null; private string encoding = string.Empty; private bool removeOutBound = false; private int spiderThreadNums = 1; public int SpiderThreadNums { get => spiderThreadNums; set => spiderThreadNums = value; } /// <summary> /// Constructor /// </summary> /// <param name="processorName"></param> /// <param name="pipeLineName"></param> public RunSpider(string processorName, string pipeLineName, string encoding, bool removeOutBound) { //通過反射,獲取當前處理類 processor = ReflectionInvoke.GetInstance(ASSEMBLY_PROCESSOR_NAME, processorName, null) as BaseProcessor; //如果需要回寫信息,使用當前委托,如這里,繼續子頁面的抓取調用 processor.InvokeFoodUrls = this.InvokeNext; pipeline = ReflectionInvoke.GetInstance(ASSEMBLY_PIPELINE_NAME, pipeLineName, null) as BasePipeline; this.encoding = encoding; this.removeOutBound = removeOutBound; } /// <summary> /// 執行,按照頁號 /// </summary> /// <param name="urlInfo"></param> /// <param name="times"></param> public void Run(string urlInfo, int times) { SetSite(encoding, removeOutBound, urlInfo, times); Run(); } /// <summary> /// 執行,按照地址集合 /// </summary> /// <param name="urlList"></param> public void Run(List<string> urlList) { SetSite(encoding, removeOutBound, urlList); Run(); } /// <summary> /// Begin spider /// </summary> private void Run() { Spider spider = Spider.Create(site, new QueueDuplicateRemovedScheduler(), processor); spider.AddPipeline(pipeline); spider.Downloader = new HttpClientDownloader(); spider.ThreadNum = this.spiderThreadNums; spider.EmptySleepTime = 3000; spider.Deep = 3; spider.Run(); } private void InvokeNext(string processorName, string pipeLineName, List<string> foodUrls) { RunSpider runSpider = new RunSpider(processorName, pipeLineName, this.encoding, true); runSpider.Run(foodUrls); } /// <summary> /// 通過可變頁號,設定站點URL /// </summary> /// <param name="encoding"></param> /// <param name="removeOutBound"></param> /// <param name="urlInfo"></param> /// <param name="times"></param> private void SetSite(string encoding, bool removeOutBound, string urlInfo, int times) { this.site = new Site { EncodingName = encoding, RemoveOutboundLinks = false }; if (times == 0) { this.site.AddStartUrl(urlInfo); } else { List<string> urls = new List<string>(); for (int i = 1; i <= times; ++i) { urls.Add(string.Format(urlInfo, i)); } this.site.AddStartUrls(urls); } } /// <summary> /// 通過URL集合設置站點URL /// </summary> /// <param name="encoding"></param> /// <param name="removeOutBound"></param> /// <param name="urlList"></param> private void SetSite(string encoding, bool removeOutBound, List<string> urlList) { this.site = new Site { EncodingName = encoding, RemoveOutboundLinks = false }; this.site.AddStartUrls(urlList); } } }
關於Processor,我后續會擴展出不通的網站實現類,那么涉及到通用屬性等需要進行抽象處理,代碼如下:
using DotnetSpider.Core; using DotnetSpider.Core.Processor; using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using static ISee.Shaun.Spiders.Common.DelegeteDefine; namespace ISee.Shaun.Spiders.Processor { public class BaseProcessor : BasePageProcessor { protected List<string> foodUrls = null; public CallbackEventHandler InvokeFoodUrls { get; set; } protected string SourceWebsite { get; set; } public BaseProcessor() { foodUrls = new List<string>(); } protected override void Handle(Page page) { throw new NotImplementedException(); } protected virtual void InvokeCallback(string processorName, string pipeLineName) { if (InvokeFoodUrls != null && this.foodUrls.Count > 0) { InvokeFoodUrls(processorName, pipeLineName, this.foodUrls); } } } }
接下來看具體的實現類(關於xpath不在多加說明,網上資料很多,如果結構不清楚,可以使用谷歌的開發者工具,或者在調試中,拿到html結構,自行分析,本文不再增加次類演示截圖):
using DotnetSpider.Core; using DotnetSpider.Core.Processor; using DotnetSpider.Core.Selector; using ISee.Shaun.Spiders.Common; using ISee.Shaun.Spiders.SpiderModel.Model; using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using static ISee.Shaun.Spiders.Common.DelegeteDefine; namespace ISee.Shaun.Spiders.Processor { public class DazhongdianpingProcessor : BaseProcessor { public DazhongdianpingProcessor() : base() { //標記當前數據來源 SourceWebsite = "大眾點評"; } /// <summary> /// 重新父類方法,開始執行數據獲取操作 /// </summary> /// <param name="page"></param> protected override void Handle(Page page) { // 利用 Selectable 查詢並構造自己想要的數據對象 var totalVideoElements = page.Selectable.SelectList(Selectors.XPath(".//div[@class='shop-list J_shop-list shop-all-list']/ul/li")).Nodes(); if (totalVideoElements == null) { return; } //定義需處理數據集合 List<Restaurant> restaurantList = new List<Restaurant>(); foreach (var restElement in totalVideoElements) { var restaurant = new Restaurant() { SourceWebsite = SourceWebsite }; //下面通過xpath開始獲取餐廳信息 restaurant.Name = restElement.Select(Selectors.XPath(".//h4")).GetValue(); var price= restElement.Select(Selectors.XPath(".//div[@class='txt']/div/a[@class='mean-price']/b")).GetValue(); restaurant.AveragePrice = string.IsNullOrEmpty(price) ? "0" : price.Replace("¥",""); restaurant.Type = restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='tag-addr']/a/span[@class='tag']")).GetValue(); restaurant.Star = restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='comment']/span/@title")).GetValue(); restaurant.ImageUrl = restElement.Select(Selectors.XPath(".//div[@class='pic']/a/img/@src")).GetValue(); var areaCode = page.Url.Substring(page.Url.LastIndexOf('/')+1); if (!string.IsNullOrEmpty(areaCode) && (areaCode.Contains("r")|| areaCode.Contains("c"))) { Dictionary<string, string> areaDic = DazhongdianpingArea.GetAreaDic(); string result= areaCode.Substring(0, areaCode.IndexOf('p')); if (areaDic.ContainsKey(result)) { restaurant.Area = areaDic[result]; } } List<ISelectable> infoList = restElement.SelectList(Selectors.XPath("./div[@class='txt']/span[@class='comment-list']/span/b")).Nodes() as List<ISelectable>; if (infoList != null && infoList.Count > 0) { var result = infoList[0].GetValue(); restaurant.Taste = string.IsNullOrEmpty(result) ? string.Empty : result; result = infoList[1].GetValue(); restaurant.Environment = string.IsNullOrEmpty(result) ? string.Empty : result; result = infoList[2].GetValue(); restaurant.ServiceScore = string.IsNullOrEmpty(result) ? string.Empty : result; } var recommetList = restElement.SelectList(Selectors.XPath(".//div[@class='txt']/div[@class='recommend']/a")).Nodes(); restaurant.Recommendation = string.Join(",", recommetList.Select(o => o.GetValue())); restaurant.Address = restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='tag-addr']/span")).GetValue(); restaurant.Position= restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='tag-addr']/a[@data-click-name='shop_tag_region_click']/span[@class='tag']")).GetValue(); var shopUrl = restElement.Select(Selectors.XPath(".//div[@class='txt']/div/a/@href")).GetValue(); restaurant.Code = shopUrl.Substring(shopUrl.LastIndexOf('/') + 1); restaurantList.Add(restaurant); //add next links if (!string.IsNullOrEmpty(shopUrl)) { this.foodUrls.Add(shopUrl); } } // 如果進行二級爬蟲,取消注釋,並且實現對應的兩個類 //InvokeCallback("DazhongdianpingFoodProcessor", "DazhongdianpingFoodPipeline"); // Save data object by key. 以自定義KEY存入page對象中供Pipeline調用 page.AddResultItem("RestaurantList", restaurantList); } } }
數據實體的定義:
using System; using System.Collections.Generic; using System.ComponentModel.DataAnnotations; using System.ComponentModel.DataAnnotations.Schema; using System.Linq; using System.Text; using System.Threading.Tasks; namespace ISee.Shaun.Spiders.SpiderModel.Model { public class FoodInfo { [Key] public int Id { get; set; } public int RestaurantId { get; set; } public string Code { get; set; } public string RestaurantCode { get; set; } public string Name { get; set; } public string Price { get; set; } public string FoodImageUrl { get; set; } [ForeignKey("RestaurantId")] public Restaurant restaurant { get; set; } } }
數據獲取下來之后,爬蟲會自動將任務分配給pipeline來處理收集到的數據信息,直接上代碼:
using DotnetSpider.Core; using DotnetSpider.Core.Pipeline; using ISee.Shaun.Spiders.SpiderModel.Model; using ISee.Shaun.Spiders.SpiderModel; using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; namespace ISee.Shaun.Spiders.Pipeline { public class DazhongdianpingPipeline : BasePipeline { /// <summary> /// 處理餐廳信息 /// </summary> /// <param name="resultItems"></param> /// <param name="spider"></param> public override void Process(IEnumerable<ResultItems> resultItems, ISpider spider) { //便利結果集 foreach (ResultItems entry in resultItems) { //定義EF實體 using (var rEntity = new FoodInfoEntity()) { List<Restaurant> resList = new List<Restaurant>(); foreach (Restaurant result in entry.Results["RestaurantList"]) { //通過餐廳名稱和地址作為篩重條件 var resultList = rEntity.RestaurantInfo.Where(o => o.Name == result.Name && o.Address == result.Address).ToList(); if (resultList.Count == 0) { resList.Add(result); } } if (resList.Count > 0) { rEntity.RestaurantInfo.AddRange(resList); rEntity.SaveChanges(); } } } } } }
好了,整體下來,就是這樣簡單,當然我還要強調一下幾個問題:
1.如果需要對大量的頁面進行數據爬取,可增加額外的xml配置文件,來定義抓取的規則或者任務。(不再細說,如有疑問可留言交流)
2.如果要完成比如美團網等網站的擴展,在Processor和Pipeline分別實現對應的類即可
3.關於數據實體,我采用了EF的Code first方式,大家可以隨意擴展自己想要的方式,或者更換數據庫等,請參閱網上大量的關於EF的文章。
今天就到這里了,基本都在上代碼,如何理解各自體會吧,另外,下周開始,停發兩年多的1024伐木累還會繼續更新,只想好好的把這件事做完,願一切安好!
補充,Github地址:https://github.com/sall84993356/Spiders.git
