DotnetSpider爬虫采集博客园

本文转载自查看原文 2016-05-25 22:36 3182 [39]Open Source

采集博客园

今天ModestMT.Zou发布了DotnetSpider爬虫第二章节，内容简单明了，基本看懂了，于是想自己试试看，直接就拿博客园开刀了。

这里有最基本的使用方式，本文章不介绍

[开源 .NET 跨平台数据采集爬虫框架: DotnetSpider] [二] 最基本，最自由的使用方式

这里我已经从https://github.com/zlzforever/DotnetSpider上下载代码并编译通过

这里用的是VS2015，因为此项目有些C#6.0语法糖

首先，用VS2015新建一个控件台程序，命名为DotnetSpiderDemo

新建一个数据对象

先引用两个Dll类库

Java2Dotnet.Spider.Core.dll

Newtonsoft.Json.dll

如果你编译DotnetSpider成功的话，可以在output目录中找到

现在来写数据处理器，实现 IPageProcessor 这个接口

 
           /// <summary> 
          
           /// 页面列表处理器 
          
           /// </summary> 
          
           public  
           class  
           PageListProcessor : IPageProcessor 
          
           { 
          
           public  
           Site Site{ 
           get 
           ;  
           set 
           ; } 
          
           public  
           void  
           Process(Page page) 
          
           { 
          
           var  
           totalCnblogElements = page.Selectable.SelectList(Selectors.XPath( 
           "//div[@class='post_item']" 
           )).Nodes(); 
          
           List<Cnblog> results =  
           new  
           List<Cnblog>(); 
          
           foreach  
           ( 
           var  
           cnblogElement  
           in  
           totalCnblogElements) 
          
           { 
          
           var  
           cnblog =  
           new  
           Cnblog(); 
          
           cnblog.Title = cnblogElement.Select(Selectors.XPath( 
           ".//div[@class='post_item_body']/h3/a" 
           )).GetValue(); 
          
           cnblog.Url = cnblogElement.Select(Selectors.XPath( 
           ".//div[@class='post_item_body']/h3" 
           )).Links().GetValue(); 
          
           cnblog.Author = cnblogElement.Select(Selectors.XPath( 
           ".//div[@class='post_item_foot']/a[1]" 
           )).GetValue(); 
          
           results.Add(cnblog); 
          
           } 
          
           page.AddResultItem( 
           "Result" 
           , results); 
          
           } 
          
           }

关于XPath,可以到这里学习http://www.w3school.com.cn/xpath/，我也是下午刚看了一遍，因为有XML/HTML基础，基本没压力

关于XPath表达式如何写，我觉得用谷歌审核元素就足够了，可以复制XPath。也有一款谷歌XPath插件，因我翻不了墙，就没安装。

如下图：//*[@id="post_list"]/div[20]/div[2]/h3/a，然后再按需改改

数据存取

需要实现 IPipeline这个接口，然后你想保存到文件或数据库就自己选择

 
           public  
           class  
           ListPipeline : IPipeline 
          
           { 
          
           private  
           string  
           _path; 
          
           public  
           ListPipeline( 
           string  
           path) 
          
           { 
          
           if  
           ( 
           string 
           .IsNullOrEmpty(path)) 
          
           { 
          
           throw  
           new  
           Exception( 
           "文件名不能为空！" 
           ); 
          
           } 
          
           _path = path; 
          
           if  
           (!File.Exists(_path)) 
          
           { 
          
           File.Create(_path); 
          
           } 
          
           } 
          
           public  
           void  
           Dispose() 
          
           { 
          
           } 
          
           public  
           void  
           Process(ResultItems resultItems, ISpider spider) 
          
           { 
          
           lock  
           ( 
           this 
           ) 
          
           { 
          
           foreach  
           (Cnblog entry  
           in  
           resultItems.Results[ 
           "Result" 
           ]) 
          
           { 
          
           File.AppendAllText(_path, JsonConvert.SerializeObject(entry)); 
          
           } 
          
           } 
          
           }

接下来在Program的Main方法中写运行代码

 
           class  
           Program 
          
           { 
          
           static  
           void  
           Main( 
           string 
           [] args) 
          
           { 
          
           var  
           site =  
           new  
           Site() { EncodingName =  
           "UTF-8"  
           }; 
          
           for  
           ( 
           int  
           i = 1; i <= 30; i++) 
           //30页 
          
           { 
          
           site.AddStartUrl( 
          
           $ 
           "http://www.cnblogs.com/#p{i}" 
           ); 
          
           } 
          
           Spider spider = Spider.Create(site,  
           new  
           PageListProcessor(),  
           new  
           QueueDuplicateRemovedScheduler()).AddPipeline( 
           new  
           ListPipeline( 
           "test.json" 
           )).SetThreadNum(2); 
           //两个线程 
          
           spider.Run(); 
          
           Console.Read(); 
          
           } 
          
           }

这样每一页信息就被保存起来了，但到这里还没完，一般情况不仅仅是采集列表页，也会采集详细页，于是我又添加了两个类，暂时我是这样实现的，但感觉有点慢

添加页面详细数据处理器

 
           /// <summary> 
          
           /// 页面详细处理器 
          
           /// </summary> 
          
           public  
           class  
           PageDetailProcessor : IPageProcessor 
          
           { 
          
           private  
           Cnblog cnblog; 
          
           public  
           PageDetailProcessor(Cnblog _cnblog) 
          
           { 
          
           cnblog = _cnblog; 
          
           } 
          
           public  
           Site Site {  
           get 
           ;  
           set 
           ; } 
          
           public  
           void  
           Process(Page page) 
          
           { 
          
           cnblog.Conter=page.Selectable.Select(Selectors.XPath( 
           "//*[@id='cnblogs_post_body']" 
           )).GetValue(); 
          
           page.AddResultItem( 
           "detail" 
           ,cnblog); 
          
           } 
          
           }

再添加页面详细数据保存

 
           public  
           class  
           DetailPipeline : IPipeline 
          
           { 
          
           private  
           string  
           path; 
          
           public  
           DetailPipeline( 
           string  
           _path) 
          
           { 
          
           if  
           ( 
           string 
           .IsNullOrEmpty(_path)) 
          
           { 
          
           throw  
           new  
           Exception( 
           "路径不能为空！" 
           ); 
          
           } 
          
           path = _path; 
          
           if  
           (!Directory.Exists(_path)) 
          
           { 
          
           Directory.CreateDirectory(_path); 
          
           } 
          
           } 
          
           public  
           void  
           Dispose() 
          
           { 
          
           } 
          
           public  
           void  
           Process(ResultItems resultItems, ISpider spider) 
          
           { 
          
           Cnblog cnblog=resultItems.Results[ 
           "detail" 
           ]; 
          
           FileStream fs=File.Create(path +  
           "\\"  
           + cnblog.Title +  
           ".txt" 
           ); 
          
           byte 
           [] bytes=UTF8Encoding.UTF8.GetBytes( 
           "Url:" 
           +cnblog.Url+Environment.NewLine+cnblog.Conter); 
          
           fs.Write(bytes,0,bytes.Length); 
          
           fs.Flush(); 
          
           fs.Close(); 
          
           } 
          
           }

修改ListPipeline这个类RequestDetail方法，我的想法是列表数据保存一次就请求一次详细页，然后再保存详细页

所有详细页都保存在details这个目录下

 
           public  
           class  
           ListPipeline : IPipeline 
          
           { 
          
           private  
           string  
           _path; 
          
           public  
           ListPipeline( 
           string  
           path) 
          
           { 
          
           if  
           ( 
           string 
           .IsNullOrEmpty(path)) 
          
           { 
          
           throw  
           new  
           Exception( 
           "文件名不能为空！" 
           ); 
          
           } 
          
           _path = path; 
          
           if  
           (!File.Exists(_path)) 
          
           { 
          
           File.Create(_path); 
          
           } 
          
           } 
          
           public  
           void  
           Dispose() 
          
           { 
          
           } 
          
           public  
           void  
           Process(ResultItems resultItems, ISpider spider) 
          
           { 
          
           lock  
           ( 
           this 
           ) 
          
           { 
          
           foreach  
           (Cnblog entry  
           in  
           resultItems.Results[ 
           "Result" 
           ]) 
          
           { 
          
           File.AppendAllText(_path, JsonConvert.SerializeObject(entry)); 
          
           RequestDetail(entry); 
          
           } 
          
           } 
          
           } 
          
           /// <summary> 
          
           /// 请求详细页 
          
           /// </summary> 
          
           /// <param name="entry"></param> 
          
           private  
           static  
           void  
           RequestDetail(Cnblog entry) 
          
           { 
          
           ISpider spider; 
          
           var  
           site =  
           new  
           Site() {EncodingName =  
           "UTF-8" 
           }; 
          
           site.AddStartUrl(entry.Url); 
          
           spider = 
          
           Spider.Create(site,  
           new  
           PageDetailProcessor(entry),  
           new  
           QueueDuplicateRemovedScheduler()) 
          
           .AddPipeline( 
           new  
           DetailPipeline( 
           "details" 
           )) 
          
           .SetThreadNum(1); 
          
           spider.Run(); 
          
           } 
          
           }

其它代码保持不变，运行程序，现在已经能保存详细页内容了

最后，程序运行下来没什么大问题，但就是在采集详细页时比较慢，我的想法是把所有详细页一起加到调度中心，然后开多个线程去运行，这个有待学习。

Demo下载

采集博客园

今天ModestMT.Zou发布了DotnetSpider爬虫第二章节，内容简单明了，基本看懂了，于是想自己试试看，直接就拿博客园开刀了。

这里有最基本的使用方式，本文章不介绍

[开源 .NET 跨平台数据采集爬虫框架: DotnetSpider] [二] 最基本，最自由的使用方式

这里我已经从https://github.com/zlzforever/DotnetSpider上下载代码并编译通过

这里用的是VS2015，因为此项目有些C#6.0语法糖

首先，用VS2015新建一个控件台程序，命名为DotnetSpiderDemo

新建一个数据对象

先引用两个Dll类库

Java2Dotnet.Spider.Core.dll

Newtonsoft.Json.dll

如果你编译DotnetSpider成功的话，可以在output目录中找到

现在来写数据处理器，实现 IPageProcessor 这个接口

 
           /// <summary> 
          
           /// 页面列表处理器 
          
           /// </summary> 
          
           public  
           class  
           PageListProcessor : IPageProcessor 
          
           { 
          
           public  
           Site Site{ 
           get 
           ;  
           set 
           ; } 
          
           public  
           void  
           Process(Page page) 
          
           { 
          
           var  
           totalCnblogElements = page.Selectable.SelectList(Selectors.XPath( 
           "//div[@class='post_item']" 
           )).Nodes(); 
          
           List<Cnblog> results =  
           new  
           List<Cnblog>(); 
          
           foreach  
           ( 
           var  
           cnblogElement  
           in  
           totalCnblogElements) 
          
           { 
          
           var  
           cnblog =  
           new  
           Cnblog(); 
          
           cnblog.Title = cnblogElement.Select(Selectors.XPath( 
           ".//div[@class='post_item_body']/h3/a" 
           )).GetValue(); 
          
           cnblog.Url = cnblogElement.Select(Selectors.XPath( 
           ".//div[@class='post_item_body']/h3" 
           )).Links().GetValue(); 
          
           cnblog.Author = cnblogElement.Select(Selectors.XPath( 
           ".//div[@class='post_item_foot']/a[1]" 
           )).GetValue(); 
          
           results.Add(cnblog); 
          
           } 
          
           page.AddResultItem( 
           "Result" 
           , results); 
          
           } 
          
           }

关于XPath,可以到这里学习http://www.w3school.com.cn/xpath/，我也是下午刚看了一遍，因为有XML/HTML基础，基本没压力

关于XPath表达式如何写，我觉得用谷歌审核元素就足够了，可以复制XPath。也有一款谷歌XPath插件，因我翻不了墙，就没安装。

如下图：//*[@id="post_list"]/div[20]/div[2]/h3/a，然后再按需改改

数据存取

需要实现 IPipeline这个接口，然后你想保存到文件或数据库就自己选择

 
           public  
           class  
           ListPipeline : IPipeline 
          
           { 
          
           private  
           string  
           _path; 
          
           public  
           ListPipeline( 
           string  
           path) 
          
           { 
          
           if  
           ( 
           string 
           .IsNullOrEmpty(path)) 
          
           { 
          
           throw  
           new  
           Exception( 
           "文件名不能为空！" 
           ); 
          
           } 
          
           _path = path; 
          
           if  
           (!File.Exists(_path)) 
          
           { 
          
           File.Create(_path); 
          
           } 
          
           } 
          
           public  
           void  
           Dispose() 
          
           { 
          
           } 
          
           public  
           void  
           Process(ResultItems resultItems, ISpider spider) 
          
           { 
          
           lock  
           ( 
           this 
           ) 
          
           { 
          
           foreach  
           (Cnblog entry  
           in  
           resultItems.Results[ 
           "Result" 
           ]) 
          
           { 
          
           File.AppendAllText(_path, JsonConvert.SerializeObject(entry)); 
          
           } 
          
           } 
          
           }

接下来在Program的Main方法中写运行代码

 
           class  
           Program 
          
           { 
          
           static  
           void  
           Main( 
           string 
           [] args) 
          
           { 
          
           var  
           site =  
           new  
           Site() { EncodingName =  
           "UTF-8"  
           }; 
          
           for  
           ( 
           int  
           i = 1; i <= 30; i++) 
           //30页 
          
           { 
          
           site.AddStartUrl( 
          
           $ 
           "http://www.cnblogs.com/#p{i}" 
           ); 
          
           } 
          
           Spider spider = Spider.Create(site,  
           new  
           PageListProcessor(),  
           new  
           QueueDuplicateRemovedScheduler()).AddPipeline( 
           new  
           ListPipeline( 
           "test.json" 
           )).SetThreadNum(2); 
           //两个线程 
          
           spider.Run(); 
          
           Console.Read(); 
          
           } 
          
           }

添加页面详细数据处理器

 
           /// <summary> 
          
           /// 页面详细处理器 
          
           /// </summary> 
          
           public  
           class  
           PageDetailProcessor : IPageProcessor 
          
           { 
          
           private  
           Cnblog cnblog; 
          
           public  
           PageDetailProcessor(Cnblog _cnblog) 
          
           { 
          
           cnblog = _cnblog; 
          
           } 
          
           public  
           Site Site {  
           get 
           ;  
           set 
           ; } 
          
           public  
           void  
           Process(Page page) 
          
           { 
          
           cnblog.Conter=page.Selectable.Select(Selectors.XPath( 
           "//*[@id='cnblogs_post_body']" 
           )).GetValue(); 
          
           page.AddResultItem( 
           "detail" 
           ,cnblog); 
          
           } 
          
           }

再添加页面详细数据保存

 
           public  
           class  
           DetailPipeline : IPipeline 
          
           { 
          
           private  
           string  
           path; 
          
           public  
           DetailPipeline( 
           string  
           _path) 
          
           { 
          
           if  
           ( 
           string 
           .IsNullOrEmpty(_path)) 
          
           { 
          
           throw  
           new  
           Exception( 
           "路径不能为空！" 
           ); 
          
           } 
          
           path = _path; 
          
           if  
           (!Directory.Exists(_path)) 
          
           { 
          
           Directory.CreateDirectory(_path); 
          
           } 
          
           } 
          
           public  
           void  
           Dispose() 
          
           { 
          
           } 
          
           public  
           void  
           Process(ResultItems resultItems, ISpider spider) 
          
           { 
          
           Cnblog cnblog=resultItems.Results[ 
           "detail" 
           ]; 
          
           FileStream fs=File.Create(path +  
           "\\"  
           + cnblog.Title +  
           ".txt" 
           ); 
          
           byte 
           [] bytes=UTF8Encoding.UTF8.GetBytes( 
           "Url:" 
           +cnblog.Url+Environment.NewLine+cnblog.Conter); 
          
           fs.Write(bytes,0,bytes.Length); 
          
           fs.Flush(); 
          
           fs.Close(); 
          
           } 
          
           }

修改ListPipeline这个类RequestDetail方法，我的想法是列表数据保存一次就请求一次详细页，然后再保存详细页

所有详细页都保存在details这个目录下

 
           public  
           class  
           ListPipeline : IPipeline 
          
           { 
          
           private  
           string  
           _path; 
          
           public  
           ListPipeline( 
           string  
           path) 
          
           { 
          
           if  
           ( 
           string 
           .IsNullOrEmpty(path)) 
          
           { 
          
           throw  
           new  
           Exception( 
           "文件名不能为空！" 
           ); 
          
           } 
          
           _path = path; 
          
           if  
           (!File.Exists(_path)) 
          
           { 
          
           File.Create(_path); 
          
           } 
          
           } 
          
           public  
           void  
           Dispose() 
          
           { 
          
           } 
          
           public  
           void  
           Process(ResultItems resultItems, ISpider spider) 
          
           { 
          
           lock  
           ( 
           this 
           ) 
          
           { 
          
           foreach  
           (Cnblog entry  
           in  
           resultItems.Results[ 
           "Result" 
           ]) 
          
           { 
          
           File.AppendAllText(_path, JsonConvert.SerializeObject(entry)); 
          
           RequestDetail(entry); 
          
           } 
          
           } 
          
           } 
          
           /// <summary> 
          
           /// 请求详细页 
          
           /// </summary> 
          
           /// <param name="entry"></param> 
          
           private  
           static  
           void  
           RequestDetail(Cnblog entry) 
          
           { 
          
           ISpider spider; 
          
           var  
           site =  
           new  
           Site() {EncodingName =  
           "UTF-8" 
           }; 
          
           site.AddStartUrl(entry.Url); 
          
           spider = 
          
           Spider.Create(site,  
           new  
           PageDetailProcessor(entry),  
           new  
           QueueDuplicateRemovedScheduler()) 
          
           .AddPipeline( 
           new  
           DetailPipeline( 
           "details" 
           )) 
          
           .SetThreadNum(1); 
          
           spider.Run(); 
          
           } 
          
           }

其它代码保持不变，运行程序，现在已经能保存详细页内容了

Demo下载

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 [初探爬虫框架: DotnetSpider] 一采集博客园 DotnetSpider5 爬博客园新闻页面数据采集——网络爬虫实战（ASP.NET Web 博客园为例） .NET轻松写博客园爬虫【nodeJS爬虫】前端爬虫系列 -- 小爬「博客园」 Python爬虫爬取博客园作业 Python爬虫爬取博客园并保存爬虫篇-博客园搜索爬取 python爬虫:将本人博客园文章转化为MarkDown格式爬虫之博客园精华客户端

DotnetSpider爬虫采集博客园

采集博客园

[开源 .NET 跨平台 数据采集 爬虫框架: DotnetSpider] [二] 最基本，最自由的使用方式

首先，用VS2015新建一个控件台程序，命名为DotnetSpiderDemo

新建一个数据对象

现在来写数据处理器，实现 IPageProcessor 这个接口

数据存取

采集博客园

[开源 .NET 跨平台 数据采集 爬虫框架: DotnetSpider] [二] 最基本，最自由的使用方式

首先，用VS2015新建一个控件台程序，命名为DotnetSpiderDemo

新建一个数据对象

现在来写数据处理器，实现 IPageProcessor 这个接口

数据存取

免责声明！

[开源 .NET 跨平台数据采集爬虫框架: DotnetSpider] [二] 最基本，最自由的使用方式

[开源 .NET 跨平台数据采集爬虫框架: DotnetSpider] [二] 最基本，最自由的使用方式