使用HtmlAgilityPack實現簡單的博客園主頁內容抓取（2014-03-31）

本文轉載自查看原文 2014-03-31 18:33 1538 網頁抓取

一、時間：

　　2014-03-31 18:08:32，又到了下班的時間了，忙了一天，也累了，中午都沒吃飯。。。。。

二、事件：

　　win8剛出來那會，有個想法，想做一個第三方的博客園軟件應用，奈何技術太渣，瑣事良多，只能不了了之，最近想自己做個網站，於是就想抓取園子里面的內容，因為每天看博客都會讓我成長，學到很多！

三、實現方法：

　　本來想自己存取網頁，利用正則解析頁面，奈何到解析標題時各種問題，而且自己想想也知道效率不是很高，於是就有了使用HtmlAgilityPack（下載地址：http://htmlagilitypack.codeplex.com/，開源的dll，很不錯！）來實現網頁解析，也可以用微軟自己的mshtml，相對而言HtmlAgilityPack更好，於是就用了！

四、代碼簡單實現：

 1 using System;
 2 using System.Collections;
 3 using System.IO;
 4 using System.Net;
 5 using System.Text;
 6 using System.Text.RegularExpressions;
 7 using HtmlAgilityPack;
 8 namespace 測試
 9 {
10     class Program
11     {
12         static void Main(string[] args)
13         {
14             ////指定請求
15             //HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.cnblogs.com/#p2");
16 
17             ////得到返回
18             //HttpWebResponse response = (HttpWebResponse)request.GetResponse();
19 
20             ////得到流
21             //Stream recStream = response.GetResponseStream();
22 
23             ////編碼方式
24             //Encoding gb2312 = Encoding.UTF8;
25 
26             ////指定轉換為gb2312編碼
27             //StreamReader sr = new StreamReader(recStream, gb2312);
28 
29             ////以字符串方式得到網頁內容
30             //String content = sr.ReadToEnd();
31 
32             WebClient wc = new WebClient();
33             wc.BaseAddress = "http://www.cnblogs.com/sitehome/p/2";
34             wc.Encoding = Encoding.UTF8;
35             HtmlDocument doc = new HtmlDocument();
36             string html = wc.DownloadString("http://www.cnblogs.com/sitehome/p/2");
37             doc.LoadHtml(html);
38             string listNode = "/html/body/div[1]/div[4]/div[6]";
39             string[] title = new string[20];
40             string[] digest = new string[20];
41             string[] time = new string[20];
42             string[] uriList = new string[20];
43             string str;
44             HtmlNode node;
45             for(int i=0;i<20;i++)
46             {
47                 str = listNode + "/div[" + (i+1) + "]/div[2]/h3[1]";
48                 node = doc.DocumentNode.SelectSingleNode(str);
49                 title[i]=node.InnerText;
50                 str = listNode + "/div[" + (i+1) + "]/div[2]/p[1]";
51                 node = doc.DocumentNode.SelectSingleNode(str);
52                 digest[i] = node.InnerText;
53                 str = listNode + "/div[" + (i+1) + "]/div[2]/div[1]";
54                 node = doc.DocumentNode.SelectSingleNode(str);
55                 Regex r = new Regex("\\s20\\d\\d-\\d\\d-\\d\\d\\s\\d\\d:\\d\\d");
56                 Match m = r.Match(node.InnerText);
57                 time[i] = m.ToString();
58 
59                 str = listNode + "/div[" + (i + 1) + "]/div[2]/h3[1]/a[1]";
60                 node = doc.DocumentNode.SelectSingleNode(str);
61                 uriList[i] = node.Attributes["href"].Value;
62             }
63             foreach(string str2 in title)
64             {
65                 Console.WriteLine(str2);
66             }
67             foreach (string str2 in uriList)
68             {
69                 Console.WriteLine(str2);
70             }
71             foreach (string str2 in time)
72             {
73                 Console.WriteLine(str2);
74             }
75             Console.ReadKey();         
76         }

五、結果：

六、簡單說明：

　　　　　前20行博客主題

　　　　　中間20行博客地址

　　　　　后面20行時間

七、注意：

1. http://www.cnblogs.com/sitehome/p/2 ：主頁第二頁，前面使用http://www.cnblogs.com/#p2一直不行，看了源碼部分的js才明白，

　　　2.本人技術有限，不喜勿噴，但是歡迎提意見交流，

　　 3.本人新建一群：交友&&知識學習&&職業交流，希望：互幫互助，互相學習。。。。閑人勿進，有意請冒泡一下！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 .net core 實現簡單爬蟲—抓取博客園的博文列表簡單爬蟲-抓取博客園文章列表博客園個人主頁美化 C語言博客園作業03 C語言|博客園作業03 博客園定制頁面（二）——博客主題/主頁設置（CSS）使用nodejs抓取博客園內容---Promise模塊探索在博客園主頁添加側邊欄小插件博客園主頁上添加Live 2D模型博客園主頁特效之-鼠標點擊特效