『開源』50行代碼扒取博客園文章

本文轉載自查看原文 2015-08-07 09:55 6776 HTML匹配/ 高級程序架構設計/ WinForm高級編程/ 網頁爬蟲/ 高級算法設計抽象/ Laura.MatchCore

今天在博客園看到一篇文章：《網絡爬蟲+HtmlAgilityPack+windows服務從博客園爬取20萬博文》

於是心血來潮，立即動手用 50 行代碼，完成博客園文章扒取。

並非嘩眾取寵，有圖有真相 —— 直接上圖。

並非惡意攻擊博客園 —— 所以只扒取 10頁數據，望博客園管理員見諒。

數據准備（瀏覽器F12攔截監視）：

　　文章列表鏈接 : http://www.cnblogs.com/mvc/AggSite/PostList.aspx?CategoryId=808&CategoryType=SiteHome&ItemListActionName=PostList&PageIndex=3&ParentCategoryId=0

　　文章列表HTML : <a class="titlelnk" href="http://www.cnblogs.com/2010wuhao/p/4707154.html" target="_blank">Android中的Intent詳解</a> —— 其中 class="titlelnk" 為重點

　　文章正文HTML : <div id="cnblogs_post_body"> ...... </div><div id="MySignature"> —— 其中 id="cnblogs_post_body" 和 id="MySignature" 為重點

匹配引擎：

　　配置文章列表的扒取規則：

　　得到格式化之后的 HTML：

　　添加正式字段的匹配規則：

　　所見即所得，實時查看匹配結果：

　　配置文章正文的扒取規則：

　　調試文章正文的扒取結果：

　　新建控制台項目：

　　引入核心框架 Laura.MatchCore：

　　完成 50行代碼，調試運行，扒取開始：

程序源碼（50行）：

using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Runtime.Serialization.Formatters.Binary;
using System.Text;
using Laura.MatchCore.Entity;

namespace Temp_20150806_1838
{
    class Program
    {
        static void Main(string[] args)
        {
            const string 文章列表URL_模板 = "http://www.cnblogs.com/mvc/AggSite/PostList.aspx?CategoryId=808&CategoryType=SiteHome&ItemListActionName=PostList&PageIndex={PageIndex}&ParentCategoryId=0";

            MatchSchema matchSchema_List = (MatchSchema) ReadStream(@"Data\扒取博客園_文章列表.data");
            MatchSchema matchSchema_Content = (MatchSchema) ReadStream(@"Data\扒取博客園_文章內容.data");

            for (int i = 1; i <= 10; i++) //只 扒取 10 頁
            {
                Console.WriteLine("正在 扒取 第 {0} 頁", i);
                Console.WriteLine("----------------------------------------");


                string urlList = 文章列表URL_模板.Replace("{PageIndex}", i.ToString());
                string htmlList = ReadHtml(urlList, Encoding.UTF8);

                MatchObject matchObject_List = matchSchema_List.CalculateFieldValues(htmlList);
                List<string> listTitle = matchObject_List.GetValues("文章標題");
                List<string> listUrl = matchObject_List.GetValues("文章URL");

                for (int j = 0; j < listUrl.Count; j++)
                {
                    string urlContent = listUrl[j];
                    string htmlContent = ReadHtml(urlContent, Encoding.UTF8);

                    MatchObject matchObject_Content = matchSchema_Content.CalculateFieldValues(htmlContent);
                    string content = matchObject_Content.GetValue("文章正文");

                    Console.WriteLine("標題: {0} \r\n正文: {1}", listTitle[j], (content.Length >= 100 ? content.Substring(0, 100) : content)); //控制台輸出 正文截取 100位
                    Console.WriteLine("----------------------------------------");
                }
            }


            Console.WriteLine("扒取 博客園 10頁 完成");
        }




        //兩個 工具類, 不算入 正式代碼
        public static object ReadStream(string file)
        {
            if (!File.Exists(file)) return null;

            try
            {
                BinaryFormatter myBf = new BinaryFormatter();
                using (FileStream myFs = new FileStream(file, FileMode.Open))
                {
                    object record = myBf.Deserialize(myFs);
                    return record;
                }
            }
            catch { return null; }
        }
        public static string ReadHtml(string url, Encoding encoding)
        {
            string result = string.Empty;
            try
            {
                using (WebClient webClient = new WebClient { Headers = { { "user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)" } } })
                {
                    using (Stream stream = webClient.OpenRead(url))
                    {
                        if (stream != null)
                        {
                            using (StreamReader streamReader = new StreamReader(stream, encoding))
                            {
                                result = streamReader.ReadToEnd();
                                stream.Close();
                                streamReader.Close();
                            }
                        }
                    }
                }
            }
            catch (Exception exp)
            {
                string logMsg = string.Format("BaseUtil.WebTools.ReadHtml(url) 通過 Url: |{0}| 獲取網頁Html 時發生錯誤:{1}", url, exp);
                Console.WriteLine(logMsg);
            }
            return result;
        }
    }
}

源碼點擊下載（包括核心框架源碼） —— 如果您覺得本文不錯，麻煩點擊一下右下角的推薦。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

『開源』50行代碼 扒取 博客園文章

免責聲明！

『開源』50行代碼扒取博客園文章