簡單爬蟲-抓取博客園文章列表

本文轉載自查看原文 2014-05-20 08:39 6773 Asp.Net

如果使用對方網站數據，而又沒有響應的接口，或者使用接口不夠靈活的情況下，使用爬蟲在合適不過了。爬蟲有幾種，對方網站展示形式有幾種都是用分析，每個網站展示有相似的地方，有不同的地方。

大部分使用httpRequst就能完成，不管是否添加了口令、隨即碼、請求參數、提交方式get或者post、地址來源、多次響應等等。但是有些網站使用ajax如果是返回json或固定格式的也好處理，如果是很復雜的，可以使用webbrower控件進行抓取，最后正則解析，獲取所需要的數據即可。

那我們來抓取去首頁網站列表文章標題、文章摘要、文章發布時間、文章作者、文章評論次數、文章瀏覽次數。看下結構圖。

get請求返回靜態html附代碼如下

public class HttpCnblogs
    {
        public static List<CnblogsModel> HttpGetHtml()
        {

            HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.cnblogs.com/");
            request.Method = "GET";
            request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
            request.UserAgent = "	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0";
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            Stream stream = response.GetResponseStream();
            StreamReader sr = new StreamReader(stream);
            string articleContent = sr.ReadToEnd();

            List<CnblogsModel> list = new List<CnblogsModel>();

            #region 正則表達式
            //div post_item_body列表
            Regex regBody = new Regex(@"<div\sclass=""post_item_body"">([\s\S].*?)</div>", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
            //a標簽 文章標題  作者名字 評論 閱讀
            Regex regA = new Regex("<a[^>]*?>(.*?)</a>", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
            //p標簽 文章內容
            Regex regP = new Regex(@"<p\sclass=""post_item_summary"">(.*?)</p>", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
            //提取評論 閱讀次數如：評論（10）-》10
            Regex regNumbernew = new Regex(@"\d+", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
            //提取時間
            Regex regTime = new Regex(@"\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
            #endregion
            MatchCollection mList = regBody.Matches(articleContent);
            CnblogsModel model = null;
            String strBody = String.Empty;
            for (int i = 0; i < mList.Count; i++)
            {
                model = new CnblogsModel();
                strBody = mList[i].Groups[1].ToString();
                MatchCollection aList = regA.Matches(strBody);
                int aCount = aList.Count;
                model.ArticleTitle = aList[0].Groups[1].ToString();
                model.ArticleAutor = aCount == 5 ? aList[2].Groups[1].ToString() : aList[1].Groups[1].ToString();
                model.ArticleComment = Convert.ToInt32(regNumbernew.Match(aList[aCount-2].Groups[1].ToString()).Value);
                model.ArticleTime = regTime.Match(strBody).Value;
                model.ArticleView = Convert.ToInt32(regNumbernew.Match(aList[aCount-1].Groups[1].ToString()).Value);
                model.ArticleContent = regP.Matches(strBody)[0].Groups[1].ToString();
                list.Add(model);
            }
            return list;
        }
    }

    public class CnblogsModel
    {
        /// <summary>
        /// 文章標題
        /// </summary>
        public String ArticleTitle { get; set; }
        /// <summary>
        /// 文章內容摘要
        /// </summary>
        public String ArticleContent { get; set; }
        /// <summary>
        /// 文章作者
        /// </summary>
        public String ArticleAutor { get; set; }
        /// <summary>
        /// 文章發布時間
        /// </summary>
        public String ArticleTime { get; set; }
        /// <summary>
        /// 文章評論量
        /// </summary>
        public Int32 ArticleComment { get; set; }
        /// <summary>
        /// 文章瀏覽量
        /// </summary>
        public Int32 ArticleView { get; set; }
    }

最后看看獲取的文章model

寫的不好，還請見諒，准備下面試去。。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 .net core 實現簡單爬蟲—抓取博客園的博文列表 Python簡單爬蟲爬取自己博客園所有文章爬蟲實戰【2】Python博客園-獲取某個博主所有文章的URL列表 C#簡單的爬蟲，爬博客園首頁文章標題 python爬蟲:將本人博客園文章轉化為MarkDown格式 [Python學習] 簡單網絡爬蟲抓取博客文章及思想介紹 .NET Core 實現定時抓取博客園首頁文章信息並發送到郵箱為博客園文章添加目錄的方法如何上傳圖片到博客園的文章中？博客園文章插入圖片

簡單爬蟲-抓取博客園文章列表

如果使用對方網站數據，而又沒有響應的接口，或者使用接口不夠靈活的情況下，使用爬蟲在合適不過了。爬蟲有幾種，對方網站展示形式有幾種都是用分析，每個網站展示有相似的地方，有不同的地方。

那我們來抓取去首頁網站列表 文章標題、文章摘要、文章發布時間、文章作者、文章評論次數、文章瀏覽次數。看下結構圖。

get請求返回靜態html附代碼如下

最后看看獲取的文章model

免責聲明！

那我們來抓取去首頁網站列表文章標題、文章摘要、文章發布時間、文章作者、文章評論次數、文章瀏覽次數。看下結構圖。