動態抓取網頁信息

本文轉載自查看原文 2016-04-27 17:16 1265 C#技術/ 動態抓取網頁信息

　　前幾天在做數據庫實驗時，總是手動的向數據庫中添加少量的固定數據，於是就想如何向數據庫中導入大量的動態的數據？在網上了解了網絡爬蟲，它可以幫助我們完成這項工作，關於網絡爬蟲的原理和基礎知識，網上有大量的相關介紹，本人不想在累述，個人覺得下面的文章寫得非常的好（網絡爬蟲基本原理一、網絡爬蟲基本原理二）。

　　本博客就以采集博客園首頁的新聞部分為例吧。本例為了直觀簡單就采用MVC，將采集到的數據顯示到頁面中，（其實有好多小型網站就是采用抓取技術抓取網上各自需要的信息，再做相應的應用）。另外在實際的抓取過程中可以采用多線程抓取，以加快采集的速度。

　　下面我們先看看博客園的首頁並做相關的分析：

　　采集后的結果：

　　抓取的原理：先獲取對應url頁面的html內容，然后根據找出你要抓取的目標數據的的html結構，看看這個結構是否有某種規律，然后用正則去匹配這個規則，匹配到了以后就可以采集出來。我們可以先看看頁面源碼，可以發現新聞部分的規律：位於id="post_list"的<div>之間

　　於是，我們便可以得到相應的正則表達式了。

"<div\\s*class=\"post_item\">\\s*.*\\s*.*\\s*.*\\s*.*\\s*.*\\s*.*\\s*.*\\s*<div\\s*class=\"post_item_body\">\\s*<h3><a\\s*class=\"titlelnk\"\\s*href=\"(?<href>.*)\"\\s*target=\"_blank\">(?<title>.*)</a>.*\\s*<p\\s*class=\"post_item_summary\">\\s*(?<content>.*)\\s*</p>"

　　原理很簡單，下面我就給出源代碼：線建立一個MVC空項目，再在Controller文件下添加一個控制器HomeController,再為控制器添加一個視圖Index

HomeController.cs部分代碼：

using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text.RegularExpressions;
using System.Web.Mvc;

namespace WebApplication1.Controllers
{
    public class HomeController : Controller
    {
        /// <summary>
        /// 通過Url地址獲取具體網頁內容 發起一個請求獲得html內容
        /// </summary>
        /// <param name="strUrl"></param>
        /// <returns></returns>
        public static string SendUrl(string strUrl)
        {
            try
            {
                WebRequest webRequest = WebRequest.Create(strUrl);
                WebResponse webResponse = webRequest.GetResponse();
                StreamReader reader = new StreamReader(webResponse.GetResponseStream());
                string result = reader.ReadToEnd();
                return result;
            }
            catch (Exception ex)
            {
                throw ex;
            }
        }
        public ActionResult Index()
        {
            string strPattern = "<div\\s*class=\"post_item\">\\s*.*\\s*.*\\s*.*\\s*.*\\s*.*\\s*.*\\s*.*\\s*<div\\s*class=\"post_item_body\">\\s*<h3><a\\s*class=\"titlelnk\"\\s*href=\"(?<href>.*)\"\\s*target=\"_blank\">(?<title>.*)</a>.*\\s*<p\\s*class=\"post_item_summary\">\\s*(?<content>.*)\\s*</p>";
            List<List<string>> list = new List<List<string>>();
            Regex regex = new Regex(strPattern, RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.CultureInvariant);
            if (regex.IsMatch(SendUrl("http://www.cnblogs.com/")))
            {
                MatchCollection matchCollection = regex.Matches(SendUrl("http://www.cnblogs.com/"));
                foreach (Match match in matchCollection)
                {
                    List<string> one_list = new List<string>();
                    one_list.Add(match.Groups[2].Value);//獲取到的是列表數據的標題
                    one_list.Add(match.Groups[3].Value);//獲取到的是內容
                    one_list.Add(match.Groups[1].Value);//獲取到的是鏈接到的地址
                    list.Add(one_list);
                }
            }
            ViewBag.list = list;
            return View();
        }
    }
}

Index視圖部分代碼：

@{
    Layout = null;
}

<!DOCTYPE html>

<html>
<head>
    <meta name="viewport" content="width=device-width" />
    <title>Index</title>
    <style type="text/css">
        #customers {
            font-family: "Trebuchet MS", Arial, Helvetica, sans-serif;
            width: 100%;
            border-collapse: collapse;
            outline: #00ff00 dotted thick;
        }

            #customers td, #customers th {
                font-size: 1em;
                border: 1px solid #98bf21;
                padding: 3px 7px 2px 7px;
            }

            #customers th {
                font-size: 1.1em;
                text-align: left;
                padding-top: 5px;
                padding-bottom: 4px;
                background-color: #A7C942;
                color: #ffffff;
            }
    </style>
</head>
<body>
    <div>
        <table id="customers">
            <tr>
                <th>標題</th>
                <th>內容</th>
                <th>鏈接</th>
            </tr>
            @foreach (var a in ViewBag.list)
            {
                int count = 0;
                <tr>
                    @foreach (string b in a)
                    {
                        if (++count == 3)
                        {
                            <td><a href="@b">@HttpUtility.HtmlDecode(b)</a></td>@*使轉義符正常輸出*@
                        }
                        else if(count==1)
                        {
                            <td><font color="red">@HttpUtility.HtmlDecode(b)</font></td>
                        }
                        else
                        {
                            <td>@HttpUtility.HtmlDecode(b)</td>
                        }
                    }
                </tr>
            }
        </table>
    </div>
</body>
</html>

　　博客寫到這，一個完整的MVC項目就可以運行了，但是我只采集了一頁，我們也可以將博客園首頁中的分頁那部分（即pager_buttom）采集下來，再添加實現分頁的方法即可，在此代碼我就不貼了，自己試試看。不過如果要將信息導入數據庫，則需要建立相應的表，然后按照表中的屬性在從html中一一采集抽取出所需要的相應信息即可,另外我們不應該將采集到的每條新聞對應的頁面源碼存入數據庫，而應該將每個新聞對應的鏈接存入數據庫即可。原因是下載大量的新聞對應的頁面需要大量的時間，印象采集的效率，並且將大量的新聞頁面文件存入數據庫，會需要大量的內存，還會影響數據庫的性能。

　　本人也是個菜鳥，剛學不久，敬請大神們指摘。謝謝。勿笑。。。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 shell腳本抓取網頁信息 HttpClient（一）HttpClient抓取網頁基本信息 HtmlUnitDriver 網頁內容動態抓取利用Webkit抓取動態網頁和鏈接 java抓取動態生成的網頁--吐槽 C#: 抓取網頁類（獲取網頁中所有信息）網頁信息抓取 Jsoup的不足之處 httpunit python數據分析-網頁信息抓取爬蟲進階之Selenium和chromedriver,動態網頁（Ajax）數據抓取淺談如何使用python抓取網頁中的動態數據