玩玩小爬蟲——入門

本文轉載自查看原文 2012-11-02 22:44 11459 小爬蟲系列

前段時間做一個產品，盈利方式也就是賣數據給用戶，用wpf包裝一下，當然數據提供方是由公司定向爬蟲采集的，雖然在實際工作

中沒有接觸這一塊，不過私下可以玩一玩，研究研究。

既然要抓取網頁的內容，肯定我們會有一個startUrl，通過這個startUrl就可以用廣度優先的方式遍歷整個站點，就如我們學習數據結

構中圖的遍歷一樣。

既然有“請求網頁”和“解析網頁”兩部分，在代碼實現上，我們得需要有兩個集合，分別是Todo和Visited集合，為了簡單起見，我們

從單機版爬蟲說起，說起爬蟲，就必然逃避不了海量數據，既然是海量數據，那么性能問題不容忽視，在Todo和Visited集合的甄別

上，我們選擇用Queue和HashSet，畢竟HashSet在定位查找方面只需常量的時間，下面我們用活動圖來闡述一下。

在廣度優先的時候，我們需要注意兩個問題：

①：有的時候網頁是相對地址，我們需要轉化為絕對地址。

②：剔除外鏈。

看看其中我們一個部門的官網，廣度遍歷一下，看看有多少鏈接，當然是剔除外鏈的。

  1 using System;
  2 using System.Collections.Generic;
  3 using System.Linq;
  4 using System.Text;
  5 using System.Net;
  6 using System.IO;
  7 using System.Text.RegularExpressions;
  8 
  9 namespace ConsoleApplication1
 10 {
 11     public class Program
 12     {
 13         static void Main(string[] args)
 14         {
 15             var crawler = new Crawler("http://www.weishangye.com/");
 16 
 17             crawler.DownLoad();
 18 
 19             //show 一下我們爬到的鏈接
 20             foreach (var item in Crawler.visited)
 21             {
 22                 Console.WriteLine(item);
 23             }
 24         }
 25     }
 26 
 27     public class Crawler
 28     {
 29         //基地址
 30         public static Uri baseUri;
 31         public static string baseHost = string.Empty;
 32 
 33         /// <summary>
 34         /// 工作隊列
 35         /// </summary>
 36         public static Queue<string> todo = new Queue<string>();
 37 
 38         //已訪問的隊列
 39         public static HashSet<string> visited = new HashSet<string>();
 40 
 41         public Crawler(string url)
 42         {
 43             baseUri = new Uri(url);
 44 
 45             //基域
 46             baseHost = baseUri.Host.Substring(baseUri.Host.IndexOf('.'));
 47 
 48             //抓取首地址入隊
 49             todo.Enqueue(url);
 50         }
 51 
 52         public void DownLoad()
 53         {
 54             while (todo.Count > 0)
 55             {
 56                 var currentUrl = todo.Dequeue();
 57 
 58                 //當前url標記為已訪問過
 59                 visited.Add(currentUrl);
 60 
 61                 var request = WebRequest.Create(currentUrl) as HttpWebRequest;
 62 
 63                 var response = request.GetResponse() as HttpWebResponse;
 64 
 65                 var sr = new StreamReader(response.GetResponseStream());
 66 
 67                 //提取url，將未訪問的放入todo表中
 68                 RefineUrl(sr.ReadToEnd());
 69             }
 70         }
 71 
 72         /// <summary>
 73         /// 提取Url
 74         /// </summary>
 75         /// <param name="html"></param>
 76         public void RefineUrl(string html)
 77         {
 78             Regex reg = new Regex(@"(?is)<a[^>]*?href=(['""]?)(?<url>[^'""\s>]+)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>");
 79 
 80             MatchCollection mc = reg.Matches(html);
 81 
 82             foreach (Match m in mc)
 83             {
 84                 var url = m.Groups["url"].Value;
 85 
 86                 if (url == "#")
 87                     continue;
 88 
 89                 //相對路徑轉換為絕對路徑
 90                 Uri uri = new Uri(baseUri, url);
 91 
 92                 //剔除外網鏈接(獲取頂級域名)
 93                 if (!uri.Host.EndsWith(baseHost))
 94                     continue;
 95 
 96                 if (!visited.Contains(uri.ToString()))
 97                 {
 98                     todo.Enqueue(uri.ToString());
 99                 }
100             }
101         }
102     }
103 }

當然還有很多優化的地方，既然是開篇也就這樣了，快速入門才是第一位。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 QTP入門——玩玩小飛機玩玩小爬蟲——試搭小架構玩玩小爬蟲——抓取動態頁面玩玩小爬蟲——抓取時的幾個小細節爬蟲入門實戰，知乎小爬蟲放養的小爬蟲--豆瓣電影入門級爬蟲(mongodb使用教程~) 爬蟲入門玩玩小程序：使用 WebApi 交互打造原生的微信小程序 - 圖靈小書架 [小爬蟲]——某網站視頻爬蟲 Python爬蟲筆記(一):爬蟲基本入門