用C#開發蜘蛛網絡爬蟲采集程序（附源碼）（二）

本文轉載自查看原文 2012-09-18 22:40 3700 C#/ 爬蟲/ 蜘蛛/ asp.net/ 采集

上次已經可以得到頁面的HTML代碼了，接下來需要對HTML代碼分析，得到里面所有的鏈接和過濾掉沒用的HTML代碼，把文字內容保留下來。

分析HTML代碼，通過正規表達式將鏈接和鏈接的文字內容保存下來。

    private void FindLink(string html)
    {
        this.TextBox3.Text="";
        List<string> hrefList = new List<string>();//鏈接
        List<string> nameList = new List<string>();//鏈接名稱

        string pattern = @"<a\s*href=(""|')(?<href>[\s\S.]*?)(""|').*?>\s*(?<name>[\s\S.]*?)</a>";
        MatchCollection mc = Regex.Matches(html, pattern);
        foreach (Match m in mc)
        {
            if (m.Success)
            {
                //加入集合數組
                hrefList.Add(m.Groups["href"].Value);
                nameList.Add(m.Groups["name"].Value);
                this.TextBox3.Text += m.Groups["href"].Value + "|" + m.Groups["name"].Value + "\n";
            }
        }
    }

這個方法只實現簡單的找到鏈接，並沒有過濾掉#或javascript:void(0)這樣的內容。

接下要過濾掉沒有用的HTML代碼，保留文字內容，基本還是正規表達式，網上還有很多種方法，寫的正規的HTML頁面都可以正常過濾掉，不過對於那些代碼都不成對的、不按常理出牌的網站，我就很無語了……

    public string ClearHtml(string text)//過濾html,js,css代碼
    {
        text = text.Trim();
        if (string.IsNullOrEmpty(text))
            return string.Empty;
        text = Regex.Replace(text, "<head[^>]*>(?:.|[\r\n])*?</head>", "");
        text = Regex.Replace(text, "<script[^>]*>(?:.|[\r\n])*?</script>", "");
        text = Regex.Replace(text, "<style[^>]*>(?:.|[\r\n])*?</style>", "");
        
        text = Regex.Replace(text, "(<[b|B][r|R]/*>)+|(<[p|P](.|\\n)*?>)", ""); //<br> 
        text = Regex.Replace(text, "\\&[a-zA-Z]{1,10};", "");
        text = Regex.Replace(text, "<[^>]*>", "");

        text = Regex.Replace(text, "(\\s*&[n|N][b|B][s|S][p|P];\\s*)+", ""); //&nbsp;
        text = Regex.Replace(text, "<(.|\\n)*?>", string.Empty); //其它任何標記
        text = Regex.Replace(text, "[\\s]{2,}", " "); //兩個或多個空格替換為一個

        text = text.Replace("'", "''");
        text = text.Replace("\r\n", "");
        text = text.Replace("  ", "");
        text = text.Replace("\t", "");
        return text.Trim();
    }

最后再加個通過URL分析IP地址的方法，有些域名做均衡負載的都可以分析出多個IP，不過只能在本地運行，放IIS上需要完整的信任級別，關於信任級別的說明請點這里。

    private void IPAddresses(string url)
    {
        url = url.Substring(url.IndexOf("//") + 2);
        if (url.IndexOf("/") != -1)
        {
            url = url.Remove(url.IndexOf("/"));
        }
        this.Literal1.Text += "<br>" + url;
        try
        {
            System.Text.ASCIIEncoding ASCII = new System.Text.ASCIIEncoding();
            IPHostEntry ipHostEntry = Dns.GetHostEntry(url);
            System.Net.IPAddress[] ipaddress = ipHostEntry.AddressList;
            foreach (IPAddress item in ipaddress)
            {
                this.Literal1.Text += "<br>IP:" + item;
            }
        }
        catch { }
    }

用C#開發蜘蛛網絡爬蟲采集程序（一）

用C#開發蜘蛛網絡爬蟲采集程序（二）

源代碼下載

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【ShoppingWebCrawler】-C#開發的基於Webkit內核開源爬蟲蜘蛛引擎用C#實現網絡爬蟲（一） C#實現自動升級（附源碼）老蝸牛寫采集：網絡爬蟲（二） C# 多線程網絡爬蟲我的第一個網絡爬蟲 C#版福利程序員專車 C#/ASP.NET MVC微信公眾號接口開發之從零開發（三）回復消息（附源碼） C#使用Xamarin開發可移植移動應用(1.入門與Xamarin.Forms頁面),附源碼 c# 獲取網頁的爬蟲程序