[ Crawler ] 爬蟲防屏蔽技巧

本文轉載自查看原文 2013-08-08 16:51 7920

技巧1 仿真Request(使用隨機UserAgent、隨機Proxy與隨機時間間隔對牆進行沖擊)

准備UserAgent array與Proxy array，隨機拼對，進行訪問。一般情況下，會有 ScrapManager 下面包含 UserAgentManager 與 ProxyManager的一些封裝。注意在輪詢遍歷時候，需要Sleep一定的時間。

Thread.Sleep(Consts.RandInt() * 1000);

public class ScrapManager
{
     public static void Load()
     {
        ProxyManager.Load();
        UserAgentManager.Load();
     }

     public static void Next( )
     {
        ProxyManager.Next();
        UserAgentManager.Next();
     }
}

public class ProxyManager
{
   public static string Proxy = " your proxy ";
    public static void Load()
   {

   }

   public static Next()
   { 

   }
}

public class UserAgentManager
{
   public static string UserAgent = "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)";

   public static void Load()
   {

   }

   public static Next()
   {

 
   }
}

string HtmlContent  = string.Empty;

// Request
HttpWebRequest m_HttpWebRequest = (HttpWebRequest)WebRequest.Create(“ your link”);

// Proxy
m_HttpWebRequest.Proxy = new WebProxy(ProxyManager.Proxy, true);

// UserAgent
m_HttpWebRequest.UserAgent = UserAgentManager.UserAgent;
m_HttpWebRequest.Method = "GET";
m_HttpWebRequest.Timeout = -1;

// Response
HttpWebResponse m_HttpWebResponse = (HttpWebResponse)m_HttpWebRequest.GetResponse();

using (StreamReader reader = new StreamReader(m_HttpWebResponse.GetResponseStream()))
{
    HtmlContent = reader.ReadToEnd();
    reader.Close();
}

總結：保持隨機性，一般能不會被完全屏蔽。受限於手上的代理數，需要很多的代理，博主本人手上有14個代理，還是感到有點吃力。

技巧2 Iframe嵌套原頁面使用前段抓取(針對load script html page)

參考：http://www.cnblogs.com/VincentDao/archive/2013/02/05/2892466.html

總結：實現較為簡單，適合扒取腳本load data的網站。

技巧3 仿造Cookie（針對某些門戶的屏蔽措施）

// Cookie
CookieContainer m_CookieContainer = new CookieContainer();
m_HttpWebRequest.CookieContainer = m_CookieContainer;
m_HttpWebRequest.CookieContainer.Add(new Cookie() { Name = "key", Value = "value", Domain = www.example.com });

總結：可以使用IE開發人員工具，Firefox，Chrome對request與response的Cookie進行監測。一般解決商城、社交網絡的網頁扒取。

技巧4 使用Selenium調用瀏覽器扒取頁面

總結：被屏蔽概率最低，能很好的解決以上暴露的不足與問題。對Dev的水平要求較高。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 識別User Agent屏蔽一些Web爬蟲防采集如何讓你的scrapy爬蟲不再被ban之二（利用第三方平台crawlera做scrapy爬蟲防屏蔽）爬蟲_Crawler4j的使用 Java開源爬蟲框架-crawler4j 基於Node.js的爬蟲工具 – Node Crawler 超小開源爬蟲Crawler學習筆記 NGINX屏蔽垃圾爬蟲用nginx屏蔽爬蟲的方法屏蔽ewt的防刷課腳本 Nginx防爬蟲優化