很多網站的防采集的辦法,就是判斷瀏覽器來源referer和cookie以及userAgent,道高一尺魔高一丈.
最近發現維護的一個爬蟲應用,爬不到數據了,看了一下日志發現被爬網站做了防采集策略,經過定位后,發現被爬網站是針對referer做了驗證,以下是解決方法:
在Java中獲取一個網站的HTML內容可以通過HttpURLConnection來獲取.我們在HttpURLConnection中可以設置referer來偽造referer,輕松繞過這類防采集的網站:
HttpURLConnection connection = null;
URL url = new URL(urlStr);
if (useProxy) {
Proxy proxy = ProxyServerUtil.getProxy();
connection = (HttpURLConnection) url.openConnection(proxy);
} else {
connection = (HttpURLConnection) url.openConnection();
}
connection.setRequestMethod( "POST");
connection.setRequestProperty("referer", "http://xxxx.xxx.com");
connection.addRequestProperty("User-Agent", ProxyServerUtil.getUserAgent());
connection.setConnectTimeout(10000);
connection.setReadTimeout(10000);