HttpClient(二)-- 模擬瀏覽器抓取網頁


一、設置請求頭消息 User-Agent模擬瀏覽器

   1.當使用第一節的代碼 來 訪問推酷的時候,會返回給我們如下信息:

網頁內容:<!DOCTYPE html>
<html>
    <head>
          <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    </head>
    <body>
        <p>系統檢測親不是真人行為,因系統資源限制,我們只能拒絕你的請求。如果你有疑問,可以通過微博 http://weibo.com/tuicool2012/ 聯系我們。</p>
    </body>
</html>

  這是因為網站做了限制,限制別人爬。解決方式可以設置請求頭消息 User-Agent模擬瀏覽器。代碼如下:

/**
     * 抓取網頁信息使用 get請求
     * @param args
     * @throws IOException 
     * @throws ClientProtocolException 
     */
    public static void main(String[] args) throws ClientProtocolException, IOException {
        // 創建httpClient實例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 創建httpGet實例
        HttpGet httpGet = new HttpGet("http://www.tuicool.com");
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");
        CloseableHttpResponse response = httpClient.execute(httpGet);
        if(response != null){
            HttpEntity entity = response.getEntity();   // 獲取網頁內容
            String result = EntityUtils.toString(entity, "UTF-8"); 
            System.out.println("網頁內容:" + result);
        }
        if(response != null){
            response.close();
        }
        if(httpClient != null){
            httpClient.close();
        }
    }

   給HttpGet方法設置頭消息,即可模擬瀏覽器訪問。

二、獲取響應內容Content-Type  

   使用  entity.getContentType().getValue()  來獲取Content-Type,代碼如下:

public static void main(String[] args) throws ClientProtocolException, IOException {
        // 創建httpClient實例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 創建httpGet實例
        HttpGet httpGet = new HttpGet("http://www.tuicool.com");
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");
        CloseableHttpResponse response = httpClient.execute(httpGet);
        if(response != null){
            HttpEntity entity = response.getEntity();   // 獲取網頁內容
            System.out.println("Content-Type:" + entity.getContentType().getValue()); // 獲取Content-Type
        }
        if(response != null){
            response.close();
        }
        if(httpClient != null){
            httpClient.close();
        }
    }

三、獲取響應狀態

  200 -- 正常

  403 -- 拒絕

  500 -- 服務器報錯

  400 -- 未找到頁面

  使用 response.getStatusLine().getStatusCode() 獲取響應狀態,代碼如下:

public static void main(String[] args) throws ClientProtocolException, IOException {
        // 創建httpClient實例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 創建httpGet實例
        HttpGet httpGet = new HttpGet("http://www.tuicool.com");
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");
        CloseableHttpResponse response = httpClient.execute(httpGet);
        if(response != null){
            int state = response.getStatusLine().getStatusCode();
            System.out.println("響應狀態:" + state);
        }
        if(response != null){
            response.close();
        }
        if(httpClient != null){
            httpClient.close();
        }
    }

 四、HttpClient學習地址

  開源博客系統-HttpClient

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM