HttpClient（一）HttpClient抓取網頁基本信息

本文轉載自查看原文 2018-10-16 23:15 675 HttpClient

一、HttpClient簡介

　　HttpClient 是 Apache Jakarta Common 下的子項目，可以用來提供高效的、最新的、功能豐富的支持 HTTP 協議的客戶端編程工具包，

　　並且它支持 HTTP 協議最新的版本和建議。

　　官方站點：http://hc.apache.org/　　　

　　最新版本4.5 http://hc.apache.org/httpcomponents-client-4.5.x/　

　　官方文檔： http://hc.apache.org/httpcomponents-client-4.5.x/tutorial/html/index.html

　　Maven地址：

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.2</version>
</dependency>

　　HTTP 協議可能是現在 Internet 上使用得最多、最重要的協議了，越來越多的 Java 應用程序需要直接通過 HTTP 協議來訪問網絡資源。雖然在 JDK 的 java net包中

　　已經提供了訪問 HTTP 協議的基本功能，但是對於大部分應用程序來說，JDK 庫本身提供的功能還不夠豐富和靈活。HttpClient 是 Apache Jakarta Common 下的子

　　項目，用來提供高效的、最新的、功能豐富的支持 HTTP 協議的客戶端編程工具包，並且它支持 HTTP 協議最新的版本和建議。HttpClient 已經應用在很多的項目中，

　　比如 Apache Jakarta 上很著名的另外兩個開源項目 Cactus 和 HTMLUnit 都使用了 HttpClient。現在HttpClient最新版本為 HttpClient 4.5 (GA) （2015-09-11）。

　　總結：我們搞爬蟲的，主要是用HttpClient模擬瀏覽器請求第三方站點url，然后響應，獲取網頁數據，然后用Jsoup來提取我們需要的信息。

二、使用HttpClient獲取網頁內容

　　這里我們來抓取博客園首頁的源碼內容

package com.jxlg.study.httpclient;

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

public class GetWebPageContent {
    /**
     * 抓取網頁信息使用get請求
     * @param args
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        //創建httpClient實例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //創建httpGet實例
        HttpGet httpGet = new HttpGet("http://www.cnblogs.com");
        CloseableHttpResponse response = httpClient.execute(httpGet);
        if (response != null){
            HttpEntity entity =  response.getEntity();  //獲取網頁內容
            String result = EntityUtils.toString(entity, "UTF-8");
            System.out.println("網頁內容:"+result);
        }
        if (response != null){
            response.close();
        }
        if (httpClient != null){
            httpClient.close();
        }
    }
}

　　上述代碼中可以直接獲取到網頁內容，有的獲取到的內容是中文亂碼的，這就需要根據網頁的編碼來設置編碼了，比如gb2312。

三、模擬瀏覽器抓取網頁

3.1、設置請求頭消息User-Agent模擬瀏覽器

　　當我們使用上面寫的那個代碼去獲取推酷的網頁源碼是（http://www.tuicool.com）時，會返回給我們如下信息：

網頁內容:<!DOCTYPE html>
<html>
    <head>
          <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    </head>
    <body>
        <p>系統檢測親不是真人行為，因系統資源限制，我們只能拒絕你的請求。如果你有疑問，可以通過微博 http://weibo.com/tuicool2012/ 聯系我們。</p>
    </body>
</html>

　　這是因為網站做了限制，限制別人爬。解決方式可以設置請求頭消息 User-Agent模擬瀏覽器。代碼如下：

import java.io.IOException;

public class GetWebPageContent {
    /**
     * 抓取網頁信息使用get請求
     * @param args
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        //創建httpClient實例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //創建httpGet實例
        HttpGet httpGet = new HttpGet("http://www.tuicool.com");
        httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36");
        CloseableHttpResponse response = httpClient.execute(httpGet);
        if (response != null){
            HttpEntity entity =  response.getEntity();  //獲取網頁內容
            String result = EntityUtils.toString(entity, "UTF-8");
            System.out.println("網頁內容:"+result);
        }
        if (response != null){
            response.close();
        }
        if (httpClient != null){
            httpClient.close();
        }
    }
}

　　給HttpGet方法設置頭消息，即可模擬瀏覽器訪問。

3.2、獲取響應內容Content-Type

　　使用 entity.getContentType().getValue() 來獲取Content-Type，代碼如下：

public class GetWebPageContent {
    /**
     * 抓取網頁信息使用get請求
     * @param args
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        //創建httpClient實例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //創建httpGet實例
        HttpGet httpGet = new HttpGet("http://www.tuicool.com");
        httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36");
        CloseableHttpResponse response = httpClient.execute(httpGet);
        if (response != null){
            HttpEntity entity =  response.getEntity();  //獲取網頁內容
            System.out.println("Content-Type："+entity.getContentType().getValue());  //獲取Content-Type
        }
        if (response != null){
            response.close();
        }
        if (httpClient != null){
            httpClient.close();
        }
    }
}

　　結果：

　　一般網頁是text/html當然有些是帶編碼的，比如請求www.tuicool.com：輸出：

　　　　Content-Type:text/html; charset=utf-8

　　假如請求js文件，比如 http://www.open1111.com/static/js/jQuery.js

　　運行輸出：

　　　　Content-Type:application/javascript

　　假如請求的是文件，比如 http://central.maven.org/maven2/HTTPClient/HTTPClient/0.3-3/HTTPClient-0.3-3.jar

　　運行輸出：

　　　　Content-Type:application/java-archive

　　當然Content-Type還有一堆，那這東西對於我們爬蟲有啥用的，我們再爬取網頁的時候，可以通過

　　Content-Type來提取我們需要爬取的網頁或者是爬取的時候，需要過濾掉的一些網頁。

3.3、獲取響應狀態

　　使用 response.getStatusLine().getStatusCode() 獲取響應狀態，代碼如下：

public class GetWebPageContent {
    /**
     * 抓取網頁信息使用get請求
     * @param args
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        //創建httpClient實例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //創建httpGet實例
        HttpGet httpGet = new HttpGet("http://www.tuicool.com");
        httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36");
        CloseableHttpResponse response = httpClient.execute(httpGet);
        if (response != null){
            int statusCode = response.getStatusLine().getStatusCode();
            System.out.println("響應狀態:"+statusCode);
        }
        if (response != null){
            response.close();
        }
        if (httpClient != null){
            httpClient.close();
        }
    }
}

　　結果：

　　我們HttpClient向服務器請求時，正常情況執行成功返回200狀態碼，不一定每次都會請求成功，

　　比如這個請求地址不存在返回404，服務器內部報錯，返回500有些服務器有防采集，假如你頻繁的采集數據，則返回403 拒絕你請求。

　　當然我們是有辦法的后面會講到用代理IP。

四、抓取圖片

　　使用HttpClient抓取圖片，先通過 entity.getContent() 獲取輸入流，然后使用 common io 中的文件復制方法將圖片專區到本地，代碼如下：

4.1、加入依賴

    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.5</version>
    </dependency>

4.2、核心代碼

package com.jxlg.study.httpclient;

import org.apache.commons.io.FileUtils;
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;

import java.io.File;
import java.io.IOException;
import java.io.InputStream;

public class GetPictureByUrl {
    public static void main(String[] args) throws IOException {
        //圖片路徑
        String url = "https://wx2.sinaimg.cn/mw690/006RYJvjly1fmfk7c049vj30zk0qogq6.jpg";
        //創建httpClient實例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //創建httpGet實例
        HttpGet httpGet = new HttpGet(url);
        //設置請求頭消息
        httpGet.setHeader("user-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36");
        CloseableHttpResponse response = httpClient.execute(httpGet);
        //獲取.后綴
        String fileName = url.substring(url.lastIndexOf("."), url.length());

        if (response != null){
            HttpEntity entity = response.getEntity();
            if (entity != null){
                System.out.println("Content-Type:"+entity.getContentType().getValue());
                InputStream inputStream = entity.getContent();
                //文件復制
                FileUtils.copyToFile(inputStream,new File("D:love"+fileName));
            }
        }
        if (response != null){
            response.close();
        }
        if (httpClient != null){
            httpClient.close();
        }
    }
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 HttpClient（一）HttpClient抓取網頁基本信息使用HttpClient 4.3.4 自動登錄並抓取中國聯通用戶基本信息和賬單數據,GET/POST/Cookie HttpClient（二）-- 模擬瀏覽器抓取網頁 HttpClient抓取動態頁面使用java開源工具httpClient及jsoup抓取解析網頁數據利用HttpClient4訪問網頁通過Fiddler抓取Java HttpClient的HTTP包動態抓取網頁信息 httpclient 發送文件和字符串信息 java HttpClient 獲取頁面Cookie信息