https://www.cnblogs.com/yxnyd/p/9801396.html
HttpClient(二)HttpClient使用Ip代理與處理連接超時
前言
其實前面寫的那一點點東西都是輕輕點水,其實HttpClient還有很多強大的功能:
一、HttpClient使用代理IP
1.1、前言
在爬取網頁的時候,有的目標站點有反爬蟲機制,對於頻繁訪問站點以及規則性訪問站點的行為,會采集屏蔽IP措施。
這時候,代理IP就派上用場了。可以使用代理IP,屏蔽一個就換一個IP。
關於代理IP的話 也分幾種 透明代理、匿名代理、混淆代理、高匿代理,一般使用高匿代理。
1.2、幾種代理IP
1)透明代理(Transparent Proxy)
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Your IP
透明代理雖然可以直接“隱藏”你的IP地址,但是還是可以從HTTP_X_FORWARDED_FOR來查到你是誰。
2)匿名代理(Anonymous Proxy)
REMOTE_ADDR = proxy IP
HTTP_VIA = proxy IP
HTTP_X_FORWARDED_FOR = proxy IP
匿名代理比透明代理進步了一點:別人只能知道你用了代理,無法知道你是誰。
還有一種比純匿名代理更先進一點的:混淆代理
3)混淆代理(Distorting Proxies)
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Random IP address
如上,與匿名代理相同,如果使用了混淆代理,別人還是能知道你在用代理,但是會得到一個假的IP地址,偽裝的更逼真。
4)高匿代理(Elite proxy或High Anonymity Proxy)
REMOTE_ADDR = Proxy IP
HTTP_VIA = not determined
HTTP_X_FORWARDED_FOR = not determined
可以看出來,高匿代理讓別人根本無法發現你是在用代理,所以是最好的選擇。
一般我們搞爬蟲 用的都是 高匿的代理IP;
那代理IP 從哪里搞呢 很簡單 百度一下,你就知道 一大堆代理IP站點。 一般都會給出一些免費的,但是花點錢搞收費接口更加方便。
1.3、實例來使用代理Ip
使用 RequestConfig.custom().setProxy(proxy).build() 來設置代理IP
package com.jxlg.study.httpclient; import com.sun.org.apache.regexp.internal.RE; import org.apache.http.HttpEntity; import org.apache.http.HttpHost; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import java.io.IOException; public class UseProxy { public static void main(String[] args) throws IOException { //創建httpClient實例 CloseableHttpClient httpClient = HttpClients.createDefault(); //創建httpGet實例 HttpGet httpGet = new HttpGet("http://www.tuicool.com"); //設置代理IP,設置連接超時時間 、 設置 請求讀取數據的超時時間 、 設置從connect Manager獲取Connection超時時間、 HttpHost proxy = new HttpHost("58.60.255.82",8118); RequestConfig requestConfig = RequestConfig.custom() .setProxy(proxy) .setConnectTimeout(10000) .setSocketTimeout(10000) .setConnectionRequestTimeout(3000) .build(); httpGet.setConfig(requestConfig); //設置請求頭消息 httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"); CloseableHttpResponse response = httpClient.execute(httpGet); if (response != null){ HttpEntity entity = response.getEntity(); //獲取返回實體 if (entity != null){ System.out.println("網頁內容為:"+ EntityUtils.toString(entity,"utf-8")); } } if (response != null){ response.close(); } if (httpClient != null){ httpClient.close(); } } }
1.4、實際開發中怎么去獲取代理ip
我們可以使用HttpClient來 爬取 http://www.xicidaili.com/ 上最新的20條的高匿代理IP,來保存到 鏈表中,當一個IP被屏蔽之后獲取連接超時時,
就接着取出 鏈表中的一個IP,以此類推,可以判斷當鏈表中的數量小於5的時候,就重新爬取 代理IP 來保存到鏈表中。
1.5、HttpClient連接超時及讀取超時
httpClient在執行具體http請求時候 有一個連接的時間和讀取內容的時間;
1)HttpClient連接時間
所謂連接的時候 是HttpClient發送請求的地方開始到連接上目標url主機地址的時間,理論上是距離越短越快,
線路越通暢越快,但是由於路由復雜交錯,往往連接上的時間都不固定,運氣不好連不上,HttpClient的默認連接時間,據我測試,
默認是1分鍾,假如超過1分鍾 過一會繼續嘗試連接,這樣會有一個問題 假如遇到一個url老是連不上,會影響其他線程的線程進去,說難聽點,
就是蹲着茅坑不拉屎。所以我們有必要進行特殊設置,比如設置10秒鍾 假如10秒鍾沒有連接上 我們就報錯,這樣我們就可以進行業務上的處理,
比如我們業務上控制 過會再連接試試看。並且這個特殊url寫到log4j日志里去。方便管理員查看。
2)HttpClient讀取時間
所謂讀取的時間 是HttpClient已經連接到了目標服務器,然后進行內容數據的獲取,一般情況 讀取數據都是很快速的,
但是假如讀取的數據量大,或者是目標服務器本身的問題(比如讀取數據庫速度慢,並發量大等等..)也會影響讀取時間。
同上,我們還是需要來特殊設置下,比如設置10秒鍾 假如10秒鍾還沒讀取完,就報錯,同上,我們可以業務上處理。
比如我們這里給個地址 http://central.maven.org/maven2/,這個是國外地址 連接時間比較長的,而且讀取的內容多 。很容易出現連接超時和讀取超時。
我們如何用代碼實現呢?
HttpClient給我們提供了一個RequestConfig類 專門用於配置參數比如連接時間,讀取時間以及前面講解的代理IP等。
例子:
package com.jxlg.study.httpclient; import org.apache.http.HttpEntity; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import java.io.IOException; public class TimeSetting { public static void main(String[] args) throws IOException { CloseableHttpClient httpClient = HttpClients.createDefault(); HttpGet httpGet = new HttpGet("http://central.maven.org/maven2/"); RequestConfig config = RequestConfig.custom() .setConnectTimeout(5000) .setSocketTimeout(5000) .build(); httpGet.setConfig(config); httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"); CloseableHttpResponse response = httpClient.execute(httpGet); if (response != null){ HttpEntity entity = response.getEntity(); System.out.println("網頁內容為:"+ EntityUtils.toString(entity,"UTF-8")); } if (response != null){ response.close(); } if (httpClient != null){ httpClient.close(); } } }
httpClient在請求時設置代理服務器(Http Proxy)的方法
https://www.iteye.com/blog/rd-030-2357128
httpclient的兩個重要的參數maxPerRoute及MaxTotal
https://blog.csdn.net/u013905744/article/details/94714696
博客分類:
異步HttpClient大量請求
由於項目中有用到HttpClient異步發送大量http請求,所以做已記錄
思路:使用HttpClient連接池,多線程
public class HttpAsyncClient { private static int socketTimeout = 500;// 設置等待數據超時時間0.5秒鍾 根據業務調整 private static int connectTimeout = 2000;// 連接超時 private static int poolSize = 100;// 連接池最大連接數 private static int maxPerRoute = 100;// 每個主機的並發最多只有1500 private static int connectionRequestTimeout = 3000; //從連接池中后去連接的timeout時間 // http代理相關參數 private String host = "58.60.255.82"; private int port = 8118; private String username = ""; private String password = ""; // 異步httpclient private CloseableHttpAsyncClient asyncHttpClient; // 異步加代理的httpclient private CloseableHttpAsyncClient proxyAsyncHttpClient; public HttpAsyncClient() { try { this.asyncHttpClient = createAsyncClient(false); this.proxyAsyncHttpClient = createAsyncClient(true); } catch (Exception e) { e.printStackTrace(); } } public CloseableHttpAsyncClient createAsyncClient(boolean proxy) throws KeyManagementException, UnrecoverableKeyException, NoSuchAlgorithmException, KeyStoreException, MalformedChallengeException, IOReactorException { RequestConfig requestConfig = RequestConfig.custom() .setConnectionRequestTimeout(connectionRequestTimeout) .setConnectTimeout(connectTimeout) .setSocketTimeout(socketTimeout).build(); SSLContext sslcontext = SSLContexts.createDefault(); UsernamePasswordCredentials credentials = new UsernamePasswordCredentials( username, password); CredentialsProvider credentialsProvider = new BasicCredentialsProvider(); credentialsProvider.setCredentials(AuthScope.ANY, credentials); // 設置協議http和https對應的處理socket鏈接工廠的對象 Registry<SchemeIOSessionStrategy> sessionStrategyRegistry = RegistryBuilder .<SchemeIOSessionStrategy> create() .register("http", NoopIOSessionStrategy.INSTANCE) .register("https", new SSLIOSessionStrategy(sslcontext)) .build(); // 配置io線程 IOReactorConfig ioReactorConfig = IOReactorConfig.custom().setSoKeepAlive(false).setTcpNoDelay(true) .setIoThreadCount(Runtime.getRuntime().availableProcessors()) .build(); // 設置連接池大小 ConnectingIOReactor ioReactor; ioReactor = new DefaultConnectingIOReactor(ioReactorConfig); PoolingNHttpClientConnectionManager conMgr = new PoolingNHttpClientConnectionManager( ioReactor, null, sessionStrategyRegistry, null); if (poolSize > 0) { conMgr.setMaxTotal(poolSize); } if (maxPerRoute > 0) { conMgr.setDefaultMaxPerRoute(maxPerRoute); } else { conMgr.setDefaultMaxPerRoute(10); } ConnectionConfig connectionConfig = ConnectionConfig.custom() .setMalformedInputAction(CodingErrorAction.IGNORE) .setUnmappableInputAction(CodingErrorAction.IGNORE) .setCharset(Consts.UTF_8).build(); Lookup<AuthSchemeProvider> authSchemeRegistry; authSchemeRegistry = RegistryBuilder .<AuthSchemeProvider> create() .register(AuthSchemes.BASIC, new BasicSchemeFactory()) .register(AuthSchemes.DIGEST, new DigestSchemeFactory()) .register(AuthSchemes.NTLM, new NTLMSchemeFactory()) .register(AuthSchemes.SPNEGO, new SPNegoSchemeFactory()) .register(AuthSchemes.KERBEROS, new KerberosSchemeFactory()) .build(); conMgr.setDefaultConnectionConfig(connectionConfig); if (proxy) { return HttpAsyncClients.custom().setConnectionManager(conMgr) .setDefaultCredentialsProvider(credentialsProvider) .setDefaultAuthSchemeRegistry(authSchemeRegistry) .setProxy(new HttpHost(host, port)) .setDefaultCookieStore(new BasicCookieStore()) .setDefaultRequestConfig(requestConfig).build(); } else { return HttpAsyncClients.custom().setConnectionManager(conMgr) .setDefaultCredentialsProvider(credentialsProvider) .setDefaultAuthSchemeRegistry(authSchemeRegistry) .setDefaultCookieStore(new BasicCookieStore()).build(); } } public CloseableHttpAsyncClient getAsyncHttpClient() { return asyncHttpClient; } public CloseableHttpAsyncClient getProxyAsyncHttpClient() { return proxyAsyncHttpClient; } }
public class HttpClientFactory { private static HttpAsyncClient httpAsyncClient = new HttpAsyncClient(); private HttpClientFactory() { } private static HttpClientFactory httpClientFactory = new HttpClientFactory(); public static HttpClientFactory getInstance() { return httpClientFactory; } public HttpAsyncClient getHttpAsyncClientPool() { return httpAsyncClient; } }
public void sendThredPost(List<FaceBookUserQuitEntity> list,String title,String subTitle,String imgUrl){ if(list == null || list.size() == 0){ new BusinessException("亞洲查詢用戶數據為空"); } int number = list.size(); int num = number / 10; PostThread[] threads = new PostThread[1]; if(num > 0){ threads = new PostThread[10]; for(int i = 0; i <= 9; i++) { List<FaceBookUserQuitEntity> threadList = list.subList(i * num, (i + 1) * num > number ? number : (i + 1) * num); if (threadList == null || threadList.size() == 0) { new BusinessException("亞洲切分用戶數據為空"); } threads[i] = new PostThread(HttpClientFactory.getInstance().getHttpAsyncClientPool().getAsyncHttpClient(), threadList, title, subTitle, imgUrl); } for (int k = 0; k< threads.length; k++) { threads[k].start(); logger.info("亞洲線程: {} 啟動",k); } for (int j = 0; j < threads.length; j++) { try { threads[j].join(); } catch (InterruptedException e) { e.printStackTrace(); } } }else{ threads[0] = new PostThread(HttpClientFactory.getInstance().getHttpAsyncClientPool().getAsyncHttpClient(), list,title,subTitle, imgUrl); threads[0].start(); try { threads[0].join(); } catch (InterruptedException e) { e.printStackTrace(); } }
public PostThread(CloseableHttpAsyncClient httpClient, List<FaceBookUserQuitEntity> list, String title, String subTitle,String imgUrl){ this.httpClient = httpClient; this.list = list; this. title= title; this. subTitle= subTitle; this. imgUrl= imgUrl; } @Override public void run() { try { int size = list.size(); for (int k = 0; k < size; k += 100) { List<FaceBookUserQuitEntity> subList = new ArrayList<FaceBookUserQuitEntity>(); if (k + 100 < size) { subList = list.subList(k, k + 100); } else { subList = list.subList(k, size); } if(subList.size() > 0){ httpClient.start(); final long startTime = System.currentTimeMillis(); final CountDownLatch latch = new CountDownLatch(subList.size()); for (FaceBookUserQuitEntity faceBookEntity : subList) { String senderId = faceBookEntity.getSenderId(); String player_id = faceBookEntity.getPlayer_id(); logger.info("開始發送消息:playerid=" + player_id); String bodyStr = getPostbody(senderId, player_id, title, subTitle, imgUrl, "Play Game", ""); if (!bodyStr.isEmpty()) { final HttpPost httpPost = new HttpPost(URL); StringEntity stringEntity = new StringEntity(bodyStr, "utf-8"); stringEntity.setContentEncoding("UTF-8"); stringEntity.setContentType("application/json"); httpPost.setEntity(stringEntity); httpClient.execute(httpPost, new FutureCallback<HttpResponse>() { @Override public void completed(HttpResponse result) { latch.countDown(); int statusCode = result.getStatusLine().getStatusCode(); if(200 == statusCode){ logger.info("請求發消息成功="+bodyStr); try { logger.info(EntityUtils.toString(result.getEntity(), "UTF-8")); } catch (IOException e) { e.printStackTrace(); } }else{ logger.info("請求返回狀態="+statusCode); logger.info("請求發消息失敗="+bodyStr); try { logger.info(EntityUtils.toString(result.getEntity(), "UTF-8")); } catch (IOException e) { e.printStackTrace(); } } } @Override public void failed(Exception ex) { latch.countDown(); logger.info("請求發消息失敗e="+ex); } @Override public void cancelled() { latch.countDown(); } }); } } try { latch.await(); } catch (InterruptedException e) { e.printStackTrace(); } long leftTime = 10000 - (System.currentTimeMillis() - startTime); if (leftTime > 0) { try { Thread.sleep(leftTime); } catch (InterruptedException e) { e.printStackTrace(); } } } } } catch (UnsupportedCharsetException e) { e.printStackTrace(); } }
以上工具代碼可直接使用,發送邏輯代碼需適當修改。
HttpClient超時設置詳解
httpclient的幾個重要參數,及httpclient連接池的重要參數說明
httpclient的兩個重要的參數maxPerRoute及MaxTotal
httpclient的連接池3個參數
HTTP請求時connectionRequestTimeout 、connectionTimeout、socketTimeout三個超時時間的含義
1.connectionRequestTimout:指從連接池獲取連接的timeout
2.connetionTimeout:指客戶端和服務器建立連接的timeout,
就是http請求的三個階段,一:建立連接;二:數據傳送;三,斷開連接。超時后會ConnectionTimeOutException
3.socketTimeout:指客戶端和服務器建立連接后,客戶端從服務器讀取數據的timeout,超出后會拋出SocketTimeOutException
httpclient封裝了java中進行http網絡請求的底層實現,是一個被廣泛使用的組件。
httpclient是支持池化機制的,這兩個參數maxPerRoute及MaxTotal就是表示池化設置的。
例子2:Apache的Fluent,其Executor類
/** * An Executor for fluent requests * <p/> * A {@link PoolingHttpClientConnectionManager} with maximum 100 connections per route and * a total maximum of 200 connections is used internally. */ //最大100 connections per route 以及 最大200個 connection CONNMGR = new PoolingHttpClientConnectionManager(sfr); CONNMGR.setDefaultMaxPerRoute(100); CONNMGR.setMaxTotal(200); CLIENT = HttpClientBuilder.create().setConnectionManager(CONNMGR).build();
maxPerRoute及MaxTotal參數含義
maxPerRoute及MaxTotal這兩個參數的含義是什么呢?
下面用測試代碼說明一下
測試端
public class HttpFluentUtil { private Logger logger = LoggerFactory.getLogger(HttpFluentUtil.class); private final static int MaxPerRoute = 2; private final static int MaxTotal = 4; final static PoolingHttpClientConnectionManager CONNMGR; final static HttpClient CLIENT; final static Executor EXECUTOR; static { LayeredConnectionSocketFactory ssl = null; try { ssl = SSLConnectionSocketFactory.getSystemSocketFactory(); } catch (final SSLInitializationException ex) { final SSLContext sslcontext; try { sslcontext = SSLContext.getInstance(SSLConnectionSocketFactory.TLS); sslcontext.init(null, null, null); ssl = new SSLConnectionSocketFactory(sslcontext); } catch (final SecurityException ignore) { } catch (final KeyManagementException ignore) { } catch (final NoSuchAlgorithmException ignore) { } } final Registry<ConnectionSocketFactory> sfr = RegistryBuilder.<ConnectionSocketFactory>create() .register("http", PlainConnectionSocketFactory.getSocketFactory()) .register("https", ssl != null ? ssl : SSLConnectionSocketFactory.getSocketFactory()).build(); CONNMGR = new PoolingHttpClientConnectionManager(sfr); CONNMGR.setDefaultMaxPerRoute(MaxPerRoute); CONNMGR.setMaxTotal(MaxTotal); CLIENT = HttpClientBuilder.create().setConnectionManager(CONNMGR).build(); EXECUTOR = Executor.newInstance(CLIENT); } public static String Get(String uri, int connectTimeout, int socketTimeout) throws IOException { return EXECUTOR.execute(Request.Get(uri).connectTimeout(connectTimeout).socketTimeout(socketTimeout)) .returnContent().asString(); } public static String Post(String uri, StringEntity stringEntity, int connectTimeout, int socketTimeout) throws IOException { return EXECUTOR.execute(Request.Post(uri).socketTimeout(socketTimeout) .addHeader("Content-Type", "application/json").body(stringEntity)).returnContent().asString(); } public static void main(String[] args) { HttpUtil httpUtil = new HttpUtil(); String url = "http://localhost:9064/app/test"; // 服務端sleep 5秒再返回 for (int i = 0; i < 5; i++) { // MaxPerRoute若設置為2,則5線程分3組返回(2、2、1),共15秒 new Thread(new Runnable() { @Override public void run() { try { String result = HttpFluentUtil.Get(url, 2000, 2000); System.out.println(result); } catch (IOException e) { e.printStackTrace(); } } }).start(); } } }
服務器端
很簡單的springmvc
@GetMapping(value="test") public String test() throws InterruptedException { Thread.sleep(1000); return "1"; }
測試1:測試端MaxPerRoute=5 MaxTotal=4
服務器端結果
可以看到先接收4個請求,處理完成后,再接收下一次剩余的1個請求。即其一次最多接收MaxTotal次請求。
測試2:測試端MaxPerRoute=2 MaxTotal=5
服務器端結果
可以看到接收2個請求,2個請求,1個請求,即說明maxPerRoute意思是某一個服務每次能並行接收的請求數量。
什么場景下要設置?
知道了兩個參數的含義,那么在什么情況下要對這兩個參數進行設置呢?
比如說下面的場景
服務1要通過Fluent調用服務2的接口。服務1發送了400個請求,但由於Fluent默認只支持maxPerRoute=100,MaxTotal=200,比如接口執行時間為500ms,由於maxPerRoute=100,所以要分為100,100,100,100分四批來執行,全部執行完成需要2000ms。而如果maxPerRoute設置為400,全部執行完需要500ms。在這種情況下(提供並發能力時)就要對這兩個參數進行設置了。
設置的方法
1、Apache Fluent可以使用上面測試的HttpFluentUtil工具類來執行請求
2、RestTemplate類似使用下面的方式
@Bean public HttpClient httpClient() { Registry<ConnectionSocketFactory> registry = RegistryBuilder.<ConnectionSocketFactory>create() .register("http", PlainConnectionSocketFactory.getSocketFactory()) .register("https", SSLConnectionSocketFactory.getSocketFactory()) .build(); PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager(registry); connectionManager.setMaxTotal(restTemplateProperties.getMaxTotal()); connectionManager.setDefaultMaxPerRoute(restTemplateProperties.getDefaultMaxPerRoute()); connectionManager.setValidateAfterInactivity(restTemplateProperties.getValidateAfterInactivity()); RequestConfig requestConfig = RequestConfig.custom() .setSocketTimeout(restTemplateProperties.getSocketTimeout()) .setConnectTimeout(restTemplateProperties.getConnectTimeout()) .setConnectionRequestTimeout(restTemplateProperties.getConnectionRequestTimeout()) .build(); return HttpClientBuilder.create() .setDefaultRequestConfig(requestConfig) .setConnectionManager(connectionManager) .build(); } @Bean public ClientHttpRequestFactory httpRequestFactory() { return new HttpComponentsClientHttpRequestFactory(httpClient()); } @Bean public RestTemplate restTemplate() { return new RestTemplate(httpRequestFactory()); }
其中RestTemplateProperties通過配置文件來配置
max-total default-max-per-route connect-timeout 獲取連接超時 connection-request-timeout 請求超時 socket-timeout 讀超時
總結:
max-total:連接池里的最大連接數 default-max-per-route:某一個/每服務每次能並行接收的請求數量 connect-timeout 從連接池里獲取連接超時時間 connection-request-timeout 請求超時時間 socket-timeout 讀超時時間
參考:https://blog.csdn.net/u013905744/java/article/details/94714696