https://www.cnblogs.com/yxnyd/p/9801396.html
HttpClient(二)HttpClient使用Ip代理與處理連接超時
前言
其實前面寫的那一點點東西都是輕輕點水,其實HttpClient還有很多強大的功能:
一、HttpClient使用代理IP
1.1、前言
在爬取網頁的時候,有的目標站點有反爬蟲機制,對於頻繁訪問站點以及規則性訪問站點的行為,會采集屏蔽IP措施。
這時候,代理IP就派上用場了。可以使用代理IP,屏蔽一個就換一個IP。
關於代理IP的話 也分幾種 透明代理、匿名代理、混淆代理、高匿代理,一般使用高匿代理。
1.2、幾種代理IP
1)透明代理(Transparent Proxy)
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Your IP
透明代理雖然可以直接“隱藏”你的IP地址,但是還是可以從HTTP_X_FORWARDED_FOR來查到你是誰。
2)匿名代理(Anonymous Proxy)
REMOTE_ADDR = proxy IP
HTTP_VIA = proxy IP
HTTP_X_FORWARDED_FOR = proxy IP
匿名代理比透明代理進步了一點:別人只能知道你用了代理,無法知道你是誰。
還有一種比純匿名代理更先進一點的:混淆代理
3)混淆代理(Distorting Proxies)
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Random IP address
如上,與匿名代理相同,如果使用了混淆代理,別人還是能知道你在用代理,但是會得到一個假的IP地址,偽裝的更逼真。
4)高匿代理(Elite proxy或High Anonymity Proxy)
REMOTE_ADDR = Proxy IP
HTTP_VIA = not determined
HTTP_X_FORWARDED_FOR = not determined
可以看出來,高匿代理讓別人根本無法發現你是在用代理,所以是最好的選擇。
一般我們搞爬蟲 用的都是 高匿的代理IP;
那代理IP 從哪里搞呢 很簡單 百度一下,你就知道 一大堆代理IP站點。 一般都會給出一些免費的,但是花點錢搞收費接口更加方便。
1.3、實例來使用代理Ip
使用 RequestConfig.custom().setProxy(proxy).build() 來設置代理IP
package com.jxlg.study.httpclient;
import com.sun.org.apache.regexp.internal.RE;
import org.apache.http.HttpEntity;
import org.apache.http.HttpHost;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
public class UseProxy {
public static void main(String[] args) throws IOException {
//創建httpClient實例
CloseableHttpClient httpClient = HttpClients.createDefault();
//創建httpGet實例
HttpGet httpGet = new HttpGet("http://www.tuicool.com");
//設置代理IP,設置連接超時時間 、 設置 請求讀取數據的超時時間 、 設置從connect Manager獲取Connection超時時間、
HttpHost proxy = new HttpHost("58.60.255.82",8118);
RequestConfig requestConfig = RequestConfig.custom()
.setProxy(proxy)
.setConnectTimeout(10000)
.setSocketTimeout(10000)
.setConnectionRequestTimeout(3000)
.build();
httpGet.setConfig(requestConfig);
//設置請求頭消息
httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36");
CloseableHttpResponse response = httpClient.execute(httpGet);
if (response != null){
HttpEntity entity = response.getEntity(); //獲取返回實體
if (entity != null){
System.out.println("網頁內容為:"+ EntityUtils.toString(entity,"utf-8"));
}
}
if (response != null){
response.close();
}
if (httpClient != null){
httpClient.close();
}
}
}
1.4、實際開發中怎么去獲取代理ip
我們可以使用HttpClient來 爬取 http://www.xicidaili.com/ 上最新的20條的高匿代理IP,來保存到 鏈表中,當一個IP被屏蔽之后獲取連接超時時,
就接着取出 鏈表中的一個IP,以此類推,可以判斷當鏈表中的數量小於5的時候,就重新爬取 代理IP 來保存到鏈表中。
1.5、HttpClient連接超時及讀取超時
httpClient在執行具體http請求時候 有一個連接的時間和讀取內容的時間;
1)HttpClient連接時間
所謂連接的時候 是HttpClient發送請求的地方開始到連接上目標url主機地址的時間,理論上是距離越短越快,
線路越通暢越快,但是由於路由復雜交錯,往往連接上的時間都不固定,運氣不好連不上,HttpClient的默認連接時間,據我測試,
默認是1分鍾,假如超過1分鍾 過一會繼續嘗試連接,這樣會有一個問題 假如遇到一個url老是連不上,會影響其他線程的線程進去,說難聽點,
就是蹲着茅坑不拉屎。所以我們有必要進行特殊設置,比如設置10秒鍾 假如10秒鍾沒有連接上 我們就報錯,這樣我們就可以進行業務上的處理,
比如我們業務上控制 過會再連接試試看。並且這個特殊url寫到log4j日志里去。方便管理員查看。
2)HttpClient讀取時間
所謂讀取的時間 是HttpClient已經連接到了目標服務器,然后進行內容數據的獲取,一般情況 讀取數據都是很快速的,
但是假如讀取的數據量大,或者是目標服務器本身的問題(比如讀取數據庫速度慢,並發量大等等..)也會影響讀取時間。
同上,我們還是需要來特殊設置下,比如設置10秒鍾 假如10秒鍾還沒讀取完,就報錯,同上,我們可以業務上處理。
比如我們這里給個地址 http://central.maven.org/maven2/,這個是國外地址 連接時間比較長的,而且讀取的內容多 。很容易出現連接超時和讀取超時。
我們如何用代碼實現呢?
HttpClient給我們提供了一個RequestConfig類 專門用於配置參數比如連接時間,讀取時間以及前面講解的代理IP等。
例子:
package com.jxlg.study.httpclient;
import org.apache.http.HttpEntity;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
public class TimeSetting {
public static void main(String[] args) throws IOException {
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("http://central.maven.org/maven2/");
RequestConfig config = RequestConfig.custom()
.setConnectTimeout(5000)
.setSocketTimeout(5000)
.build();
httpGet.setConfig(config);
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36");
CloseableHttpResponse response = httpClient.execute(httpGet);
if (response != null){
HttpEntity entity = response.getEntity();
System.out.println("網頁內容為:"+ EntityUtils.toString(entity,"UTF-8"));
}
if (response != null){
response.close();
}
if (httpClient != null){
httpClient.close();
}
}
}
httpClient在請求時設置代理服務器(Http Proxy)的方法
https://www.iteye.com/blog/rd-030-2357128
httpclient的兩個重要的參數maxPerRoute及MaxTotal
https://blog.csdn.net/u013905744/article/details/94714696
博客分類:
異步HttpClient大量請求
由於項目中有用到HttpClient異步發送大量http請求,所以做已記錄
思路:使用HttpClient連接池,多線程
public class HttpAsyncClient {
private static int socketTimeout = 500;// 設置等待數據超時時間0.5秒鍾 根據業務調整
private static int connectTimeout = 2000;// 連接超時
private static int poolSize = 100;// 連接池最大連接數
private static int maxPerRoute = 100;// 每個主機的並發最多只有1500
private static int connectionRequestTimeout = 3000; //從連接池中后去連接的timeout時間
// http代理相關參數
private String host = "58.60.255.82";
private int port = 8118;
private String username = "";
private String password = "";
// 異步httpclient
private CloseableHttpAsyncClient asyncHttpClient;
// 異步加代理的httpclient
private CloseableHttpAsyncClient proxyAsyncHttpClient;
public HttpAsyncClient() {
try {
this.asyncHttpClient = createAsyncClient(false);
this.proxyAsyncHttpClient = createAsyncClient(true);
} catch (Exception e) {
e.printStackTrace();
}
}
public CloseableHttpAsyncClient createAsyncClient(boolean proxy)
throws KeyManagementException, UnrecoverableKeyException,
NoSuchAlgorithmException, KeyStoreException,
MalformedChallengeException, IOReactorException {
RequestConfig requestConfig = RequestConfig.custom()
.setConnectionRequestTimeout(connectionRequestTimeout)
.setConnectTimeout(connectTimeout)
.setSocketTimeout(socketTimeout).build();
SSLContext sslcontext = SSLContexts.createDefault();
UsernamePasswordCredentials credentials = new UsernamePasswordCredentials(
username, password);
CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
credentialsProvider.setCredentials(AuthScope.ANY, credentials);
// 設置協議http和https對應的處理socket鏈接工廠的對象
Registry<SchemeIOSessionStrategy> sessionStrategyRegistry = RegistryBuilder
.<SchemeIOSessionStrategy> create()
.register("http", NoopIOSessionStrategy.INSTANCE)
.register("https", new SSLIOSessionStrategy(sslcontext))
.build();
// 配置io線程
IOReactorConfig ioReactorConfig = IOReactorConfig.custom().setSoKeepAlive(false).setTcpNoDelay(true)
.setIoThreadCount(Runtime.getRuntime().availableProcessors())
.build();
// 設置連接池大小
ConnectingIOReactor ioReactor;
ioReactor = new DefaultConnectingIOReactor(ioReactorConfig);
PoolingNHttpClientConnectionManager conMgr = new PoolingNHttpClientConnectionManager(
ioReactor, null, sessionStrategyRegistry, null);
if (poolSize > 0) {
conMgr.setMaxTotal(poolSize);
}
if (maxPerRoute > 0) {
conMgr.setDefaultMaxPerRoute(maxPerRoute);
} else {
conMgr.setDefaultMaxPerRoute(10);
}
ConnectionConfig connectionConfig = ConnectionConfig.custom()
.setMalformedInputAction(CodingErrorAction.IGNORE)
.setUnmappableInputAction(CodingErrorAction.IGNORE)
.setCharset(Consts.UTF_8).build();
Lookup<AuthSchemeProvider> authSchemeRegistry;
authSchemeRegistry = RegistryBuilder
.<AuthSchemeProvider> create()
.register(AuthSchemes.BASIC, new BasicSchemeFactory())
.register(AuthSchemes.DIGEST, new DigestSchemeFactory())
.register(AuthSchemes.NTLM, new NTLMSchemeFactory())
.register(AuthSchemes.SPNEGO, new SPNegoSchemeFactory())
.register(AuthSchemes.KERBEROS, new KerberosSchemeFactory())
.build();
conMgr.setDefaultConnectionConfig(connectionConfig);
if (proxy) {
return HttpAsyncClients.custom().setConnectionManager(conMgr)
.setDefaultCredentialsProvider(credentialsProvider)
.setDefaultAuthSchemeRegistry(authSchemeRegistry)
.setProxy(new HttpHost(host, port))
.setDefaultCookieStore(new BasicCookieStore())
.setDefaultRequestConfig(requestConfig).build();
} else {
return HttpAsyncClients.custom().setConnectionManager(conMgr)
.setDefaultCredentialsProvider(credentialsProvider)
.setDefaultAuthSchemeRegistry(authSchemeRegistry)
.setDefaultCookieStore(new BasicCookieStore()).build();
}
}
public CloseableHttpAsyncClient getAsyncHttpClient() {
return asyncHttpClient;
}
public CloseableHttpAsyncClient getProxyAsyncHttpClient() {
return proxyAsyncHttpClient;
}
}
public class HttpClientFactory {
private static HttpAsyncClient httpAsyncClient = new HttpAsyncClient();
private HttpClientFactory() {
}
private static HttpClientFactory httpClientFactory = new HttpClientFactory();
public static HttpClientFactory getInstance() {
return httpClientFactory;
}
public HttpAsyncClient getHttpAsyncClientPool() {
return httpAsyncClient;
}
}
public void sendThredPost(List<FaceBookUserQuitEntity> list,String title,String subTitle,String imgUrl){
if(list == null || list.size() == 0){
new BusinessException("亞洲查詢用戶數據為空");
}
int number = list.size();
int num = number / 10;
PostThread[] threads = new PostThread[1];
if(num > 0){
threads = new PostThread[10];
for(int i = 0; i <= 9; i++) {
List<FaceBookUserQuitEntity> threadList = list.subList(i * num, (i + 1) * num > number ? number : (i + 1) * num);
if (threadList == null || threadList.size() == 0) {
new BusinessException("亞洲切分用戶數據為空");
}
threads[i] = new PostThread(HttpClientFactory.getInstance().getHttpAsyncClientPool().getAsyncHttpClient(),
threadList, title, subTitle, imgUrl);
}
for (int k = 0; k< threads.length; k++) {
threads[k].start();
logger.info("亞洲線程: {} 啟動",k);
}
for (int j = 0; j < threads.length; j++) {
try {
threads[j].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}else{
threads[0] = new PostThread(HttpClientFactory.getInstance().getHttpAsyncClientPool().getAsyncHttpClient(),
list,title,subTitle, imgUrl);
threads[0].start();
try {
threads[0].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
public PostThread(CloseableHttpAsyncClient httpClient, List<FaceBookUserQuitEntity> list, String title, String subTitle,String imgUrl){
this.httpClient = httpClient;
this.list = list;
this. title= title;
this. subTitle= subTitle;
this. imgUrl= imgUrl;
}
@Override
public void run() {
try {
int size = list.size();
for (int k = 0; k < size; k += 100) {
List<FaceBookUserQuitEntity> subList = new ArrayList<FaceBookUserQuitEntity>();
if (k + 100 < size) {
subList = list.subList(k, k + 100);
} else {
subList = list.subList(k, size);
}
if(subList.size() > 0){
httpClient.start();
final long startTime = System.currentTimeMillis();
final CountDownLatch latch = new CountDownLatch(subList.size());
for (FaceBookUserQuitEntity faceBookEntity : subList) {
String senderId = faceBookEntity.getSenderId();
String player_id = faceBookEntity.getPlayer_id();
logger.info("開始發送消息:playerid=" + player_id);
String bodyStr = getPostbody(senderId, player_id, title, subTitle,
imgUrl, "Play Game", "");
if (!bodyStr.isEmpty()) {
final HttpPost httpPost = new HttpPost(URL);
StringEntity stringEntity = new StringEntity(bodyStr, "utf-8");
stringEntity.setContentEncoding("UTF-8");
stringEntity.setContentType("application/json");
httpPost.setEntity(stringEntity);
httpClient.execute(httpPost, new FutureCallback<HttpResponse>() {
@Override
public void completed(HttpResponse result) {
latch.countDown();
int statusCode = result.getStatusLine().getStatusCode();
if(200 == statusCode){
logger.info("請求發消息成功="+bodyStr);
try {
logger.info(EntityUtils.toString(result.getEntity(), "UTF-8"));
} catch (IOException e) {
e.printStackTrace();
}
}else{
logger.info("請求返回狀態="+statusCode);
logger.info("請求發消息失敗="+bodyStr);
try {
logger.info(EntityUtils.toString(result.getEntity(), "UTF-8"));
} catch (IOException e) {
e.printStackTrace();
}
}
}
@Override
public void failed(Exception ex) {
latch.countDown();
logger.info("請求發消息失敗e="+ex);
}
@Override
public void cancelled() {
latch.countDown();
}
});
}
}
try {
latch.await();
} catch (InterruptedException e) {
e.printStackTrace();
}
long leftTime = 10000 - (System.currentTimeMillis() - startTime);
if (leftTime > 0) {
try {
Thread.sleep(leftTime);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
} catch (UnsupportedCharsetException e) {
e.printStackTrace();
}
}
以上工具代碼可直接使用,發送邏輯代碼需適當修改。
HttpClient超時設置詳解
httpclient的幾個重要參數,及httpclient連接池的重要參數說明
httpclient的兩個重要的參數maxPerRoute及MaxTotal
httpclient的連接池3個參數
HTTP請求時connectionRequestTimeout 、connectionTimeout、socketTimeout三個超時時間的含義
1.connectionRequestTimout:指從連接池獲取連接的timeout
2.connetionTimeout:指客戶端和服務器建立連接的timeout,
就是http請求的三個階段,一:建立連接;二:數據傳送;三,斷開連接。超時后會ConnectionTimeOutException
3.socketTimeout:指客戶端和服務器建立連接后,客戶端從服務器讀取數據的timeout,超出后會拋出SocketTimeOutException
httpclient封裝了java中進行http網絡請求的底層實現,是一個被廣泛使用的組件。
httpclient是支持池化機制的,這兩個參數maxPerRoute及MaxTotal就是表示池化設置的。
例子2:Apache的Fluent,其Executor類
/**
* An Executor for fluent requests
* <p/>
* A {@link PoolingHttpClientConnectionManager} with maximum 100 connections per route and
* a total maximum of 200 connections is used internally.
*/
//最大100 connections per route 以及 最大200個 connection
CONNMGR = new PoolingHttpClientConnectionManager(sfr);
CONNMGR.setDefaultMaxPerRoute(100);
CONNMGR.setMaxTotal(200);
CLIENT = HttpClientBuilder.create().setConnectionManager(CONNMGR).build();
maxPerRoute及MaxTotal參數含義
maxPerRoute及MaxTotal這兩個參數的含義是什么呢?
下面用測試代碼說明一下
測試端
public class HttpFluentUtil {
private Logger logger = LoggerFactory.getLogger(HttpFluentUtil.class);
private final static int MaxPerRoute = 2;
private final static int MaxTotal = 4;
final static PoolingHttpClientConnectionManager CONNMGR;
final static HttpClient CLIENT;
final static Executor EXECUTOR;
static {
LayeredConnectionSocketFactory ssl = null;
try {
ssl = SSLConnectionSocketFactory.getSystemSocketFactory();
} catch (final SSLInitializationException ex) {
final SSLContext sslcontext;
try {
sslcontext = SSLContext.getInstance(SSLConnectionSocketFactory.TLS);
sslcontext.init(null, null, null);
ssl = new SSLConnectionSocketFactory(sslcontext);
} catch (final SecurityException ignore) {
} catch (final KeyManagementException ignore) {
} catch (final NoSuchAlgorithmException ignore) {
}
}
final Registry<ConnectionSocketFactory> sfr = RegistryBuilder.<ConnectionSocketFactory>create()
.register("http", PlainConnectionSocketFactory.getSocketFactory())
.register("https", ssl != null ? ssl : SSLConnectionSocketFactory.getSocketFactory()).build();
CONNMGR = new PoolingHttpClientConnectionManager(sfr);
CONNMGR.setDefaultMaxPerRoute(MaxPerRoute);
CONNMGR.setMaxTotal(MaxTotal);
CLIENT = HttpClientBuilder.create().setConnectionManager(CONNMGR).build();
EXECUTOR = Executor.newInstance(CLIENT);
}
public static String Get(String uri, int connectTimeout, int socketTimeout) throws IOException {
return EXECUTOR.execute(Request.Get(uri).connectTimeout(connectTimeout).socketTimeout(socketTimeout))
.returnContent().asString();
}
public static String Post(String uri, StringEntity stringEntity, int connectTimeout, int socketTimeout)
throws IOException {
return EXECUTOR.execute(Request.Post(uri).socketTimeout(socketTimeout)
.addHeader("Content-Type", "application/json").body(stringEntity)).returnContent().asString();
}
public static void main(String[] args) {
HttpUtil httpUtil = new HttpUtil();
String url = "http://localhost:9064/app/test"; // 服務端sleep 5秒再返回
for (int i = 0; i < 5; i++) { // MaxPerRoute若設置為2,則5線程分3組返回(2、2、1),共15秒
new Thread(new Runnable() {
@Override
public void run() {
try {
String result = HttpFluentUtil.Get(url, 2000, 2000);
System.out.println(result);
} catch (IOException e) {
e.printStackTrace();
}
}
}).start();
}
}
}
服務器端
很簡單的springmvc
@GetMapping(value="test")
public String test() throws InterruptedException {
Thread.sleep(1000);
return "1";
}
測試1:測試端MaxPerRoute=5 MaxTotal=4
服務器端結果

可以看到先接收4個請求,處理完成后,再接收下一次剩余的1個請求。即其一次最多接收MaxTotal次請求。
測試2:測試端MaxPerRoute=2 MaxTotal=5
服務器端結果

可以看到接收2個請求,2個請求,1個請求,即說明maxPerRoute意思是某一個服務每次能並行接收的請求數量。
什么場景下要設置?
知道了兩個參數的含義,那么在什么情況下要對這兩個參數進行設置呢?
比如說下面的場景
服務1要通過Fluent調用服務2的接口。服務1發送了400個請求,但由於Fluent默認只支持maxPerRoute=100,MaxTotal=200,比如接口執行時間為500ms,由於maxPerRoute=100,所以要分為100,100,100,100分四批來執行,全部執行完成需要2000ms。而如果maxPerRoute設置為400,全部執行完需要500ms。在這種情況下(提供並發能力時)就要對這兩個參數進行設置了。
設置的方法
1、Apache Fluent可以使用上面測試的HttpFluentUtil工具類來執行請求
2、RestTemplate類似使用下面的方式
@Bean
public HttpClient httpClient() {
Registry<ConnectionSocketFactory> registry = RegistryBuilder.<ConnectionSocketFactory>create()
.register("http", PlainConnectionSocketFactory.getSocketFactory())
.register("https", SSLConnectionSocketFactory.getSocketFactory())
.build();
PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager(registry);
connectionManager.setMaxTotal(restTemplateProperties.getMaxTotal());
connectionManager.setDefaultMaxPerRoute(restTemplateProperties.getDefaultMaxPerRoute());
connectionManager.setValidateAfterInactivity(restTemplateProperties.getValidateAfterInactivity());
RequestConfig requestConfig = RequestConfig.custom()
.setSocketTimeout(restTemplateProperties.getSocketTimeout())
.setConnectTimeout(restTemplateProperties.getConnectTimeout())
.setConnectionRequestTimeout(restTemplateProperties.getConnectionRequestTimeout())
.build();
return HttpClientBuilder.create()
.setDefaultRequestConfig(requestConfig)
.setConnectionManager(connectionManager)
.build();
}
@Bean
public ClientHttpRequestFactory httpRequestFactory() {
return new HttpComponentsClientHttpRequestFactory(httpClient());
}
@Bean
public RestTemplate restTemplate() {
return new RestTemplate(httpRequestFactory());
}
其中RestTemplateProperties通過配置文件來配置
max-total default-max-per-route connect-timeout 獲取連接超時 connection-request-timeout 請求超時 socket-timeout 讀超時
總結:
max-total:連接池里的最大連接數 default-max-per-route:某一個/每服務每次能並行接收的請求數量 connect-timeout 從連接池里獲取連接超時時間 connection-request-timeout 請求超時時間 socket-timeout 讀超時時間
參考:https://blog.csdn.net/u013905744/java/article/details/94714696

