下面是webmagic官方的默認實現HttpClientDownloader中的下載方法。
@Override public Page download(Request request, Task task) { Site site = null; if (task != null) { site = task.getSite(); } Set<Integer> acceptStatCode; String charset = null; Map<String, String> headers = null; if (site != null) { acceptStatCode = site.getAcceptStatCode(); charset = site.getCharset(); headers = site.getHeaders(); } else { acceptStatCode = Sets.newHashSet(200); } logger.info("downloading page {}", request.getUrl()); CloseableHttpResponse httpResponse = null; int statusCode=0; try { HttpUriRequest httpUriRequest = getHttpUriRequest(request, site, headers); httpResponse = getHttpClient(site).execute(httpUriRequest); statusCode = httpResponse.getStatusLine().getStatusCode(); request.putExtra(Request.STATUS_CODE, statusCode); if (statusAccept(acceptStatCode, statusCode)) { Page page = handleResponse(request, charset, httpResponse, task); onSuccess(request); return page; } else { logger.warn("code error " + statusCode + "\t" + request.getUrl()); return null; } } catch (IOException e) { logger.warn("download page " + request.getUrl() + " error", e); if (site.getCycleRetryTimes() > 0) { return addToCycleRetry(request, site); } onError(request); return null; } finally { request.putExtra(Request.STATUS_CODE, statusCode); try { if (httpResponse != null) { //ensure the connection is released back to pool EntityUtils.consume(httpResponse.getEntity()); } } catch (IOException e) { logger.warn("close response fail", e); } } }
上面第一個標黃的方法,構造org.apache.http.client.methods.HttpUriRequest。這是一個挺重要的方法,這里面涉及到各種請求頭文件之類的東西。
還有最重要的代理ip這里也是底層實現的地方。
protected HttpUriRequest getHttpUriRequest(Request request, Site site, Map<String, String> headers) { RequestBuilder requestBuilder = selectRequestMethod(request).setUri(request.getUrl()); if (headers != null) { for (Map.Entry<String, String> headerEntry : headers.entrySet()) { requestBuilder.addHeader(headerEntry.getKey(), headerEntry.getValue()); } } RequestConfig.Builder requestConfigBuilder = RequestConfig.custom() .setConnectionRequestTimeout(site.getTimeOut()) .setSocketTimeout(site.getTimeOut()) .setConnectTimeout(site.getTimeOut()) .setCookieSpec(CookieSpecs.BEST_MATCH); if (site.getHttpProxyPool() != null && site.getHttpProxyPool().isEnable()) { HttpHost host = site.getHttpProxyFromPool(); requestConfigBuilder.setProxy(host); request.putExtra(Request.PROXY, host); }else if(site.getHttpProxy()!= null){ HttpHost host = site.getHttpProxy(); requestConfigBuilder.setProxy(host); request.putExtra(Request.PROXY, host); } requestBuilder.setConfig(requestConfigBuilder.build()); return requestBuilder.build(); }
下面進入download方法中標黃的第二個方法,這個方法返回一個org.apache.http.impl.client.CloseableHttpClient類型對象:
private CloseableHttpClient getHttpClient(Site site) { if (site == null) { return httpClientGenerator.getClient(null); } String domain = site.getDomain();
//Map<String, CloseableHttpClient> httpClients CloseableHttpClient httpClient = httpClients.get(domain); if (httpClient == null) { synchronized (this) { httpClient = httpClients.get(domain); if (httpClient == null) { httpClient = httpClientGenerator.getClient(site); httpClients.put(domain, httpClient); } } } return httpClient; }
進入download第三個標黃的方法,該方法返回一個us.codecraft.webmagic.Page對象,這個page對象是webmagic自己封裝的對象:
protected Page handleResponse(Request request, String charset, HttpResponse httpResponse, Task task) throws IOException { String content = getContent(charset, httpResponse); Page page = new Page(); page.setRawText(content); page.setUrl(new PlainText(request.getUrl())); page.setRequest(request); page.setStatusCode(httpResponse.getStatusLine().getStatusCode()); return page; }