以下內容僅供學習交流使用,請勿做他用,否則后果自負。
一.什么是HttpClient?
HTTP 協議可能是現在 Internet 上使用得最多、最重要的協議了,越來越多的 Java 應用程序需要直接通過 HTTP 協議來訪問網絡資源。雖然在 JDK 的 java net包中已經提供了訪問 HTTP 協議的基本功能,但是對於大部分應用程序來說,JDK 庫本身提供的功能還不夠豐富和靈活。HttpClient 是 Apache Jakarta Common 下的子項目,用來提供高效的、最新的、功能豐富的支持 HTTP 協議的客戶端編程工具包,並且它支持 HTTP 協議最新的版本和建議。HttpClient 已經應用在很多的項目中,比如 Apache Jakarta 上很著名的另外兩個開源項目 Cactus 和 HTMLUnit 都使用了 HttpClient。現在HttpClient最新版本為 HttpClient 4.3.4(2014-06-22).
-----引自百度百科
簡單的說,HttpClient就是一個Apache的一個對於Http封裝的一個jar包.
下面將介紹使用GET/POST請求,登錄中國聯通網站並抓取用戶的基本信息和賬單數據.
二.新建一個maven項目httpclient
我這里的環境是jdk1.7+Intelij idea 13.0+ubuntu12.04+maven+HttpClient 4.3.4 .下面首先建一個maven項目:
如圖所示,選擇quickstart
然后next下去即可.
建好項目后,如下圖所示:
雙擊pom.xml文件並添加所需要的jar包:
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.3.4</version> </dependency>
maven會自動將需要的其它jar包下載好,實際上所需要的jar包如下圖所示:
三.登錄中國聯通並抓取數據
1.使用Get模擬登錄,抓取每月賬單數據
中國聯通有兩種登錄方式:
上面兩圖的區別一個是帶驗證碼,一個是不帶驗證碼,下面將先解決不帶驗證碼的登錄.
package com.amos; import org.apache.http.Header; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpPost; import org.apache.http.impl.client.DefaultHttpClient; import org.apache.http.util.EntityUtils; import java.io.File; import java.io.FileOutputStream; import java.io.InputStream; /** * @author amosli * 登錄並抓取中國聯通數據 */ public class LoginChinaUnicom { /** * @param args * @throws Exception */ public static void main(String[] args) throws Exception { String name = "中國聯通手機號碼"; String pwd = "手機服務密碼"; String url = "https://uac.10010.com/portal/Service/MallLogin?callback=jQuery17202691898950318097_1403425938090&redirectURL=http%3A%2F%2Fwww.10010.com&userName=" + name + "&password=" + pwd + "&pwdType=01&productType=01&redirectType=01&rememberMe=1"; HttpClient httpClient = new DefaultHttpClient(); HttpGet httpGet = new HttpGet(url); HttpResponse loginResponse = httpClient.execute(httpGet); if (loginResponse.getStatusLine().getStatusCode() == 200) { for (Header head : loginResponse.getAllHeaders()) { System.out.println(head); } HttpEntity loginEntity = loginResponse.getEntity(); String loginEntityContent = EntityUtils.toString(loginEntity); System.out.println("登錄狀態:" + loginEntityContent); //如果登錄成功 if (loginEntityContent.contains("resultCode:\"0000\"")) { //月份 String months[] = new String[]{"201401", "201402", "201403", "201404", "201405"}; for (String month : months) { String billurl = "http://iservice.10010.com/ehallService/static/historyBiil/execute/YH102010002/QUERY_YH102010002.processData/QueryYH102010002_Data/" + month + "/undefined"; HttpPost httpPost = new HttpPost(billurl); HttpResponse billresponse = httpClient.execute(httpPost); if (billresponse.getStatusLine().getStatusCode() == 200) { saveToLocal(billresponse.getEntity(), "chinaunicom.bill." + month + ".2.html"); } } } } }
找到要登錄的url以及要傳的參數,這里手機號碼服務密碼這里就不提供了.
new一個DefaultHttpClient,然后使用Get方式發出請求,如果登錄成功,其返回代碼是0000.
再用HttpPost方式將返回值寫到本地.
/** * 寫文件到本地 * * @param httpEntity * @param filename */ public static void saveToLocal(HttpEntity httpEntity, String filename) { try { File dir = new File("/home/amosli/workspace/chinaunicom/"); if (!dir.isDirectory()) { dir.mkdir(); } File file = new File(dir.getAbsolutePath() + "/" + filename); FileOutputStream fileOutputStream = new FileOutputStream(file); InputStream inputStream = httpEntity.getContent(); if (!file.exists()) { file.createNewFile(); } byte[] bytes = new byte[1024]; int length = 0; while ((length = inputStream.read(bytes)) > 0) { fileOutputStream.write(bytes, 0, length); } inputStream.close(); fileOutputStream.close(); } catch (Exception e) { e.printStackTrace(); } }
這里如果只是想輸出一下可以使用EntityUtils.toString(HttpEntity entity)方法,其源碼如下:
public static String toString( final HttpEntity entity, final Charset defaultCharset) throws IOException, ParseException { Args.notNull(entity, "Entity"); final InputStream instream = entity.getContent(); if (instream == null) { return null; } try { Args.check(entity.getContentLength() <= Integer.MAX_VALUE, "HTTP entity too large to be buffered in memory"); int i = (int)entity.getContentLength(); if (i < 0) { i = 4096; } Charset charset = null; try { final ContentType contentType = ContentType.get(entity); if (contentType != null) { charset = contentType.getCharset(); } } catch (final UnsupportedCharsetException ex) { throw new UnsupportedEncodingException(ex.getMessage()); } if (charset == null) { charset = defaultCharset; } if (charset == null) { charset = HTTP.DEF_CONTENT_CHARSET; } final Reader reader = new InputStreamReader(instream, charset); final CharArrayBuffer buffer = new CharArrayBuffer(i); final char[] tmp = new char[1024]; int l; while((l = reader.read(tmp)) != -1) { buffer.append(tmp, 0, l); } return buffer.toString(); } finally { instream.close(); } }
這里可以發現其實現方式還是比較容易看懂的,可以指定編碼,也可以不指定.
2.帶驗證碼的登錄,抓取基本信息
package com.amos; import org.apache.http.HttpResponse; import org.apache.http.client.CookieStore; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpPost; import org.apache.http.cookie.Cookie; import org.apache.http.impl.client.*; import org.apache.http.util.EntityUtils; import java.io.BufferedReader; import java.io.InputStream; import java.io.InputStreamReader; /** * Created by amosli on 14-6-22. */ public class LoginWithCaptcha { public static void main(String args[]) throws Exception { //生成驗證碼的鏈接 String createCaptchaUrl = "http://uac.10010.com/portal/Service/CreateImage"; HttpClient httpClient = new DefaultHttpClient(); String name = "中國聯通手機號碼"; String pwd = "手機服務密碼"; //這里可自定義所需要的cookie CookieStore cookieStore = new BasicCookieStore(); CloseableHttpClient httpclient = HttpClients.custom() .setDefaultCookieStore(cookieStore) .build(); //get captcha,獲取驗證碼 HttpGet captchaHttpGet = new HttpGet(createCaptchaUrl); HttpResponse capthcaResponse = httpClient.execute(captchaHttpGet); if (capthcaResponse.getStatusLine().getStatusCode() == 200) { //將驗證碼寫入本地 LoginChinaUnicom.saveToLocal(capthcaResponse.getEntity(), "chinaunicom.capthca." + System.currentTimeMillis()); } //手工輸入驗證碼並驗證 HttpResponse verifyResponse = null; String capthca = null; String uvc = null; do { //輸入驗證碼,讀入鍵盤輸入 //1) InputStream inputStream = System.in; BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream)); System.out.println("請輸入驗證碼:"); capthca = bufferedReader.readLine(); //2) //Scanner scanner = new Scanner(System.in); //capthca = scanner.next(); String verifyCaptchaUrl = "http://uac.10010.com/portal/Service/CtaIdyChk?verifyCode=" + capthca + "&verifyType=1"; HttpGet verifyCapthcaGet = new HttpGet(verifyCaptchaUrl); verifyResponse = httpClient.execute(verifyCapthcaGet); AbstractHttpClient abstractHttpClient = (AbstractHttpClient) httpClient; for (Cookie cookie : abstractHttpClient.getCookieStore().getCookies()) { System.out.println(cookie.getName() + ":" + cookie.getValue()); if (cookie.getName().equals("uacverifykey")) { uvc = cookie.getValue(); } } } while (!EntityUtils.toString(verifyResponse.getEntity()).contains("true")); //登錄 String loginurl = "https://uac.10010.com/portal/Service/MallLogin?userName=" + name + "&password=" + pwd + "&pwdType=01&productType=01&verifyCode=" + capthca + "&redirectType=03&uvc=" + uvc; HttpGet loginGet = new HttpGet(loginurl); CloseableHttpResponse loginResponse = httpclient.execute(loginGet); System.out.print("loginResponse:" + EntityUtils.toString(loginResponse.getEntity())); //抓取基本信息數據 HttpPost basicHttpGet = new HttpPost("http://iservice.10010.com/ehallService/static/acctBalance/execute/YH102010005/QUERY_AcctBalance.processData/Result"); LoginChinaUnicom.saveToLocal(httpclient.execute(basicHttpGet).getEntity(), "chinaunicom.basic.html"); } }
這里有兩個難點,一是驗證碼,二uvc碼;
驗證碼,這里將其寫到本地,然后人工輸入,這個還比較好解決.
uvc碼,很重要,這個是在cookie里的,httpclient操作cookie的方法網上找了很久都沒有找到,后來看其源碼才看到.
3.效果圖
賬單數據(這里是json格式的數據,可能不太方便查看):
4.本文源碼
https://github.com/amosli/crawl/tree/httpclient