廣州圖書館借閱抓取——httpClient的使用

本文轉載自查看原文 2017-10-16 09:17 1488 httpClient/ 爬蟲

歡迎訪問我的個人網站，要是能在GitHub上對網站源碼給個star就更好了。

搭建自己的網站的時候，想把自己讀過借過的書都想記錄一下，大學也做過自己學校的借書記錄的爬取，但是數據庫刪掉了==，只保留一張截圖。所以還是要好好珍惜自己閱讀的日子吧，記錄自己的借書記錄——廣州圖書館，現在代碼已經放在服務器上定時運行，結果查看我的網站（關於我）頁面。整個代碼采用HttpClient，存儲放在MySql，定時使用Spring自帶的Schedule，下面是抓取的過程。

1.頁面跳轉過程

一般都是進入首頁http://www.gzlib.gov.cn/，點擊進登陸頁面，然后輸入賬號密碼。表面上看起來沒什么特別之處，實際上模擬登陸的時候不僅僅是向鏈接post一個請求那么簡單，得到的response要么跳回登陸頁面，要么無限制重定向。

事實上，它做了單點登錄，如下圖，廣州圖書館的網址為：www.gzlib.gov.cn，而登陸的網址為：login.gzlib.gov.cn。原理網上很多人都講的很好了，可以看看這篇文章SSO單點登錄。

2.處理方法

解決辦法不難，只要先模擬訪問一下首頁即可獲取圖書館的session，python的獲取代碼如：session.get("http://www.gzlib.gov.cn/")，打印cookie之后如下：

[<Cookie JSESSIONID=19E2DDED4FE7756AA9161A52737D6B8E for .gzlib.gov.cn/>, <Cookie JSESSIONID=19E2DDED4FE7756AA9161A52737D6B8E for www.gzlib.gov.cn/>, <Cookie clientlanguage=zh_CN for www.gzlib.gov.cn/>]

整個登陸抓取的流程如下：

即：
（1）用戶先點擊廣州圖書館的首頁，以獲取改網址的session，然后點擊登錄界面，解析html，獲取lt（自定義的參數，類似於驗證碼），以及單點登錄服務器的session。
（2）向目標服務器（單點登錄服務器）提交post請求，請求參數中包含username（用戶名），password（密碼），event（時間，默認為submit），lt（自定義請求參數），同時服務端還要驗證的參數：refer（來源頁面），host（主機信息），Content-Type（類型）。
（3）打印response，搜索你自己的名字，如果有則表示成功了，否則會跳轉回登陸頁面。
（4）利用cookie去訪問其他頁面，此處實現的是對借閱歷史的抓取，所以訪問的頁面是：http://www.gzlib.gov.cn/member/historyLoanList.jspx。

基本的模擬登陸和獲取就是這些，之后還有對面html的解析，獲取書名、書的索引等，然后封裝成JavaBean，再之后便是保存入數據庫。（去重沒有做，不知道用什么方式比較好）

3.代碼

3.1 Java中，一般用來提交http請求的大部分用的都是httpclient，首先，需要導入的httpclient相關的包：

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.3</version>
</dependency>
<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpcore</artifactId>
    <version>4.4.7</version>
</dependency>

3.2 構建聲明全局變量——上下文管理器，其中context為上下文管理器

public class LibraryUtil {
    private static CloseableHttpClient httpClient = null;
    private static HttpClientContext context = null;
    private static CookieStore cookieStore = null;
    static {
        init();
    }
    private static void init() {
        context = HttpClientContext.create();
        cookieStore = new BasicCookieStore();
        // 配置超時時間
        RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(12000).setSocketTimeout(6000)
                .setConnectionRequestTimeout(6000).build();
        // 設置默認跳轉以及存儲cookie
        httpClient = HttpClientBuilder.create()
                .setKeepAliveStrategy(new DefaultConnectionKeepAliveStrategy())
                .setRedirectStrategy(new DefaultRedirectStrategy()).setDefaultRequestConfig(requestConfig)
                .setDefaultCookieStore(cookieStore).build();
    }
    ...

3.3 聲明一個get函數，其中header可自定義，此處不需要，但是保留着，做成一個通用的吧。

    public static CloseableHttpResponse get(String url, Header[] header) throws IOException {
        HttpGet httpget = new HttpGet(url);
        if (header != null && header.length > 0) {
            httpget.setHeaders(header);
        }
        CloseableHttpResponse response = httpClient.execute(httpget, context);//context用於存儲上下文
        return response;
    }

3.4 訪問首頁以獲得session，服務器上會話是使用session存儲的，本地瀏覽器使用的是cookie，只要本地不退出，那么使用本地的cookie來訪問也是可以的，但是為了達到模擬登陸的效果，這里就不再闡述這種方式。

CloseableHttpResponse homeResponse = get("http://www.gzlib.gov.cn/", null);
homeResponse.close();

此時，如果打印cookie，可以看到目前的cookie如下：

<RequestsCookieJar[
<Cookie JSESSIONID=54702A995ECFC684B192A86467066F20 for .gzlib.gov.cn/>, 
<Cookie JSESSIONID=54702A995ECFC684B192A86467066F20 for www.gzlib.gov.cn/>, 
<Cookie clientlanguage=zh_CN for www.gzlib.gov.cn/>]>

3.5 訪問登陸頁面，獲取單點登錄服務器之后的cookie，解析網頁，獲取自定義參數lt。這里的解析網頁使用了Jsoup，語法和python中的BeautifulSoup中類似。

String loginURL = "http://login.gzlib.gov.cn/sso-server/login?service=http%3A%2F%2Fwww.gzlib.gov.cn%2Flogin.jspx%3FreturnUrl%3Dhttp%253A%252F%252Fwww.gzlib.gov.cn%252F%26locale%3Dzh_CN&appId=www.gzlib.gov.cn&locale=zh_CN";
CloseableHttpResponse loginGetResponse = get(loginURL, null);
String content = toString(loginGetResponse);
String lt = Jsoup.parse(content).select("form").select("input[name=lt]").attr("value");
loginGetResponse.close();

此時，再次查看cookie，多了一個（www.gzlib.gov.cn/sso-server）：

<RequestsCookieJar[
<Cookie JSESSIONID=54702A995ECFC684B192A86467066F20 for .gzlib.gov.cn/>, 
<Cookie JSESSIONID=54702A995ECFC684B192A86467066F20 for www.gzlib.gov.cn/>, 
<Cookie clientlanguage=zh_CN for www.gzlib.gov.cn/>, 
<Cookie JSESSIONID=9918DDF929757B244456D4ECD2DAB2CB for www.gzlib.gov.cn/sso-server/>]>

3.6 聲明一個post函數，用來提交post請求，其中提交的參數默認為

    public static CloseableHttpResponse postParam(String url, String parameters, Header[] headers)
            throws IOException {
        System.out.println(parameters);
        HttpPost httpPost = new HttpPost(url);
        if (headers != null && headers.length > 0) {
            for (Header header : headers) {
                httpPost.addHeader(header);
            }
        }
        List<NameValuePair> nvps = toNameValuePairList(parameters);
        httpPost.setEntity(new UrlEncodedFormEntity(nvps, "UTF-8"));
        CloseableHttpResponse response = httpClient.execute(httpPost, context);
        return response;
    }

3.7 登陸成功后，如果沒有聲明returnurl，即登錄鏈接為（http://login.gzlib.gov.cn/sso-server/login），那么只是會顯示成功登錄的頁面：

后台應該是定義了一個service用來進行鏈接跳轉的，想要獲取登錄成功之后的跳轉頁面可修改service之后的鏈接，這里將保持原始狀態。此時，查看cookie結果如下：

<RequestsCookieJar[
<Cookie JSESSIONID=54702A995ECFC684B192A86467066F20 for .gzlib.gov.cn/>, 
<Cookie JSESSIONID=54702A995ECFC684B192A86467066F20 for www.gzlib.gov.cn/>, 
<Cookie clientlanguage=zh_CN for www.gzlib.gov.cn/>, 
<Cookie CASTGC=TGT-198235-zkocmYyBP6c9G7EXjKyzgKR7I40QI4JBalTkrnr9U6ZkxuP6Tn for www.gzlib.gov.cn/sso-server>, 
<Cookie JSESSIONID=9918DDF929757B244456D4ECD2DAB2CB for www.gzlib.gov.cn/sso-server/>]>

其中，出現CASTGC表明登陸成功了，可以使用該cookie來訪問廣州圖書館的其他頁面，在python中是直接跳轉到其他頁面，而在java使用httpclient過程中，看到的並不是直接的跳轉，而是一個302重定向，打印Header之后結果如下圖：

認真研究一下鏈接，就會發現服務器相當於給了一張通用票ticket，即：可以使用該ticket訪問任何頁面，而returnUrl則是返回的頁面。這里我們直接訪問該重定向的url。

Header header = response.getHeaders("Location")[0];
CloseableHttpResponse home = get(header.getValue(), null);

然后打印頁面，即可獲取登陸之后跳回的首頁。

3.8 解析html
獲取session並跳回首頁之后，再訪問借閱歷史頁面，然后對結果進行html解析，python中使用了BeautifulSoup，簡單而又實用，java中的jsoup也是一個不錯的選擇。

        String html = getHTML();
        Element element = Jsoup.parse(html).select("table.jieyue-table").get(0).select("tbody").get(0);
        Elements trs = element.select("tr");
        for (int i = 0; i < trs.size(); i++) {
            Elements tds = trs.get(i).select("td");
            System.out.println(tds.get(1).text());
        }

輸出結果：

企業IT架構轉型之道
大話Java性能優化
深入理解Hadoop
大話Java性能優化
Java EE開發的顛覆者：Spring Boot實戰
大型網站技術架構：核心原理與案例分析
Java性能權威指南
Akka入門與實踐
高性能網站建設進階指南：Web開發者性能優化最佳實踐：Performance best practices for Web developers
Java EE開發的顛覆者：Spring Boot實戰
深入理解Hadoop
大話Java性能優化

點擊查看源碼

總結

目前，改代碼已經整合進個人網站之中，每天定時抓取一次，但是仍有很多東西沒有做（如分頁、去重等），有興趣的可以研究一下源碼，要是能幫忙完善就更好了。感謝Thanks♪(･ω･)ﾉ。整個代碼接近250行，當然...包括了注釋，但是使用python之后，也不過25行=w=，這里貼一下python的源碼吧。同時，歡迎大家訪問我的個人網站，也歡迎大家能給個star。

import urllib.parse
import requests
from bs4 import BeautifulSoup

session = requests.session()
session.get("http://www.gzlib.gov.cn/")
session.headers.update(
    {"Referer": "http://www.gzlib.gov.cn/member/historyLoanList.jspx",
     "origin": "http://login.gzlib.gov.cn",
     'Content-Type': 'application/x-www-form-urlencoded',
     'host': 'www.gzlib.gov.cn',
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
     }
)
baseURL = "http://login.gzlib.gov.cn/sso-server/login"
soup = BeautifulSoup(session.get(baseURL).text, "html.parser")
lt = soup.select("form")[0].find(attrs={'name': 'lt'})['value']
postdict = {"username": "你的身份證",
            "password": "密碼（默認為身份證后6位）",
            "_eventId": "submit",
            "lt": lt
            }
postdata = urllib.parse.urlencode(postdict)
session.post(baseURL, postdata)
print(session.get("http://www.gzlib.gov.cn/member/historyLoanList.jspx").text)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲實戰---抓取圖書館借閱信息國家圖書館借閱攻略圖書館預約爬蟲圖書館學基礎圖書館管理系統SRS ASP.NET Core MVC 打造一個簡單的圖書館管理系統 (修正版)（六）學生借閱/預約/查詢書籍事務 20190318-使用類做一個簡單的圖書館管理系統 java實現簡易的圖書館管理系統 Java小項目之：圖書館管理系統！在家和圖書館學習哪個好