python&MongoDB爬取圖書館借閱記錄（沒有驗證碼）

本文轉載自查看原文 2016-02-05 02:05 1847 python/ 網頁爬蟲/ Mongodb

　　題外話：這個爬蟲本來是想用java完成然后發布在博客園里的，但是一直用java都失敗了，最后看到別人用了python，然后自己就找別人問了問關鍵的知識點，發現連接那部分，python只用了19行！！！！！好了，其他的就不多說，直接上需求和代碼

　　首先是需要爬取的鏈接和網頁：http://211.81.31.34/uhtbin/cgisirsi/x/0/0/57/49?user_id=LIBSCI_ENGI&password=LIBSC

　　登陸進去之后進入我的賬號——借閱、預約及申請記錄——借閱歷史就可以看到所要爬取的內容

　　然后將借閱歷史中的題名、著者、借閱日期、歸還日期、索書號存入Mongodb數據庫中，以上便是這次爬蟲的需求。

下面開始：

各軟件版本為：

python 2.7.11

MongoDb 3.2.1

Pycharm 5.0.4

MongoDb Management Studio 1.9.3

360極速瀏覽器懶得查了

一、登陸模塊

python中的登陸一般都是用urllib和urllib2這兩個模塊，首先我們要查看網頁的源代碼：

 1         <form name="loginform" method="post" action="/uhtbin/cgisirsi/?ps=nPdFje4RP9/理工大學館/125620449/303">
 2 <!--  Copyright (c) 2004, Sirsi Corporation - myProfile login or view myFavorites -->
 3 <!--  Copyright (c) 1998 - 2003, Sirsi Corporation - Sets the default values for USER_ID, ALT_ID, and PIN prompts. - The USER_ID, ALT_ID, and PIN page variables will be returned. -->
 4 
 5 <!-- If the user has not logged in, first try to default to the ID based on the IP address - the $UO and $Uf will be set.  If that fails, then default to the IDs in the config file. If the user has already logged in, default to the logged in user's IDs, unless the user is a shared login. -->
 6 
 7 
 8 
 9       <!-- only user ID is used if both on -->
10         <div class="user_name">
11             <label for="user_id">借閱證號碼:</label>
12             <input class="user_name_input" type="text" name="user_id" id="user_id"  maxlength="20" value=""/>
13         </div>
14     
15      
16         <div class="password">
17             <label for="password">個人密碼:</label>
18             <input class="password_input" type="password" name="password" id="password"  maxlength="20" value=""/>
19         </div>  
20     
21      
22     <input type="submit" value="用戶登錄" class="login_button"/>

查找網頁中的form表單中的action，方法為post，但是隨后我們發現，該網頁中的action地址不是一定的，是隨機變化的，刷新一下就變成了下面這樣子的：

1  <form name="loginform" method="post" action="/uhtbin/cgisirsi/?ps=1Nimt5K1Lt/理工大學館/202330426/303">

我們可以看到/?ps到/之間的字符串是隨機變化的（加粗部分），於是我們需要用到另一個模塊——BeautifulSoup實時獲取該鏈接：

1         url = "http://211.81.31.34/uhtbin/cgisirsi/x/0/0/57/49?user_id=LIBSCI_ENGI&password=LIBSC"
2         res = urllib2.urlopen(url).read()
3         soup = BeautifulSoup(res, "html.parser")
4         login_url = "http://211.81.31.34" + soup.findAll("form")[1]['action'].encode("utf8")

之后就可以正常使用urllib和urllib來模擬登陸了，下面列舉一下BeautifulSoup的常用方法，之后的HTML解析需要：

1.soup.contents 該屬性可以將tag的子節點以列表的方式輸出

2.soup.children 通過tag的.children生成器，可以對tag的子節點進行循環

3.soup.parent 獲取某個元素的父節點

4.soup.find_all(name,attrs,recursive,text,**kwargs) 搜索當前tag的所有tag子節點，並判斷是否符合過濾器的條件

5.soup.find_all("a",class="xx") 按CSS搜索

6.find(name,attrs,recursive,text,**kwargs) 可以通過limit和find_all區分開

更多資料請訪問BeautifulSoup官網：http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

二、解析所獲得的HTML

先看看需求中的HTML的特點：

 1      <tbody id="tblSuspensions">
 2 <!-- OCLN changed Listcode to Le to support charge history -->
 3      <!-- SIRSI_List Listcode="LN" -->
 4 
 5      <tr>
 6        <td class="accountstyle" align="left">
 7           <!-- SIRSI_Conditional IF List_DC_Exists="IB" AND NOT List_DC_Comp="IB^" -->
 8 <!-- Start title here -->
 9  <!-- Title -->
10             做人要低調，說話要幽默 孫郡鎧編著
11         </td>
12        <td class="accountstyle author" align="left">
13           <!-- Author -->
14             孫郡鎧 編著
15         </td>
16        <td class="accountstyle due_date" align="center">
17           <!-- Date Charged -->
18             2015/9/10,16:16
19        </td>
20         <td class="accountstyle due_date" align="left">
21         <!-- Date Returned -->
22             2015/9/23,15:15
23         </td>
24 
25         <td class="accountstyle author" align="center">
26           <!-- Call Number -->
27             B821-49/S65
28         </td>
29 
30       </tr>
31 
32      <tr>
33        <td class="accountstyle" align="left">
34           <!-- SIRSI_Conditional IF List_DC_Exists="IB" AND NOT List_DC_Comp="IB^" -->
35 <!-- Start title here -->
36  <!-- Title -->
37             我用一生去尋找 潘石屹的人生哲學 潘石屹著
38         </td>
39        <td class="accountstyle author" align="left">
40           <!-- Author -->
41             潘石屹, 1963- 著
42         </td>
43        <td class="accountstyle due_date" align="center">
44           <!-- Date Charged -->
45             2015/9/10,16:16
46        </td>
47         <td class="accountstyle due_date" align="left">
48         <!-- Date Returned -->
49             2015/9/25,15:23
50         </td>
51 
52         <td class="accountstyle author" align="center">
53           <!-- Call Number -->
54             B821-49/P89
55         </td>
56 
57       </tr>

由所有代碼，注意這行：

<tbody id="tblSuspensions">
該標簽表示下面的內容將是借閱書籍的相關信息，我們采用遍歷該網頁所有子節點的方法獲得id="tblSuspensions"的內容：

1             for i, k in enumerate(BeautifulSoup(detail, "html.parser").find(id='tblSuspensions').children):
2                 # print i,k
3                 if isinstance(k, element.Tag):
4                     bookhtml.append(k)
5                     # print type(k)

三、提取所需要的內容

這一步比較簡單，bs4中的BeautifulSoup可以輕易的提取：

 1             for i in bookhtml:
 2                 # p
 3                 # rint i
 4                 name = i.find(class_="accountstyle").getText()
 5                 author = i.find(class_="accountstyle author", align="left").getText()
 6                 Date_Charged = i.find(class_="accountstyle due_date", align="center").getText()
 7                 Date_Returned = i.find(class_="accountstyle due_date", align="left").getText()
 8                 bookid = i.find(class_="accountstyle author", align="center").getText()
 9                 bookinfo.append(
10                     [name.strip(), author.strip(), Date_Charged.strip(), Date_Returned.strip(), bookid.strip()])

這一步采用getText（）的方法將text中內容提取出來；strip()方法是去掉前后空格，同時可以保留之間的空格，比如：s="   a a  ",使用s.strip()之后即為"a a"

四、連接數據庫
據說NoSQL以后會很流行，隨后采用了Mongodb數據庫圖圖新鮮，結果一折騰真是煩，具體安裝方法在上一篇日記中記載了。
1.導入python連接Mongodb的模塊
　　import pymongo
2.創建python和Mongodb的鏈接：

1 # connection database
2 conn = pymongo.MongoClient("mongodb://root:root@localhost:27017")
3 db = conn.book
4 collection = db.book

3.將獲得的內容保存到數據庫：

1                 user = {"_id": xuehao_ben,
2                         "Bookname": name.strip(),
3                         "Author": author.strip(),
4                         "Rent_Day": Date_Charged.strip(),
5                         "Return_Day": Date_Returned.strip()}
6                 j += 1
7                 collection.insert(user)

上面基本完成了，但是爬蟲做到這個沒有意義，重點在下面

五、獲取全校學生的借閱記錄

　　我們學校的圖書館的密碼都是一樣的，應該沒有人閑得無聊改密碼，甚至沒有人用過這個網站去查詢自己的借閱記錄，所以，做個循環，就可以輕易的獲取到全校的借閱記錄了，然后並沒有那么簡單，str（0001）強制將int變成string，但是在cmd的python中是報錯的（在1位置），在pycharm前面三個0是忽略的，只能用傻瓜式的四個for循環了。好了，下面是所有代碼：

  1 # encoding=utf8
  2 import urllib2
  3 import urllib
  4 import pymongo
  5 import socket
  6 
  7 from bs4 import BeautifulSoup
  8 from bs4 import element
  9 
 10 # connection database
 11 conn = pymongo.MongoClient("mongodb://root:root@localhost:27017")
 12 db = conn.book
 13 collection = db.book
 14 
 15 
 16 # 循環開始
 17 def xunhuan(xuehao):
 18     try:
 19         socket.setdefaulttimeout(60)
 20         s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
 21         s.bind(("127.0.0.1", 80))
 22         url = "http://211.81.31.34/uhtbin/cgisirsi/x/0/0/57/49?user_id=LIBSCI_ENGI&password=LIBSC"
 23         res = urllib2.urlopen(url).read()
 24         soup = BeautifulSoup(res, "html.parser")
 25         login_url = "http://211.81.31.34" + soup.findAll("form")[1]['action'].encode("utf8")
 26         params = {
 27             "user_id": "賬號前綴你猜你猜" + xuehao,
 28             "password": "密碼你猜猜"
 29         }
 30         print params
 31         params = urllib.urlencode(params)
 32         req = urllib2.Request(login_url, params)
 33         lianjie = urllib2.urlopen(req)
 34         # print lianjie
 35         jieyue_res = lianjie.read()
 36         # print jieyue_res     首頁的HTML代碼
 37         houmian = BeautifulSoup(jieyue_res, "html.parser").find_all('a', class_='rootbar')[1]['href']
 38         # print houmian
 39         houmian = urllib.quote(houmian.encode('utf8'))
 40         url_myaccount = "http://211.81.31.34" + houmian
 41         # print url_myaccount
 42         # print urllib.urlencode(BeautifulSoup(jieyue_res, "html.parser").find_all('a',class_ = 'rootbar')[0]['href'])
 43 
 44         lianjie2 = urllib.urlopen(url_myaccount)
 45         myaccounthtml = lianjie2.read()
 46         detail_url = ''
 47         # print (BeautifulSoup(myaccounthtml).find_all('ul',class_='gatelist_table')[0]).children
 48         print "連接完成，開始爬取數據"
 49         for i in (BeautifulSoup(myaccounthtml, "html.parser").find_all('ul', class_='gatelist_table')[0]).children:
 50             if isinstance(i, element.NavigableString):
 51                 continue
 52             for ii in i.children:
 53                 detail_url = ii['href']
 54             break
 55         detail_url = "http://211.81.31.34" + urllib.quote(detail_url.encode('utf8'))
 56         detail = urllib.urlopen(detail_url).read()
 57         # print detail
 58         bookhtml = []
 59         bookinfo = []
 60 
 61         # 解決沒有借書
 62         try:
 63             for i, k in enumerate(BeautifulSoup(detail, "html.parser").find(id='tblSuspensions').children):
 64                 # print i,k
 65                 if isinstance(k, element.Tag):
 66                     bookhtml.append(k)
 67                     # print type(k)
 68             print "look here!!!"
 69             j = 1
 70             for i in bookhtml:
 71                 # p
 72                 # rint i
 73                 name = i.find(class_="accountstyle").getText()
 74                 author = i.find(class_="accountstyle author", align="left").getText()
 75                 Date_Charged = i.find(class_="accountstyle due_date", align="center").getText()
 76                 Date_Returned = i.find(class_="accountstyle due_date", align="left").getText()
 77                 bookid = i.find(class_="accountstyle author", align="center").getText()
 78                 bookinfo.append(
 79                     [name.strip(), author.strip(), Date_Charged.strip(), Date_Returned.strip(), bookid.strip()])
 80                 xuehao_ben = str(xuehao) + str("_") + str(j)
 81                 user = {"_id": xuehao_ben,
 82                         "Bookname": name.strip(),
 83                         "Author": author.strip(),
 84                         "Rent_Day": Date_Charged.strip(),
 85                         "Return_Day": Date_Returned.strip()}
 86                 j += 1
 87                 collection.insert(user)
 88         except Exception, ee:
 89             print ee
 90             print "此人沒有借過書"
 91             user = {"_id": xuehao,
 92                     "Bookname": "此人",
 93                     "Author": "沒有",
 94                     "Rent_Day": "借過",
 95                     "Return_Day": "書"}
 96             collection.insert(user)
 97 
 98         print "********" + str(xuehao) + "_Finish"+"**********"
 99     except Exception, e:
100         s.close()
101         print e
102         print "socket超時，重新運行"
103         xunhuan(xuehao)
104 
105 
106 # with contextlib.closing(urllib.urlopen(req)) as A:
107 #    print A
108 #   print xuehao
109 # print req
110 
111 for i1 in range(0, 6):
112     for i2 in range(0, 9):
113         for i3 in range(0, 9):
114             for i4 in range(0, 9):
115                 xueha = str(i1) + str(i2) + str(i3) + str(i4)
116                 chushi = '0000'
117                 if chushi == xueha:
118                     print "=======爬蟲開始=========="
119                 else:
120                     print xueha + "begin"
121                     xunhuan(xueha)
122 
123 conn.close()
124 print "End!!!"

下面是Mongodb Management Studio的顯示內容（部分）：





　　總結：這次爬蟲遇到了很多問題，問了很多人，但是最終效果還不是很理想，雖然用了try except語句，但是還是會報錯10060，連接超時（我只能質疑學校的服務器了TT），還有就是，你可以看到數據庫中列的順序不一樣=。=這個我暫時未理解，求各位博客園友解釋了(づ￣ 3￣)づ，感謝那些這次爬取幫過我的人(*^__^*)，再接再厲，加油↖(^ω^)↗

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲實戰---抓取圖書館借閱信息基於 Django 的圖書館借閱系統國家圖書館借閱攻略廣州圖書館借閱抓取——httpClient的使用全國圖書館參考咨詢聯盟模擬登陸及爬取可爬取的圖片圖書館仿真圖書館預約爬蟲 Python+mysql-圖書館管理系統 ASP.NET Core 打造一個簡單的圖書館管理系統（八）學生借閱/預約/查詢書籍事務 python的N個小功能(找到要爬取的驗證碼鏈接，並大量下載驗證碼樣本)