網絡爬蟲（2）--異常處理

本文轉載自查看原文 2016-06-22 19:13 1585 爬蟲/ Python網絡爬蟲/ Python

上一節中對網絡爬蟲的學習的准備工作作了簡要的介紹，並以一個簡單的網頁爬取為例子。但網絡是十分復雜的，對網站的訪問並不一定都會成功，因此需要對爬取過程中的異常情況進行處理，否則爬蟲在遇到異常情況時就會發生錯誤停止運行。

讓我們看看urlopen中可能出現的異常：

html = urlopen("http://www.heibanke.com/lesson/crawler_ex00/")

這行代碼主要可能發生兩種異常：

1.網頁在服務器上不存在（或獲取頁面的時候出現錯誤）
2.服務器不存在
第一種異常發生時，程序會返回HTTP錯誤，urlopen函數會拋出“HTTPError”異常。
第二種異常，urlopen會返回一個None對象。
加入對這兩種異常的處理后，上一節中的代碼如下：

 1 __author__ = 'f403'
 2 #coding = utf-8
 3 from urllib.request import urlopen
 4 from urllib.error import HTTPError
 5 from bs4 import BeautifulSoup
 6 
 7 try:
 8    html = urlopen("http://www.heibanke.com/lesson/crawler_ex00/")
 9    if html is None:
10       print("Url is not found")
11    else:
12       bsobj = BeautifulSoup(html,"html.parser")
13       print(bsobj.h1)
14 except HTTPError as e:
15    print(e)

加入異常處理后，可以處理網頁訪問中發生的異常，可以保證網頁從服務器的成功獲取。但這不能保證網頁的內容和我們的預期一致，如上面的程序中，我們不能保證h1標簽一定存在，因此我們需要考慮這類異常。

這類異常也可以分為2類：

1.訪問一個不存在的標簽

2.訪問一個不存的標簽的子標簽

第一種情況出現時，BeautifulSoup返回一個None對象，而第二種情況會拋出AttributeError。

加入這部分的異常處理后，代碼為：

 1 __author__ = 'f403'
 2 #coding = utf-8
 3 from urllib.request import urlopen
 4 from urllib.error import HTTPError
 5 from bs4 import BeautifulSoup
 6 
 7 try:
 8    html = urlopen("http://www.heibanke.com/lesson/crawler_ex00/")
 9    if html is None:
10       print("Url is not found")
11    else:
12       bsobj = BeautifulSoup(html,"html.parser")
13       try:
14          t = bsobj.h1
15          if t is None:
16             print("tag is not exist")
17          else:
18             print(t)
19       except AttributeError as e:
20          print(e)
21 except HTTPError as e:
22    print(e)

來自為知筆記(Wiz)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 爬蟲3 異常處理 Python爬蟲入門五之URLError異常處理 AJAX之超時與網絡異常處理接口開發、異常處理、網絡編程 python之異常處理 HttpClient異常處理 Lua異常處理 Django 異常處理 Python之異常處理 NodeJS之異常處理