嵩天老師python網課爬蟲實例1的問題和解決方法

本文轉載自查看原文 2019-05-18 10:50 1235 Python

一，AttributeError: 'NoneType' object has no attribute 'children', 網頁'tbody'沒有子類

很明顯，報錯的意思是說tbody下面沒有children，說明我們在gethtmltext的時候可能出現了問題，可以用print（r.status.code）測試，發現並不是200，print(r.raise_for_status())返回的值也是None ，其次 gethtmltext返回的也是 error，說明我們並沒有成功下載網頁源碼。錯誤原因猜測

1,zuihaodaxue.com網站采取了反爬機制

2,由於教程錄制時間久遠，url網址錯誤

第一種情況，加上代理頭和cookies測試，發現一樣提示 AttributeError: 'NoneType' object has no attribute 'children' ，最后發現是我自己 r.text 寫成了 r.txt

def gethtmltext(url):
    headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"}
    cooks={"cookies":"Hm_lvt_2ce94714199fe618dcebb5872c6def14=1558142987; Hm_lpvt_2ce94714199fe618dcebb5872c6def14=1558147316"}
    try:
        r=requests.get(url,headers=headers,cookies=cooks,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return "error"

2，發現老師給的網址是zuihaodaxue.cn,現在網站更新變成了.com，所以換成 http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html 即可解決問題

如果還存在問題那么大部分原因可能是單詞拼寫錯誤。

以下為正確代碼：

 1 def gethtmltext(url):
 2     try:
 3         r=requests.get(url,timeout=30)
 4         r.raise_for_status()
 5         r.encoding=r.apparent_encoding
 6         return r.text
 7     except:
 8         return "error"
 9 
10 if __name__ == '__main__':
11     url="http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html"

二，format()函數格式錯誤——ValueError: Invalid format specifier

以下代碼錯誤：提示ValueError: Invalid format specifier

1 def printunivlist(urlist,num):
2     tplt = "{0:^6}\t{1:{3}^10}\t{2:^10}\t{3:^10}"
3     print(tplt.format("排名","學校","地區","總分",chr(12288)))
4     for i in range(num):
5         u = urlist[i]
6         print(tplt.format(u[0],u[1],u[2],u[3],chr(12288)))

以下代碼正確：

1 def printunivlist(urlist,num):
2     tplt = "{0:^6}\t{1:{4}^10}\t{2:^10}\t{3:^10}"
3     print(tplt.format("排名","學校","地區","總分",chr(12288)))
4     for i in range(num):
5         u = urlist[i]
6         print(tplt.format(u[0],u[1],u[2],u[3],chr(12288)))

可以看出，僅僅是1:{3}^10和1:{4}^10的差別。

原因分析：

第二行改為#這里添加了tplit = "{0:^10}\t{1:{3}^10}\t{2:^10}" ；{3:^10}”你添加了地區，相應的作為填充不足10個字符長度的chr(12288)已經不是3了，而是4。

在這里很多同學肯定會問{1:{3}^10}，填充為什么是填充3個或4個，為什么是在1號位填充：

第一，中英文全半角造成不對齊的原因產生在1號位；

第二，分析實例【Python爬取中國前20強大學】前20大學的結果，為排名、學校名稱、總分，3個地方需填充，即為3；

第三：后面加入省市，為排名、學校名稱、總分、省市，4個地方需要填充，即為4；

轉載來源：https://blog.csdn.net/Andone_hsx/article/details/84025828

最后貼上我的代碼：

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import bs4
 4 
 5 def gethtmltext(url):
 6     try:
 7         r=requests.get(url,timeout=30)
 8         r.raise_for_status()
 9         r.encoding=r.apparent_encoding
10         return r.text
11     except:
12         return "error"
13 
14 def fillunivlist(urlist,html):
15     soup = BeautifulSoup(html,"html.parser")
16     for tr in soup.find("tbody").children:
17         if isinstance(tr,bs4.element.Tag):
18             tds = tr("td") #將所有標簽存為一個列表
19             urlist.append([tds[0].string,tds[1].string,tds[2].string,tds[4].string])
20 
21 
22 
23 def printunivlist(urlist,num):
24     tplt = "{0:^6}\t{1:{4}^10}\t{2:^10}\t{3:^10}"
25     print(tplt.format("排名","學校","地區","總分",chr(12288)))
26     for i in range(num):
27         u = urlist[i]
28         print(tplt.format(u[0],u[1],u[2],u[3],chr(12288)))
29 
30 if __name__ == '__main__':
31     unifo=[]
32     url="http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html"
33     html=gethtmltext(url)
34     fillunivlist(unifo,html)
35     printunivlist(unifo,20)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python安裝途中遇到的問題和解決方法【學習筆記】PYTHON網絡爬蟲與信息提取(北理工嵩天) 北理工嵩天Python學習筆記 python爬蟲中文亂碼解決方法 python爬蟲 403 Forbidden 解決方法跨域問題產生的原因和解決方法 python-報錯和解決方法匯總用python解決‘三天打魚兩天曬網’的問題。中國有句俗語叫“三天打魚兩天曬網” 爬蟲：滑動驗證解決方法及python實現前端412異常和解決方法