python爬取中遇到的一些錯誤以及解決方案:
must be str, not ReadTimeout
must be str, not ConnectionError
429 Too Many Requests
亂碼(gb2312)
1 錯誤信息: 2 AS1084航班爬取錯誤 3 must be str, not ProxyError 錯誤信息未處理 4 解決方案: 5 使用try except:print(記錄錯誤航班) pass跳出錯誤繼續爬取 6 7 錯誤信息: 8 CA3767航班爬取錯誤 9 local variable 'ok' referenced before assignment 未賦值前被引用 10 解決方案: 11 賦值改為全局變量 global ok 12 13 錯誤信息: 14 MF1930航班爬取完成! 15 must be str, not ReadTimeout 獲取網頁超時 16 content = requests.get( 17 'http://happiness.variflight.com/info/detail?fnum', 18 proxies=proxies,timeout=30).text 19 解決方案: 20 超時即 except:pass重新連接頁面 21 22 錯誤信息: 23 NS8185航班爬取完成! 24 must be str, not ConnectionError 數據庫連接錯誤 25 解決方案: 26 重連數據庫,記錄並 pass跳過此條航班信息 27 28 錯誤信息: 29 429 Too Many Requests 錯誤頁面 30 403 31 502 32 解決方案: 33 頻繁訪問頁面,判斷為正常頁面 爬取即可 34 35 解決方案: 36 unc = stringa.decode("gb2312") #先decode 37 print unc.encode("utf-8") #后轉utf-8 38 HTML亂碼 此編碼方式為gb2312 39 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> 40 <HTML><HEAD> 41 <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=gb2312"> 42 <TITLE>′í?ó£o?ú?ù???óμ?í??·£¨URL£??T·¨??è?</TITLE> 43 <STYLE type="text/css"><!--BODY{background-color:#ffffff;font-family:verdana,sans-serif}PRE{font-family:sans-serif}--></STYLE> 44 </HEAD><BODY> 45 <H1>′í?ó</H1> 46 <H2>?ú?ù???óμ?í??·£¨URL£??T·¨??è?</H2> 47 <HR noshade size="1px"> 48 <P> 49 μ±3¢ê??áè?ò???í??·£¨URL£?ê±£o 50 <A HREF="http://happiness.variflight.com/info/detail?fnum=CZ3134&dep=TSN&arr=CAN&date=2017-12-28&type=1">http://happiness.variflight.com/info/detail?fnum=CZ3134&dep=TSN&arr=CAN&date=2017-12-28&type=1</A> 51 <P> 52 ·¢éúá???áDμ?′í?ó£o 53 <UL> 54 <LI> 55 <STRONG> 56 Read Error 57 <BR> 58 ?áè?′í?ó 59 </STRONG> 60 </UL> 61 62 <P> 63 ?μí3??ó|£o 64 <PRE><I> (104) Connection reset by peer</I></PRE> 65 66 <P> 67 An error condition occurred while reading data from the network. Please 68 retry your request. 69 <BR> 70 ?y?úí¨1yí????áè?êy?Yê±·¢éúá?′í?ó£?????D?3¢ê??£ 71 </P> 72 <P>±??o′?·t???÷1üàí?±£o<A HREF="mailto:support@chinacache.com">support@chinacache.com</A>