Python-網站頁面代碼獲取

本文轉載自查看原文 2018-05-07 16:50 907 bs4/ http/ urllib3/ Python

Python3.6

庫：urllib3, bs4

主程序是抓取亞馬遜圖書銷售排名數據，但是亞馬遜應該是加了反爬蟲，拒絕疑似機器人的請求，這部分暫時以百度代替。

其實簡單的頁面抓取，常用的urllib.request就能實現，但是urllib3功能更多，應用前景更廣，需要學習。

首先導入模塊：

import urllib3, bs4

定義要訪問的頁面：

urltest = 'https://www.baidu.com'

定義函數，這里對比兩種解碼方法：

def httpget():
    http = urllib3.PoolManager()   #首先產生一個PoolManager實例
    urllib3.disable_warnings()     #忽略https的無效證書警報
    page = http.request('GET','%s'%urltest)   #發起GET請求
    print(page.status)        #服務器返回的代碼
    print(page.data)          #服務器返回的數據，返回的是xml字符串
    print(page.data.decode())  #利用默認'utf-8'編碼格式去解碼
    res = bs4.BeautifulSoup(page.data,'lxml')  #利用lxml模塊解碼
    print(res)
    return None

執行函數httpget()輸出結果：

200
b'<!DOCTYPE html><!--STATUS OK--><body link="#0000cc"><div ...（#省略）

<!DOCTYPE html><!--STATUS OK-->
<html>
<head>
    <meta http-equiv="content-type" content="text/html;charset=utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <link rel="dns-prefetch" href="//s1.bdstatic.com"/>
    <link rel="dns-prefetch" href="//t1.baidu.com"/>
    <link rel="dns-prefetch" href="//t2.baidu.com"/>
    <link rel="dns-prefetch" href="//t3.baidu.com"/>
    <link rel="dns-prefetch" href="//t10.baidu.com"/>
    <link rel="dns-prefetch" href="//t11.baidu.com"/>
    <link rel="dns-prefetch" href="//t12.baidu.com"/>
    <link rel="dns-prefetch" href="//b1.bdstatic.com"/>
    <title>百度一下，你就知道</title>
　　...（#省略）
　　...（#省略）


</body></html>

<!DOCTYPE html>
<!--STATUS OK--><html>
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<link href="//s1.bdstatic.com" rel="dns-prefetch"/>
<link href="//t1.baidu.com" rel="dns-prefetch"/>
<link href="//t2.baidu.com" rel="dns-prefetch"/>
<link href="//t3.baidu.com" rel="dns-prefetch"/>
<link href="//t10.baidu.com" rel="dns-prefetch"/>
<link href="//t11.baidu.com" rel="dns-prefetch"/>
<link href="//t12.baidu.com" rel="dns-prefetch"/>
<link href="//b1.bdstatic.com" rel="dns-prefetch"/>
<title>百度一下，你就知道</title>
...（#省略）
...（#省略）

</body></html>


Process finished with exit code 0

在這里兩種解碼方式都沒出錯，但是如果換成比較復雜的頁面，普通的decode()方式就容易報錯。

比如京東這個頁面：

url = 'https://item.jd.com/6072622.html'

將urltest替換成url之后執行代碼，執行結果如下：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 146: invalid start byte

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python-函數和代碼復用 python-獲取當前時間 Python-爬蟲-針對有frame框架的頁面 Python-爬蟲-針對有frame框架的頁面 Python-統計svn變更代碼行數 Python-多線程執行代碼 Python-事件驅動模型代碼 python-獲取操作系統信息 python-獲取URL中的json數據 Python-獲取文件路徑os