Python:requests庫、BeautifulSoup4庫的基本使用（實現簡單的網絡爬蟲）

本文轉載自查看原文 2019-11-10 18:35 274

Python:requests庫、BeautifulSoup4庫的基本使用（實現簡單的網絡爬蟲）

一、requests庫的基本使用

requests是python語言編寫的簡單易用的HTTP庫，使用起來比urllib更加簡潔方便。

requests是第三方庫，使用前需要通過pip安裝。

pip install requests

1.基本用法：

import requests

#以百度首頁為例

response = requests.get('http://www.baidu.com')

#response對象的屬性

print(response.status_code)  # 打印狀態碼

print(response.url)          # 打印請求url

print(response.headers)      # 打印頭信息

print(response.cookies)      # 打印cookie信息

print(response.text)  #以文本形式打印網頁源碼

print(response.content) #以字節流形式打印

運行后顯示：

狀態碼：200


url：www.baidu.com


#輸出headers信息、cookie信息以及網頁源碼信息


<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

2.各種請求方式（HTTP測試網站：http://httpbin.org/）

import requests


requests.get('http://httpbin.org/get')

requests.post('http://httpbin.org/post')

requests.put('http://httpbin.org/put')

requests.delete('http://httpbin.org/delete')

requests.head('http://httpbin.org/get')

requests.options('http://httpbin.org/get')

3.response對象的方法

json()：能夠在HTTP響應內容中解析存在的JSON數據，方便解析HTTP的操作。

raise_for_status()：只要返回的請求狀態status_code不是200，則產生異常。用於try-except語句。

requests會產生幾種常用異常：

ConnectionError異常：網絡異常，如DNS查詢失敗、拒絕連接等。

HTTPError異常：無效HTTP響應。

Timeout異常：請求URL超時。

TooManyRedirects異常：請求超過了設定的最大重定向次數。

獲取一個網頁的內容的函數建議使用如下代碼：

def getHTMLText(url):

    try:

        r=requests.get(url,timeout=30)

        r.raise_for_status()#如果狀態不是200，拋出異常

        r.encoding='utf-8'#無論原來用什么編碼都改為utf-8

        return r.text

    except:

        return ''

二、beautifulsoup4庫的基本使用

beautifulsoup4庫用於解析和處理HTML和XML。其最大優點是能根據HTML和XML語法建立解析樹，提取有用信息。

beautifulsoup4也是第三方庫，使用前同樣需要通過pip安裝。

pip install beautifulsoup4

注意：beautifulsoup4庫和beautifulsoup庫不能混為一談，后者由於年久失修，已經不再維護了。

在使用beautifulsoup4庫之前需要進行引用：

from bs4 import BeautifulSoup

使用BeautifulSoup()創建一個BeautifulSoup對象。

import requests

from bs4 import BeautifulSoup

 
r=requests.get('http://www.baidu.com')

r.encoding='utf-8'

soup=BeautifulSoup(r.text,'html.parser')

print(type(soup))

BeautifulSoup對象是一個樹形結構，包含HTML頁面中每一個Tag標簽，這些標簽構成BeautifulSoup對象的屬性。BeautifulSoup對象常用屬性如下：

soup.head：HTML頁面的<head>內容

soup.title：HTML頁面的標題內容，在<head>之中

soup.body：HTML頁面的<body>內容

soup.p：HTML頁面第一個<p>內容

soup.strings：HTML頁面所有呈現在web上的字符串內容

soup.stripped_strings：HTML頁面所有呈現在web上的非空格字符串內容

#輸出百度首頁title標簽的內容

import requests

from bs4 import BeautifulSoup

 

r=requests.get('http://www.baidu.com')

r.encoding='utf-8'

soup=BeautifulSoup(r.text,'html.parser')

print(soup.title)

beautifulsoup4庫中每一個Tag標簽稱為一個Tag對象，標簽對象的常用屬性如下：

name：標簽本身的名稱，是一個字符串，如a。

attrs：字典，包含了標簽的全部屬性。

contents：列表，包含當前標簽下所有子標簽的內容。

string：字符串，標簽所包圍的文本，網頁中真實的文字。

import requests

from bs4 import BeautifulSoup


r=requests.get('http://www.baidu.com')

r.encoding='utf-8'

soup=BeautifulSoup(r.text,'html.parser')

print(soup.a)

print(soup.a.name)

print(soup.a.attrs)

print(soup.a.string)

print(soup.p.contents)

如果需要遍歷整個HTML頁面列出標簽對應的所有內容，可以用到find_all()方法。

BeautifulSoup.find_all( name , attrs , recursive , string , limit )

根據參數找對應標簽，返回類型為列表。參數如下：

name：根據標簽名查找。

attrs：根據標簽屬性值查找，需要列出屬性名和值，用JSON表示。

recursive：設置查找層次，只查找當前標簽下一層時使用recursive=False。

string：根據關鍵字查找string屬性內容，采用string=開始。

limit：返回結果個數，默認返回全部結果。

import requests

from bs4 import BeautifulSoup

#爬取前程無憂網軟件工程師薪資

r=requests.get('https://m.51job.com/search/joblist.php?jobarea=180400,180200&keyword=%E8%BD%AF%E4%BB%B6%E5%B7%A5%E7%A8%8B%E5%B8%88&partner=webmeta')

r.encoding='utf-8'

soup=BeautifulSoup(r.text,'html.parser')

allsalary=soup.find_all('em')

for i in allsalary:

    if len(i.text)==0:

    　　continue

    print(i.text)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python---requests和beautifulsoup4模塊的使用 python3.4學習筆記(十七) 網絡爬蟲使用Beautifulsoup4抓取內容 python爬蟲入門（三）XPATH和BeautifulSoup4 用requests庫和BeautifulSoup4庫爬取新聞列表爬蟲基礎庫之beautifulsoup的簡單使用 Python: 安裝BeautifulSoup4 Python爬蟲教程-23-數據提取-BeautifulSoup4（一） python爬蟲beautifulsoup4系列4-子節點爬蟲-使用BeautifulSoup4（bs4）解析html數據爬蟲（四）：BeautifulSoup庫的使用