python爬蟲之requests+selenium+BeautifulSoup

本文轉載自查看原文 2018-07-08 12:28 1512 BeautifulSoup/ python爬蟲/ requests/ Python

前言：

環境配置：windows64、python3.4
requests庫基本操作：

1、安裝：pip install requests

2、功能：使用 requests 發送網絡請求，可以實現跟瀏覽器一樣發送各種HTTP請求來獲取網站的數據。

3、命令集操作：

import requests  # 導入requests模塊

r = requests.get("https://api.github.com/events")  # 獲取某個網頁

# 設置超時，在timeout設定的秒數時間后停止等待響應
r2 = requests.get("https://api.github.com/events", timeout=0.001)

payload = {'key1': 'value1', 'key2': 'value2'}
r1 = requests.get("http://httpbin.org/get", params=payload)

print(r.url)  # 打印輸出url

print(r.text)  # 讀取服務器響應的內容

print(r.encoding)  # 獲取當前編碼

print(r.content)  # 以字節的方式請求響應體

print(r.status_code)  # 獲取響應狀態碼
print(r.status_code == requests.codes.ok)  # 使用內置的狀態碼查詢對象

print(r.headers)  # 以一個python字典形式展示的服務器響應頭
print(r.headers['content-type'])  # 大小寫不敏感，使用任意形式訪問這些響應頭字段

print(r.history)  # 是一個response對象的列表

print(type(r))  # 返回請求類型

BeautifulSoup4庫基本操作：

1、安裝：pip install BeautifulSoup4

2、功能：Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Python庫。

3、命令集操作：

 1 import requests
 2 from bs4 import BeautifulSoup

 3 html_doc = """
 4 <html><head><title>The Dormouse's story</title></head>
 5 <body>
 6 <p class="title"><b>The Dormouse's story</b></p>
 7 
 8 <p class="story">Once upon a time there were three little sisters; and their names were
 9 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
10 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
11 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
12 and they lived at the bottom of a well.</p>
13 
14 <p class="story">...</p>
15 """
16 
17 ss = BeautifulSoup(html_doc,"html.parser")
18 print (ss.prettify())  #按照標准的縮進格式的結構輸出
19 print(ss.title)   # <title>The Dormouse's story</title>
20 print(ss.title.name)   #title
21 print(ss.title.string)   #The Dormouse's story
22 print(ss.title.parent.name)   #head
23 print(ss.p)   #<p class="title"><b>The Dormouse's story</b></p>
24 print(ss.p['class'])   #['title']
25 print(ss.a)   #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
26 print(ss.find_all("a"))   #[。。。]
29 print(ss.find(id = "link3"))   #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
30 
31 for link in ss.find_all("a"):
32     print(link.get("link")) #獲取文檔中所有<a>標簽的鏈接
33 
34 print(ss.get_text()) #從文檔中獲取所有文字內容

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 html_doc = """
 5 <html><head><title>The Dormouse's story</title></head>
 6 <body>
 7 <p class="title"><b>The Dormouse's story</b></p>
 8 <p class="story">Once upon a time there were three little sisters; and their names were
 9 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
10 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
11 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
12 and they lived at the bottom of a well.</p>
13 
14 <p class="story">...</p>
15 """

16 soup = BeautifulSoup(html_doc, 'html.parser')  # 聲明BeautifulSoup對象
17 find = soup.find('p')  # 使用find方法查到第一個p標簽
18 print("find's return type is ", type(find))  # 輸出返回值類型
19 print("find's content is", find)  # 輸出find獲取的值
20 print("find's Tag Name is ", find.name)  # 輸出標簽的名字
21 print("find's Attribute(class) is ", find['class'])  # 輸出標簽的class屬性值
22 
23 print(find.string)  # 獲取標簽中的文本內容
24 
25 markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
26 soup1 = BeautifulSoup(markup, "html.parser")
27 comment = soup1.b.string
28 print(type(comment))  # 獲取注釋中內容

小試牛刀：

1 import requests
2 import io
3 import sys
4 sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') #改變標准輸出的默認編碼
5 
6 r = requests.get('https://unsplash.com') #像目標url地址發送get請求，返回一個response對象
7 
8 print(r.text) #r.text是http response的網頁HTML

參考鏈接：

https://blog.csdn.net/u012662731/article/details/78537432

http://www.cnblogs.com/Albert-Lee/p/6276847.html

https://blog.csdn.net/enohtzvqijxo00atz3y8/article/details/78748531

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python 爬蟲—— requests BeautifulSoup Python爬蟲之BeautifulSoup和requests $python爬蟲系列（2）—— requests和BeautifulSoup庫的基本用法 Python爬蟲常用庫介紹（requests、BeautifulSoup、lxml、json） python爬蟲（beautifulsoup） python 爬蟲 requests+BeautifulSoup 爬取巨潮資訊公司概況代碼實例【Python】在Pycharm中安裝爬蟲庫requests , BeautifulSoup , lxml 的解決方法 Python爬蟲_BeautifulSoup 定位取值 python爬蟲beautifulsoup4系列3 python爬蟲requests的使用