Python爬蟲常用庫介紹（requests、BeautifulSoup、lxml、json）

本文轉載自查看原文 2020-03-16 12:02 2387 Python爬蟲/ Python爬蟲常用庫

1、requests庫

http協議中，最常用的就是GET方法：
import requests

response = requests.get('http://www.baidu.com')
print(response.status_code)  # 打印狀態碼
print(response.url)          # 打印請求url
print(response.headers)      # 打印頭信息
print(response.cookies)      # 打印cookie信息
print(response.text)         #以文本形式打印網頁源碼
print(response.content)      #以字節流形式打印

除此GET方法外，還有許多其他方法：

import requests

requests.get('http://httpbin.org/get')
requests.post('http://httpbin.org/post')
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')

2、BeautifulSoup庫

BeautifulSoup庫主要作用：

經過Beautiful庫解析后得到的Soup文檔按照標准縮進格式的結構輸出，為結構化的數據，為數據過濾提取做出准備。

Soup文檔可以使用find()和find_all()方法以及selector方法定位需要的元素：

1. find_all()方法

soup.find_all('div',"item") #查找div標簽，class="item"

find_all(name, attrs, recursive, string, limit, **kwargs)
@PARAMS:
    name: 查找的value，可以是string，list，function，真值或者re正則表達式
    attrs: 查找的value的一些屬性，class等。
    recursive: 是否遞歸查找子類，bool類型
    string: 使用此參數，查找結果為string類型；如果和name搭配，就是查找符合name的包含string的結果。
    limit: 查找的value的個數
    **kwargs: 其他一些參數

2. find()方法

find()方法與find_all()方法類似，只是find_all()方法返回的是文檔中符合條件的所有tag,是一個集合，find()方法返回的一個Tag

3、select()方法

soup.selector(div.item > a > h1) 從大到小，提取需要的信息，可以通過瀏覽器復制得到。

select方法介紹

示例：

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

在寫css時，標簽名不加任何修飾，類名前加點，id名前加 #，我們可以用類似的方法來篩選元素，用到的方法是soup.select()，返回類型是list。

(1).通過標簽名查找

print(soup.select('title')) #篩選所有為title的標簽，並打印其標簽屬性和內容
# [<title>The Dormouse's story</title>]

print(soup.select('a')) #篩選所有為a的標簽
# [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('b')) #篩選所有為b的標簽，並打印
# [<b>The Dormouse's story</b>]

(2).通過類名查找

print soup.select('.sister')    #查找所有class為sister的標簽，並打印其標簽屬性和內容
# [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

(3).通過id名查找

print soup.select('#link1') #查找所有id為link1的標簽，並打印其標簽屬性和內容
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

(4).組合查找

組合查找即和寫class文件時，標簽名與類名、id名進行的組合原理是一樣的，例如查找p標簽中，id等於link1的內容，二者需要空格分開。

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

直接子標簽查找

print soup.select("head > title")
#[<title>The Dormouse's story</title>]

(5).屬性查找

查找時還可以加入屬性元素，屬性需要用中括號括起來，注意屬性和標簽屬於同一節點，所以中間不能加空格，否則會無法匹配到。

print soup.select("head > title")
#[<title>The Dormouse's story</title>]
 
print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

屬性仍然可以與上述查找方式組合，不在同一節點的空格隔開，同一節點的不加空格。

print soup.select('p a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

BeautifulSoup庫例句：

from bs4 import BeautifulSoup
import requests

f = requests.get(url,headers=headers) 
   soup = BeautifulSoup(f.text,'lxml') 

   for k in soup.find_all('div',class_='pl2'):     #找到div並且class為pl2的標簽
      b = k.find_all('a')       #在每個對應div標簽下找a標簽，會發現，一個a里面有四組span
      n.append(b[0].get_text())    #取第一組的span中的字符串

3、lxml庫

lxml 是一個HTML/XML的解析器，主要的功能是如何解析和提取 HTML/XML 數據。

示例如下：

# 使用 lxml 的 etree 庫
from lxml import etree 

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> # 注意，此處缺少一個 </li> 閉合標簽
     </ul>
 </div>
'''

#利用etree.HTML，將字符串解析為HTML文檔
html = etree.HTML(text) 

# 按字符串序列化HTML文檔
result = etree.tostring(html) 

print(result)

輸出結果如下：

<html><body> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>

可以看到。lxml會自動修改HTML代碼。例子中不僅補全了li標簽，還添加了body，html標簽。

4、json庫

函數	描述
json.dumps	將python對象編碼成JSON字符串
json.loads	將已編碼的JSON字符串解析為python對象

1. json.dumps的使用

#!/usr/bin/python
import json

data = [ { 'name' : '張三', 'age' : 25}, { 'name' : '李四', 'age' : 26} ]

jsonStr1 = json.dumps(data) #將python對象轉為JSON字符串
jsonStr2 = json.dumps(data,sort_keys=True,indent=4,separators=(',',':')) #讓JSON數據格式化輸出,sort_keys:當key為文本，此值為True則按順序打印，為False則隨機打印
jsonStr3 = json.dumps(data, ensure_ascii=False) #將漢字不轉換為unicode編碼

print(jsonStr1)
print('---------------分割線------------------')
print(jsonStr2)
print('---------------分割線------------------')
print(jsonStr3)

輸出結果：

[{"name": "\u5f20\u4e09", "age": 25}, {"name": "\u674e\u56db", "age": 26}] ---------------分割線------------------ [ { "age":25, "name":"\u5f20\u4e09" }, { "age":26, "name":"\u674e\u56db" } ] ---------------分割線------------------ [{"name": "張三", "age": 25}, {"name": "李四", "age": 26}]

2. json.loads的使用

#!/usr/bin/python
import json

data = [ { 'name' : '張三', 'age' : 25}, { 'name' : '李四', 'age' : 26} ]

jsonStr = json.dumps(data)
print(jsonStr)

jsonObj = json.loads(jsonStr)
print(jsonObj)
# 獲取集合第一個
for i in jsonObj:
    print(i['name'])

輸出結果為：

[{"name": "\u5f20\u4e09", "age": 25}, {"name": "\u674e\u56db", "age": 26}] [{'name': '張三', 'age': 25}, {'name': '李四', 'age': 26}] 張三 李四`

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【Python】在Pycharm中安裝爬蟲庫requests , BeautifulSoup , lxml 的解決方法 $python爬蟲系列（2）—— requests和BeautifulSoup庫的基本用法 Python 爬蟲—— requests BeautifulSoup Python爬蟲之BeautifulSoup和requests Python爬蟲之requests庫介紹(一) centos6裝python3，並安裝requests, lxml和beautifulsoup模塊 Python requests+BeautifulSoup爬蟲（下載圖片） python3 爬蟲（requests+BeautifulSoup） python爬蟲之requests+selenium+BeautifulSoup Python:requests庫、BeautifulSoup4庫的基本使用（實現簡單的網絡爬蟲）