python---requests和beautifulsoup4模塊的使用

本文轉載自查看原文 2018-06-19 18:42 4942 Python

Requests：是使用 Apache2 Licensed 許可證的基於Python開發的HTTP 庫，其在Python內置模塊的基礎上進行了高度的封裝，從而使得Pythoner進行網絡請求時，變得美好了許多，使用Requests可以輕而易舉的完成瀏覽器可有的任何操作。

BeautifulSoup：是一個模塊，該模塊用於接收一個HTML或XML字符串，然后將其進行格式化，之后遍可以使用他提供的方法進行快速查找指定元素，從而使得在HTML或XML中查找指定元素變得簡單。

一：安裝模塊

pip3 install requests
pip3 install beautifulsoup4

二：requests和beautifulsoup4模塊的簡單聯合使用

獲取每條新聞的標題，標題鏈接和圖片

import requests
from bs4 import BeautifulSoup
import uuid


reponse = requests.get(url="https://www.autohome.com.cn/news/")
reponse.encoding = reponse.apparent_encoding　　#獲取文本原來編碼，使兩者編碼一致才能正確顯示

soup = BeautifulSoup(reponse.text,'html.parser')　　#使用的是html解析，一般使用lxml解析更好
target = soup.find(id="auto-channel-lazyload-article")　　#find根據屬性去獲取對象，id,attr,tag...自定義屬性
li_list = target.find_all('li')　　#列表形式 for li in li_list:
    a_tag = li.find('a')
    if a_tag:
        href = a_tag.attrs.get("href")　　#屬性是字典形式，使用get獲取指定屬性
        title = a_tag.find("h3").text　　#find獲取的是對象含有標簽，獲取text
        img_src = "http:"+a_tag.find("img").attrs.get('src')
        print(href)
        print(title)
        print(img_src)
        img_reponse = requests.get(url=img_src)
        file_name = str(uuid.uuid4())+'.jpg'　　#設置一個不重復的圖片名
        with open(file_name,'wb') as fp:
            fp.write(img_reponse.content)

總結使用：

（1）requests模塊

reponse = requests.get(url)　　#根據url獲取響應對象
reponse.apparent_encoding　　  #獲取文本的原來編碼
reponse.encoding　　　　　　　　 #對文本編碼進行設置
reponse.text                  #獲取文本內容，str類型
reponse.content　　　　　　　　  #獲取數據，byte類型
reponse.status_code　　　　　　 #獲取響應狀態碼

（2）beautifulsoup4模塊

soup = BeautifulSoup('網頁代碼','html.parser')      　　　　 #獲取HTML對象
target = soup.find(id="auto-channel-lazyload-article")    #根據自定義屬性獲取標簽對象，默認找到第一個
li_list = target.find_all('li')    　　　　　　　　　　　　　　#根據標簽名，獲取所有的標簽對象，放入列表中

注意：是自定義標簽都可以查找
v1 = soup.find('div')
v1 = soup.find(id='il')
v1 = soup.find('div',id='i1')

find_all一樣

對於獲取的標簽對象，我們可以使用
obj.text 　　　　獲取文本
obj.attrs 　　  獲取屬性字典

三.requests模塊詳解

含有下面幾種接口api,最后都會調用request方法，所以開始討論request方法的詳細使用。

def get(url, params=None, **kwargs):
    r"""Sends a GET request.

    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    kwargs.setdefault('allow_redirects', True)
    return request('get', url, params=params, **kwargs)


def options(url, **kwargs):
    r"""Sends an OPTIONS request.

    :param url: URL for the new :class:`Request` object.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    kwargs.setdefault('allow_redirects', True)
    return request('options', url, **kwargs)


def head(url, **kwargs):
    r"""Sends a HEAD request.

    :param url: URL for the new :class:`Request` object.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    kwargs.setdefault('allow_redirects', False)
    return request('head', url, **kwargs)


def post(url, data=None, json=None, **kwargs):
    r"""Sends a POST request.

    :param url: URL for the new :class:`Request` object.
    :param data: (optional) Dictionary (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    return request('post', url, data=data, json=json, **kwargs)


def put(url, data=None, **kwargs):
    r"""Sends a PUT request.

    :param url: URL for the new :class:`Request` object.
    :param data: (optional) Dictionary (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    return request('put', url, data=data, **kwargs)


def patch(url, data=None, **kwargs):
    r"""Sends a PATCH request.

    :param url: URL for the new :class:`Request` object.
    :param data: (optional) Dictionary (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    return request('patch', url, data=data, **kwargs)


def delete(url, **kwargs):
    r"""Sends a DELETE request.

    :param url: URL for the new :class:`Request` object.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    return request('delete', url, **kwargs)

除request方法外的其他方法

from . import sessions


def request(method, url, **kwargs):
    """Constructs and sends a :class:`Request <Request>`.

    :param method: method for the new :class:`Request` object.
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param data: (optional) Dictionary or list of tuples ``[(key, value)]`` (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
        ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
        or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
        defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
        to add for the file.
    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    :param timeout: (optional) How many seconds to wait for the server to send data
        before giving up, as a float, or a :ref:`(connect timeout, read
        timeout) <timeouts>` tuple.
    :type timeout: float or tuple
    :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.
    :type allow_redirects: bool
    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    :param verify: (optional) Either a boolean, in which case it controls whether we verify
            the server's TLS certificate, or a string, in which case it must be a path
            to a CA bundle to use. Defaults to ``True``.
    :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response

    Usage::

      >>> import requests
      >>> req = requests.request('GET', 'http://httpbin.org/get')
      <Response [200]>
    """

    # By using the 'with' statement we are sure the session is closed, thus we
    # avoid leaving sockets open which can trigger a ResourceWarning in some
    # cases, and look like a memory leak in others.
 with sessions.Session() as session: return session.request(method=method, url=url, **kwargs)

參數介紹：

    :param method: 　提交方式get,post,put,patch,delete,options,head

    :param url: 　　 提交地址

    :param params: 　在URL中傳遞的參數 GET
　　　　　　　　　　　  request.request(method='GET',url='http://xxxx.com',params={'k1':'v1','k2':'v2'})
　　　　　　　　　　  　會自動裝換為http://xxxx.com?k1=v1&k2=v2

    :param data:   　在請求體中傳遞的數據，字典，字節，文件對象 POST
　　　　　　　　　　  　request.request(method='GET',url='http://xxxx.com',data={'user':'aaaa','password':'bbb'})
　　　　　　　　　　  　雖然顯示為字典形式，但是會在傳遞時也轉換為data = "user=aaaa&password=bbbb"

    :param json:　　 存放在Django中請求體中的body中--->request.body中
　　　　　　　　　　　　request.request(method='GET',url='http://xxxx.com',json={'user':'aaaa','password':'bbb'})
　　　　　　　　　　　　會將json數據裝換為字符串形式 json="{'user':'aaaa','password':'bbb'}",存放在請求體的body中
　　　　　　　　　　　　和data相比：data中只能存放基礎類型，不能存放字典，列表等，二json只是將數據字符串化，所以可以存放這些數據類型

    :param headers: 請求頭
　　　　　　　　　　　　可以用於防止別人使用腳本登錄網站，例如上面抽屜自動登錄就是根據請求頭中用戶代理，來過濾用戶。也可以使用Referer看上一步網站位置，可以防止盜鏈等
　　　　　　　
    :param cookies: cookies,存放在請求頭中，傳遞時是放在headers中傳遞過去的

    :param files: 　用於post方式傳遞文件時使用。使用鍵值對形式
　　　　　　　　　　　 request.post(usl='xxx',files={
　　　　　　　　　　　　　　'f1':open('s1.py','rb'),　　#傳遞的name:文件對象/文件內容 'f1':'dawfwafawfawf'
　　　　　　　　　　　　　　'f2':('newf1name',open('s1.py','rb')　　#元組中第一個參數，是上傳到服務器中的文件名，可指定
　　　　　　　　　　　 })

    :param auth: 　權限驗證，一般用於在web前端對數據進行加密base64加密。，一些網站在登錄時，使用登錄框輸入用戶密碼后，在前端進行加密，然后將數據存放在請求頭中
　　　　　　　　　　　ret = requests.get('https://api.github.com/user', 
　　　　　　　　　　　　　　　　　　　　　　auth=HTTPBasicAuth('用戶名', '密碼')
　　　　　　　　　　　)

    :param timeout: 超時float或者元組 一個參數時為float，代表等待服務器返回響應內容的時間，兩個參數時為元組形式，第一個代表連接網站超時時間，第二個代表等待服務器響應的超時時間
　　　　　　　　　　　  ret = requests.get('http://google.com/', timeout=1)　
　　　　　　　　　　　　ret = requests.get('http://google.com/', timeout=(5, 1))

    :param allow_redirects: 允許重定向，類型為布爾型，默認為True，允許后，會去獲取重定向后的頁面數據進行返回
　　　　　　　　　　　　requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)

    :param proxies: 代理，例如電腦出口IP（公網IP，非局域網）限制，以實現同IP操作限制。我們聯系不同人員通過不同公網IP去操作，以實現解除操作限制，這些人員稱為代理
　　　　　　　　　　　　技術上使用：代理服務器，我們向代理服務器上發送數據，讓服務器替我們去選用代理IP去向指定的網站發送請求　　
　　　　　　　　　　　　request.post(
　　　　　　　　　　　　　　url = "http://dig.chouti.com/log",
　　　　　　　　　　　　　　data = form_data,
　　　　　　　　　　　　　　proxys = {
　　　　　　　　　　　　　　　　'http':'http://代理服務器地址:端口',
　　　　　　　　　　　　　　　　'https':'http://代理服務器地址:端口',
　　　　　　　　　　　　　　}
　　　　　　　　　　　　)

    :param stream: 流方式獲取文件數據，下載一點數據到內存，就保存到磁盤中，每下載一點就保存一點。防止因為內存不足文件過大而不能完成下載任務情況
　　　　　　　　　　　from contextlib import closing
　　　　　　　　　　　with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
　　　　　　　　　　　　　　# 在此處理響應。
　　　　　　　　　　　　　　for i in r.iter_content():
　　　　　　　　　　　　　　　　print(i)


    :param cert:　　帶HTTPS時，通道進行ssl加密,原來http是使用socket，數據未加密，不安全。現在的HTTPS是含有加密解密過程。需要證書存在

　　　　　　　　　　　 一種是：自定義證書，客戶端需要客戶自己去安裝證書
　　　　　　　　　　　　　　　request.get(
　　　　　　　　　　　　　　　　url="https:...",
　　　　　　　　　　　　　　　　cert="xxx.pem",　　#每次訪問需要攜帶證書，格式是pem,('.crt','.key')<兩個文件都需要攜帶，一起拼接加密>,兩種文件驗證方法
　　　　　　　　　　　　　　　)
　　　　　　　　　　　另一種是：在系統中帶有的認證證書，需要去購買，廠家和系統聯合，系統內置，直接對網站解析驗證


    :param verify: 布爾類型，當為false時，忽略上面cert證書的存在，照樣可以獲取結果，一般網站為了用戶便利，是允許這種情況

補充：request模塊中session方法

對於上面的自動登錄時，cookie和session等會話期間產生的數據需要我們自己手動管理。而session方法，會將請求獲取的響應cookie和響應體等放入全局變量中，以后我們訪問該網站時，會將這些數據自動攜帶一起發生過去。

注意：對於請求頭我們自己還是需要去配置的

import requests

headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'

session = requests.session()

i1 = session.get("https://dig.chouti.com/",headers=headers)
i1.close()

form_data = {
    'phone':"xxxx",
    'password':"xxxx",
    'oneMonth':''
}

i2 = session.post(url="https://dig.chouti.com/login",data=form_data,headers=headers)

i3 = session.post("https://dig.chouti.com/link/vote?linksId=20324146",headers=headers)
print(i3.text)


{"result":{"code":"9999", "message":"推薦成功", "data":{"jid":"cdu_52941024478","likedTime":"1529507291930000","lvCount":"7","nick":"山上有風景","uvCount":"3","voteTime":"小於1分鍾前"}}}

補充：Django中request對象（不是上面的request模塊）

推文：django-request對象

無論我們發送什么樣的格式，都會到request.body中，而request.post中可能沒有值
依據的是請求頭中的content-type來判斷類型

例如： Content-Type: text/html;charset:utf-8;
 常見的媒體格式類型如下：

    text/html ： HTML格式
    text/plain ：純文本格式      
    text/xml ：  XML格式
    image/gif ：gif圖片格式    
    image/jpeg ：jpg圖片格式 
    image/png：png圖片格式
   以application開頭的媒體格式類型：

   application/xhtml+xml ：XHTML格式
   application/xml     ： XML數據格式
   application/atom+xml  ：Atom XML聚合格式    
   application/json    ： JSON數據格式
   application/pdf       ：pdf格式  
   application/msword  ： Word文檔格式
   application/octet-stream ： 二進制流數據（如常見的文件下載）
   application/x-www-form-urlencoded ： <form encType=””>中默認的encType，form表單數據被編碼為key/value格式發送到服務器（表單默認的提交數據的格式）
   另外一種常見的媒體格式是上傳文件之時使用的：
   multipart/form-data ： 需要在表單中進行文件上傳時，就需要使用該格式
     以上就是我們在日常的開發中，經常會用到的若干content-type的內容格式。

　 例如：當我使用post傳遞數據，在服務端接收請求體，存放在request.body中，
　 然后到請求頭中查詢content-type：application/x-www-form-urlencoded
　 再將接收的請求體拷貝到request.post中存放

四.beautifulsoup4模塊詳解

標簽的使用方法

HTML代碼

from bs4 import BeautifulSoup

html = '''
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
    <a href="/wwewe/fafwaw" class="btn btn2">666daw6fw</a>
    <div id="content" for='1'>
        <p>div>p
            <label>title</label>
        </p>
    </div>
    <hr/>
    <p id="bott">div,p</p>
</body>
</html>
'''

soup = BeautifulSoup(html,features="lxml")

1.name,標簽名稱

tag = soup.find("a")
print(tag.name) #a
tag = soup.find(id="content")
print(tag.name) #div

name

2.標簽attrs屬性的操作

tag = soup.find('a')
print(tag.attrs)    #{'href': '/wwewe/fafwaw', 'class': ['btn', 'btn2']}
print(tag.attrs['href'])    #/wwewe/fafwaw
tag.attrs['id']="btn-primary"   #添加
del tag.attrs['class']  #刪除
tag.attrs['href']="/change"　　#改
print(tag.attrs)    #{'id': 'btn-primary', 'href': '/change'}

attrs

3.children所有子標簽

body = soup.find("body")
print(body.children)    #list_iterator object,只會獲取子標簽，對於內部子孫標簽是作為字符串形式存在
from bs4.element import Tag
for child in body.children:
    # print(type(child))
    # <class 'bs4.element.NavigableString'>字符串類型，一般是換行符，空格等
    # <class 'bs4.element.Tag'>子節點類型
    if type(child) == Tag:
        print(child)

children

4.descendants所有子孫標簽

body = soup.find("body")
for child in body.descendants:  #會將內部子孫標簽提出來，再次進行一次操作
    # print(type(child))
    # <class 'bs4.element.NavigableString'>字符串類型，一般是換行符，空格等
    # <class 'bs4.element.Tag'>子節點類型
    if type(child) == Tag:
        print(child)

descendants

5.clear，遞歸清空子標簽，保留自己

body = soup.find("body")
body.clear()  #清空子標簽,保留自己
print(soup) #body標簽存在，內部為空

clear

6.decompose遞歸刪除所有標簽，包含自己

body = soup.find('body')
body.decompose()    #遞歸刪除，包含自己
print(soup) #body標簽不存在

View Code

7.extract，遞歸刪除所有標簽（同decompose）,獲取刪除的標簽

body = soup.find('body')
deltag = body.extract() #遞歸刪除，包含本標簽
print(soup) #無body標簽
print(deltag)   #是所有我們刪除的標簽

extract

8.decode，轉化為字符串（含當前標簽）；decode_contents（不含當前標簽）

#用字符串形式輸出，也可以直接輸出，內置__str__方法
body = soup.find('body')
v = body.decode()   #含有當前標簽
print(v)
v = body.decode_contents()  #不含當前標簽
print(v)

decode decode_contents

9.encode,轉換為字節（含當前標簽）；encode_contents（不含當前標簽）

#轉換為字節類型
body = soup.find('body')
v = body.encode()      #含有body
print(v)
v = body.encode_contents()  #不含body
print(v)

encode encode_contents

10.find的靈活使用：按照標簽名，屬性，文本，recursive是否遞歸查找

tag = soup.find(name="p")    #默認是尋找所有子孫節點的數據,遞歸查找
print(tag)  #找到子孫下的第一個
tag = soup.find(name='p',recursive=False)
print(tag)  #None   是因為，當前標簽是html標簽，而不是body

tag = soup.find('body').find('p')
print(tag)  ##找到子孫下的第一個
tag = soup.find('body').find('p',recursive=False)
print(tag)  #<p>div,p</p>

tag = soup.find('body').find('div',attrs={"id":"content","for":"1"},recursive=False)
print(tag)  #找到該標簽

find

11.find_all的靈活使用：標簽名，屬性，文本，正則，函數，limit，recursive查找

tags = soup.find_all('p')
print(tags)

tags = soup.find_all('p',limit=1)   #只去獲取一個，但是返回還是列表
print(tags)

tags = soup.find_all('p',attrs={'id':"bott"}) #按屬性查找
print(tags)

tags = soup.find_all(name=['p','a'])    #查找所有p,a標簽
print(tags)

tags = soup.find("body").find_all(name=['p','a'],recursive=False)    #查找所有p,a標簽,只找子標簽
print(tags)

tags = soup.find("body").find_all(name=['p','a'],text="div,p")  #查找所有文本時div,p的標簽
print(tags)

正則匹配：
import re
pat = re.compile("p")
tags = soup.find_all(name=pat)
print(tags)

pat = re.compile("^lab")    #查找所有以lab開頭的標簽
tags = soup.find_all(name=pat)
print(tags)

pat = re.compile(".*faf.*")
tags = soup.find_all(attrs={"href":pat})    #或者直接href=pat
print(tags)

pat = re.compile("cont.*")
tags = soup.find_all(id=pat)
print(tags)

函數匹配：

def func(tag):
    return tag.has_attr("class") and tag.has_attr("href")

tags = soup.find_all(name=func)
print(tags)

find_all

12.標簽屬性的獲取get，判斷has_attr

tag = soup.find('a')
print(tag.get("href"))  #獲取標簽屬性
print(tag.attrs.get("href"))  #獲取標簽屬性

print(tag.has_attr("href"))

has_attr

13.標簽文本的獲取get_text,string和修改string

tag = soup.find(id='content')
print(tag.get_text())   #獲取標簽的文本內容,會獲取所有的子孫標簽文本

tag = soup.find("label")
print(tag.get_text())   #title
print(tag.string)   #title
tag.string = "test"
print(tag.get_text())   #test

get_text string

14.index查看標簽在其父標簽中的索引位置

body = soup.find("body")
child_tag = body.find("div",recursive=False)
if child_tag:
    print(body.index(child_tag))    #必須是其子標簽，不是子孫標簽

index

15.is_empty_element判斷是否是空標簽，或者閉合標簽

tag = soup.find('hr')
print(tag.is_empty_element) #判斷是不是空標簽，閉合標簽

is_empty_element

16.當前標簽的關聯標簽

tag.next
tag.next_element
tag.next_elements　　#會包含有字符串文本類型 tag.next_sibling　　#只獲取標簽對象Tag
tag.next_siblings

tag.previous
tag.previous_element
tag.previous_elements
tag.previous_sibling
tag.previous_siblings

tag.parent
tag.parents

tag = soup.find(id="content")
print(tag)
print(tag.next) #下一個元素，這里是換行符
print(tag.next_element) #下一個元素，這里是換行符
print(tag.next_elements)    #下面的所有子孫標簽，都會提出來進行一次迭代
for ele in tag.next_elements:
    print(ele)

print(tag.next_sibling) #只去獲取子標簽
print(tag.next_siblings)    #只含有子標簽
for ele in tag.next_siblings:
    print(ele)

next_element next_sibling演示和區別

17.find_根據條件去操作當前標簽的關聯標簽,使用方法和上面類似

tag.find_next(...)
tag.find_all_next(...)
tag.find_next_sibling(...)
tag.find_next_siblings(...)

tag.find_previous(...)
tag.find_all_previous(...)
tag.find_previous_sibling(...)
tag.find_previous_siblings(...)

tag.find_parent(...)
tag.find_parents(...)

tag = soup.find("label")

# print(tag.parent)
# for par in tag.parents:
#     print(par)

print(tag.find_parent(id='content'))    #根據條件去上面查找符合條件的一個標簽
print(tag.find_parents(id='content'))   #根據條件去向上面查找所有符合條件的標簽，列表形式

parent find_parent使用比較

18.select,select_one, CSS選擇器

soup.select("title")
 
soup.select("p nth-of-type(3)")
 
soup.select("body a")
 
soup.select("html head title")
 
tag = soup.select("span,a")
 
soup.select("head > title")
 
soup.select("p > a")
 
soup.select("p > a:nth-of-type(2)")
 
soup.select("p > #link1")
 
soup.select("body > a")
 
soup.select("#link1 ~ .sister")
 
soup.select("#link1 + .sister")
 
soup.select(".sister")
 
soup.select("[class~=sister]")
 
soup.select("#link1")
 
soup.select("a#link2")
 
soup.select('a[href]')
 
soup.select('a[href="http://example.com/elsie"]')
 
soup.select('a[href^="http://example.com/"]')
 
soup.select('a[href$="tillie"]')
 
soup.select('a[href*=".com/el"]')
 
 
from bs4.element import Tag
 
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr('href'):
            continue
        yield child
 
tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)
print(type(tags), tags)
 
from bs4.element import Tag
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr('href'):
            continue
        yield child
 
tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)
print(type(tags), tags)

select

19.Tag類新建標簽

from bs4.element import Tag
tag_obj = Tag(name='pre',attrs={"col":30})
tag_obj.string="這是一個新建標簽"
print(tag_obj)  #<pre col="30">這是一個新建標簽</pre>

Tag()

20.append將新建標簽，追加到內部標簽，是放在最后面的（注意append可以將已存在的標簽對象移動到另一個標簽下面，原來的不存在了）

soup = BeautifulSoup(html,features="lxml")

from bs4.element import Tag
tag_obj = Tag(name='pre',attrs={"col":30})
tag_obj.string="這是一個新建標簽"
# print(tag_obj)  #<pre col="30">這是一個新建標簽</pre>

soup.find(id="content").append(tag_obj) #追加放在最后面

print(soup)

append

21.insert為當前標簽內部指定位置插入標簽

soup = BeautifulSoup(html,features="lxml")

from bs4.element import Tag
tag_obj = Tag(name='pre',attrs={"col":30})
tag_obj.string="這是一個新建標簽"
# print(tag_obj)  #<pre col="30">這是一個新建標簽</pre>

soup.find(id="content").insert(0,tag_obj) #追加放在最前面

print(soup)

insert指定位置插入

22.insert_after,insert_before 在當前標簽后面或前面插入

soup = BeautifulSoup(html,features="lxml")

from bs4.element import Tag
tag_obj = Tag(name='pre',attrs={"col":30})
tag_obj.string="這是一個新建標簽"
# print(tag_obj)  #<pre col="30">這是一個新建標簽</pre>

soup.find(id="content").insert_before(tag_obj) #放在當前標簽前面
soup.find(id="content").insert_after(tag_obj) #放在當前標簽后面

print(soup)

insert_before insert_after

23.replace_with 將當前標簽替換為指定標簽

soup = BeautifulSoup(html,features="lxml")

from bs4.element import Tag
tag_obj = Tag(name='pre',attrs={"col":30})
tag_obj.string="這是一個新建標簽"
# print(tag_obj)  #<pre col="30">這是一個新建標簽</pre>

soup.find(id="content").replace_with(tag_obj) #原來div標簽被替換

print(soup)

replace_with

24. setup創建標簽之間的關系（用途不明顯，用途不大）

def setup(self, parent=None, previous_element=None, next_element=None,
          previous_sibling=None, next_sibling=None):

soup = BeautifulSoup(html,features="lxml")

div = soup.find('div')
a = soup.find('a')

div.setup(next_sibling=a)
print(soup) #沒有變化

print(div.next_sibling) #是我們設置的那個標簽對象

setup

25.wrap，用指定標簽將當前標簽包裹起來

soup = BeautifulSoup(html,features="lxml")

from bs4.element import Tag
tag_obj = Tag(name='pre',attrs={"col":30})
tag_obj.string="這是一個新建標簽"

a = soup.find("a")

a.wrap(tag_obj) #用新建標簽將當前a標簽包含起來

div = soup.find('div')
tag_obj.wrap(div)   #用原本存在的標簽包含現在的tag_obj,包含數放在最后面
print(soup)

wrap是調用標簽將自己包含

26.unwrap，去掉當前標簽，將保留其包裹（內部）的標簽

div = soup.find('div')
div.unwrap()
print(soup)

unwrap將外層的當前標簽去掉

五：實現自動登錄github網站

import requests
from bs4 import BeautifulSoup html1 = requests.get(url="https://github.com/login")　　#先到登錄頁，獲取token，cookies html1.encoding = html1.apparent_encoding soup = BeautifulSoup(html1.text,features="html.parser") login_token_obj = soup.find(name='input', attrs={'name': 'authenticity_token'}) login_token = login_token_obj.get("value")　　#獲取到頁面的令牌 cookie_dict = html1.cookies.get_dict() html1.close() 
#填寫form表單需要的數據 login_data = {　　 'login':"賬號", 'password':"密碼", 'authenticity_token':login_token, "utf8": "", "commit":"Sign in" }
 session_reponse = requests.post("https://github.com/session",data=login_data,cookies=cookie_dict)　　#必須傳入cookies cookie_dict.update(session_reponse.cookies.get_dict())　　#更新網站的cookies index_reponse = requests.get("https://github.com/settings/repositories",cookies=cookie_dict)　　#必須攜帶cookies soup2 = BeautifulSoup(index_reponse.text,features="html.parser")　　#解析下面的列表數據，獲取項目名和項目大小 item_list = soup2.find_all("div",{'class':'listgroup-item'}) for item in item_list: a_obj = item.find("a") s_obj = item.find('small') print(a_obj.text) print(s_obj.text)

六：實現自動登錄抽屜新熱榜，實現點贊

推文：為何大量網站不能抓取?爬蟲突破封禁的6種常見方法

1.其中抽屜網防止直接被爬取數據，使用的是對請求頭進行驗證，所以我們需要修改請求頭，防止被網站防火牆攔截

2.抽屜網，對於第一次傳遞的cookies中gpsd數據進行了授權，在我們后續的操作中需要的是第一次請求中的gpsd,我們若是使用了其他的請求中的cookie，那么會出錯

import requests

headers = {}　　#設置請求頭 headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36' headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' i1 = requests.get("https://dig.chouti.com/",headers=headers) i1_cookie = i1.cookies.get_dict() print(i1_cookie) i1.close() form_data = { 'phone':"xxxx", 'password':"xxxx", 'oneMonth':'' } headers['Accept'] = '*/*' i2 = requests.post(url="https://dig.chouti.com/login",headers=headers,data=form_data,cookies=i1_cookie) i2_cookie = i2.cookies.get_dict() i2_cookie.update(i1_cookie) i3 = requests.post("https://dig.chouti.com/link/vote?linksId=20306326",headers=headers,cookies=i2_cookie) print(i3.text)

{'JSESSIONID': 'aaaoJAuXMtUytb02Uw9pw', 'route': '0c5178ac241ad1c9437c2aafd89a0e50', 'gpsd': '91e20c26ddac51c60ce4ca8910fb5669'}

{"result":{"code":"9999", "message":"推薦成功", "data":{"jid":"cdu_52941024478","likedTime":"1529420936883000","lvCount":"23","nick":"山上有風景","uvCount":"2","voteTime":"小於1分鍾前"}}}

七：自動登錄知乎

Python模擬登陸新版知乎（代碼全）

模擬登陸改版后的知乎（講解詳細）

知乎改版使用restapi后模擬登錄（講了signature）

八：自動登錄博客園

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python模塊--BeautifulSoup4 和 lxml requests和BeautifulSoup模塊的使用 Python:requests庫、BeautifulSoup4庫的基本使用（實現簡單的網絡爬蟲）使用pip安裝BeautifulSoup4模塊 Python學習之beautifulsoup4庫的使用 python怎么安裝requests、beautifulsoup4等第三方庫 Python: 安裝BeautifulSoup4 BeautifulSoup4基本使用 python安裝BeautifulSoup4 Python學習---xml文件的解析[beautifulsoup4模塊學習]