python爬蟲中get和post方法介紹以及cookie作用

本文轉載自查看原文 2020-09-21 14:59 446

首先確定你要爬取的目標網站的表單提交方式，可以通過開發者工具看到。這里推薦使用chrome。

這里我用163郵箱為例

打開工具后再Network中，在Name選中想要了解的網站，右側headers里的request method就是提交方式。status如果是200表示成功訪問下面的有頭信息，cookie是你登錄之后產生的存儲會話（session）信息的。第一次訪問該網頁需要提供用戶名和密碼，之后只需要在headers里提供cookie就可以登陸進去。

引入requests庫，會提供get和post的方法。

import requests
import ssl

user_agent="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"
accept='text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
accept_language='zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3'


upgrade='1'


headers={
  'User-Agent':user_agent,
  'Accept':accept,
  'Accept-Language':accept_language,
'Cookie':'....'#這里填入你登陸后產生的cookie
}

r = requests.get("http://mail.163.com/js6/main.jsp?sid=OAwUtGgglzEJoANLHPggrsKKAhsyheAT&df=mail163_letter#module=welcome.WelcomeModule%7C%7B%7D",headers=headers,verify=False)
fp = open("/temp/csdn.txt","w",encoding='utf-8')
fp.write(str(r.content,'utf-8'))
fp.close()

這里我引入了ssl庫，因為我第一次訪問的網頁證書過期。

如果我們使用爬蟲進入這樣的網站時，會報錯：SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)

在requests的get和post方法中，有一個參數為verify，把他設為False后會禁掉證書的要求

python爬蟲 urllib模塊發起post請求過程解析

urllib模塊發起的POST請求

案例：爬取百度翻譯的翻譯結果

1.通過瀏覽器捉包工具，找到POST請求的url

針對ajax頁面請求的所對應url獲取，需要用到瀏覽器的捉包工具。查看百度翻譯針對某個字條發送ajax請求，所對應的url

點擊clear按鈕可以把抓包工具，所抓到請求清空

然后填上翻譯字條發送ajax請求，紅色框住的都是發送的ajax請求

抓包工具All按鈕代表顯示抓到的所有請求，包括GET、POST請求、基於ajax的POST請求

XHR代表只顯示抓到的基於ajax的POST請求

哪個才是我們所要的基於ajax的POST請求，這個POST請求是攜帶翻譯字條的蘋果請求參數

再看看這個POST請求對應的請求URL ，這個URL是我們要請求的URL

發起POST請求之前，要處理POST請求攜帶的參數

3步流程:

一、將POST請求封裝到字典

二、使用parse模塊中的urlencode(返回值類型是字符串類型)進行編碼處理

三、將步驟二的編碼結果轉換成byte類型

import urllib.request import urllib.parse
 # 1.指定url
url = 'https://fanyi.baidu.com/sug'


# 發起POST請求之前，要處理POST請求攜帶的參數 流程:

 # 一、將POST請求封裝到字典
data = { # 將POST請求所有攜帶參數放到字典中
  'kw':'蘋果', }

 # 二、使用parse模塊中的urlencode(返回值類型是字符串類型)進行編碼處理
data = urllib.parse.urlencode(data) 
 # 三、將步驟二的編碼結果轉換成byte類型
data = data.encode() '''2. 發起POST請求:urlopen函數的data參數表示的就是經過處理之后的 POST請求攜帶的參數 ''' response = urllib.request.urlopen(url=url,data=data) data = response.read() print(data)

把拿到的翻譯結果去json在線格式校驗(在線JSON校驗格式化工具(Be JSON)),

點擊格式化校驗和unicode轉中文

import re,json,requests,os
from hashlib import md5
from urllib.parse import urlencode
from requests.exceptions import RequestException
from bs4 import BeautifulSoup
from multiprocessing import Pool


#請求索引頁
def get_page_index(offset,keyword):
  #傳送的數據
  data={
    'offset': offset,
    'format': 'json',
    'keyword': keyword,
    'autoload': 'true',
    'count': '20',
    'cur_tab': 1
  }
  #自動編碼為服務器可識別的url
  url="https://www.toutiao.com/search_content/?"+urlencode(data)
  #異常處理
  try:
    #獲取返回的網頁
    response=requests.get(url)
    #判斷網頁的狀態碼是否正常獲取
    if response.status_code==200:
      #返回解碼后的網頁
      return response.text
    #不正常獲取，返回None
    return None
  except RequestException:
    #提示信息
    print("請求索引頁出錯")
    return None

#解析請求的索引網頁數據
def parse_page_index(html):
  #json加載轉換
  data=json.loads(html)
  #數據為真，並且data鍵值存在與數據中
  if data and 'data' in data.keys():
    #遍歷返回圖集所在的url
    for item in data.get('data'):
      yield item.get('article_url')

#圖集詳情頁請求
def get_page_detail(url):
  #設置UA，模擬瀏覽器正常訪問
  head = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
  #異常處理
  try:
    response=requests.get(url,headers=head)
    if response.status_code==200:
      return response.text
    return None
  except RequestException:
    print("請求詳情頁出錯")
    return None

#解析圖集詳情頁的數據
def parse_page_detail(html,url):
  #異常處理
  try:
    #格式轉換與圖集標題提取
    soup=BeautifulSoup(html,'lxml')
    title=soup.select('title')[0].get_text()
    print(title)

    #正則查找圖集鏈接
    image_pattern = re.compile('gallery: (.*?),\n', re.S)
    result = re.search(image_pattern, html)
    if result:
      #數據的優化
      result=result.group(1)
      result = result[12:]
      result = result[:-2]
      #替換
      result = re.sub(r'\\', '', result)       #json加載
      data = json.loads(result)
      #判斷數據不為空，並確保sub——images在其中
      if data and 'sub_images' in data.keys():
        #sub_images數據提取
        sub_images=data.get('sub_images')
        #列表數據提取
        images=[item.get('url') for item in sub_images]
        #圖片下載
        for image in images:download_images(image)
        #返回字典
        return {
          'title':title,
          'url':url,
          'images':images
        }
  except Exception:
    pass

#圖片url請求
def download_images(url):
  #提示信息
  print('正在下載',url)
  #瀏覽器模擬
  head = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
  #異常處理
  try:
    response = requests.get(url, headers=head)     if response.status_code == 200:
      #圖片保存
      save_image(response.content)
    return None
  except RequestException:
    print("請求圖片出錯")
    return None

#圖片保存
def save_image(content):
  #判斷文件夾是否存在，不存在則創建
  if '街拍' not in os.listdir():
    os.makedirs('街拍')
  #設置寫入文件所在文件夾位置
  os.chdir('E:\python寫網路爬蟲\CSDN爬蟲學習\街拍')
  #路徑，名稱，后綴
  file_path='{0}/{1}.{2}'.format(os.getcwd(),md5(content).hexdigest(),'jpg')
  #圖片保存
  with open(file_path,'wb') as f:
      f.write(content)
      f.close()

#主函數
def mian(offset):
  #網頁獲取
  html=get_page_index(offset,'街拍')
  #圖集url
  for url in parse_page_index(html):
    if url!=None:
      #圖集網頁詳情
      html=get_page_detail(url)
      #圖集內容
      result=parse_page_detail(html,url)
if __name__ == '__main__':

  #創建訪問的列表（0-9）頁
  group=[i*10 for i in range(10)]

  #創建多線程進程池
  pool=Pool()

  #進程池啟動，傳入的數據
  pool.map(mian,group)

python爬蟲基於requests模塊的get請求實現詳解

import requests # 1.指定url
url = 'https://www.sogou.com/'

# 2.發起get請求:get方法會返回請求成功的響應對象
response = requests.get(url=url)
 # 3.獲取響應中的數據：text屬性作用是可以獲取響應對象中字符串形式的頁面數據
page_data = response.text
 # 4.持久化數據
with open("sougou.html","w",encoding="utf-8") as f: f.write(page_data) f.close() print("ok")

requests模塊如何處理攜帶參數的get請求，返回攜帶參數的請求

需求:指定一個詞條，獲取搜狗搜索結果所對應的頁面數據

之前urllib模塊處理url上參數有中文的需要處理編碼，requests會自動處理url編碼

發起帶參數的get請求

params可以是傳字典或者列表

def get(url, params=None, **kwargs):
  r"""Sends a GET request.
  :param url: URL for the new :class:`Request` object.
  :param params: (optional) Dictionary, list of tuples or bytes to send
    in the body of the :class:`Request`.
  :param \*\*kwargs: Optional arguments that ``request`` takes.
  :return: :class:`Response <Response>` object
  :rtype: requests.Response

import requests
# 指定url
url = 'https://www.sogou.com/web'


# 封裝get請求參數
prams = {
  'query':'周傑倫',
  'ie':'utf-8'
}
response = requests.get(url=url,params=prams)
page_text = response.text
with open("周傑倫.html","w",encoding="utf-8") as f:
  f.write(page_text)
  f.close()
print("ok")

利用requests模塊自定義請求頭信息，並且發起帶參數的get請求

get方法有個headers參數把請求頭信息的字典賦給headers參數

import requests
# 指定url
url = 'https://www.sogou.com/web'


# 封裝get請求參數
prams = {
  'query':'周傑倫',
  'ie':'utf-8'
}


# 自定義請求頭信息
headers={
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
  }

response = requests.get(url=url,params=prams,headers=headers)
page_text = response.text

with open("周傑倫.html","w",encoding="utf-8") as f:
  f.write(page_text)
  f.close()
print("ok")

python爬蟲基於requests模塊發起ajax的get請求實現解析

基於requests模塊發起ajax的get請求

需求：爬取豆瓣電影分類排行榜 https://movie.douban.com/中的電影詳情數據

用抓包工具捉取使用ajax加載頁面的請求

鼠標往下下滾輪拖動頁面，會加載更多的電影信息，這個局部刷新是當前頁面發起的ajax請求，

用抓包工具捉取頁面刷新的ajax的get請求，捉取滾輪在最底部時候發起的請求

這個get請求是本次發起的請求的url

ajax的get請求攜帶參數

獲取響應內容不再是頁面數據，是json字符串，是通過異步請求獲取的電影詳情信息

start和limit參數需要注意，改變這兩個參數獲取的電影詳情不一樣

import requests
import json

# 指定ajax-get請求的url（通過抓包進行獲取）
url = 'https://movie.douban.com/j/chart/top_list?'


# 封裝ajax的get請求攜帶的參數(從抓包工具中獲取) 封裝到字典
param = {
  'type': '13',
  'interval_id': '100:90',
  'action': '',
  'start': '20', # 從第20個電影開始獲取詳情
  'limit': '20', # 獲取多少個電影詳情
  # 改變這兩個參數獲取的電影詳情不一樣
}

# 定制請求頭信息，相關的頭信息必須封裝在字典結構中
headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
}


# 發起ajax的get請求還是用get方法
response = requests.get(url=url,params=param,headers=headers) # 獲取響應內容：響應內容為json字符串
data = response.text
data = json.loads(data)
for data_dict in data:
  print(data_dict["rank"],data_dict["title"])

'''
芙蓉鎮
沉靜如海
淘金記
馬戲團
情迷意亂
士兵之歌
戰爭與和平
怦然心動
大話西游之月光寶盒
羅馬假日
音樂之聲
一一
雨中曲
我愛你
莫娣
卡比利亞之夜
婚姻生活
本傑明·巴頓奇事
情書
春光乍泄
'''

post請求：

#!user/bin/python
#coding=utf-8 
或者 
#-*-coding:utf-8-*-

#導入工具，內置的庫
import urllib
import urllib2

#加一個\可以換行
#response = \
  #urllib2.urlopen("https://hao.360.cn/?wd_xp1")
#print response.read()

request = urllib2.Request('http://www.baidu.com')
#response = urllib2.urlopen(request)

 #構造post請求
params={}
params['account']='jredu'
params['pwd']=''


#對數據進行編碼
data = urllib.urlencode(params)
response = urllib2.urlopen(request,data)
print response.url
print response.code
print response.read()

get請求：

#導入工具，內置的庫
import urllib
import urllib2
#加一個\可以換行
#response = \
  #urllib2.urlopen("https://hao.360.cn/?wd_xp1")
#print response.read()
url='http://www.baidu.com'
#response = urllib2.urlopen(request)

 #構造post請求
params={}
params['account']='jredu'
params['pwd']=''


#對數據進行編碼
data = urllib.urlencode(params)
request = urllib2.Request(url+"?"+data)
response = urllib2.urlopen(request)
print response.url
print response.code
print response.read()

Python大數據之使用lxml庫解析html網頁文件示例

lxml是Python的一個html/xml解析並建立dom的庫，lxml的特點是功能強大，性能也不錯，xml包含了ElementTree ，html5lib ，beautfulsoup 等庫。

使用lxml前注意事項：先確保html經過了utf-8解碼，即code =html.decode('utf-8', 'ignore')，否則會出現解析出錯情況。因為中文被編碼成utf-8之后變成 '/u2541'　之類的形式，lxml一遇到　"/"就會認為其標簽結束。

具體用法：元素節點操作

1、解析HTMl建立DOM

from lxml import etree
dom = etree.HTML(html)

2、查看dom中子元素的個數 len(dom)

3、查看某節點的內容：etree.tostring(dom[0])

4、獲取節點的標簽名稱：dom[0].tag

5、獲取某節點的父節點：dom[0].getparent()

6、獲取某節點的屬性節點的內容：dom[0].get("屬性名稱")

對xpath路徑的支持：

XPath即為XML路徑語言，是用一種類似目錄樹的方法來描述在XML文檔中的路徑。比如用"/"來作為上下層級間的分隔。第一個"/"表示文檔的根節點（注意，不是指文檔最外層的tag節點，而是指文檔本身）。比如對於一個HTML文件來說，最外層的節點應該是"/html"。

xpath選取元素的方式：

1、絕對路徑，如page.xpath("/html/body/p")，它會找到body這個節點下所有的p標簽

2、相對路徑，page.xpath("//p"),它會找到整個html代碼里的所有p標簽。

xpath篩選方式：

1、選取元素時一個列表，可通過索引查找[n]

2、通過屬性值篩選元素p =page.xpath("//p[@style='font-size:200%']")

3、如果沒有屬性可以通過text()（獲取元素中文本）、position()（獲取元素位置）、last()等進行篩選

獲取屬性值

dom.xpath(.//a/@href)

獲取文本

dom.xpath(".//a/text()")

#!/usr/bin/python
# -*- coding:utf-8 -*-
from scrapy.spiders import Spider
from lxml import etree
from jredu.items import JreduItem
class JreduSpider(Spider):
  name = 'tt' #爬蟲的名字，必須的，唯一的
  allowed_domains = ['sohu.com']
  start_urls = [
    'http://www.sohu.com'
  ]
  def parse(self, response):
    content = response.body.decode('utf-8')
    dom = etree.HTML(content)
    for ul in dom.xpath("//div[@class='focus-news-box']/div[@class='list16']/ul"):
      lis = ul.xpath("./li")
      for li in lis:
        item = JreduItem() #定義對象
        if ul.index(li) == 0:
          strong = li.xpath("./a/strong/text()")
          li.xpath("./a/@href")
          item['title']= strong[0]
          item['href'] = li.xpath("./a/@href")[0]
        else:
          la = li.xpath("./a[last()]/text()")
          item['title'] = la[0]
          item['href'] = li.xpath("./a[last()]/href")[0]
        yield item

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 之post、get與cookie實戰【Python爬蟲】學習筆記 -- post請求的方法(Cookie反爬) Python爬蟲—requests庫get和post方法使用 Python爬蟲：基本操作（發送get、post請求，模擬瀏覽器，加入cookie信息） Python爬蟲(三)_urllib2:get和post請求 http中get和post請求的作用和區別 Python的Bottle框架中實現最基本的get和post的方法的教程 Tornado 中的 get() 或 post() 方法 Python爬蟲selenium中get_cookies()和add_cookie（）的用法 Python爬蟲-04：貼吧爬蟲以及GET和POST的區別