Python爬蟲常用之登錄(三) 使用http請求登錄


前面說了使用瀏覽器登錄較為簡單,不需要過多分析,而使用請求登錄恰恰就是以分析為主.

 開發一個請求登錄程序的流程:

    分析請求->模擬請求->測試登錄->調整參數->測試登錄->登錄成功

一、分析網頁

從網頁着手,打開博客園的登錄頁面,F12調出網頁調試,選擇network的tab,然后登錄,登錄成功后大致有如下請求可以看到:

可以看到圈起來的signin請求,很明顯這個就是登錄的請求,別的網站也有叫login之類的,大同小異.

我們來仔細看一下這個請求.

主要注意到:使用post方法,請求頭一大堆,有三個參數.

先分析參數,根據上一篇的網頁分析,大致可以猜到input1和input2這兩個參數是用戶名和密碼,

remenber應該是是否記錄的那個選框.這個試幾次就知道是不是了,我們暫且都定死為false

接下來,把網頁上的參數全部拷下來,模擬請求,看看結果

import requests

session = requests.Session()

url = "https://passport.cnblogs.com/user/signin"

data = {
    "input1": "MXBZobfesF1W+pRwgRdyYtqYIGjMrL3jq/cCgRdA10Mn5WTe/stf/WoTtSfLMzHj72LtU9+A8xvR6mrENzUM+8IJllSrCpqXvgLgInBVQYpc4PTYfrYswrR3WL4oRu+5wUvUUSYGUFVDbHjPIXLk63WCbJs6uCCCXtReGoHQgSA=",
    "input2": "kXm57UelqJrj3FUy/oyzGt8sfSiU8vdbU59kBTTtCFhGBlnpY2SylhJ3jRr2ayFyIFwsg20DC9UBWxI9P85C4otnXbpknulA56AUYcTGsbaPSewX2+gU9+3+5LpKRxQFnufW+fkP5oiVESj/uV/9WeONAqaU52Z7UsNgxvr/L3Q=",
    "remember": False
}

headers = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.8",
    "Connection": "keep-alive",
    "Host": "passport.cnblogs.com",
    "Origin": "https://passport.cnblogs.com",
    "Referer": "https://passport.cnblogs.com/user/signin?ReturnUrl=https%3A%2F%2Fwww.cnblogs.com%2F",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36",
    "VerificationToken": "Pl9U45ZjLRvnUGroHRmgwdKmWv8OzORHGEU6PjAuj1yyLXQjAqBIYjfNtIh-lMMQJgyzbFRMh4TbvAQNnq0uD3Qcj9k1:nxi3tgeeeOYyz7REolByuYtTow8Qw0AYQElwZ5vIj5oUJr-Tna1n2wG8WLaVNOIFNCyx_eiI2tWM9m2nsbUM9BJol881",
    "Upgrade-Insecure-Requests": "1",
    "X-Requested-With": "XMLHttpRequest",
    'Content-Type': 'application/x-www-form-urlencoded'
}
r = session.post(url, headers=headers, data=data)
print r
print r.content

上面只是用到了requests的基本用法.可以看到打印出來的頁面內容仍是登錄頁面,說明登錄失敗.

 

二、分析加密

看看input的格式,似乎是加密過的,我們先復制下來找個網頁用base64解碼看看,抱着一絲它只是簡單base64編碼過的希望.

事實證明,這個並不是簡單地將用戶名和密碼使用base64編碼一下,因為解碼出來的全是亂碼.可以自己試一下.

因為我最近一直在破解各種網站的登錄方式,很容易能想到,多半是先用rsa加密過的.但是如果沒有這種經歷該怎么分析呢?

我們可以回到網頁上面來,打開Elements,我們^F搜一下input1,發現不止在登錄標簽那里有,我們看看其他的,大多在header里面的<script>里面

找到這個,可以看到僅僅是用js做了一個簡單的加密,有興趣的可以看下這個js代碼(https://passport.cnblogs.com/scripts/jsencrypt.min.js),就是rsa.

這個也可以明顯看到,remember這個參數確實就是網頁上的下次自動登錄標志.

我們在python里面直接選擇rsa模塊代替js的加密.

把網頁上的公鑰復制下來,作為公鑰輸入,填上自己的用戶名和密碼,使用rsa和base64加密看看

import rsa
import base64
from web_encrypt import str2key

username = "Masako"
password = "123456"

rsa_str = "MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCp0wHYbg/NOPO3nzMD3dndwS0MccuMeXCHgVlGOoYyFwLdS24Im2e7YyhB0wrUsyYf0/nhzCzBK8ZC9eCWqd0aHbdgOQT6CuFQBMjbyGYvlVYU2ZP7kG9Ft6YV6oc9ambuO7nPZh+bvXH0zDKfi02prknrScAKC0XhadTHT3Al0QIDAQAB"
pub_key = str2key(rsa_str)
modulus = int(pub_key[0], 16)
exponent = int(pub_key[1], 16)
key = rsa.PublicKey(modulus, exponent)
encrypt_name = rsa.encrypt(username, key)
encrypt_pw = rsa.encrypt(password, key)

input1 = base64.b64encode(encrypt_name)
input2 = base64.b64encode(encrypt_pw)

print input1
print input2

輸出結果如下:

Q/+Aq2og1LeCQDPqVbfhUohK3R+hu0CTcCajTJC1mO/GqxSHWqUx2mrMMt3GJrSZ+Ip66dIh+0RpKbRPyk1Sqj/MV1+SL00HUQSgwZOdlQBl+gQfYEq6RSqjw2Id4gHXgb5TcG63Q8r2TEoEWk9Yi45sx2rbARG/2FuRZqYg8zQ=
nFVRcbBqj7OcPHvIoznWrGUOfhq83rfN0f/nEBG/B+lSON6hUAnHCkwHg5S5nkOo+Avv7F1NrxskV/JI+ysbFHskjPp+T24X/vcjIj8VH68qW5u+4EtrQJGomOgefkXdKeA+A1eu7cAeZqDdGgf4d/Rb43A6S+dahvoGJSqiN1I=

可以看到,格式已經非常接近了,其實這就是需要填入請求的數據.

上面代碼中的str2key()方法是我自己寫的,主要功能就是將網頁上的rsa公鑰字符串轉換成python可識別的格式.

因為百度了很久沒有看到什么好的辦法,所以自己寫了一個,后面有時間再拿出來講.

 

三、整合調試

現在把第一步和第二步結合起來,將第二步加密過的結果傳到第一步中,試一下能否成功.

# /usr/bin/python
# encoding: utf-8

import rsa
import base64
import requests
from web_encrypt import str2key


def login(username, password):
    rsa_str = "MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCp0wHYbg/NOPO3nzMD3dndwS0MccuMeXCHgVlGOoYyFwLdS24Im2e7YyhB0wrUsyYf0/nhzCzBK8ZC9eCWqd0aHbdgOQT6CuFQBMjbyGYvlVYU2ZP7kG9Ft6YV6oc9ambuO7nPZh+bvXH0zDKfi02prknrScAKC0XhadTHT3Al0QIDAQAB"
    pub_key = str2key(rsa_str)
    modulus = int(pub_key[0], 16)
    exponent = int(pub_key[1], 16)
    key = rsa.PublicKey(modulus, exponent)
    encrypt_name = rsa.encrypt(username, key)
    encrypt_pw = rsa.encrypt(password, key)

    input1 = base64.b64encode(encrypt_name)
    input2 = base64.b64encode(encrypt_pw)

    session = requests.Session()

    url = "https://passport.cnblogs.com/user/signin"

    data = {
        "input1": input1,
        "input2": input2,
        "remember": False
    }

    headers = {
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "zh-CN,zh;q=0.8",
        "Connection": "keep-alive",
        "Host": "passport.cnblogs.com",
        "Origin": "https://passport.cnblogs.com",
        "Referer": "https://passport.cnblogs.com/user/signin?ReturnUrl=https%3A%2F%2Fwww.cnblogs.com%2F",
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36",
        "VerificationToken": "Pl9U45ZjLRvnUGroHRmgwdKmWv8OzORHGEU6PjAuj1yyLXQjAqBIYjfNtIh-lMMQJgyzbFRMh4TbvAQNnq0uD3Qcj9k1:nxi3tgeeeOYyz7REolByuYtTow8Qw0AYQElwZ5vIj5oUJr-Tna1n2wG8WLaVNOIFNCyx_eiI2tWM9m2nsbUM9BJol881",
        "Upgrade-Insecure-Requests": "1",
        "X-Requested-With": "XMLHttpRequest",
        'Content-Type': 'application/x-www-form-urlencoded'
    }
    r = session.post(url, headers=headers, data=data)
    print r
    print r.content

if __name__ == "__main__":
    username = "Masako"
    password = "*****"
    login(username, password)
View Code

代碼多了起來,寫了個函數包裹一下啊啊啊

試一下,還是會失敗.

再回頭看一下signin這個請求,它的請求頭除了常見的幾個參數,也沒有cookie什么的用來識別身份的,不算特別......等等,好像看到了一個很陌生的字段

VerificationToken是個什么鬼,貌似忽略了這個.反復登錄幾次比較一下請求記錄,這個值每次都不一樣,說明它是在變的.

那么怎么獲取到這個變化的值呢,一般我會從兩方面着手:1.看看有沒有單獨請求這個參數的network記錄,2.看看網頁上有沒有相關字段.

這里,在網頁上就可以找到這個字段:

就在第二步加密代碼的下面.

其實第一步分析請求的時候,我們就可以注意到這個參數的問題,但是由於經驗不足或者說粗心大意,到現在才去修正它.

這個ajax就是構造登錄請求的代碼.可以看到,它也是設置了一個headers.

我選擇訪問網頁,使用正則,獲取到這個字段

代碼如下:

import re
import requests

headers = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.8",
    "Connection": "keep-alive",
    "Host": "passport.cnblogs.com",
    "Origin": "https://passport.cnblogs.com",
    "Referer": "https://passport.cnblogs.com/user/signin?ReturnUrl=https%3A%2F%2Fwww.cnblogs.com%2F",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36",
}

url = "https://passport.cnblogs.com/user/signin"

session = requests.Session()

r = session.get(url, headers=headers)

tmp = re.findall("'VerificationToken':(.*?)}", r.content, re.S)
token = tmp[0].strip()
token = token.strip("'\r\n")
print token

將這段代碼添加到登錄程序中

並把token傳到登錄請求的請求頭中

# /usr/bin/python
# encoding: utf-8

import re
import rsa
import base64
import requests
from web_encrypt import str2key


def login(username, password):

    rsa_str = "MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCp0wHYbg/NOPO3nzMD3dndwS0MccuMeXCHgVlGOoYyFwLdS24Im2e7YyhB0wrUsyYf0/nhzCzBK8ZC9eCWqd0aHbdgOQT6CuFQBMjbyGYvlVYU2ZP7kG9Ft6YV6oc9ambuO7nPZh+bvXH0zDKfi02prknrScAKC0XhadTHT3Al0QIDAQAB"
    pub_key = str2key(rsa_str)
    modulus = int(pub_key[0], 16)
    exponent = int(pub_key[1], 16)
    key = rsa.PublicKey(modulus, exponent)
    encrypt_name = rsa.encrypt(username, key)
    encrypt_pw = rsa.encrypt(password, key)

    input1 = base64.b64encode(encrypt_name)
    input2 = base64.b64encode(encrypt_pw)

    session = requests.Session()

    url = "https://passport.cnblogs.com/user/signin"

    headers = {
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "zh-CN,zh;q=0.8",
        "Connection": "keep-alive",
        "Host": "passport.cnblogs.com",
        "Origin": "https://passport.cnblogs.com",
        "Referer":"https://passport.cnblogs.com/user/signin?ReturnUrl=https%3A%2F%2Fwww.cnblogs.com%2F",
        "User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36",
    }

    data = {
        "input1": input1,
        "input2": input2,
        "remember": False
    }

    before = session.get(url, headers=headers)
    tmp = re.findall("'VerificationToken':(.*?)}", before.content, re.S)
    token = tmp[0].strip()
    token = token.strip("'\r\n")

    headers["VerificationToken"] = token
    headers["X-Requested-With"] = "XMLHttpRequest"
    r = session.post(url, headers=headers, data=data)
    print r
    print r.content

if __name__ == "__main__":
    username = "Masako"
    password = "*****"
    login(username, password)

填入正確的用戶名和密碼,返回:

{"success":true}

表示登錄成功!大功告成.

只需要保存r.cookie就可以訪問博客園里面需要登錄才能訪問的內容哦.

直接使用這個登錄過的session也是可以的!


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM