Scrapy模擬登錄知乎

本文轉載自查看原文 2016-10-02 14:30 3002 爬蟲/ Scrapy

建立項目

scrapy startproject zhihu_login
scrapy genspider zhihu www.zhihu.com

編寫spider

知乎的登錄頁url是http://www.zhihu.com/#signin, 為了方便重寫sart_requests

# -*- coding: utf-8 -*-
import scrapy

class ZhihuSpider(scrapy.Spider):
    name = "zhihu"
    allowed_domains = ["www.zhihu.com"]

    def start_requests(self):
        # 返回值必須是一個序列
        return [scrapy.Request('http://www.zhihu.com/#signin')]

    def parse(self, response):
        print response

測試能不能正確返回, 返回結果是

[scrapy] DEBUG: Retrying <GET http://www.zhihu.com/robots.txt> (failed 1 times): 500 Internal Server Error

在settings中假如USER_AGENT再進行測試, 返回200, 說明是知乎驗證瀏覽器的問題, 到此可以成功請求到

DEBUG: Crawled (200) <GET http://www.zhihu.com/robots.txt> (referer: None)

確定post都需要傳入哪些參數, 使用開發者工具得到post值如下(沒有出現驗證碼的情況)

_xsrf    (在html中可以找到)
email
password
remember_me

定義login函數, 用於post登錄
以上找大了_xsrf的值

# -*- coding: utf-8 -*-
import scrapy

class ZhihuSpider(scrapy.Spider):
    name = "zhihu"
    allowed_domains = ["www.zhihu.com"]

    def start_requests(self):
        # 返回值必須是一個序列
        return [scrapy.Request('http://www.zhihu.com/#signin', callback=self.login)]

    def login(self, response):
        print '-------'     # 便於測試
        _xsrf = response.xpath(".//*[@id='sign-form-1']/input[2]/@value").extract()[0]
        print _xsrf

使用FormRequest登錄

def login(self, response):
        print '-------'     # 便於測試
        _xsrf = response.xpath(".//*[@id='sign-form-1']/input[2]/@value").extract()[0]
        print _xsrf
        return [scrapy.FormRequest(
            url = 'http://www.zhihu.com/login/email',    # 這是post的真實地址
            formdata={
                '_xsrf': _xsrf,
                'email': 'xxxxxxxx',    # email
                'password': 'xxxxxxxx',    # password
                'remember_me': 'true',
            },
            headers=self.headers,
            callback=self.check_login,
        )]

檢測是否登錄成功, 知乎的response會返回一個json, 如果里面r為0的話說明成功登錄

def check_login(self, response):
        if json.loads(response.body)['r'] == 0:
            yield scrapy.Request(
                                'http://www.zhihu.com', 
                                headers=self.headers, 
                                callback=self.page_content,
                                dont_filter=True,    # 因為是第二次請求, 設置為True, 默認是False, 否則報錯
                                )

spider的完整代碼

# -*- coding: utf-8 -*-
import scrapy
import json

class ZhihuSpider(scrapy.Spider):
    name = "zhihu"
    allowed_domains = ["www.zhihu.com"]
    headers = {
            'Host': 'www.zhihu.com',
            'Referer': 'http://www.zhihu.com',
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
        }

    def start_requests(self):
        # 返回值必須是一個序列
        return [scrapy.Request('http://www.zhihu.com/#signin', callback=self.login)]

    def login(self, response):
        print '-------'     # 便於測試
        _xsrf = response.xpath(".//*[@id='sign-form-1']/input[2]/@value").extract()[0]
        print _xsrf
        return [scrapy.FormRequest(
            url = 'http://www.zhihu.com/login/email',    # 這是post的真實地址
            formdata={
                '_xsrf': _xsrf,
                'email': 'xxxxxxxx',    # email
                'password': 'xxxxxxxx',    # password
                'remember_me': 'true',
            },
            headers=self.headers,
            callback=self.check_login,
        )]

    def check_login(self, response):
        if json.loads(response.body)['r'] == 0:
            yield scrapy.Request(
                                'http://www.zhihu.com', 
                                headers=self.headers, 
                                callback=self.page_content,
                                dont_filter=True,    
                                )

    def page_content(self, response):
        with open('first_page.html', 'wb') as f:
            f.write(response.body)
        print 'done'

注: 也是剛學scrapy, 暫時不知道怎么處理驗證碼的情況, 還望大牛指教

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy模擬登錄2018新版知乎 Java爬蟲——模擬登錄知乎 2020.10.20 利用POST請求模擬登錄知乎 python爬蟲--運用cookie模擬登錄知乎 scrapy 模擬登錄后再抓取知乎模擬登錄，支持驗證碼和保存 Cookies scrapy 知乎用戶信息爬蟲屏蔽知乎登錄彈窗 python爬蟲-知乎登錄 scrapy基礎知識之使用FormRequest.from_response()方法模擬用戶登錄：