scrapy 模擬登錄后再抓取

本文轉載自查看原文 2015-12-02 14:40 1913 python/ data/ login/ scrapy/ crawl

深度好文：

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

class MySpider(InitSpider):
    name = 'myspider'
    allowed_domains = ['domain.com']
    login_page = 'http://www.domain.com/login'
    start_urls = ['http://www.domain.com/useful_page/',
                  'http://www.domain.com/another_useful_page/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),
             callback='parse_item', follow=True),
    )

    def init_request(self):
        """This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        """Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "Hi Herman" in response.body:
            self.log("Successfully logged in. Let's start crawling!")
            # Now the crawling can begin..
            self.initialized()
        else:
            self.log("Bad times :(")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse_item(self, response):

        # Scrape data from page

備注: 該代碼片段來自於: http://www.sharejs.com/codes/python/8544


使用header

request_headers = { 'User-Agent': 'PeekABoo/1.3.7' }
request = urllib2.Request('http://sebsauvage.net', None, request_headers)
urlfile = urllib2.urlopen(request)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy 通過FormRequest模擬登錄再繼續 Scrapy模擬登錄知乎抓取登錄后的數據利用scrapy模擬登錄知乎 Scrapy用Cookie實現模擬登錄 python爬蟲之scrapy模擬登錄 CURL的模擬登錄和抓取頁面基於puppeteer模擬登錄抓取頁面 scrapy模擬登錄2018新版知乎 scrapy模擬登錄值攜帶cookie