scrapy爬蟲學習系列七：scrapy常見問題解決方案

本文轉載自查看原文 2017-11-25 11:12 7606 爬蟲/ python/ scrapy

1 常見錯誤

1.1 錯誤： ImportError: No module named win32api

官方參考： https://doc.scrapy.org/en/latest/faq.html#scrapy-crashes-with-importerror-no-module-named-win32api

官方參考里面有個win32的連接，你下載后安裝就可以了。

1.2 DEBUG: Forbidden by robots.txt: <GET https://www.baidu.com>

官方參考： https://doc.scrapy.org/en/latest/topics/settings.html#robotstxt-obey

修改setting.py中的ROBOTSTXT_OBEY = False

1.3 抓取xml文檔的時候使用xpath無法返回結果

官方參考： https://doc.scrapy.org/en/latest/faq.html#i-m-scraping-a-xml-document-and-my-xpath-selector-doesn-t-return-any-items

response.selector.remove_namespaces()
response.xpath("//link")

這個問題正常情況我們不用執行remove_namespaces的，只有在抓取不到數據的時候的時候嘗試修改下。

1.4 響應流亂碼

官方參考： https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse.encoding

1 在請求的構造函數中設定encoding

2 在http header中設置

3 在response body定義encoding

4 對獲取到的響應流進行轉碼，這也是最后的方法了。

def parse(self, response):
    #具體怎么轉，要看你的編碼的
    response=response.replace(encoding="gbk")

    # todo extract a item

2常用解決方案

2.1 scrapy發送抓取數據個數的郵件到指定用戶

官方文檔有關於email的說明： https://doc.scrapy.org/en/latest/topics/email.html

博友的一篇文章，使用了scrapy的mail模塊： http://blog.csdn.net/you_are_my_dream/article/details/60868329

我自己嘗試了下使用scrapy的mail模塊發送郵件，但是日志是發送成功，但是一直沒有收到郵件，不知道啥情況，所以換成了smtpllib發送。修改pipeline.py文件如下：

class MailPipeline(object):
    def __init__(self):
        self.count = 0
    def open_spider(self,spider):
        pass
    def process_item(self, item, spider):
        self.count=self.count + 1
        return item                         #切記，這個return item 必須有， 沒有的話，后續的pipeline沒法處理數據的。
    def close_spider(self, spider):
        import smtplib
        from email.mime.text import MIMEText
        _user = "1072892917@qq.com"
        _pwd = "xxxxxxxx"                   #這個密碼不是直接登陸的密碼， 是smtp授權碼。具體可以參考http://blog.csdn.net/you_are_my_dream/article/details/60868329
        _to = "1072892917@qq.com"

        msg = MIMEText("Test")
        msg["Subject"] = str(self.count)   #這里我們把抓取到的item個數，當主題發送
        msg["From"] = _user
        msg["To"] = _to
        try:
            s = smtplib.SMTP_SSL("smtp.qq.com", 465)       #參考 http://service.mail.qq.com/cgi-bin/help?subtype=1&no=167&id=28
            s.login(_user, _pwd)
            s.sendmail(_user, _to, msg.as_string())
            s.quit()
            print("Success!")
        except smtplib.SMTPException as e:
            print("Falied,%s" % e)

import json
class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

修改settings.py文件如下

ITEM_PIPELINES = {
　　　　#這個302，303，數字越小，越先通過管道。
      'quotesbot.pipelines.MailPipeline': 302,
     'quotesbot.pipelines.JsonWriterPipeline': 303
}

這樣我們可以把抓取到的數據先通過MailPipeline獲取到抓取的個數，然后發送郵件，在經過jsonWritePipeline進行持久化處理，當然你可以修改pipeline的順序，發送郵件的時候把持久化的文件作為附件發送。

注意： scrapy的mail模塊使用的是twist的mail模塊，支持異步的。

2.2 在scrapy中使用beautifulsoup

scrapy 官方參考： https://doc.scrapy.org/en/latest/faq.html#can-i-use-scrapy-with-beautifulsoup

bs4官方英文參考：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

bs4官方中文參考： https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

from bs4 import BeautifulSoup
import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/',
    )

    def parse(self, response):
        # use lxml to get decent HTML parsing speed
        soup = BeautifulSoup(response.text, 'lxml')
        yield {
            "url": response.url,
            "title": soup.h1.string
        }

2.3 抓取的item不同的屬性值的提取需要來自多個頁面，不是單個頁面就能提取到所有屬性

官方參考： https://doc.scrapy.org/en/latest/faq.html#how-can-i-scrape-an-item-with-attributes-in-different-pages

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    yield request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    yield item

這個是通過request的meta傳遞給后續的請求的，最終的那個請求返回item結果。

2.4 如何抓取一個需要登陸的頁面

官方參考： https://doc.scrapy.org/en/latest/faq.html#how-can-i-simulate-a-user-login-in-my-spider

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.logger.error("Login failed")
            return

        # continue scraping with authenticated session...

這個就是使用FromRequest把用戶名和密碼提交，獲取對應的服務器響應，這里需要對響應流進行判定，如果登陸成功進行抓取，如果失敗退出。

2.5 不創建工程運行一個爬蟲

官方參考： https://doc.scrapy.org/en/latest/faq.html#can-i-run-a-spider-without-creating-a-project

官方參考： https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

2.6 最簡單的方式存儲抓取到的數據

官方參考： https://doc.scrapy.org/en/latest/faq.html#simplest-way-to-dump-all-my-scraped-items-into-a-json-csv-xml-file

scrapy crawl myspider -o items.json
scrapy crawl myspider -o items.csv
scrapy crawl myspider -o items.xml
scrapy crawl myspider -o items.jl

這個方法是最快的方法了。但是有個問題。 json的使用的ansi編碼，對中文不支持，我們需要使用utf-8的。這個時候這個就有問題。

1.可以在設置中指定 FEED_EXPORT_ENCODING = 'utf-8'

2. 參考我寫的導出各個格式的item結果。http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_005_scrapy.html

這2個方法都是可以的，建議使用第二種方法，這樣擴展比較方便。

2.7 指定條件滿足就停止爬蟲

官方參考： https://doc.scrapy.org/en/latest/faq.html#how-can-i-instruct-a-spider-to-stop-itself

def parse_page(self, response):
    if 'Bandwidth exceeded' in response.body:
        raise CloseSpider('bandwidth_exceeded')

如果設置的抓取到指定item個數就終止的話，可以采用如下方法：

# -*- coding: utf-8 -*-
import scrapy

from scrapy.exceptions import CloseSpider

class ToScrapeSpiderXPath(scrapy.Spider):

    def __init__(self):
        self.count=0     #設置下當前個數
        self.max_count=100  #設置最大抓取個數
    name = 'toscrape-xpath'
    start_urls = [
        'http://quotes.toscrape.com/',
    ]

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            self.count =self.count +1
            if self.count > self.max_count:
                raise CloseSpider('bandwidth_exceeded')
            yield {
                'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
                'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
                'tags': quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').extract()
            }

        next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

當然，也是可以調用self.crawler.stop()方法。

其實scrapy內置有個中間件可以設置一些指定的條件去關閉爬蟲的具體參考 https://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.extensions.closespider

關於這個中間件的設置，簡單說下：

CLOSESPIDER_TIMEOUT ：爬蟲打開超過指定的時間就關閉爬蟲
CLOSESPIDER_ITEMCOUNT ：指定數量的item通過了pipeline就關閉爬蟲，如果還有請求，是會繼續工作的。但是多個item個數不會超過並發個數 CONCURRENT_REQUESTS.
CLOSESPIDER_PAGECOUNT : 抓取到指定頁面個數的時候關閉爬蟲
CLOSESPIDER_ERRORCOUNT ：捕獲到指定的錯誤次數的時候關閉爬蟲

內置的幾條如果沒能合乎你的心意，你可以自己寫一個擴展即可。具體可以參考： https://doc.scrapy.org/en/latest/topics/extensions.html#writing-your-own-extension

2.7 避免爬蟲被banned

官方參考： https://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned

1 設置一個list集合存放userAgent,每次請求從幾何里面選擇一個userAgent.

2 禁用cookies,有些網址啟用cookies來識別bot.

3 使用下載延遲download_delay，有些網址對單位時間內請求次數有限制，過多請求會被禁的。

4 如果肯能的話使用谷歌緩存，而不是直接請求網址。

5 使用ip池，比如ProxyMesh，scrapoxy

6 使用高度分布的下載器，比如Crawlera

2.8 啟動爬蟲的時候接受參數

官方參考： https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.example.com/categories/%s' % category]

這樣，我們運行爬蟲的時候使用如下的即可

scrapy crawl myspider -a category=electronics

2.9 修該pipeline支持多種格式導出

官方參考： https://doc.scrapy.org/en/latest/topics/exporters.html#using-item-exporters

博客參考（我自己的）： http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_005_scrapy.html

具體項目的參考： https://github.com/zhaojiedi1992/ScrapyCnblogs

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy import signals
from scrapy.exporters import *
import logging
logger=logging.getLogger(__name__)
class BaseExportPipeLine(object):
    def __init__(self,**kwargs):
        self.files = {}
        self.exporter=kwargs.pop("exporter",None)
        self.dst=kwargs.pop("dst",None)
        self.option=kwargs
    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        file = open(self.dst, 'wb')
        self.files[spider] = file
        self.exporter = self.exporter(file,**self.option)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

# 
# 'fields_to_export':["url","edit_url","title"] 設定只導出部分字段,以下幾個pipeline都支持這個參數
# 'export_empty_fields':False 設定是否導出空字段 以下幾個pipeline都支持這個參數
# 'encoding':'utf-8' 設定默認編碼，以下幾個pipeline都支持這個參數
# 'indent' :1： 設置縮進，這個參數主要給JsonLinesExportPipeline使用
# "item_element":"item"設置xml節點元素的名字，只能XmlExportPipeline使用,效果是<item></item>
# "root_element":"items"設置xml根元素的名字，只能XmlExportPipeline使用，效果是<items>里面是很多item</items>
# "include_headers_line":True 是否包含字段行， 只能CsvExportPipeline使用
# "join_multivalued":","設置csv文件的分隔符號， 只能CsvExportPipeline使用
# 'protocol':2設置PickleExportPipeline 導出協議，只能PickleExportPipeline使用
# "dst":"items.json" 設置目標位置
class JsonExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":JsonItemExporter,"dst":"items.json","encoding":"utf-8","indent":4,}
        super(JsonExportPipeline, self).__init__(**option)
class JsonLinesExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":JsonLinesItemExporter,"dst":"items.jl","encoding":"utf-8"}
        super(JsonLinesExportPipeline, self).__init__(**option)
class XmlExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":XmlItemExporter,"dst":"items.xml","item_element":"item","root_element":"items","encoding":'utf-8'}
        super(XmlExportPipeline, self).__init__(**option)
class CsvExportPipeline(BaseExportPipeLine):
    def __init__(self):
        # 設置分隔符的這個，我這里測試是不成功的
        option={"exporter":CsvItemExporter,"dst":"items.csv","encoding":"utf-8","include_headers_line":True, "join_multivalued":","}
        super(CsvExportPipeline, self).__init__(**option)
class  PickleExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":PickleItemExporter,"dst":"items.pickle",'protocol':2}
        super(PickleExportPipeline, self).__init__(**option)
class  MarshalExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":MarshalItemExporter,"dst":"items.marsha"}
        super(MarshalExportPipeline, self).__init__(**option)
class  PprintExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":PprintItemExporter,"dst":"items.pprint.jl"}
        super(PprintExportPipeline, self).__init__(**option)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。