實用scrapy批量下載自己的博客園文章

本文轉載自查看原文 2017-04-02 21:55 1547 Python

首先，在items.py中定義幾個字段用來保存網頁數據（網址，標題，網頁源碼）

如下所示：

import scrapy


class MycnblogsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    page_title = scrapy.Field()
    page_url = scrapy.Field()
    page_html = scrapy.Field()

最重要的是我們的spider，我們這里的spider繼承自CrawlSpider，方便我們定義正則來提示爬蟲需要抓取哪些頁面。

如：爬去下一頁，爬去各個文章

在spdier中，我們使用parse_item方法來解析目標網頁，從而得到文章的網址，標題和內容。

注：在parse_item方法中，我們在得到的html源碼中，新增了base標簽，這樣打開下載后的html文件，不至於頁面錯亂，而是使用博客園的css樣式

spdier源碼如下:

# -*- coding: utf-8 -*-
from mycnblogs.items import MycnblogsItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class CnblogsSpider(CrawlSpider):
    name = "cnblogs"
    allowed_domains = ["cnblogs.com"]
    start_urls = ['http://www.cnblogs.com/hongfei/']
    rules = (
        # 爬取下一頁，沒有callback，意味着follow為True
        Rule(LinkExtractor(allow=('default.html\?page=\d+',))),
        # 爬取所有的文章，並使用parse_item方法進行解析，得到文章網址，文章標題，文章內容
        Rule(LinkExtractor(allow=('hongfei/p/',)), callback='parse_item'),
        Rule(LinkExtractor(allow=('hongfei/articles/',)), callback='parse_item'),
        Rule(LinkExtractor(allow=('hongfei/archive/\d+/\d+/\d+/\d+.html',)), callback='parse_item'),
    )

    def parse_item(self, response):
        item = MycnblogsItem()
        item['page_url'] = response.url
        item['page_title'] = response.xpath("//title/text()").extract_first()
        html = response.body.decode("utf-8")
        html = html.replace("<head>", "<head><base href='http://www.cnblogs.com/'>")
        item['page_html'] = html
        yield item

在pipelines.py文件中，我們使用process_item方法來處理返回的item

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs

class MycnblogsPipeline(object):

    def process_item(self, item, spider):
        file_name = './blogs/' + item['page_title'] + '.html'
        with codecs.open(filename=file_name, mode='wb', encoding='utf-8') as f:
            f.write(item['page_html'])
        return item

以下是item pipeline的一些典型應用：

清理HTML數據
驗證爬取的數據(檢查item包含某些字段)
查重(並丟棄)
將爬取結果保存到數據庫中

為了啟用一個Item Pipeline組件，你必須將它的類添加到 ITEM_PIPELINES 配置，就像下面這個例子:

ITEM_PIPELINES = {
   'mycnblogs.pipelines.MycnblogsPipeline': 300,
}

程序運行后，將采集所有的文章到本地，如下所示：

原文地址：http://www.cnblogs.com/hongfei/p/6659934.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 博客園為文章添加目錄博客園上傳markdown格式文章用Python向博客園發布新文章博客園文章方塊背景格式如何防止博客園文章被竊取博客園文章markdown實現本博客園所有至今天為止所MVC文章源碼下載博客園主題選址—搭建個人博客系列文章你博客園文章中的圖片可以放大嗎？反正我的是可以放大了！博客園文章自動生成導航目錄