使用scrapy框架爬取自己的博文

本文轉載自查看原文 2014-05-04 15:00 23961 python與web基礎

　　scrapy框架是個比較簡單易用基於python的爬蟲框架，http://scrapy-chs.readthedocs.org/zh_CN/latest/ 這個是不錯的中文文檔

　　幾個比較重要的部分：

　　items.py：用來定義需要保存的變量，其中的變量用Field來定義，有點像python的字典

　　pipelines.py：用來將提取出來的Item進行處理，處理過程按自己需要進行定義

　　spiders：定義自己的爬蟲

　　爬蟲的類型也有好幾種：

　　1）spider:最基本的爬蟲，其他的爬蟲一般是繼承了該最基本的爬蟲類，提供訪問url，返回response的功能，會默認調用parse方法

　　2）CrawlSpider：繼承spider的爬蟲，實際使用比較多，設定rule規則進行網頁的跟進與處理，注意點：編寫爬蟲的規則的時候避免使用parse名，因為這會覆蓋繼承的spider的的方法parse造成錯誤。其中比較重要的是對Rule的規則的編寫，要對具體的網頁的情況進行分析。

　　3）XMLFeedSpider 與 CSVFeedSpider

　　實際操作：

items.py下的：

from scrapy.item import Item, Field


class Website(Item):

    headTitle = Field()
    description = Field()
    url = Field()

spider.py下的：

# -*- coding:gb2312 -*-
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from dirbot.items import Website
import sys
import string
sys.stdout=open('output.txt','w') #將打印信息輸出在相應的位置下


add = 0
class DmozSpider(CrawlSpider):

    name = "huhu"
    allowed_domains = ["cnblogs.com"]
    start_urls = [
        "http://www.cnblogs.com/huhuuu",
    ]

    
    rules = (
        # 提取匹配 huhuuu/default.html\?page\=([\w]+) 的鏈接並跟進鏈接(沒有callback意味着follow默認為True)
        Rule(SgmlLinkExtractor(allow=('huhuuu/default.html\?page\=([\w]+)', ),)),

        # 提取匹配 'huhuuu/p/' 的鏈接並使用spider的parse_item方法進行分析
        Rule(SgmlLinkExtractor(allow=('huhuuu/p/', )), callback='parse_item'),
    )

    def parse_item(self, response):
        global add #用於統計博文的數量
        
        print  add
        add+=1
        
        sel = Selector(response)
        items = []

        item = Website()
        item['headTitle'] = sel.xpath('/html/head/title/text()').extract()#觀察網頁對應得html源碼
        item['url'] = response
        print item
        items.append(item)
        return items

最后在相應的目錄文件下運行scrapy crawl huhu

結果：

但是我的博文好歹有400篇左右，最后只搜出了100篇，這是什么情況

查了一些搜出來的網頁地址，很多都是2013.10 到最近更新的博文情況，沒道理啊，最后注意了老的博文的網址，原來老的博文地址的結構更新的博文地址的結構不同

現在的：http://www.cnblogs.com/huhuuu/p/3384978.html

老的：http://www.cnblogs.com/huhuuu/archive/2012/04/10/2441060.html

然后在rule里面加入老網頁的規則，就可以把博客中沒加密的博文都搜出來了

# -*- coding:gb2312 -*-
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from dirbot.items import Website
import sys
import string
sys.stdout=open('output.txt','w') #將打印信息輸出在相應的位置下


add = 0
class DmozSpider(CrawlSpider):

    name = "huhu"
    allowed_domains = ["cnblogs.com"]
    start_urls = [
        "http://www.cnblogs.com/huhuuu",
    ]

    
    rules = (
        # 提取匹配 huhuuu/default.html\?page\=([\w]+) 的鏈接並跟進鏈接(沒有callback意味着follow默認為True)
        Rule(SgmlLinkExtractor(allow=('huhuuu/default.html\?page\=([\w]+)', ),)),

        # 提取匹配 'huhuuu/p/' 的鏈接並使用spider的parse_item方法進行分析
        Rule(SgmlLinkExtractor(allow=('huhuuu/p/', )), callback='parse_item'),
        Rule(SgmlLinkExtractor(allow=('huhuuu/archive/', )), callback='parse_item'), #以前的一些博客是archive形式的所以
    )

    def parse_item(self, response):
        global add #用於統計博文的數量
        
        print  add
        add+=1
        
        sel = Selector(response)
        items = []

        item = Website()
        item['headTitle'] = sel.xpath('/html/head/title/text()').extract()#觀察網頁對應得html源碼
        item['url'] = response
        print item
        items.append(item)
        return items

又做了一個爬取博客園首頁博客的代碼，其實只要修改Rule即可：

# -*- coding:gb2312 -*-
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from dirbot.items import Website
import sys
import string
sys.stdout=open('output.txt','w') #將打印信息輸出在相應的位置下


add = 0
class DmozSpider(CrawlSpider):

    name = "huhu"
    allowed_domains = ["cnblogs.com"]
    start_urls = [
        "http://www.cnblogs.com/",
    ]

    
    rules = (
       
        Rule(SgmlLinkExtractor(allow=('sitehome/p/[0-9]+', ),)),


        Rule(SgmlLinkExtractor(allow=('[^\s]+/p/', )), callback='parse_item'),
   
    )

    def parse_item(self, response):
        global add #用於統計博文的數量
        
        print  add
        add+=1
        
        sel = Selector(response)
        items = []

        item = Website()
        item['headTitle'] = sel.xpath('/html/head/title/text()').extract()#觀察網頁對應得html源碼
        item['url'] = response
        print item
        items.append(item)
        return items

View Code

參考：http://scrapy-chs.readthedocs.org/zh_CN/latest/topics/spiders.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用scrapy框架爬取自己的博文（2） Scrapy爬取自己的博客內容 Scrapy 爬取新浪微博博文被爬是一種什么樣的體驗？爬蟲入門（四）——Scrapy框架入門：使用Scrapy框架爬取全書網小說數據 scrapy框架爬取妹子圖片 R 語言爬蟲之 cnblog博文爬取 nodejs爬取博客園的博文 scrapy框架爬取多級頁面 scrapy框架的使用