scrapy作為流行的python爬蟲框架,簡單易用,這里簡單介紹如何使用該爬蟲框架爬取個人博客信息。關於python的安裝和scrapy的安裝配置請讀者自行查閱相關資料,或者也可以關注我后續的內容。
本文使用的python版本為2.7.9 scrapy版本為0.14.3
1.假設我們爬蟲的名字為vpoetblog
在命令行下切換到桌面目錄,輸入startproject scrapy vpoetblog 如下圖所示:

命令執行成功后會在桌面生成一個名為vpoetblog的文件夾

該文件夾的目錄為:
│ scrapy.cfg │ └─vpoetblog │ items.py │ pipelines.py │ settings.py │ __init__.py │ └─spiders __init__.py
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
│ scrapy.cfg │ data.txt //用於保存抓取到的數據 └─doubanmoive │ items.py //用於定義抓取的item │ pipelines.py //用於將抓取的數據進行保存 │ settings.py │ __init__.py │ └─spiders
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
blog_spider.py //主爬蟲函數 用於定義抓取規則等 __init__.py
- 1
- 2
items.py內容如下:
- # -*- coding: cp936 -*-
- from scrapy.item import Item, Field
- class VpoetblogItem(Item):
- # define the fields for your item here like:
- # name = Field()
- article_name = Field() #文章名字
- public_time = Field() #發表時間
- read_num = Field() #閱讀數量
pipelines.py內容如下:
- # -*- coding: utf-8 -*-
- import sys
- reload(sys)
- sys.setdefaultencoding('utf-8')
- from scrapy.exceptions import DropItem
- from scrapy.conf import settings
- from scrapy import log
- class Pipeline(object):
- def __init__(self):
- print 'abc'
- def process_item(self, item, spider):
- #Remove invalid data
- #valid = True
- #for data in item:
- #if not data:
- #valid = False
- #raise DropItem("Missing %s of blogpost from %s" %(data, item['url']))
- #print 'crawl no data.....\n'
- #if valid:
- #Insert data into txt
- input = open('data.txt', 'a')
- input.write('article_name:'+item['article_name'][0]+' ');
- input.write('public_time:'+item['public_time'][0]+' ');
- input.write('read_num:'+item['read_num'][0]+' ');
- input.close()
- return item
settings.py內容如下:
- # Scrapy settings for vpoetblog project
- #
- # For simplicity, this file contains only the most important settings by
- # default. All the other settings are documented here:
- #
- # http://doc.scrapy.org/topics/settings.html
- #
- BOT_NAME = 'vpoetblog'
- BOT_VERSION = '1.0'
- SPIDER_MODULES = ['vpoetblog.spiders']
- NEWSPIDER_MODULE = 'vpoetblog.spiders'
- ITEM_PIPELINES={
- 'vpoetblog.pipelines.Pipeline':300
- }
- DOWNLOAD_DELAY = 2
- RANDOMIZE_DOWNLOAD_DELAY = True
- USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
- COOKIES_ENABLED = True
- # -*- coding: utf-8 -*-
- from scrapy.selector import HtmlXPathSelector
- from scrapy.contrib.spiders import CrawlSpider,Rule
- from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
- from vpoetblog.items import VpoetblogItem
- class MoiveSpider(CrawlSpider):
- name="vpoetblog"
- allowed_domains=["blog.csdn.net"]
- start_urls=["http://blog.csdn.net/u013018721/article/list/1"]
- rules=[
- Rule(SgmlLinkExtractor(allow=(r'http://blog.csdn.net/u013018721/article/list/\d+'))),
- Rule(SgmlLinkExtractor(allow=(r'http://blog.csdn.net/u013018721/article/details/\d+')),callback="parse_item"),
- ]
- def parse_item(self,response):
- sel=HtmlXPathSelector(response)
- item=VpoetblogItem()
- item['article_name']=sel.select('//*[@class="link_title"]/a/text()').extract()
- item['public_time']=sel.select('//*[@class="link_postdate"]/text()').extract()
- item['read_num']=sel.select('//*[@class="link_view"]/text()').extract()
- return item
運行命令如下:

運行截圖如下:

