在安裝完scrapy以后,相信大家都會躍躍欲試想定制一個自己的爬蟲吧?我也不例外,下面詳細記錄一下定制一個scrapy工程都需要哪些步驟。如果你還沒有安裝好scrapy,又或者為scrapy的安裝感到頭疼和不知所措,可以參考下前面的文章安裝python爬蟲scrapy踩過的那些坑和編程外的思考。這里就拿博客園來做例子吧,抓取博客園的博客列表並保存到json文件。
環境:CentOS 6.0 虛擬機
scrapy(如未安裝可參考安裝python爬蟲scrapy踩過的那些坑和編程外的思考)
1、創建工程cnblogs
[root@bogon share]# scrapy startproject cnblogs 2015-06-10 15:45:03 [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot) 2015-06-10 15:45:03 [scrapy] INFO: Optional features available: ssl, http11 2015-06-10 15:45:03 [scrapy] INFO: Overridden settings: {} New Scrapy project 'cnblogs' created in: /mnt/hgfs/share/cnblogs You can start your first spider with: cd cnblogs scrapy genspider example example.com
2、查看下工程的結構
[root@bogon share]# tree cnblogs/ cnblogs/ ├── cnblogs │ ├── __init__.py │ ├── items.py #用於定義抽取網頁結構 │ ├── pipelines.py #將抽取的數據進行處理 │ ├── settings.py #爬蟲配置文件 │ └── spiders │ └── __init__.py └── scrapy.cfg #項目配置文件
3、定義抽取cnblogs的網頁結構,修改items.py
這里我們抽取四個內容:
- 文章標題
- 文章鏈接
- 文在所在的列表頁URL
- 摘要
[root@bogon cnblogs]# vi cnblogs/items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class CnblogsItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field() listUrl = scrapy.Field() pass
4、創建spider
[root@bogon cnblogs]# vi cnblogs/spiders/cnblogs_spider.py #coding=utf-8 import re import json from scrapy.selector import Selector try: from scrapy.spider import Spider except: from scrapy.spider import BaseSpider as Spider from scrapy.utils.response import get_base_url from scrapy.utils.url import urljoin_rfc from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle from cnblogs.items import * class CnblogsSpider(CrawlSpider): #定義爬蟲的名稱 name = "CnblogsSpider" #定義允許抓取的域名,如果不是在此列表的域名則放棄抓取 allowed_domains = ["cnblogs.com"] #定義抓取的入口url start_urls = [ "http://www.cnblogs.com/rwxwsblog/default.html?page=1" ] # 定義爬取URL的規則,並指定回調函數為parse_item rules = [ Rule(sle(allow=("/rwxwsblog/default.html\?page=\d{1,}")), #此處要注意?號的轉換,復制過來需要對?號進行轉義。 follow=True, callback='parse_item') ] #print "**********CnblogsSpider**********" #定義回調函數 #提取數據到Items里面,主要用到XPath和CSS選擇器提取網頁數據 def parse_item(self, response): #print "-----------------" items = [] sel = Selector(response) base_url = get_base_url(response) postTitle = sel.css('div.day div.postTitle') #print "=============length=======" postCon = sel.css('div.postCon div.c_b_p_desc') #標題、url和描述的結構是一個松散的結構,后期可以改進 for index in range(len(postTitle)): item = CnblogsItem() item['title'] = postTitle[index].css("a").xpath('text()').extract()[0] #print item['title'] + "***************\r\n" item['link'] = postTitle[index].css('a').xpath('@href').extract()[0] item['listUrl'] = base_url item['desc'] = postCon[index].xpath('text()').extract()[0] #print base_url + "********\n" items.append(item) #print repr(item).decode("unicode-escape") + '\n' return items
注意:
首行要設置為:#coding=utf-8 或 # -*- coding: utf-8 -*- 哦!否則會報錯。
SyntaxError: Non-ASCII character '\xe5' in file /mnt/hgfs/share/cnblogs/cnblogs/spiders/cnblogs_spider.py on line 15, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
spider的名稱為:CnblogsSpider,后面會用到。
5、修改pipelines.py文件
[root@bogon cnblogs]# vi cnblogs/pipelines.py # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html from scrapy import signals import json import codecs class JsonWithEncodingCnblogsPipeline(object): def __init__(self): self.file = codecs.open('cnblogs.json', 'w', encoding='utf-8') def process_item(self, item, spider): line = json.dumps(dict(item), ensure_ascii=False) + "\n" self.file.write(line) return item def spider_closed(self, spider): self.file.close()
注意類名為JsonWithEncodingCnblogsPipeline哦!settings.py中會用到
6、修改settings.py,添加以下兩個配置項
ITEM_PIPELINES = { 'cnblogs.pipelines.JsonWithEncodingCnblogsPipeline': 300, }
LOG_LEVEL = 'INFO'
7、運行spider,scrapy crawl 爬蟲名稱(cnblogs_spider.py中定義的name)
[root@bogon cnblogs]# scrapy crawl CnblogsSpider
8、查看結果more cnblogs.json(pipelines.py中定義的名稱)
more cnblogs.json
9、如果有需要可以將結果轉成txt文本格式,可參考另外一篇文章python將json格式的數據轉換成文本格式的數據或sql文件
源碼可在此下載:https://github.com/jackgitgz/CnblogsSpider
10、相信大家還會有疑問,我們能不能將數據直接保存在數據庫呢?答案是可以的,接下來的文章會逐一介紹,敬請期待。
參考資料:
http://doc.scrapy.org/en/master/
http://blog.csdn.net/HanTangSongMing/article/details/24454453