現在寫一個利用scrapy爬蟲框架爬取最新美劇的項目。
准備工作:
目標地址:http://www.meijutt.com/new100.html
爬取項目:美劇名稱、狀態、電視台、更新時間
1、創建工程目錄
mkdir scrapyProject cd scrapyProject
2、創建工程項目
scrapy startproject meiju100 cd meiju100 scrapy genspider meiju meijutt.com
3、查看目錄結構

4、設置爬取項目(items.py)
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class Meiju100Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
storyName = scrapy.Field()
storyState = scrapy.Field()
tvStation = scrapy.Field()
updateTime = scrapy.Field()
5、編寫爬取腳本(meiju.py)
# -*- coding: utf-8 -*-
import scrapy
from meiju100.items import Meiju100Item
class MeijuSpider(scrapy.Spider):
name = "meiju"
allowed_domains = ["meijutt.com"]
start_urls = ['http://www.meijutt.com/new100.html']
def parse(self, response):
items = []
subSelector = response.xpath('//ul[@class="top-list fn-clear"]/li')
for sub in subSelector:
item = Meiju100Item()
item['storyName'] = sub.xpath('./h5/a/text()').extract()
item['storyState'] = sub.xpath('./span[1]/font/text()').extract()
if item['storyState']:
pass
else:
item['storyState'] = sub.xpath('./span[1]/text()').extract()
item['tvStation'] = sub.xpath('./span[2]/text()').extract()
if item['tvStation']:
pass
else:
item['tvStation'] = [u'未知']
item['updateTime'] = sub.xpath('./div[2]/text()').extract()
if item['updateTime']:
pass
else:
item['updateTime'] = sub.xpath('./div[2]/font/text()').extract()
items.append(item)
return items
6、對爬取結果的處理
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import time
import sys
reload(sys)
sys.setdefaultencoding('utf8')
class Meiju100Pipeline(object):
def process_item(self, item, spider):
today = time.strftime('%Y%m%d',time.localtime())
fileName = today + 'movie.txt'
with open(fileName,'a') as fp:
fp.write(item['storyName'][0].encode("utf8") + '\t' + item['storyState'][0].encode("utf8") + '\t' + item['tvStation'][0] + '\t' + item['updateTime'][0] + '\n')
return item
7、設置settings.py
……
ITEM_PIPELINES = {'meiju100.pipelines.Meiju100Pipeline':1}
8、啟動爬蟲
scrapy crawl meiju
9、結果

10、代碼下載
http://files.cnblogs.com/files/kongzhagen/meiju100.zip
