scrapy實戰--爬取最新美劇

本文轉載自查看原文 2017-02-15 17:22 2332 scrapy實戰--爬取最新美劇/ 爬蟲_Spider

現在寫一個利用scrapy爬蟲框架爬取最新美劇的項目。

准備工作：

　　目標地址：http://www.meijutt.com/new100.html

　　爬取項目：美劇名稱、狀態、電視台、更新時間

1、創建工程目錄

mkdir scrapyProject
cd scrapyProject

2、創建工程項目

scrapy startproject meiju100
cd meiju100
scrapy genspider meiju meijutt.com

3、查看目錄結構

4、設置爬取項目(items.py)

#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class Meiju100Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    storyName = scrapy.Field()
    storyState = scrapy.Field()
    tvStation = scrapy.Field()
    updateTime = scrapy.Field()

5、編寫爬取腳本(meiju.py)

# -*- coding: utf-8 -*-
import scrapy
from meiju100.items import Meiju100Item

class MeijuSpider(scrapy.Spider):
    name = "meiju"
    allowed_domains = ["meijutt.com"]
    start_urls = ['http://www.meijutt.com/new100.html']

    def parse(self, response):
        items = []
        subSelector = response.xpath('//ul[@class="top-list  fn-clear"]/li')
        for sub in subSelector:
            item = Meiju100Item()
            item['storyName'] = sub.xpath('./h5/a/text()').extract()
            item['storyState'] = sub.xpath('./span[1]/font/text()').extract()
            if item['storyState']:
                pass
            else:
                item['storyState'] = sub.xpath('./span[1]/text()').extract()
            item['tvStation'] = sub.xpath('./span[2]/text()').extract()
            if item['tvStation']:
                pass
            else:
                item['tvStation'] = [u'未知']
            item['updateTime'] = sub.xpath('./div[2]/text()').extract()
            if item['updateTime']:
                pass
            else:
                item['updateTime'] = sub.xpath('./div[2]/font/text()').extract()
            items.append(item)
        return items

6、對爬取結果的處理

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import time
import sys
reload(sys)
sys.setdefaultencoding('utf8')


class Meiju100Pipeline(object):
    def process_item(self, item, spider):
        today = time.strftime('%Y%m%d',time.localtime())
        fileName = today + 'movie.txt'
        with open(fileName,'a') as fp:
            fp.write(item['storyName'][0].encode("utf8") + '\t' + item['storyState'][0].encode("utf8") + '\t' + item['tvStation'][0] + '\t' + item['updateTime'][0] + '\n')
        return item

7、設置settings.py

……
ITEM_PIPELINES = {'meiju100.pipelines.Meiju100Pipeline':1}

8、啟動爬蟲

scrapy crawl meiju

9、結果

10、代碼下載

http://files.cnblogs.com/files/kongzhagen/meiju100.zip

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 初識scrapy，美空網圖片爬取實戰爬取美團 Scrapy實戰篇（六）之爬取360圖片數據和圖片 scrapy+selenium爬取馬蜂窩網實戰 Scrapy實戰篇（五）爬取京東商城文胸信息 Scrapy實戰篇（五）之爬取歷史天氣數據 Scrapy實戰篇（三）之爬取豆瓣電影短評爬蟲實戰——Scrapy爬取伯樂在線所有文章 Scrapy 爬蟲實戰1—股票數據爬取 scrapy多url爬取