爬蟲Scrapy框架-2爬取網站視頻詳情

本文轉載自查看原文 2018-09-29 14:16 1207

爬取視頻詳情：http://www.id97.com/

創建環境：

movie.py 爬蟲文件的設置：

# -*- coding: utf-8 -*-
import scrapy

from moviePro.items import MovieproItem
class MovieSpider(scrapy.Spider):
    name = 'movie'
    # allowed_domains = ['www.id97.com']
    start_urls = ['http://www.id97.com/']

    def secondPageParse(self,response):
        item = response.meta['item']
        item['actor']=response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[1]/td[2]/a/text()').extract_first()
        item['show_time'] = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[7]/td[2]/text()').extract_first()

        yield item

    def parse(self, response):

        div_list=response.xpath('/html/body/div[1]/div[2]/div[1]/div/div')
        for div in div_list:
            item = MovieproItem()

            item['name']=div.xpath('./div/div[@class="meta"]//a/text()').extract_first()
            #類型下面有多個a標簽，所以使用//text,另外取到的是多個值，所以就用extract取值
            item['kind']=div.xpath('./div/div[@class="meta"]/div[@class="otherinfo"]//text()').extract()  #拿到的是列表類型，要轉為字符串類型

            item['kind'] = ''.join(item['kind'])
            #拿到二次連接，用於發請求，拿到電影詳細的描述信息
            item['url'] = div.xpath('./div/div[@class="meta"]//a/@href').extract_first()

            #將item對象參給二級頁面方法，進而將內容存入到item里面
            yield scrapy.Request(url=item['url'],callback=self.secondPageParse,meta={'item':item})

items.py里面的設置：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MovieproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name=scrapy.Field()
    kind=scrapy.Field()
    url=scrapy.Field()
    actor=scrapy.Field()
    show_time=scrapy.Field()

pipelines.py管道里面設置：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json

class MovieproPipeline(object):
    def process_item(self, item, spider):
        dic_item={
            '電影名字':item['name'],
            '影片類型':item['kind'],
            '主演':item['actor'],
            '上映時間':item['show_time'],

        }

        json_str=json.dumps(dic_item,ensure_ascii=False)
        with open('./movie_des.json','at',encoding='utf-8') as f:
            f.write(json_str)
        print(item['name'])
        return item

日志等級設置：

手動設置日志等級，在settings里面設置（可以寫在任意位置）

將制定日志信息，寫入到文件中進行存儲：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy爬蟲系列之四--爬取列表和詳情爬蟲框架之Scrapy——爬取某招聘信息網站 scrapy爬蟲框架爬取招聘網站爬蟲-用scrapy框架爬取騰訊視頻完整案例 python爬蟲：爬取某網站視頻 python爬蟲：爬取網站視頻一個scrapy框架的爬蟲(爬取京東圖書) 爬蟲第六篇：scrapy框架爬取某書網整站爬蟲爬取 scrapy框架下爬取老司機網站獲取磁力鏈接爬蟲---scrapy全站爬取