爬蟲Scrapy框架-2爬取網站視頻詳情


 爬取視頻詳情:http://www.id97.com/

 創建環境:

movie.py 爬蟲文件的設置:

# -*- coding: utf-8 -*-
import scrapy

from moviePro.items import MovieproItem
class MovieSpider(scrapy.Spider):
    name = 'movie'
    # allowed_domains = ['www.id97.com']
    start_urls = ['http://www.id97.com/']

    def secondPageParse(self,response):
        item = response.meta['item']
        item['actor']=response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[1]/td[2]/a/text()').extract_first()
        item['show_time'] = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[7]/td[2]/text()').extract_first()

        yield item

    def parse(self, response):

        div_list=response.xpath('/html/body/div[1]/div[2]/div[1]/div/div')
        for div in div_list:
            item = MovieproItem()

            item['name']=div.xpath('./div/div[@class="meta"]//a/text()').extract_first()
            #類型下面有多個a標簽,所以使用//text,另外取到的是多個值,所以就用extract取值
            item['kind']=div.xpath('./div/div[@class="meta"]/div[@class="otherinfo"]//text()').extract()  #拿到的是列表類型,要轉為字符串類型

            item['kind'] = ''.join(item['kind'])
            #拿到二次連接,用於發請求,拿到電影詳細的描述信息
            item['url'] = div.xpath('./div/div[@class="meta"]//a/@href').extract_first()

            #將item對象參給二級頁面方法,進而將內容存入到item里面
            yield scrapy.Request(url=item['url'],callback=self.secondPageParse,meta={'item':item})

items.py里面的設置:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MovieproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name=scrapy.Field()
    kind=scrapy.Field()
    url=scrapy.Field()
    actor=scrapy.Field()
    show_time=scrapy.Field()

 pipelines.py管道里面設置:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json

class MovieproPipeline(object):
    def process_item(self, item, spider):
        dic_item={
            '電影名字':item['name'],
            '影片類型':item['kind'],
            '主演':item['actor'],
            '上映時間':item['show_time'],

        }

        json_str=json.dumps(dic_item,ensure_ascii=False)
        with open('./movie_des.json','at',encoding='utf-8') as f:
            f.write(json_str)
        print(item['name'])
        return item

 

日志等級設置:

手動設置日志等級,在settings里面設置(可以寫在任意位置)

 

將制定日志信息,寫入到文件中進行存儲:

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM