基於python的scrapy框架爬取豆瓣電影及其可視化

本文轉載自查看原文 2019-03-13 23:02 683

1.Scrapy框架介紹

scrapy

主要介紹，spiders，engine，scheduler,downloader,Item pipeline

scrapy常見命令如下：

對應在scrapy文件中有，自己增加爬蟲文件，系統生成items,pipelines,setting的配置文件就這些。

items寫需要爬取的屬性名，pipelines寫一些數據流操作，寫入文件，還是導入數據庫中。主要爬蟲文件寫domain，屬性名的xpath，在每頁添加屬性對應的信息等。

    movieRank = scrapy.Field()
    movieName = scrapy.Field()
    Director = scrapy.Field()
    movieDesc = scrapy.Field()
    movieRate = scrapy.Field()
    peopleCount = scrapy.Field()
    movieDate = scrapy.Field()
    movieCountry = scrapy.Field()
    movieCategory = scrapy.Field()
    moviePost = scrapy.Field()

import json

class DoubanPipeline(object):
    def __init__(self):
        self.f = open("douban.json","w",encoding='utf-8')

    def process_item(self, item, spider):
        content = json.dumps(dict(item),ensure_ascii = False)+"\n"
        self.f.write(content)
        return item

    def close_spider(self,spider):
        self.f.close()

這里xpath使用過程中，安利一個chrome插件xpathHelper。

    allowed_domains = ['douban.com']
    baseURL = "https://movie.douban.com/top250?start="
    offset = 0
    start_urls = [baseURL + str(offset)]


    def parse(self, response):
        node_list = response.xpath("//div[@class='item']")

        for node in node_list:
            item = DoubanItem()
            item['movieName'] = node.xpath("./div[@class='info']/div[1]/a/span/text()").extract()[0]
            item['movieRank'] = node.xpath("./div[@class='pic']/em/text()").extract()[0]
            item['Director'] = node.xpath("./div[@class='info']/div[@class='bd']/p[1]/text()[1]").extract()[0]
            if len(node.xpath("./div[@class='info']/div[@class='bd']/p[@class='quote']/span[@class='inq']/text()")):
                item['movieDesc'] = node.xpath("./div[@class='info']/div[@class='bd']/p[@class='quote']/span[@class='inq']/text()").extract()[0]
            else:
                item['movieDesc'] = ""
            
            item['movieRate'] = node.xpath("./div[@class='info']/div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()").extract()[0] 
            item['peopleCount'] = node.xpath("./div[@class='info']/div[@class='bd']/div[@class='star']/span[4]/text()").extract()[0]
            item['movieDate'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[0]
            item['movieCountry'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[1]
            item['movieCategory'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[2]           
            item['moviePost'] = node.xpath("./div[@class='pic']/a/img/@src").extract()[0]
            yield item

        if self.offset <250:
            self.offset += 25
            url = self.baseURL+str(self.offset)
            yield scrapy.Request(url,callback = self.parse)

這里基本可以爬蟲，產生需要的json文件。

接下來是可視化過程。

我們先梳理一下，我們掌握的數據情況。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8')
douban.info()

基本我們可以分析，電影國家產地，電影拍攝年份，電影類別以及一些導演在TOP250中影響力。

先做個簡單了解，可以使用value_counts()函數。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8')
df_Country = douban['movieCountry'].copy()

for i in range(len(df_Country)):
    item = df_Country.iloc[i].strip()
    df_Country.iloc[i] = item[0]
print(df_Country.value_counts())

美國電影占半壁江山，122/250，可以反映好萊塢電影工業之強大。同樣，日本電影和香港電影在中國也有着重要地位。令人意外是，中國大陸地區電影數量不是令人滿意。豆瓣影迷對於國內電影還是非常挑剔的。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8')
df_Date = douban['movieDate'].copy()

for i in range(len(df_Date)):
    item = df_Date.iloc[i].strip()
    df_Date.iloc[i] = item[2]
print(df_Date.value_counts())

2000年以來電影數目在70%以上，考慮10代才過去9年和打分滯后性，總體來說越新的電影越能得到受眾喜愛。這可能和豆瓣top250選取機制有關，必須人數在一定數量以上。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8')
df_Cate = douban['movieCategory'].copy()

for i in range(len(df_Cate)):
    item = df_Cate.iloc[i].strip()
    df_Cate.iloc[i] = item[0]
print(df_Cate.value_counts())

劇情電影情節起伏更容易得到觀眾認可。

下面展示幾張可視化圖片

不太會用python進行展示，有些難看。其實，推薦用Echarts等插件，或者用Excel，BI軟件來處理圖片，比較方便和美觀。

第一次做這種爬蟲和可視化，多有不足之處，懇請指出。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 用python寫一個豆瓣短評通用爬蟲(登錄、爬取、可視化) python爬取豆瓣電影信息數據 Scrapy實戰篇（三）之爬取豆瓣電影短評爬取豆瓣電影信息 python 爬取豆瓣電影短評並wordcloud生成詞雲圖 Python3爬取豆瓣網電影信息 Python爬蟲入門教程：豆瓣Top電影爬取 Python爬蟲——爬取豆瓣電影Top250 初識python 之爬蟲：爬取豆瓣電影最熱評論 Python爬蟲入門 | 爬取豆瓣電影信息