python爬蟲實戰：豆瓣模擬登錄 + 影評爬取 + 詞雲制作

本文轉載自查看原文 2020-04-02 11:16 1147 Python

項目描述

爬取豆瓣上關於《哪吒之魔童降世》的短評，並制作詞雲。

技術點：

Python面向對象
模擬登陸，內容爬取
HTML解析利器：BeautifulSoup （對應Java中的JSoup）
分詞，並制作詞雲

學完后能做什么：爬取網絡中任何感興趣的東西，如小說、圖片、音樂、電影。或者其他有價值的數據，如收集電商商品信息，做一個比較網站。

環境准備

安裝Python3.x，官網下載安裝包；
安裝本次項目中使用的第三方包

pip install requests
pip install beautifulsoup4
pip install PIL
pip install pandas
pip install numpy
pip install jieba
pip install wordcloud

第三方包介紹
requests：抓取url數據
beautifulsoup4：html解析，從網頁獲取有用的數據
PIL：圖片展示
pandas：數據處理，並保存到表格
numpy：數據處理，矩陣操作
jieba：分詞
wordcloud：制作詞雲

豆瓣模擬登錄

為什么需要模擬登陸？

有些網站不登錄的話，訪問會受限。例如，在未登錄情況下，豆瓣影評只能讀取200條。

模擬登陸流程：

進入登錄頁面；
打開Chrome Debug控制台（右鍵頁面，選擇“檢測”；或者使用“F12”快捷鍵）；
進行登錄操作；
在Chrome Debug控制台抓取登錄消息

獲取如下信息：
登錄鏈接：https://accounts.douban.com/j/mobile/login/basic
登錄參數：

{
    'ck': '',
    'name': "你的豆瓣登錄賬號",
    'password': "你的豆瓣登錄密碼",
    'remember': 'false',
    'ticket': ''
}

登錄參考代碼：

import requests

class DouBan:
    def __init__(self):
        self.login_url = 'https://accounts.douban.com/j/mobile/login/basic'
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"
        }
        self.login_data = {
            'ck': '',
            'name': "你的豆瓣登錄賬號",
            'password': "你的豆瓣登錄密碼",
            'remember': 'false',
            'ticket': ''
        }
        self.session = requests.Session()
        self.login()

    def login(self):
        response = self.session.post(self.login_url, data=self.login_data, headers=self.headers)
        print(response.json())

    def get_html(self, url):
        return self.session.get(url, headers = self.headers)

影評爬取

在豆瓣查找《哪吒之魔童降世》影評鏈接
分析短評頁面，確定抓取維度：

用戶名
('.comment-info a')[0].text
評星
('.rating')[0]['class'][0][7:8]
評論內容
('.short')[0].text
時間
('.comment-time')[0].text

分頁
1）確定分頁鏈接
https://movie.douban.com/subject/26794435/comments?start=0&limit=20&sort=new_score&status=P
2）確定總條數（即何時結束）
只爬取500條

from nezha.douban2 import DouBan
import time
import random
from bs4 import BeautifulSoup
import pandas as pd
import jieba
from wordcloud import WordCloud
import numpy as np
from PIL import Image

class nezha2:
    def __init__(self):
        self.comment_url = 'https://movie.douban.com/subject/26794435/comments?start=%d&limit=20&sort=new_score&status=P'
        self.comment_count = 500
        self.douban = DouBan()

    def get_comments(self):
        comments = {'users': [], 'ratings': [], 'shorts': [], 'times': []}
        for i in range(0, 500, 20):
            time.sleep(random.random())
            url = self.comment_url % i
            response = self.douban.get_html(url)
            print('進度', i, '條', '狀態是：', response.status_code)
            soup = BeautifulSoup(response.text)
            for comment in soup.select('.comment-item'):
                try:
                    user = comment.select('.comment-info a')[0].text
                    rating = comment.select('.rating')[0]['class'][0][7:8]
                    short = comment.select('.short')[0].text
                    t = comment.select('.comment-time')[0].text.strip()
                    # print(user, rating, short, t)
                except:
                    continue
                else:
                    comments['users'].append(user)
                    comments['ratings'].append(rating)
                    comments['shorts'].append(short)
                    comments['times'].append(t)
            # break

        comments_pd = pd.DataFrame(comments)
        # 保存完整短評信息
        comments_pd.to_csv('comments.csv')
        # 僅保存評論，作為后續分詞的數據源
        comments_pd['shorts'].to_csv('shorts.csv', index=False)

分詞

使用jieba分詞，注意要過濾掉無意義的詞語，否則會出現大量的“我，是，一”等詞語。

    def word_cut(self):
    	# 添加新詞
        with open('data/mywords.txt') as f:
            jieba.load_userdict(f)

		# 獲取短評數據
        with open('shorts.csv', 'r', encoding='utf8') as f:
            comments = f.read()

        with open('data/stop.txt') as f:
            stop_words = f.read().splitlines()

        words = []
        # 過濾無意義的詞語
        for word in jieba.cut(comments):
            if word not in stop_words:
                words.append(word)

        words = ' '.join(words)
        return words

詞雲

使用wordcloud產生詞雲

    def generate_wordcount(self):
        word_cloud = WordCloud(
            background_color='white',
            font_path='/System/Library/Fonts/PingFang.ttc',  # 顯示中文
            mask=np.array(Image.open('data/nezha.jpg')),
            max_font_size=100
        )
        word_cloud.generate(self.word_cut())
        word_cloud.to_image().show()
        word_cloud.to_file('word.jpg')

詞雲效果

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬取豆瓣影評，根據關鍵詞生成詞雲圖 Python爬取《冰雪奇緣2》豆瓣影評 [超詳細] Python3爬取豆瓣影評、去停用詞、詞雲圖、評論關鍵詞繪圖處理 Python 爬蟲實戰（1）：分析豆瓣中最新電影的影評使用python爬取流浪地球影評並制作詞雲，看看別人都說了些說什么 Python爬蟲實戰，爬取A股公司數據，簡單分析A股公司並生成詞雲 python爬蟲實戰（四）--------豆瓣網的模擬登錄（模擬登錄和驗證碼的處理----scrapy）用python寫一個豆瓣短評通用爬蟲(登錄、爬取、可視化) python 爬取豆瓣電影短評並wordcloud生成詞雲圖 python爬蟲爬取B站視頻字幕，詞頻統計，使用pyecharts畫詞雲(wordcloud)