Python爬取《少年的你》豆瓣短評

本文轉載自查看原文 2019-10-31 22:04 348 Python

周末，看到朋友在朋友圈發了一條心情，是關於最新上映的電影《少年的你》，剛好前段時間又學習了一下爬蟲，於是心血來潮，想爬一下這部電影的短評，看看口碑如何。此筆記僅用於學習，不得商業獲利！如有侵害任何公司利益，請告知刪除！

本文記錄使用request，以及正則表達式re爬取影評的過程，關於request的安裝，可以使用：pip3 install requests

1）登錄。注冊賬號，因為要爬取所有的短評內容的話，必須要登錄才可以，這也算是一種反爬蟲的手段，注冊賬號之后，我們首先要解決的就是登錄問題。

在獲取登錄的Url的時候，我們故意輸入一個錯的賬號和密碼，就能輕松拿到這個Url以及相應的請求參數了：https://accounts.douban.com/j/mobile/login/basic

拿到這些信息之后，就是發請求，登錄了，代碼如下：

def login_douban():
    try:
        login_url = 'https://accounts.douban.com/j/mobile/login/basic'
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
            'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony'
        }
        data = {
            'name': '你的用戶名',
            'password':    '你的密碼',
            'remember':    'false'
        }

        response = session.post(login_url, headers=headers, data=data)
        if response.status_code == 200:
                return response.text
        return None
    except RequestException:
        print('登錄失敗')
        return None

2）爬取。登錄成功之后，就可以訪問《少年的你》的短評頁，https://movie.douban.com/subject/30166972/comments?status=P，然后下拉到最后，通過翻頁，我們會看到一個有明顯規律的Url:https://movie.douban.com/subject/30166972/comments?start=0&limit=20&sort=new_score&status=P

注意里邊的參數，start=0&limit=20，意思就是從第0條短評開始，請求20條，也就是第0條~第20條的短評內容，那么就可以定義一個方法，來爬取短評內容，代碼如下：着色部分是重點，我們用request.Session()來保存會話狀態

session = requests.Session() def get_comment_one_page(page=0):
    start = int(page * 20)
    comment_url = 'https://movie.douban.com/subject/30166972/comments?start=%d&limit=20&sort=new_score&status=P' % start
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    }
    try:
        response = session.get(comment_url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('爬取評論失敗')
        return None

3）解析。第二步爬取到html代碼之后，我們需要解析出來里邊的內容，將評論的數據提取出來。

通過對網頁的分析，我們可以看到每一條評論，都是用一個class="comment-item"的div包裹着，那么我們就可以將這個div中需要的內容通過正則表達式提取出來：.*?表示匹配任意字符，需要提取的內容需要加()，着色部分就是我們需要提取的內容

'<div class="comment-item".*?comment-info">.*?rating".*?title="(.*?)">.*?"comment-time.*?title="(.*?)">.*?short">(.*?)</span>.*?</div>'

<div class="comment-item" data-cid="2012951340">
            
    
        <div class="avatar">
            <a title="天天天藍" href="https://www.douban.com/people/182748049/">
                <img src="https://img3.doubanio.com/icon/u182748049-1.jpg" class="">
            </a>
        </div>
    <div class="comment">
        <h3>
            <span class="comment-vote">
                <span class="votes">8867</span>
                <input value="2012951340" type="hidden">
                <a href="javascript:;" class="j a_show_login" onclick="">有用</a>
            </span>
            <span class="comment-info">
                <a href="https://www.douban.com/people/182748049/" class="">天天天藍</a>
                    <span>看過</span>
                    <span class="allstar50 rating" title="力薦"></span>
                <span class="comment-time " title="2019-10-25 09:34:22">
                    2019-10-25
                </span>
            </span>
        </h3>
        <p class="">
            
                <span class="short">應該創造怎樣的世界讓少年成長是這個電影的主題...</span>
        </p>
    </div>

        </div>

提取內容的方法如下：

def parse_comment_one_page(html):
    pattern = re.compile(
        '<div class="comment-item".*?comment-info">.*?rating".*?title="(.*?)">.*?"comment-time.*?title="(.*?)">.*?short">(.*?)</span>.*?</div>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield{
            'star': item[0],
            'time': item[1],
            'context': item[2]
        }

4）爬取所有頁。這一步其實很難，因為你會發現你的賬號馬上就會被封了..哈哈，要解決這個問題，就要用到動態代理了，好吧！我還沒來得及補習....

def get_comment_all_page():
    page = 0
    html = get_comment_one_page(page)
    condition = html is not None
    while condition:
        for item in parse_comment_one_page(html):
            print(item)
            write_to_file(item)
        #save_data_base(parse_comment_one_page(html))
        page += 1
        html = get_comment_one_page(page)
        time.sleep(random.random() * 3)
    print('爬取完畢')

爬取到內容之后，我們可以將內容寫到文件中，也可以寫入數據庫：

def write_to_file(comments):
    with open(COMMENTS_FILE_PATH, 'a', encoding='utf-8') as f:
        f.write(json.dumps(comments, ensure_ascii=False)+'\n')
        f.close()

def save_data_base(items):
    list = []
    for item in items:
        jsonstr = json.dumps(item, ensure_ascii=False)
        context = json.loads(jsonstr)
        tup = (context['star'], context['time'], context['context'])
        list.append(tup)
    connection = pymysql.connect(
        host='127.0.0.1', port=10080, user='test', password='123456', db='webspider')
    cursor = connection.cursor()
    cmd = "insert into shaoniandeni (star,time,context) values (%s,%s,%s)"
    try:
        cursor.executemany(cmd, list)
        connection.commit()
    except:
        connection.rollback()
        traceback.print_exc()
    finally:
        cursor.close()
        connection.close()

5）分析。因為我沒有爬取完，所以分析的結果是不准確的，但是呢？也能反映出來一部分問題。通過mysql中的數據分類可以看到結果如下：

出於好奇，我很關心評分給很差的同學都寫了什么，於是我又看了一下，評分為很差的內容：

6）生成詞雲。生成詞雲的時候，我過濾了一部分詞，詳情直接查看完整的源代碼。

7）源代碼。

import os
import re
import json
import time
import jieba
import random

import requests
import traceback
import pymysql
import numpy as np
import pymysql.cursors
from PIL import Image
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from requests.exceptions import RequestException

COMMENTS_FILE_PATH = 'douban_comments.txt'

# 詞雲字體
WC_FONT_PATH = 'C:\\Windows\\fonts\\STFANGSO.TTF'

# 詞雲形狀圖片
WC_MASK_IMG = 'index.jpg'

session = requests.Session()


def login_douban():
    try:
        login_url = 'https://accounts.douban.com/j/mobile/login/basic'
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
            'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony'
        }
        data = {
            'name': '',
            'password':    '',
            'remember':    'false'
        }

        response = session.post(login_url, headers=headers, data=data)
        if response.status_code == 200:
                return response.text
        return None
    except RequestException:
        print('登錄失敗')
        return None


def get_comment_one_page(page=0):
    start = int(page * 20)
    comment_url = 'https://movie.douban.com/subject/30166972/comments?start=%d&limit=20&sort=new_score&status=P' % start
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    }
    try:
        response = session.get(comment_url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('爬取評論失敗')
        return None


def parse_comment_one_page(html):
    pattern = re.compile(
        '<div class="comment-item".*?comment-info">.*?rating".*?title="(.*?)">.*?"comment-time.*?title="(.*?)">.*?short">(.*?)</span>.*?</div>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield{
            'star': item[0],
            'time': item[1],
            'context': item[2]
        }


def write_to_file(comments):
    with open(COMMENTS_FILE_PATH, 'a', encoding='utf-8') as f:
        f.write(json.dumps(comments, ensure_ascii=False)+'\n')
        f.close()


def get_comment_all_page():
    page = 0
    html = get_comment_one_page(page)
    condition = html is not None
    while condition:
        for item in parse_comment_one_page(html):
            print(item)
            write_to_file(item)
        #save_data_base(parse_comment_one_page(html))
        page += 1
        html = get_comment_one_page(page)
        time.sleep(random.random() * 3)
    print('爬取完畢')


def save_data_base(items):
    list = []
    for item in items:
        jsonstr = json.dumps(item, ensure_ascii=False)
        context = json.loads(jsonstr)
        tup = (context['star'], context['time'], context['context'])
        list.append(tup)
    connection = pymysql.connect(
        host='127.0.0.1', port=10080, user='test', password='123456', db='webspider')
    cursor = connection.cursor()
    cmd = "insert into shaoniandeni (star,time,context) values (%s,%s,%s)"
    try:
        cursor.executemany(cmd, list)
        connection.commit()
    except:
        connection.rollback()
        traceback.print_exc()
    finally:
        cursor.close()
        connection.close()


def cut_word():
    """
    對數據分詞
    :return: 分詞后的數據
    """
    with open(COMMENTS_FILE_PATH, "r", encoding="utf-8") as file:
        comment_txt = file.read()
        jieba.add_word('周冬雨')
        jieba.add_word('易烊千璽')
        jieba.add_word('白夜行')
        jieba.add_word('東野圭吾')
        wordlist = jieba.cut(comment_txt, cut_all=True)
        wl = " ".join(wordlist)
        #print(wl)
        return wl


def create_word_cloud():
    """
    生成詞雲
    :return:
    """
    # 設置詞雲形狀圖片
    wc_mask = np.array(Image.open(WC_MASK_IMG))
    # 數據清洗詞列表
    stop_words = ['就是', '不是', '但是', '還是','這種', '只是',
                  '這樣', '這個', '一個', '什么', '電影', '沒有', '真的','周冬雨','易烊千璽','冬雨','千璽','我們','他們','少年']
    # 設置詞雲的一些配置，如：字體，背景色，詞雲形狀，大小
    wc = WordCloud(background_color="white", max_words=50, mask=wc_mask, scale=4,
                   max_font_size=50, random_state=42, stopwords=stop_words, font_path=WC_FONT_PATH)
    # 生成詞雲
    wc.generate(cut_word())

    # 在只設置mask的情況下,你將會得到一個擁有圖片形狀的詞雲
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.figure()
    plt.show()


def main():
   #login_douban()
   #get_comment_all_page()
   create_word_cloud()


if __name__ == '__main__':
    main()

View Code

8）爬蟲相關類庫：

urllib python內置的http請求庫不需要安裝

re python內置正則表達式的實現不需要安裝

requests

           pip3 install requests 
         

selenium 拿到JS渲染的內容

           pip3 install selenium 
         

phantomjs 沒有界面的瀏覽器可以直接拿到html源代碼，需要官網下載

lxml 提供xpath解析方式

           pip3 install lxml 
         

BeautifulSoup 依賴於lxml庫

           pip3 install beautifulsoup4 
         

pyquery 語法與Jquery是完全一樣比BeautifulSoup更方便

           pip3 install pyquery 
         

pymysql 操作mysql數據庫的庫

           pip3 install pymysql 
         

pymongo 操作MongoDB數據庫的庫

           pip3 install pymonogo 
         

redis 用於分布式爬蟲

           pip3 install redis 
         

Flask web庫用做代理服務器

           pip3 install flask 
         

Django web框架

           pip3 install django 
         

Jupyter 相當於一個記事本

           pip3 install jupyter 
         

9）參考博客：https://www.cnblogs.com/pig66/p/11223797.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python實例：自動爬取豆瓣讀書短評，分析短評內容爬取豆瓣電影-長津湖短評 - Python python 爬取豆瓣電影短評並wordcloud生成詞雲圖用python寫一個豆瓣短評通用爬蟲(登錄、爬取、可視化) Scrapy實戰篇（三）之爬取豆瓣電影短評關於html的多行匹配，正則re.S的使用（爬取豆瓣電影短評） python 爬取豆瓣書籍信息 Python爬取豆瓣電影top python爬蟲-靜態爬取豆瓣評論用python爬取豆瓣電影Top 250