周末,看到朋友在朋友圈發了一條心情,是關於最新上映的電影《少年的你》,剛好前段時間又學習了一下爬蟲,於是心血來潮,想爬一下這部電影的短評,看看口碑如何。此筆記僅用於學習,不得商業獲利!如有侵害任何公司利益,請告知刪除!
本文記錄使用request,以及正則表達式re爬取影評的過程,關於request的安裝,可以使用:pip3 install requests
1)登錄。注冊賬號,因為要爬取所有的短評內容的話,必須要登錄才可以,這也算是一種反爬蟲的手段,注冊賬號之后,我們首先要解決的就是登錄問題。
在獲取登錄的Url的時候,我們故意輸入一個錯的賬號和密碼,就能輕松拿到這個Url以及相應的請求參數了:https://accounts.douban.com/j/mobile/login/basic
拿到這些信息之后,就是發請求,登錄了,代碼如下:
def login_douban(): try: login_url = 'https://accounts.douban.com/j/mobile/login/basic' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0', 'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony' } data = { 'name': '你的用戶名', 'password': '你的密碼', 'remember': 'false' } response = session.post(login_url, headers=headers, data=data) if response.status_code == 200: return response.text return None except RequestException: print('登錄失敗') return None
2)爬取。登錄成功之后,就可以訪問《少年的你》的短評頁,https://movie.douban.com/subject/30166972/comments?status=P,然后下拉到最后,通過翻頁,我們會看到一個有明顯規律的Url:https://movie.douban.com/subject/30166972/comments?start=0&limit=20&sort=new_score&status=P
注意里邊的參數,start=0&limit=20,意思就是從第0條短評開始,請求20條,也就是第0條~第20條的短評內容,那么就可以定義一個方法,來爬取短評內容,代碼如下:着色部分是重點,我們用request.Session()來保存會話狀態
session = requests.Session() def get_comment_one_page(page=0): start = int(page * 20) comment_url = 'https://movie.douban.com/subject/30166972/comments?start=%d&limit=20&sort=new_score&status=P' % start headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0', } try: response = session.get(comment_url, headers=headers) if response.status_code == 200: return response.text return None except RequestException: print('爬取評論失敗') return None
3)解析。第二步爬取到html代碼之后,我們需要解析出來里邊的內容,將評論的數據提取出來。
通過對網頁的分析,我們可以看到每一條評論,都是用一個class="comment-item"的div包裹着,那么我們就可以將這個div中需要的內容通過正則表達式提取出來:.*?表示匹配任意字符,需要提取的內容需要加(),着色部分就是我們需要提取的內容
'<div class="comment-item".*?comment-info">.*?rating".*?title="(.*?)">.*?"comment-time.*?title="(.*?)">.*?short">(.*?)</span>.*?</div>'
<div class="comment-item" data-cid="2012951340"> <div class="avatar"> <a title="天天天藍" href="https://www.douban.com/people/182748049/"> <img src="https://img3.doubanio.com/icon/u182748049-1.jpg" class=""> </a> </div> <div class="comment"> <h3> <span class="comment-vote"> <span class="votes">8867</span> <input value="2012951340" type="hidden"> <a href="javascript:;" class="j a_show_login" onclick="">有用</a> </span> <span class="comment-info"> <a href="https://www.douban.com/people/182748049/" class="">天天天藍</a> <span>看過</span> <span class="allstar50 rating" title="力薦"></span> <span class="comment-time " title="2019-10-25 09:34:22"> 2019-10-25 </span> </span> </h3> <p class=""> <span class="short">應該創造怎樣的世界讓少年成長是這個電影的主題...</span> </p> </div> </div>
提取內容的方法如下:
def parse_comment_one_page(html): pattern = re.compile( '<div class="comment-item".*?comment-info">.*?rating".*?title="(.*?)">.*?"comment-time.*?title="(.*?)">.*?short">(.*?)</span>.*?</div>', re.S) items = re.findall(pattern, html) for item in items: yield{ 'star': item[0], 'time': item[1], 'context': item[2] }
4)爬取所有頁。這一步其實很難,因為你會發現你的賬號馬上就會被封了..哈哈,要解決這個問題,就要用到動態代理了,好吧!我還沒來得及補習....
def get_comment_all_page(): page = 0 html = get_comment_one_page(page) condition = html is not None while condition: for item in parse_comment_one_page(html): print(item) write_to_file(item) #save_data_base(parse_comment_one_page(html)) page += 1 html = get_comment_one_page(page) time.sleep(random.random() * 3) print('爬取完畢')
爬取到內容之后,我們可以將內容寫到文件中,也可以寫入數據庫:
def write_to_file(comments): with open(COMMENTS_FILE_PATH, 'a', encoding='utf-8') as f: f.write(json.dumps(comments, ensure_ascii=False)+'\n') f.close() def save_data_base(items): list = [] for item in items: jsonstr = json.dumps(item, ensure_ascii=False) context = json.loads(jsonstr) tup = (context['star'], context['time'], context['context']) list.append(tup) connection = pymysql.connect( host='127.0.0.1', port=10080, user='test', password='123456', db='webspider') cursor = connection.cursor() cmd = "insert into shaoniandeni (star,time,context) values (%s,%s,%s)" try: cursor.executemany(cmd, list) connection.commit() except: connection.rollback() traceback.print_exc() finally: cursor.close() connection.close()
5)分析。因為我沒有爬取完,所以分析的結果是不准確的,但是呢?也能反映出來一部分問題。通過mysql中的數據分類可以看到結果如下:
出於好奇,我很關心評分給很差的同學都寫了什么,於是我又看了一下,評分為很差的內容:
6)生成詞雲。生成詞雲的時候,我過濾了一部分詞,詳情直接查看完整的源代碼。
7)源代碼。

import os import re import json import time import jieba import random import requests import traceback import pymysql import numpy as np import pymysql.cursors from PIL import Image from wordcloud import WordCloud import matplotlib.pyplot as plt from requests.exceptions import RequestException COMMENTS_FILE_PATH = 'douban_comments.txt' # 詞雲字體 WC_FONT_PATH = 'C:\\Windows\\fonts\\STFANGSO.TTF' # 詞雲形狀圖片 WC_MASK_IMG = 'index.jpg' session = requests.Session() def login_douban(): try: login_url = 'https://accounts.douban.com/j/mobile/login/basic' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0', 'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony' } data = { 'name': '', 'password': '', 'remember': 'false' } response = session.post(login_url, headers=headers, data=data) if response.status_code == 200: return response.text return None except RequestException: print('登錄失敗') return None def get_comment_one_page(page=0): start = int(page * 20) comment_url = 'https://movie.douban.com/subject/30166972/comments?start=%d&limit=20&sort=new_score&status=P' % start headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0', } try: response = session.get(comment_url, headers=headers) if response.status_code == 200: return response.text return None except RequestException: print('爬取評論失敗') return None def parse_comment_one_page(html): pattern = re.compile( '<div class="comment-item".*?comment-info">.*?rating".*?title="(.*?)">.*?"comment-time.*?title="(.*?)">.*?short">(.*?)</span>.*?</div>', re.S) items = re.findall(pattern, html) for item in items: yield{ 'star': item[0], 'time': item[1], 'context': item[2] } def write_to_file(comments): with open(COMMENTS_FILE_PATH, 'a', encoding='utf-8') as f: f.write(json.dumps(comments, ensure_ascii=False)+'\n') f.close() def get_comment_all_page(): page = 0 html = get_comment_one_page(page) condition = html is not None while condition: for item in parse_comment_one_page(html): print(item) write_to_file(item) #save_data_base(parse_comment_one_page(html)) page += 1 html = get_comment_one_page(page) time.sleep(random.random() * 3) print('爬取完畢') def save_data_base(items): list = [] for item in items: jsonstr = json.dumps(item, ensure_ascii=False) context = json.loads(jsonstr) tup = (context['star'], context['time'], context['context']) list.append(tup) connection = pymysql.connect( host='127.0.0.1', port=10080, user='test', password='123456', db='webspider') cursor = connection.cursor() cmd = "insert into shaoniandeni (star,time,context) values (%s,%s,%s)" try: cursor.executemany(cmd, list) connection.commit() except: connection.rollback() traceback.print_exc() finally: cursor.close() connection.close() def cut_word(): """ 對數據分詞 :return: 分詞后的數據 """ with open(COMMENTS_FILE_PATH, "r", encoding="utf-8") as file: comment_txt = file.read() jieba.add_word('周冬雨') jieba.add_word('易烊千璽') jieba.add_word('白夜行') jieba.add_word('東野圭吾') wordlist = jieba.cut(comment_txt, cut_all=True) wl = " ".join(wordlist) #print(wl) return wl def create_word_cloud(): """ 生成詞雲 :return: """ # 設置詞雲形狀圖片 wc_mask = np.array(Image.open(WC_MASK_IMG)) # 數據清洗詞列表 stop_words = ['就是', '不是', '但是', '還是','這種', '只是', '這樣', '這個', '一個', '什么', '電影', '沒有', '真的','周冬雨','易烊千璽','冬雨','千璽','我們','他們','少年'] # 設置詞雲的一些配置,如:字體,背景色,詞雲形狀,大小 wc = WordCloud(background_color="white", max_words=50, mask=wc_mask, scale=4, max_font_size=50, random_state=42, stopwords=stop_words, font_path=WC_FONT_PATH) # 生成詞雲 wc.generate(cut_word()) # 在只設置mask的情況下,你將會得到一個擁有圖片形狀的詞雲 plt.imshow(wc, interpolation="bilinear") plt.axis("off") plt.figure() plt.show() def main(): #login_douban() #get_comment_all_page() create_word_cloud() if __name__ == '__main__': main()
8)爬蟲相關類庫: