情感分析-英文電影評論

本文轉載自查看原文 2019-03-26 09:27 1145 NLP

一、簡介

　情感分析，有時也稱為觀點挖掘，是NLP領域一個非常重要的一個分支，它主要分析評論、文章、報道等的情感傾向，掌握或了解人們這些情感傾向非常重要。這些傾向對我們處理后續很多事情都有指定或借鑒作用

　在NLP中，首先需要把文本或單詞等轉換為數值格式，為后續機器學習或深度學習使用，把文本或單詞轉換為數值，有幾種模型，如詞袋模型（bag of words或簡稱為BOW）、word2vec等

　下面實例講解一下BOW模型：

　a）BOW的假設前提

　　BOW模型假定對於一個文檔，忽略它的單詞順序和語法、句法等要素，將其僅僅看作是若干個詞匯的集合，文檔中每個單詞的出現都是獨立的，不依賴於其它單詞是否出現

　　也就是說，文檔中任意一個位置出現的任何單詞，都不受該文檔語意影響而獨立存在

　b）TF-IDF

　　假設有三個文檔：

　　　1、The sun is shining
　　　2、The weather is sweet
　　　3、The sun is shining and the the weather is sweet

　　基於這三個文本文檔（為簡便這里以一個句子代表一個文檔），構造一個詞典或詞匯庫。如果構建詞典？首先，看出現哪些單詞，然后，給每個單詞編號。

　　在這三個文檔中，共出現7個單詞（不區分大小寫），分別是：the，is ，sun，shining，and，weather，sweet

　　然后，我們把這7個單詞給予編號，從0開始，從而得到一個單詞vs序號的字典：{'and':0,'is':1,'shining':2,'sun':3,'sweet':4,'the':5,'weather':6}

　　現在根據這個字典，把以上三個文檔轉換為特征向量(在對應序列號中是否有對應單詞及出現的頻率)：

　　第一句可轉換為：[0 1 1 1 0 1 0]

　　第二句可轉換為：[0 1 0 0 1 1 1]

　　第三句可轉換為：[1 2 1 1 1 2 1]

　　0表示字典中對應單詞在文檔中未出現，1表示對應單詞在文檔出現一次，2表示出現2次，也就是tf(t,d)表示單詞t在文檔d出現的次數

　　如果有幾個文檔，而且有些單詞在每個文檔中出現的頻度都較高，這種頻繁出現的單詞往往不含有用或特別的信息，在向量中如何降低這些單詞的權重？這里我們可以采用逆文檔頻率（inverse document frequency，idf）技術來處理

　　原始詞頻結合逆文檔頻率，稱為詞頻-逆文檔詞頻（term frequency - inverse document frequency，簡稱為tf-idf），那么tf-idf如何計算呢？計算公式如：\(tf-idf(t,d)=tf(t,d)*idf(t,d)\)

　　其中：\(idf(t,d)=log(n_d/(1+df(d,t)))\)，\(n_d\)表示總文檔數（這里總文檔數為3），\(df（d,t）\)為文檔d中的單詞t涉及的文檔數量

　　取對數是為了保證文檔中出現頻率較低的單詞被賦予較大的權重，分母中的加1是為了防止\(df（d,t）\)為零的情況。有些模型中也會在分子加上1，分子變為\(1+n_d\)

　　如我們看單詞the在第一個句子或第一個文檔（d1來表示）中的\(tf-idf（t,d）\)的值：\(tf-idf('the',d1) = tf('the',d1) * idf('the',d1) = 1 * log3/(1+3) = 1 * log0.75 = log0.75\)

　　這些計算都有現成的公式，以下我們以Scikit-learn中公式或庫來計算

from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()

docs = np.array( ['The sun is shining', 'The weather is sweet', 'The sun is shining and the weather is sweet'])
bag = count.fit_transform(docs)
print(count.vocabulary_)

　　先導入countVectorizer庫，然后實例化對象count，轉換成字典：

　　然后文檔向量化，結果和我們在前面講的文檔的向量化結果一樣：

print(bag.toarray())

　　進而我們求出文檔的tf-idf：

from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
np.set_printoptions(precision=2)
print(tfidf.fit_transform(bag).toarray())

　　說明：sklearn計算tf-idf時，還進行了歸一化處理，其中TfidfTransformer缺省使用L2范數

　　我們按照sklearn的計算方式,即\(tf-idf(t,d) = tf(t,d) * ( log[(1+n_d) / (1+df(d,t)] + 1 ) \)，不難驗證以上結果，以第一語句為例：

　　第一個元素下標是0，即and詞頻是0，所以tf-idf值是0

　　第二個元素下標是1，即is詞頻是1，idf為：1+3 = 4，然后除以（1+涉及的文檔數=1+3 = 4），取ln值為0，然后再加1，所以tf-idf值是1

　　第三個元素下標是2，即shining詞頻是1，idf為：1+3 = 4，然后除以（1+涉及的文檔數=1+2=3），ln(4/3)值為0.287657，然后再加1，所以tf-idf值為1.28

　　......

　　第七個元素下標是6，即weather詞頻是0，所以idf值為0

　　綜上所述：第一個語句（文檔）的向量結果： \(v = tf-idf(t,d1)=[0, 1, 1.28, 1.28, 0, 1, 0]\)，各個元素平方和為：\(0 + 1 + 1.28^2 + 1.28^2 + 0 + 1 + 0 = 5.2768\)

　　\(tf-idf(t,d1)norm = |v|/sqrt(\sum v_i^2) = v/sqrt(5.2768) = v/2.29 = [0, 0.43, 0.56, 0.56, 0, 0.43, 0]\)，這與sklearn計算出來的tf-idf結果一致

二、英文電影評論情感分析實例

　1）數據下載

　　鏈接：http://ai.stanford.edu/~amaas/data/sentiment，為了防止讀者無法打開該鏈接，故將下載的文件：aclImdb_v1.tar.gz上傳到百度雲盤，鏈接：https://pan.baidu.com/s/1AuMyDBuJcZ-KT5AvsdXQig 提取碼：6m11

　　在桌面新建情感分析文件夾，將下載數據解壓到情感分析文件夾，情感分析/aclImdb文件夾下有如下文件：

　　在train和test目錄下，各有二級子目錄neg和pos目錄。其中neg目錄存放大量評級負面或消極txt文件，pos存放大量評級為正面或積極的評論txt文件：（train和test目錄結構如下）

　　在情感分析文件夾內按住shift，然后鼠標右鍵，在此窗口打開命令窗口，輸入jupyter notebook，然后新建名為emotion_analysis的腳本文件：

　2）數據處理

　　pyprind是進度條小工具，可能需要讀者安裝：pip install pyprind

　　我們把train和test下的neg，pos的評論合並到一塊兒，處理數據的過程中用進度條顯示進度：

import pandas as pd
import os

import pyprind
pbar = pyprind.ProgBar(50000)

labels = {'pos':1, 'neg':0}
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = './aclImdb/%s/%s' % (s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r',encoding='UTF-8') as infile:
                txt = infile.read()
            df = df.append( [[txt, labels[l]]], ignore_index=True )
            pbar.update()
df.columns = ['review', 'sentiment']

　　可以看出處理時間不足2分鍾：

　　為了讓數據產生隨機化的效果（sklearn.utils.shuffle），我們打亂數據，然后保存到csv文件：

from sklearn.utils import shuffle

df = shuffle(df)
df.to_csv('./movie_data.csv', index = False)

　　重新加載數據，發現label已經被打亂：

df = pd.read_csv('./movie_data.csv')
df.head(10)

　3）NLTK處理

　　a）下載停詞

import nltk
nltk.download('stopwords')

　　如下顯示下載成功

　　b）對文件進行預處理，過濾停詞、刪除多余符號等

　　我們發現文件內容中含有許多下圖的thml標簽字符，如10011_9.txt文件中：

　　我們需要將其去掉：

import re
s = 'movies of the summer.<br /><br />Robin Williams'

re.sub('<[^>]*>', '', s)
#'movies of the summer.Robin Williams'

　　^用在中括號中表示非，\W表示非單詞字符，比如空格括號等等：

from nltk.corpus import stopwords
import re
stop = stopwords.words('english')
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    text=re.sub( '[\W]+',' ',text.lower() )
    tokenized=[w for w in text.split() if w not in stop]
    return tokenized

　　上面使用\W，去掉了非單詞字符

　　c）定義一個生成器函數，從csv文件中讀取文檔

def stream_docs(path):
    with open(path, 'r', encoding='UTF-8') as csv:
        next(csv)#skip header
        for line in csv:#這里沒有分割，並且沒有去掉尾部換行符，所以下標-1代表換行符，-2就是標簽，-3就是逗號
            text, label = line[:-3], int(line[-2])
            yield text, label

　　來看看movie_data.csv文件：

　　關於生成器函數：生成一個 iterable 對象，目的節省內存，下面我們來看看示例：

doc_stream = stream_docs(path='./movie_data.csv')
text, label = next(doc_stream)
print(text, label)

#再獲取一個元素
text, label = next(doc_stream)
print(text, label)

　　代碼示例，通過next()函數獲取可迭代對象元素值：

　　通過上面的小例子展示函數生成器返回一個 iterable 對象，如上面的doc_stream就是一個 iterable 對象

　　d）定義一個每次獲取的小批量數據的函數

def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None,None
        
    return docs, y

　　代碼示例，我們看看小批量數據函數，我們將size設置為3，讀取文件的三行，而doc_stream是一個可迭代對象：

doc_stream = stream_docs(path='./movie_data.csv')

docs, y = get_minibatch(doc_stream, 3)
for i in range(3):
    print(docs[i])
    print(y[i])

　　代碼示例結果：

　　e）利用sklearn中的HashingVectorizer進行語句的特征化、向量化等

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer( decode_error='ignore', n_features=2**2, preprocessor=None, tokenizer=tokenizer)
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='./movie_data.csv')

　4）訓練模型

import pyprind
import numpy as np
import warnings
warnings.filterwarnings('ignore')

pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    x_train, y_train = get_minibatch(doc_stream, size=1000)
    if not x_train:
        break
    
    x_train = vect.transform(x_train)
    clf.partial_fit(x_train, y_train, classes=classes)
    pbar.update()

　　代碼示例結果：

　5）評估模型

x_test, y_test = get_minibatch(doc_stream, size=5000)
x_test = vect.transform(x_test)
print('accuracy: %.3f' % clf.score(x_test, y_test))

　　代碼示例結果：

　　准確率達到了87%

　6）完整代碼

#coding=utf-8
import pandas as pd
import os

import pyprind
pbar = pyprind.ProgBar(50000)

#將train和test的pos和neg合並
labels = {'pos':1, 'neg':0}
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = './aclImdb/{}/{}'.format(s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r',encoding='UTF-8') as infile:
                txt = infile.read()
            df = df.append( [ [txt, labels[l]] ], ignore_index=True )
            pbar.update()
df.columns = ['review', 'sentiment']
from sklearn.utils import shuffle
df = shuffle(df)
df.to_csv('./movie_data.csv', index = False)
df = pd.read_csv('./movie_data.csv')
print('df_to_csv done!')

#預處理
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import re
stop = stopwords.words('english')
def tokenizer(text):
    text=re.sub('<[^>]*>','',text)
    emoticons=re.findall('(?::|;|=)(?:-)?(?:</span>|<spanclass="es0">|D|P)',text.lower())
    text=re.sub('[\W]+',' ',text.lower())+' '.join(emoticons).replace('-','')
    tokenized=[w for w in text.split() if w not in stop]
    return tokenized

def stream_docs(path):
    with open(path, 'r', encoding='UTF-8') as csv:
        next(csv)#skip header
        for line in csv:#這里沒有分割，並且沒有去掉尾部換行符，所以下標-1代表換行符，-2就是標簽，-3就是逗號
            text, label = line[:-3], int(line[-2])
            yield text, label

def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None,None

    return docs, y

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer( decode_error='ignore', n_features=2**21, preprocessor=None, tokenizer=tokenizer)
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='./movie_data.csv')

import pyprind
import numpy as np
import warnings
warnings.filterwarnings('ignore')

pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])
for _ in range(45):
    x_train, y_train = get_minibatch(doc_stream, size=1000)
    if not x_train:
        break

    x_train = vect.transform(x_train)
    clf.partial_fit(x_train, y_train, classes=classes)
    pbar.update()

x_test, y_test = get_minibatch(doc_stream, size=5000)
x_test = vect.transform(x_test)
print('accuracy: %.3f' % clf.score(x_test, y_test))

三、致謝

　本文參考：http://www.feiguyunai.com/index.php/2017/10/20/pythonai-nlp-emotionanaly01/

　感謝作者的分享，另外本人已經將代碼上傳百度雲盤，提取碼：iyi5

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 20行代碼實現電影評論情感分析基於Keras的imdb數據集電影評論情感二分類 Python 爬蟲實戰（1）：分析豆瓣中最新電影的影評電商產品評論數據情感分析基於情感詞典和機器學習的影評數據分析----需求分析和概念原型設計 MapReduce案例----影評分析（年份，電影id，電影名字，平均評分） netflix中文電影網站【電影影評】夢之安魂曲-敗給了BGM和豆瓣影評基於情感詞典的文本情感分析文本情感分析