基於pandas python sklearn 的美團某商家的評論分類(文本分類）

本文轉載自查看原文 2018-08-14 22:34 4553 python/ 機器學習/ machine learning

美團店鋪評價語言處理以及分類（NLP）

第一篇數據分析部分
第二篇可視化部分,
本文是該系列第三篇，文本分類
主要用到的包有jieba，sklearn，pandas，本篇博文主要先用的是詞袋模型(bag of words),將文本以數值特征向量的形式來表示(每個文檔構建一個特征向量，有很多的0，出現在特征向量中的值也叫做原始詞頻，tf(term frequency), 得到的矩陣為稀疏矩陣)
后續的算法模型會陸續進行構建

導入數據分析常用庫

import pandas as pd
import numpy as np

讀取文件

df=pd.read_excel("all_data_meituan.xlsx")[["comment","star"]]
df.head()

查看DataFrame的大小

df.shape

(17400, 2)

df['sentiment']=df['star'].apply(lambda x:1 if x>30 else 0)
df=df.drop_duplicates() ## 去掉重復的評論，剩余的文本1406條，我們將數據復制為原有數據的三倍
df=df.dropna()

X=pd.concat([df[['comment']],df[['comment']],df[['comment']]])
y=pd.concat([df.sentiment,df.sentiment,df.sentiment])
X.columns=['comment']
X.reset_index
X.shape

(3138, 1)


import jieba # 導入分詞庫
def chinese_word_cut(mytext):
    return " ".join(jieba.cut(mytext))
X['cut_comment']=X["comment"].apply(chinese_word_cut)
X['cut_comment'].head()

Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\HUANG_~1\AppData\Local\Temp\jieba.cache
DEBUG:jieba:Loading model from cache C:\Users\HUANG_~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.880 seconds.
DEBUG:jieba:Loading model cost 0.880 seconds.
Prefix dict has been built succesfully.
DEBUG:jieba:Prefix dict has been built succesfully.





0    還行 吧 ， 建議 不要 排隊 那個 烤鴨 和 羊肉串 ， 因為 烤肉 時間 本來 就 不夠...
1    去過 好 幾次 了   東西 還是 老 樣子   沒 增添 什么 新花樣   環境 倒 是 ...
2    一個 字 ： 好 ！ ！ ！   # 羊肉串 #   # 五花肉 #   # 牛舌 #   ...
3    第一次 來 吃 ， 之前 看過 好多 推薦 說 這個 好吃 ， 真的 抱 了 好 大 希望 ...
4    羊肉串 真的 不太 好吃 ， 那種 說 膻 不 膻 說 臭 不 臭 的 味 。 烤鴨 還 行...
Name: cut_comment, dtype: object

導入sklearn中的數據分割模塊，設定test數據集大小，shuffle默認Ture

from sklearn.model_selection import  train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,random_state=42,test_size=0.25)

獲取停用詞

def get_custom_stopwords(stop_words_file):
    with open(stop_words_file,encoding="utf-8") as f:
        custom_stopwords_list=[i.strip() for i in f.readlines()]
    return custom_stopwords_list

stop_words_file = "stopwords.txt"
stopwords = get_custom_stopwords(stop_words_file) # 獲取停用詞

導入詞袋模型

from sklearn.feature_extraction.text import  CountVectorizer
vect=CountVectorizer()  # 實例化
vect # 查看參數

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

# dir(vect)  # 查看vect的屬性

將分割后的文本進行fit_transform,矩陣大小為2353*1965

vect.fit_transform(X_train["cut_comment"])

<2353x1965 sparse matrix of type '<class 'numpy.int64'>'
	with 20491 stored elements in Compressed Sparse Row format>

vect.fit_transform(X_train["cut_comment"]).toarray().shape

(2353, 1965)

pd.DataFrame(vect.fit_transform(X_train["cut_comment"]).toarray(),columns=vect.get_feature_names()).iloc[:,0:25].head()
# print(vect.get_feature_names())
#  數據維數1956，不算很大（未使用停用詞）
# 將其轉化為DataFrame

發現其中有很多的數字以及無效特征，隨后傳入實例化參數的同時，加入正則匹配取出這些無意義特征，同時取出停用詞

vect = CountVectorizer(token_pattern=u'(?u)\\b[^\\d\\W]\\w+\\b',stop_words=frozenset(stopwords)) # 去除停用詞，匹配以數字開頭的非單詞字符
pd.DataFrame(vect.fit_transform(X_train['cut_comment']).toarray(), columns=vect.get_feature_names()).head()
# 1691 columns,去掉以數字為特征值的列，減少了近三百列，由1965減小到1691 
# max_df = 0.8 # 在超過這一比例的文檔中出現的關鍵詞（過於平凡），去除掉（可以自行設定）
# min_df = 3 # 在低於這一數量的文檔中出現的關鍵詞（過於獨特），去除掉。（可以自行設定）

取出數字特征之后

模型構建

從sklearn 朴素貝葉斯中導入多維貝葉斯
朴素貝葉斯通常用來處理文本分類垃圾短信，速度飛快，效果一般都不會差很多
MultinomialNB類可以選擇默認參數，如果模型預測能力不符合要求，可以適當調整

from sklearn.naive_bayes import MultinomialNB
nb=MultinomialNB()

from sklearn.pipeline import make_pipeline # 導入make_pipeline方法
pipe=make_pipeline(vect,nb)
pipe.steps #  查看pipeline的步驟（與pipeline相似）

[('countvectorizer',
  CountVectorizer(analyzer='word', binary=False, decode_error='strict',
          dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
          lowercase=True, max_df=1.0, max_features=None, min_df=1,
          ngram_range=(1, 1), preprocessor=None,
          stop_words=frozenset({'', '范圍', '但願', 'vs', '為', '過去', '集中', '這般', '孰知', '認為', '論', '36', '前后', '每年', '長期以來', 'our', '要不', '使用', '好象', 'such', '不但', '一下', 'how', '召開', '6', '全體', '嚴格', '除開', 'get', '可好', '畢竟', 'but', '如前所述', '滿足', 'your', 'keeps', '只', '大抵', '己', 'concerning', "they're", '再則', '有意的'...'reasonably', '絕對', '咧', '除此以外', '50', '得了', 'seeming', '只是', '背靠背', '弗', 'need', '其', '第二', '再者說'}),
          strip_accents=None, token_pattern='(?u)\\b[^\\d\\W]\\w+\\b',
          tokenizer=None, vocabulary=None)),
 ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]

pipe.fit(X_train.cut_comment, y_train)

Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=...e, vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

測試集預測結果

y_pred = pipe.predict(X_test.cut_comment) 
# 對測試集進行預測（其中包括了轉化以及預測）

# 模型對於測試集的准確率
from sklearn import  metrics
metrics.accuracy_score(y_test,y_pred)

0.82929936305732488

# 模型對於測試集的混淆矩陣
metrics.confusion_matrix(y_test,y_pred)
# 測試集中的預測結果：真陽性474個，假陽性112個，假陰性22個，真陰性為177個

array([[177, 112],
       [ 22, 474]], dtype=int64)

def get_confusion_matrix(conf,clas):
    import  matplotlib.pyplot as  plt
    fig,ax=plt.subplots(figsize=(2.5,2.5))
    ax.matshow(conf,cmap=plt.cm.Blues,alpha=0.3)
    tick_marks = np.arange(len(clas))
    plt.xticks(tick_marks,clas, rotation=45)
    plt.yticks(tick_marks, clas)
    for i in range(conf.shape[0]):
        for j in range(conf.shape[1]):
            ax.text(x=i,y=j,s=conf[i,j],
                   va='center',
                   ha='center')
    plt.xlabel("predict_label")
    plt.ylabel("true label")

conf=metrics.confusion_matrix(y_test,y_pred)
class_names=np.array(['0','1'])
get_confusion_matrix(np.array(conf),clas=class_names)
plt.show()

對整個數據集進行預測分類

y_pred_all = pipe.predict(X['cut_comment'])

metrics.accuracy_score(y,y_pred_all)
# 對於整個樣本集的預測正確率，整個數據集的准確率高於測試集，說明有些過擬合

0.85659655831739967

metrics.confusion_matrix(y,y_pred_all)
#  真個數據集的混淆矩陣

array([[ 801,  369],
       [  81, 1887]], dtype=int64)

y.value_counts()
# 初始樣本中 正類與負類的數量

1    1968
0    1170
Name: sentiment, dtype: int64

metrics.f1_score(y_true=y,y_pred=y_pred_all)
# f1_score 評價模型對於真個數據集

0.89346590909090906

metrics.recall_score(y, y_pred_all)
# 檢出率，也就是正類總樣本檢出的比例   真正/假陰+真正

0.95884146341463417

metrics.precision_score(y, y_pred_all)
#  准確率，  檢測出的來正類中真正類的比例  真正/假陽+真正

0.83643617021276595

print(metrics.classification_report(y, y_pred_all))
# 分類報告

             precision    recall  f1-score   support

      0       0.91      0.68      0.78      1170
      1       0.84      0.96      0.89      1968

avg / total       0.86      0.86      0.85      3138

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 美團店鋪評價語言處理以及文本分類（logistic regression） Python兩招輕松爬取美團評論訂餐系統之同步美團商家訂單 Python爬蟲系列之爬取美團美食板塊商家數據（一） python使用KNN文本分類文本分類TextCNN 文本分類：survey 文本分類模型 CNN 文本分類使用Python進行語義相似度/文本分類