美團店鋪評價語言處理以及分類（tfidf，SVM，決策樹，隨機森林，Knn，ensemble）

本文轉載自查看原文 2018-09-20 00:04 977 python/ 機器學習/ machine learning

支持向量機分類
支持向量機網格搜索
臨近法
決策樹
隨機森林
bagging方法

import pandas as pd
import numpy as np
import  matplotlib.pyplot as  plt
import time

df=pd.read_excel("all_data_meituan.xlsx")[["comment","star"]]
df.head()

	comment	star
0	還行吧，建議不要排隊那個烤鴨和羊肉串，因為烤肉時間本來就不夠，排那個要半小時，然后再回來吃烤...	40
1	去過好幾次了東西還是老樣子沒增添什么新花樣環境倒是挺不錯離我們這也挺近味道還可以 ...	40
2	一個字：好！！！ #羊肉串# #五花肉# #牛舌# #很好吃# #雞軟骨# #拌菜# #抄河...	50
3	第一次來吃，之前看過好多推薦說這個好吃，真的抱了好大希望，排隊的人挺多的，想吃得趁早來啊。還...	20
4	羊肉串真的不太好吃，那種說膻不膻說臭不臭的味。烤鴨還行，大蝦沒少吃，也就到那吃大蝦了，吃完了...	30

df.shape

(17400, 2)

df['sentiment']=df['star'].apply(lambda x:1 if x>30 else 0)
df=df.drop_duplicates() ## 去掉重復的評論
df=df.dropna()

X=pd.concat([df[['comment']],df[['comment']],df[['comment']]])
y=pd.concat([df.sentiment,df.sentiment,df.sentiment])
X.columns=['comment']
X.reset_index
X.shape

(3138, 1)

import jieba
def chinese_word_cut(mytext):
    return " ".join(jieba.cut(mytext))
X['cut_comment']=X["comment"].apply(chinese_word_cut)
X['cut_comment'].head()

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\FRED-H~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.651 seconds.
Prefix dict has been built succesfully.





0    還行 吧 ， 建議 不要 排隊 那個 烤鴨 和 羊肉串 ， 因為 烤肉 時間 本來 就 不夠...
1    去過 好 幾次 了   東西 還是 老 樣子   沒 增添 什么 新花樣   環境 倒 是 ...
2    一個 字 ： 好 ！ ！ ！   # 羊肉串 #   # 五花肉 #   # 牛舌 #   ...
3    第一次 來 吃 ， 之前 看過 好多 推薦 說 這個 好吃 ， 真的 抱 了 好 大 希望 ...
4    羊肉串 真的 不太 好吃 ， 那種 說 膻 不 膻 說 臭 不 臭 的 味 。 烤鴨 還 行...
Name: cut_comment, dtype: object

from sklearn.model_selection import  train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,random_state=42,test_size=0.25)

def get_custom_stopwords(stop_words_file):
    with open(stop_words_file,encoding="utf-8") as f:
        custom_stopwords_list=[i.strip() for i in f.readlines()]
    return custom_stopwords_list

stop_words_file = "stopwords.txt"
stopwords = get_custom_stopwords(stop_words_file)
stopwords[-10:]

['100', '01', '02', '03', '04', '05', '06', '07', '08', '09']

from sklearn.feature_extraction.text import  CountVectorizer
vect=CountVectorizer()
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

vect.fit_transform(X_train["cut_comment"])

<2353x1965 sparse matrix of type '<class 'numpy.int64'>'
	with 20491 stored elements in Compressed Sparse Row format>

vect.fit_transform(X_train["cut_comment"]).toarray().shape

(2353, 1965)

# pd.DataFrame(vect.fit_transform(X_train["cut_comment"]).toarray(),columns=vect.get_feature_names()).iloc[:10,:22]
# print(vect.get_feature_names())
# #  數據維數1956，不算很大（未使用停用詞）

vect = CountVectorizer(token_pattern=u'(?u)\\b[^\\d\\W]\\w+\\b',stop_words=frozenset(stopwords)) # 去除停用詞
pd.DataFrame(vect.fit_transform(X_train['cut_comment']).toarray(), columns=vect.get_feature_names()).head()
# 1691 columns,去掉以數字為特征值的列，減少了三列編程1691 
# max_df = 0.8 # 在超過這一比例的文檔中出現的關鍵詞（過於平凡），去除掉。
# min_df = 3 # 在低於這一數量的文檔中出現的關鍵詞（過於獨特），去除掉。

	amazing	happy	ktv	pm2	一萬個	一個多	一個月	一串	一人	一件	...	麻煩	麻醬	黃喉	黃桃	黃花魚	黃金	黑乎乎	黑椒	黑胡椒	齊全
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 1691 columns

from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn import  metrics
svc_cl=SVC()
pipe=make_pipeline(vect,svc_cl)
pipe.fit(X_train.cut_comment, y_train)

Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

y_pred = pipe.predict(X_test.cut_comment)
metrics.accuracy_score(y_test,y_pred)

0.6318471337579618

metrics.confusion_matrix(y_test,y_pred)

array([[  0, 289],
       [  0, 496]], dtype=int64)

支持向量機分類

from sklearn.svm import SVC
svc_cl=SVC() # 實例化
pipe=make_pipeline(vect,svc_cl)
pipe.fit(X_train.cut_comment, y_train)
y_pred = pipe.predict(X_test.cut_comment)
metrics.accuracy_score(y_test,y_pred)

0.6318471337579618

支持向量機網格搜索

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.pipeline import  Pipeline
# svc=SVC(random_state=1)
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfTransformer
tfidf=TfidfTransformer()
# ('tfidf',
#                       TfidfTransformer()),
#                      ('clf',
#                       SGDClassifier(max_iter=1000)),
# svc=SGDClassifier(max_iter=1000)
svc=SVC()
# pipe=make_pipeline(vect,SVC)
pipe_svc=Pipeline([("scl",vect),('tfidf',tfidf),("clf",svc)])
para_range=[0.0001,0.001,0.01,0.1,1.0,10,100,1000]
para_grid=[
    {'clf__C':para_range,
    'clf__kernel':['linear']},
    {'clf__gamma':para_range,
    'clf__kernel':['rbf']}
]

gs=GridSearchCV(estimator=pipe_svc,param_grid=para_grid,cv=10,n_jobs=-1)

gs.fit(X_train.cut_comment,y_train)

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('scl', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=frozenset({'...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid=[{'clf__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10, 100, 1000], 'clf__kernel': ['linear']}, {'clf__gamma': [0.0001, 0.001, 0.01, 0.1, 1.0, 10, 100, 1000], 'clf__kernel': ['rbf']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

gs.best_estimator_.fit(X_train.cut_comment,y_train)

Pipeline(memory=None,
     steps=[('scl', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=frozenset({'...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

y_pred = gs.best_estimator_.predict(X_test.cut_comment)
metrics.accuracy_score(y_test,y_pred)

0.9503184713375796

臨近法

from sklearn.neighbors import  KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=5,p=2,metric='minkowski')
pipe=make_pipeline(vect,knn)
pipe.fit(X_train.cut_comment, y_train)

Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=...owski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))])

y_pred = pipe.predict(X_test.cut_comment)
metrics.accuracy_score(y_test,y_pred)

0.7070063694267515

metrics.confusion_matrix(y_test,y_pred)

array([[ 87, 202],
       [ 28, 468]], dtype=int64)

決策樹

from sklearn.tree import DecisionTreeClassifier
tree=DecisionTreeClassifier(criterion='entropy',random_state=1)

pipe=make_pipeline(vect,tree)
pipe.fit(X_train.cut_comment, y_train)
y_pred = pipe.predict(X_test.cut_comment)
metrics.accuracy_score(y_test,y_pred)

0.9388535031847134

metrics.confusion_matrix(y_test,y_pred)

array([[256,  33],
       [ 15, 481]], dtype=int64)

隨機森林


from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier(criterion='entropy',random_state=1,n_jobs=2)
pipe=make_pipeline(vect,forest)
pipe.fit(X_train.cut_comment, y_train)
y_pred = pipe.predict(X_test.cut_comment)
metrics.accuracy_score(y_test,y_pred)
# 加上tfidf反而准確率96.5降低至95.0，

0.9656050955414013

metrics.confusion_matrix(y_test,y_pred)

array([[265,  24],
       [  3, 493]], dtype=int64)

bagging方法

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
tree=DecisionTreeClassifier(criterion='entropy',random_state=1)
bag=BaggingClassifier(base_estimator=tree,
                     n_estimators=10,
                     max_samples=1.0,
                     max_features=1.0,
                     bootstrap=True,
                     bootstrap_features=False,
                     n_jobs=1,random_state=1)
pipe=make_pipeline(vect,tfidf,bag)
pipe.fit(X_train.cut_comment, y_train)
y_pred = pipe.predict(X_test.cut_comment)
metrics.accuracy_score(y_test,y_pred)  #  沒用轉化td-idf 93.2%, 加上轉化步驟，准確率提升到95.5

0.9554140127388535

metrics.confusion_matrix(y_test,y_pred)

array([[260,  29],
       [  6, 490]], dtype=int64)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 從決策樹到隨機森林決策樹與隨機森林【學習筆記】分類算法-決策樹、隨機森林決策樹與隨機森林分類算法（Python實現） sklearn之決策樹和隨機森林對iris的處理比較 08 決策樹與隨機森林什么是機器學習分類算法？【K-近鄰算法(KNN)、交叉驗證、朴素貝葉斯算法、決策樹、隨機森林】 scikit-learn機器學習(四)使用決策樹做分類,並畫出決策樹,隨機森林對比分類分析--隨機森林（基於傳統決策樹、基於條件推斷樹）機器學習一（決策樹和隨機森林）

	amazing	happy	ktv	pm2	一萬個	一個多	一個月	一串	一人	一件	...	麻煩	麻醬	黃喉	黃桃	黃花魚	黃金	黑乎乎	黑椒	黑胡椒	齊全
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	amazing	happy	ktv	pm2	一萬個	一個多	一個月	一串	一人	一件	...	麻煩	麻醬	黃喉	黃桃	黃花魚	黃金	黑乎乎	黑椒	黑胡椒	齊全
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

美團店鋪評價語言處理以及分類（tfidf，SVM，決策樹，隨機森林，Knn，ensemble）

支持向量機分類

支持向量機 網格搜索

臨近法

決策樹

隨機森林

bagging方法

免責聲明！

支持向量機網格搜索

	amazing	happy	ktv	pm2	一萬個	一個多	一個月	一串	一人	一件	...	麻煩	麻醬	黃喉	黃桃	黃花魚	黃金	黑乎乎	黑椒	黑胡椒	齊全
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0