美團店鋪評價語言處理以及文本分類（logistic regression）

本文轉載自查看原文 2018-08-16 14:20 1866 python/ 機器學習/ machine learning/ 邏輯回歸

美團店鋪評價語言處理以及分類（LogisticRegression）

第一篇數據清洗與分析部分
第二篇可視化部分,
第三篇朴素貝葉斯文本分類
本文是該系列的第四篇主要討論邏輯回歸分類算法的參數以及優化
主要用到的包有jieba，sklearn，pandas，本篇博文主要先用的是詞袋模型(bag of words),將文本以數值特征向量的形式來表示(每個文檔構建一個特征向量，有很多的0，類似於前文說的category類的one-hot形式，得到的矩陣為稀疏矩陣)
比較朴素貝葉斯方法，邏輯回歸兩種分類算法
邏輯回歸算法的參數細節以及參數調優

導入數據分析常用庫

import pandas as pd
import numpy as np

讀取文件

df=pd.read_excel("all_data_meituan.xlsx")[["comment","star"]]
df.head()

上一博客中數據預處理，忘記的可以打開此鏈接復習

- 直接上處理好的特征，如下
  
![](http://ww1.sinaimg.cn/large/9ebd4c2bgy1fu9c4y3h26j20u5057wem.jpg)

### 朴素貝葉斯作為文本界的快速分類，這次將他作為對比的初始模型，將朴素貝葉斯與邏輯回歸進行比較




#### 模型構建


- 從sklearn 朴素貝葉斯中導入多維貝葉斯
- 朴素貝葉斯通常用來處理文本分類垃圾短信，速度飛快，效果一般都不會差很多
- MultinomialNB類可以選擇默認參數，如果模型預測能力不符合要求，可以適當調整
  
```python
from sklearn.naive_bayes import MultinomialNB
nb=MultinomialNB()

from sklearn.pipeline import make_pipeline # 導入make_pipeline方法
pipe=make_pipeline(vect,nb)
pipe.steps #  查看pipeline的步驟（與pipeline相似）

[('countvectorizer',
  CountVectorizer(analyzer='word', binary=False, decode_error='strict',
          dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
          lowercase=True, max_df=1.0, max_features=None, min_df=1,
          ngram_range=(1, 1), preprocessor=None,
          stop_words=frozenset({'', '范圍', '但願', 'vs', '為', '過去', '集中', '這般', '孰知', '認為', '論', '36', '前后', '每年', '長期以來', 'our', '要不', '使用', '好象', 'such', '不但', '一下', 'how', '召開', '6', '全體', '嚴格', '除開', 'get', '可好', '畢竟', 'but', '如前所述', '滿足', 'your', 'keeps', '只', '大抵', '己', 'concerning', "they're", '再則', '有意的'...'reasonably', '絕對', '咧', '除此以外', '50', '得了', 'seeming', '只是', '背靠背', '弗', 'need', '其', '第二', '再者說'}),
          strip_accents=None, token_pattern='(?u)\\b[^\\d\\W]\\w+\\b',
          tokenizer=None, vocabulary=None)),
 ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]

pipe.fit(X_train.cut_comment, y_train)

Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=...e, vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

測試集預測結果

y_pred = pipe.predict(X_test.cut_comment) 
# 對測試集進行預測（其中包括了轉化以及預測）

# 模型對於測試集的准確率
from sklearn import  metrics
metrics.accuracy_score(y_test,y_pred)

0.82929936305732488

邏輯回歸

模型構建

首先使用默認的邏輯回歸參數進行預實驗
默認參數為 solver = liblinear， max_iter=100，multi_class='ovr'，penalty='l2'
為了演示方便，我們沒有把make_pipeline 改寫為函數，而是單獨的調用，使步驟更為清楚

from sklearn.linear_model import LogisticRegression
# lr=LogisticRegression(solver='saga',max_iter=10000)
lr=LogisticRegression()  # 實例化
pipe_lr=make_pipeline(vect,lr) 
pipe_lr.steps

[('countvectorizer',
  CountVectorizer(analyzer='word', binary=False, decode_error='strict',
          dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
          lowercase=True, max_df=1.0, max_features=None, min_df=1,
          ngram_range=(1, 1), preprocessor=None,
          stop_words=frozenset({'', 'besides', '中小', '不管怎樣', '引起', '它們的', 'take', "c's", 'hopefully', 'no', '就算', '斷然', '直到', 'some', '最后一班', '許多', '非獨', '嘻', '：', '時', '兩者', '惟其', '從優', 'so', 'specified', '50', 'sometimes', '明顯', '嗬', '人家', '截至', '開始', '動不動', '大體', '以及', '使', 'own', 'whoever', "wasn't", 'cha...'我是', '／', 'my', '再則', '正常', '49', '關於', '願意', '其他', '這么', '粗', 'ｃ］', '＄', '29', '要求', '第十一', '自后'}),
          strip_accents=None, token_pattern='(?u)\\b[^\\d\\W]\\w+\\b',
          tokenizer=None, vocabulary=None)),
 ('logisticregression',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
            intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
            penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
            verbose=0, warm_start=False))]

邏輯回歸模型默認參數，對應同樣的測試集0.82929936305732488，還是提高了5%，這是在默認的solver情況下，未調整正則化等其余參數

測試集預測結果

pipe_lr.fit(X_train.cut_comment, y_train)
y_pred_lr = pipe_lr.predict(X_test.cut_comment)
metrics.accuracy_score(y_test,y_pred_lr)

0.87261146496815289

現在我們將solver修改為saga，penalty默認是l2,重新進行模型擬合與預測

lr_solver = LogisticRegression(solver='saga')
pipe_lr1=make_pipeline(vect,lr_solver)
pipe_lr1.steps

[('countvectorizer',
  CountVectorizer(analyzer='word', binary=False, decode_error='strict',
          dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
          lowercase=True, max_df=1.0, max_features=None, min_df=1,
          ngram_range=(1, 1), preprocessor=None,
          stop_words=frozenset({'', 'besides', '中小', '不管怎樣', '引起', '它們的', 'take', "c's", 'hopefully', 'no', '就算', '斷然', '直到', 'some', '最后一班', '許多', '非獨', '嘻', '：', '時', '兩者', '惟其', '從優', 'so', 'specified', '50', 'sometimes', '明顯', '嗬', '人家', '截至', '開始', '動不動', '大體', '以及', '使', 'own', 'whoever', "wasn't", 'cha...'我是', '／', 'my', '再則', '正常', '49', '關於', '願意', '其他', '這么', '粗', 'ｃ］', '＄', '29', '要求', '第十一', '自后'}),
          strip_accents=None, token_pattern='(?u)\\b[^\\d\\W]\\w+\\b',
          tokenizer=None, vocabulary=None)),
 ('logisticregression',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
            intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
            penalty='l2', random_state=None, solver='saga', tol=0.0001,
            verbose=0, warm_start=False))]

pipe_lr1.fit(X_train.cut_comment, y_train)

C:\Anaconda3\envs\nlp\lib\site-packages\sklearn\linear_model\sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)


Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=...penalty='l2', random_state=None, solver='saga', tol=0.0001,
          verbose=0, warm_start=False))])

出現這個提示，說明solver參數在saga(隨機平均梯度下降)情況下，系數沒有收斂，隨機平均梯度需要更大的迭代次數，需要調整最大迭代次數max_iter

# C:\Anaconda3\envs\nlp\lib\site-packages\sklearn\linear_model\sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
#   "the coef_ did not converge", ConvergenceWarning)
# 出現這個提示，說明solver參數在saga(隨機平均梯度下降)情況下，系數沒有收斂，隨機平均梯度需要更大的迭代次數，需要調整最大迭代次數max_iter
# 這里需要強調一點，這並不是說saga性能不好，saga針對大的數據集收斂速度比其他的優化算法更快。

重新設定了mat_iter之后，進行重新擬合，准確率達到 0.87388535031847137，准確率微弱提升

lr_solver = LogisticRegression(solver='saga',max_iter=10000)
pipe_lr1=make_pipeline(vect,lr_solver)
pipe_lr1.steps
pipe_lr1.fit(X_train.cut_comment, y_train)

Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=...penalty='l2', random_state=None, solver='saga', tol=0.0001,
          verbose=0, warm_start=False))])

y_pred_lr1 = pipe_lr1.predict(X_test.cut_comment)
metrics.accuracy_score(y_test,y_pred_lr1)

0.87388535031847137

這里補充一些關於邏輯回歸的參數（其實是凸優化的參數）

solvers 優化模型
- 相對與小規模數據liblinear的收斂速度更快，准確率與saga准確率相差無幾
- saga是sag的一種變體，同時支持兩種正則化后面需進一步的調整正則化強度以及類別（l1,l2）
- sklearn官網推薦一般情況下使用saga優化算法，同時支持l1,l2 正則化，而且對於大數據來說收斂速度更快。
- sag，lbfgs，newton-cg支持l2正則化，對於多維數據收斂速度比較快(特征多)，不支持l1正則,(損失函數需要一階或者二階連續導數)
- saga 優化算法更適合在大規模數據集（數據量與特征量）都很大的情況，表現效果會非常好，saga優化算法支持l1正則化，可適用於多維的稀疏矩陣
- liblinear 使用了開源的liblinear庫實現，內部使用了坐標軸下降法來迭代優化損失函數，同時支持(l1,l2),不支持真正的多分類（通過ovr實現的多分類）
- lbfgs：擬牛頓法的一種，利用損失函數二階導數矩陣即海森矩陣來迭代優化損失函數。
- newton-cg：也是牛頓法家族的一種，利用損失函數二階導數矩陣即海森矩陣來迭代優化損失函數。
logitisct regression參數中的C是正則化系數λ的倒數(交叉驗證參數Cs，list of floats 或者 int)
penalty 正則化選擇參數（l1，l2)
multi_class 分類方式的選擇參數（ovr，mvm）
- ovr 五種方式都支持，mvm 不支持liblinear
class_weith 類型權重參數
- class_weight={0:0.9,1:0.1} 表示類型0的權重為90%，類型1的權重是10%，如果選擇class_weith='balanced',那么就根據訓練樣本來計算權重，某類的樣本越多，則權重越低，樣本量越少，則權重越高。
- 誤分類的代價很高，對於正常人與患病者進行分類，將患者划分為正常人的代價很大，我們寧願將正常人分類為患者，這是還有進行人工干預，但是不願意將患者漏檢，這時我們可以將患者的權重適當提高
- 第二種情況是樣本高度失衡，比如患者和正常人的比例是1：700，如果不考慮權重，很容易得到一個預測准確率非常高的分類器，但是沒有啥意義，這是可以選擇balanced參數，分類器會自動根據患者比例進行調整權重。
sample_weight 樣本權重參數
- 由於樣本不平衡，導致樣本不是總體樣本的無偏估計，可能導致模型的檢出率很低，調節樣本權重有兩種方式：
- 在class_weight 使用balance參數，第二種是在fit(X, y, sample_weight=None) 擬合模型的時候，調整sample_weight
迭代次數 max_iter 默認值100，有的優化算法在默認的迭代次數時，損失函數未收斂，需要調整迭代次數

LogisticRegressionCV優化參數

LogisticRegressionCV 方法默認是l2正則化,solver設定為saga

t1=time.time()
from sklearn.linear_model import LogisticRegressionCV
lrvc = LogisticRegressionCV(Cs=[0.0001,0.005,0.001,0.05,0.01,0.1,0.5,1,10],scoring='accuracy',random_state=42,solver='saga',max_iter=10000,penalty='l2')
pipe=make_pipeline(vect,lrvc)
print(pipe.get_params)
pipe.fit(X_train.cut_comment, y_train)
y_pred=pipe.predict(X_test.cut_comment)
print(metrics.accuracy_score(y_test,y_pred))
t2=time.time()
print("time spent l2,saga",t2-t1)

<bound method Pipeline.get_params of Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=... random_state=42, refit=True,
           scoring='accuracy', solver='saga', tol=0.0001, verbose=0))])>
0.899363057325
time spent l2,saga 5.017577648162842

LogisticRegressionCV 方法 solver設定為saga,l1正則化

t1=time.time()
from sklearn.linear_model import LogisticRegressionCV
lrvc = LogisticRegressionCV(Cs=[0.0001,0.005,0.001,0.05,0.01,0.1,0.5,1,10],scoring='accuracy',random_state=42,solver='saga',max_iter=10000,penalty='l1')
pipe_cvl1=make_pipeline(vect,lrvc)
print(pipe_cvl1.get_params)
pipe_cvl1.fit(X_train.cut_comment, y_train)
y_pred=pipe_cvl1.predict(X_test.cut_comment)
print(metrics.accuracy_score(y_test,y_pred))
t2=time.time()
print("time spent l1,saga",t2-t1)

<bound method Pipeline.get_params of Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=... random_state=42, refit=True,
           scoring='accuracy', solver='saga', tol=0.0001, verbose=0))])>
0.915923566879
time spent l1,saga 64.17242479324341

l1正則化相比l2正則化，在saga優化器模式下，達到最佳參數所需要的時間增加
同時我們又驗證了liblinear與saga在l1正則化的情況下，達到最佳參數需要的時間，差距接近120倍

# LogisticRegressionCV 方法 l1正則化，sovler liblinear，速度比saga快的多，很快就收斂了，准確率沒有什么差別，只是不支持真正的多分類（為liblinear 打call）
t3=time.time()
from sklearn.linear_model import LogisticRegressionCV
lrvc = LogisticRegressionCV(Cs=[0.0001,0.005,0.001,0.05,0.01,0.1,0.5,1,10],scoring='accuracy',random_state=42,solver='liblinear',max_iter=10000,penalty='l1')
pipe_cvl1=make_pipeline(vect,lrvc)
print(pipe_cvl1.get_params)
pipe_cvl1.fit(X_train.cut_comment, y_train)
y_pred=pipe_cvl1.predict(X_test.cut_comment)
print("accuracy":metrics.accuracy_score(y_test,y_pred))
t4=time.time()
print("time spent l1 liblinear ",t4-t3)

<bound method Pipeline.get_params of Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=...om_state=42, refit=True,
           scoring='accuracy', solver='liblinear', tol=0.0001, verbose=0))])>
"accuracy":0.912101910828
time spent l1 liblinear  0.22439861297607422

后續還會包括其他的一些經典模型的構建以及優化，包括SVM（線性，核函數），decision tree，knn，同時也有集成的算法包括隨機森林，bagging，GBDT等算法進行演示

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 基於pandas python sklearn 的美團某商家的評論分類(文本分類）自然語言處理之文本分類 Python自然語言處理筆記【一】文本分類之監督式分類文本分類TextCNN 文本分類：survey 文本分類模型 CNN 文本分類實習Learning記錄（九）——文本分類評價指標F1原理解析機器學習-文本分類（2）-新聞文本分類文本分類問題匯總