一、簡介
此文是對利用jieba,word2vec,LR進行搜狐新聞文本分類的准確性的提升,數據集和分詞過程一樣,這里就不在敘述,讀者可參考前面的處理過程
經過jieba分詞,產生24000條分詞結果(sohu_train.txt有24000行數據,每行對應一個分詞結果)
with open('cutWords_list.txt') as file:
cutWords_list = [ k.split() for k in file ]

1)TfidfVectorizer模型
調用sklearn.feature_extraction.text庫的TfidfVectorizer方法實例化模型對象。TfidfVectorizer方法4個參數含義:
- 第1個參數是分詞結果,數據類型為列表,其中的元素也為列表
- 第2個關鍵字參數stop_words是停頓詞,數據類型為列表
- 第3個關鍵字參數min_df是詞頻低於此值則忽略,數據類型為int或float
- 第4個關鍵字參數max_df是詞頻高於此值則忽略,數據類型為int或float
from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(cutWords_list, stop_words=stopword_list, min_df=40, max_df=0.3)
2)特征工程
X = tfidf.fit_transform(train_df['文章'])
print('詞表大小', len(tfidf.vocabulary_))
print(X.shape)
可見每篇文章內容被向量化,維度特征時3946

3)模型訓練
3.1)LabelEncoder
from sklearn.preprocessing import LabelEncoder
import pandas as pd
train_df = pd.read_csv('sohu_train.txt', sep='\t', header=None)
labelEncoder = LabelEncoder()
y = labelEncoder.fit_transform(train_df[0])#一旦給train_df加上columns,就無法使用[0]來獲取第一列了
y.shape

3.2)邏輯回歸
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2) logistic_model = LogisticRegression(multi_class = 'multinomial', solver='lbfgs') logistic_model.fit(train_X, train_y) logistic_model.score(test_X, test_y)
其中邏輯回歸參數官方文檔:http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

3.3)保存模型
調用pickle:pip install pickle
第1個參數是保存的對象,可以為任意數據類型,因為有3個模型需要保存,所以下面代碼第1個參數是字典
第2個參數是保存的文件對象
import pickle
with open('tfidf.model', 'wb') as file:
save = {
'labelEncoder' : labelEncoder,
'tfidfVectorizer' : tfidf,
'logistic_model' : logistic_model
}
pickle.dump(save, file)
3.4)交叉驗證
在進行此步的時候,不需要運行此步之前的所有步驟,即可以重新運行jupyter notebook。然后調用pickle庫的load方法加載保存的模型對象,代碼如下:
import pickle
with open('tfidf.model', 'rb') as file:
tfidf_model = pickle.load(file)
tfidfVectorizer = tfidf_model['tfidfVectorizer']
labelEncoder = tfidf_model['labelEncoder']
logistic_model = tfidf_model['logistic_model']
load模型后,重新加載測試集:
import pandas as pd
train_df = pd.read_csv('sohu_train.txt', sep='\t', header=None)
X = tfidfVectorizer.transform(train_df[1])
y = labelEncoder.transform(train_df[0])
調用sklearn.linear_model庫的LogisticRegression方法實例化邏輯回歸模型對象。
調用sklearn.model_selection庫的ShuffleSplit方法實例化交叉驗證對象。
調用sklearn.model_selection庫的cross_val_score方法獲得交叉驗證每一次的得分。
最后打印每一次的得分以及平均分,代碼如下:
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import ShuffleSplit from sklearn.model_selection import cross_val_score logistic_model = LogisticRegression(multi_class='multinomial', solver='lbfgs') cv_split = ShuffleSplit(n_splits=5, test_size=0.3) score_ndarray = cross_val_score(logistic_model, X, y, cv=cv_split) print(score_ndarray) print(score_ndarray.mean())

4)模型評估
繪制混淆矩陣:
from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegressionCV from sklearn.metrics import confusion_matrix import pandas as pd train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2) logistic_model = LogisticRegressionCV(multi_class='multinomial', solver='lbfgs') logistic_model.fit(train_X, train_y) predict_y = logistic_model.predict(test_X) pd.DataFrame(confusion_matrix(test_y, predict_y),columns=labelEncoder.classes_, index=labelEncoder.classes_)

繪制precision、recall、f1-score、support報告表:
import numpy as np
from sklearn.metrics import precision_recall_fscore_support
def eval_model(y_true, y_pred, labels):
#計算每個分類的Precision, Recall, f1, support
p, r, f1, s = precision_recall_fscore_support( y_true, y_pred)
#計算總體的平均Precision, Recall, f1, support
tot_p = np.average(p, weights=s)
tot_r = np.average(r, weights=s)
tot_f1 = np.average(f1, weights=s)
tot_s = np.sum(s)
res1 = pd.DataFrame({
u'Label': labels,
u'Precision' : p,
u'Recall' : r,
u'F1' : f1,
u'Support' : s
})
res2 = pd.DataFrame({
u'Label' : ['總體'],
u'Precision' : [tot_p],
u'Recall': [tot_r],
u'F1' : [tot_f1],
u'Support' : [tot_s]
})
res2.index = [999]
res = pd.concat( [res1, res2])
return res[ ['Label', 'Precision', 'Recall', 'F1', 'Support'] ]
predict_y = logistic_model.predict(test_X)
eval_model(test_y, predict_y, labelEncoder.classes_)

5)模型測試
import pandas as pd
test_df = pd.read_csv('sohu_test.txt', sep='\t', header=None)
test_X = tfidfVectorizer.transform(test_df[1])
test_y = labelEncoder.transform(test_df[0])
predict_y = logistic_model.predict(test_X)
eval_model(test_y, predict_y, labelEncoder.classes_)

6)總結
訓練集數據共有24000條,測試集數據共有12000條。經過交叉驗證,模型平均得分為0.8711
模型評估時,使用LogisticRegressionCV模型,得分提高了3%,為0.9076
最后在測試集上的f1-score指標為0.8990,總體來說這個分類模型較優秀,能夠投入實際應用
7)致謝
本文參考簡書:https://www.jianshu.com/p/96b983784dae
感謝作者的詳細過程,再次感謝!
8)流程圖

