參考:https://mp.weixin.qq.com/s/6vkz18Xw4USZ3fldd_wf5g
1、數據集下載地址
https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531810/train_set.csv.zip
https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531810/test_a.csv.zip
數據集來自天池比賽,訓練集20w條樣本,測試集A包括5w條樣本。而且文本按照字符級別進行了匿名處理,處理后的數據為下:

這里就直接拆分訓練集為訓練集和測試集了。
在數據集中標簽的對應的關系如下:
{'科技': 0, '股票': 1, '體育': 2, '娛樂': 3, '時政': 4, '社會': 5, '教育': 6, '財經': 7, '家居': 8, '游戲': 9, '房產': 10, '時尚': 11, '彩票': 12, '星座': 13}
評價指標:

2、導入相應包
import pandas as pd import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import RidgeClassifier import matplotlib.pyplot as plt from sklearn.metrics import f1_score
3、讀取數據
train_path="/content/drive/My Drive/nlpdata/news/train_set.csv" train_df = pd.read_csv(train_path, sep='\t', nrows=15000) train_df['text']
train_df['label']
4、進行文本分類
(1)n-gram+嶺分類
vectorizer = CountVectorizer(max_features=3000) train_test = vectorizer.fit_transform(train_df['text']) clf = RidgeClassifier() clf.fit(train_test[:10000], train_df['label'].values[:10000]) val_pred = clf.predict(train_test[10000:]) print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
0.65441877581244
(2)TF-IDF+嶺分類
tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000) train_test = tfidf.fit_transform(train_df['text']) clf = RidgeClassifier() clf.fit(train_test[:10000], train_df['label'].values[:10000]) val_pred = clf.predict(train_test[10000:]) print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
0.8719372173702
5、探究參數對模型的影響
取大小為5000的樣本,保持其他參數不變,令阿爾法從0.15增加至1.5,畫出F1關於阿爾法的圖像
(1)針對於嶺分類而言:阿爾法對模型的影響
sample = train_df[0:5000] n = int(2*len(sample)/3) tfidf = TfidfVectorizer(ngram_range=(2,3), max_features=2500) train_test = tfidf.fit_transform(sample['text']) train_x = train_test[:n] train_y = sample['label'].values[:n] test_x = train_test[n:] test_y = sample['label'].values[n:] f1 = [] for i in range(10): clf = RidgeClassifier(alpha = 0.15*(i+1), solver = 'sag') clf.fit(train_x, train_y) val_pred = clf.predict(test_x) f1.append(f1_score(test_y, val_pred, average='macro')) plt.plot([0.15*(i+1) for i in range(10)], f1) plt.xlabel('alpha') plt.ylabel('f1_score') plt.show()

可以看出阿爾法不宜取的過大,也不宜過小。越小模型的擬合能力越強,泛化能力越弱,越大模型的擬合能力越差,泛化能力越強。
(2)max_features對模型的影響
分別取max_features的值為1000、2000、3000、4000,研究max_features對模型精度的影響
f1 = [] features = [1000,2000,3000,4000] for i in range(4): tfidf = TfidfVectorizer(ngram_range=(2,3), max_features=features[i]) train_test = tfidf.fit_transform(sample['text']) train_x = train_test[:n] train_y = sample['label'].values[:n] test_x = train_test[n:] test_y = sample['label'].values[n:] clf = RidgeClassifier(alpha = 0.1*(i+1), solver = 'sag') clf.fit(train_x, train_y) val_pred = clf.predict(test_x) f1.append(f1_score(test_y, val_pred, average='macro')) plt.plot(features, f1) plt.xlabel('max_features') plt.ylabel('f1_score') plt.show()

可以看出max_features越大模型的精度越高,但是當max_features超過某個數之后,再增加max_features的值對模型精度的影響就不是很顯著了。
(3) ngram_range對模型的影響
n-gram提取詞語字符數的下邊界和上邊界,考慮到中文的用詞習慣,ngram_range可以在(1,4)之間選取
f1 = [] for i in range(4): tfidf = TfidfVectorizer(ngram_range=(1,1), max_features=2000) train_test = tfidf.fit_transform(sample['text']) train_x = train_test[:n] train_y = sample['label'].values[:n] test_x = train_test[n:] test_y = sample['label'].values[n:] clf = RidgeClassifier(alpha = 0.1*(i+1), solver = 'sag') clf.fit(train_x, train_y) val_pred = clf.predict(test_x) f1.append(f1_score(test_y, val_pred, average='macro')) tfidf = TfidfVectorizer(ngram_range=(2,2), max_features=2000) train_test = tfidf.fit_transform(sample['text']) train_x = train_test[:n] train_y = sample['label'].values[:n] test_x = train_test[n:] test_y = sample['label'].values[n:] clf = RidgeClassifier(alpha = 0.1*(i+1), solver = 'sag') clf.fit(train_x, train_y) val_pred = clf.predict(test_x) f1.append(f1_score(test_y, val_pred, average='macro')) tfidf = TfidfVectorizer(ngram_range=(3,3), max_features=2000) train_test = tfidf.fit_transform(sample['text']) train_x = train_test[:n] train_y = sample['label'].values[:n] test_x = train_test[n:] test_y = sample['label'].values[n:] clf = RidgeClassifier(alpha = 0.1*(i+1), solver = 'sag') clf.fit(train_x, train_y) val_pred = clf.predict(test_x) f1.append(f1_score(test_y, val_pred, average='macro')) tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=2000) train_test = tfidf.fit_transform(sample['text']) train_x = train_test[:n] train_y = sample['label'].values[:n] test_x = train_test[n:] test_y = sample['label'].values[n:] clf = RidgeClassifier(alpha = 0.1*(i+1), solver = 'sag') clf.fit(train_x, train_y) val_pred = clf.predict(test_x) f1.append(f1_score(test_y, val_pred, average='macro'))
[0.7931919639413474, 0.7831242477075827, 0.6293265527038611, 0.8436709720083034,
0.8127288721306228, 0.791639726421815, 0.6425340629702662, 0.8512559206701422,
0.82151852494927, 0.7978544191527702, 0.6500441251723578, 0.8516726763849712, 0.
8275245575862662, 0.7963717190315031, 0.6577157272412916, 0.8485051384495732]
6、其它分類模型
均使用TF-IDF作為預處理方式。
(1)邏輯回歸
from sklearn import linear_model tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=5000) train_test = tfidf.fit_transform(train_df['text']) # 詞向量 15000*max_features reg = linear_model.LogisticRegression(penalty='l2', C=1.0,solver='liblinear') reg.fit(train_test[:10000], train_df['label'].values[:10000]) val_pred = reg.predict(train_test[10000:]) print('預測結果中各類新聞數目') print(pd.Series(val_pred).value_counts()) print('\n F1 score為') print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
預測結果中各類新聞數
0 1032
1 1029
2 782
3 588
4 375
5 316
6 224
8 166
7 161
9 123
10 109
11 60
12 23
13 12
dtype: int64
F1 score為
0.8464704900433653
(2)SGDClassifier
tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=5000) train_test = tfidf.fit_transform(train_df['text']) # 詞向量 15000*max_features reg = linear_model.SGDClassifier(loss="log", penalty='l2', alpha=0.0001,l1_ratio=0.15) reg.fit(train_test[:10000], train_df['label'].values[:10000]) val_pred = reg.predict(train_test[10000:]) print('預測結果中各類新聞數目') print(pd.Series(val_pred).value_counts()) print('\n F1 score為') print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))

(3)SVM
from sklearn import svm tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=5000) train_test = tfidf.fit_transform(train_df['text']) # 詞向量 15000*max_features reg = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto',decision_function_shape='ovr') reg.fit(train_test[:10000], train_df['label'].values[:10000]) val_pred = reg.predict(train_test[10000:]) print('預測結果中各類新聞數目') print(pd.Series(val_pred).value_counts()) print('\n F1 score為') print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))

