在本次分析中,我使用了隨機森林回歸,並涉及數據標准化和超參數調優。在這里,我使用隨機森林分類器,對好酒和不太好的酒進行二元分類。
首先導入數據包:
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
導入數據:
data = pd.read_csv('winequality-red.csv') data.head()
data.describe()
注釋:
fixed acidity:非揮發性酸
volatile acidity : 揮發性酸
citric acid:檸檬酸
residual sugar :剩余糖分
chlorides:氯化物
free sulfur dioxide :游離二氧化硫
total sulfur dioxide:總二氧化硫
density:密度
pH:pH
sulphates:硫酸鹽
alcohol:酒精
quality:質量
所有數據的數值為1599,所以沒有缺失值。讓我們看看是否有重復值:
extra = data[data.duplicated()]
extra.shape
有240個重復值,但先不刪除它,因為葡萄酒的質量等級是由不同的品酒師給出的。
數據可視化
sns.set() data.hist(figsize=(10,10), color='red') plt.show()
只有質量是離散型變量,主要集中在5和6中,下面分析下變量的相關性:
colormap = plt.cm.viridis plt.figure(figsize=(12,12)) plt.title('Correlation of Features', y=1.05, size=15) sns.heatmap(data.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, linecolor='white', annot=True)
觀察:
酒精與葡萄酒質量的相關性最高,其次是各種酸度、硫酸鹽、密度和氯化物。
使用分類器:
將葡萄酒分成兩組;“優質”>5為“好酒”
y = data.quality # set 'quality' as target X = data.drop('quality', axis=1) # rest are features print(y.shape, X.shape) # check correctness
# Create a new y1 y1 = (y > 5).astype(int) y1.head()
# plot histogram ax = y1.plot.hist(color='green') ax.set_title('Wine quality distribution', fontsize=14) ax.set_xlabel('aggregated target value')
利用隨機森林分類器訓練預測模型
from sklearn.model_selection import train_test_split, cross_val_score from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, log_loss from sklearn.metrics import confusion_matrix
將數據分割為訓練和測試數據集
seed = 8 # set seed for reproducibility X_train, X_test, y_train, y_test = train_test_split(X, y1, test_size=0.2, random_state=seed)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
對隨機森林分類器進行交叉驗證訓練和評價
# Instantiate the Random Forest Classifier RF_clf = RandomForestClassifier(random_state=seed) RF_clf
# 在訓練數據集上計算k-fold交叉驗證,並查看平均精度得分 cv_scores = cross_val_score(RF_clf,X_train, y_train, cv=10, scoring='accuracy') print('The accuracy scores for the iterations are {}'.format(cv_scores)) print('The mean accuracy score is {}'.format(cv_scores.mean()))
執行預測
RF_clf.fit(X_train, y_train)
pred_RF = RF_clf.predict(X_test)
# Print 5 results to see for i in range(0,5): print('Actual wine quality is ', y_test.iloc[i], ' and predicted is ', pred_RF[i])
在前五名中,有一個錯誤。讓我們看看指標。
print(accuracy_score(y_test, pred_LR)) print(log_loss(y_test, pred_LR))
print(confusion_matrix(y_test, pred_LR))
總共有81個分類錯誤。
與Logistic回歸分類器相比,隨機森林分類器更優。
讓我們調優隨機森林分類器的超參數
from sklearn.model_selection import GridSearchCV grid_values = {'n_estimators':[50,100,200],'max_depth':[None,30,15,5], 'max_features':['auto','sqrt','log2'],'min_samples_leaf':[1,20,50,100]} grid_RF = GridSearchCV(RF_clf,param_grid=grid_values,scoring='accuracy') grid_RF.fit(X_train, y_train)
grid_RF.best_params_
除了估計數之外,其他推薦值是默認值。
RF_clf = RandomForestClassifier(n_estimators=100,random_state=seed) RF_clf.fit(X_train,y_train) pred_RF = RF_clf.predict(X_test) print(accuracy_score(y_test,pred_RF)) print(log_loss(y_test,pred_RF))
print(confusion_matrix(y_test,pred_RF))
通過超參數調諧,射頻分類器的准確度已提高到82.5%,日志損失值也相應降低。分類錯誤的數量也減少到56個。
將隨機森林分類器作為基本推薦器,將紅酒分為“推薦”(6級以上)或“不推薦”(5級以下),預測准確率為82.5%似乎是合理的。