利用python分析紅葡萄酒數據

本文轉載自查看原文 2018-06-20 12:39 5551

在本次分析中，我使用了隨機森林回歸，並涉及數據標准化和超參數調優。在這里，我使用隨機森林分類器，對好酒和不太好的酒進行二元分類。

首先導入數據包：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

導入數據：

data = pd.read_csv('winequality-red.csv')
data.head()

data.describe()

注釋：

fixed acidity：非揮發性酸

volatile acidity ：揮發性酸

citric acid：檸檬酸

residual sugar ：剩余糖分

chlorides：氯化物

free sulfur dioxide ：游離二氧化硫

total sulfur dioxide：總二氧化硫

density：密度

pH：pH

sulphates：硫酸鹽

alcohol：酒精

quality：質量

所有數據的數值為1599，所以沒有缺失值。讓我們看看是否有重復值：

extra = data[data.duplicated()]
extra.shape

有240個重復值，但先不刪除它，因為葡萄酒的質量等級是由不同的品酒師給出的。

數據可視化

sns.set()
data.hist(figsize=(10,10), color='red')
plt.show()

只有質量是離散型變量，主要集中在5和6中，下面分析下變量的相關性：

colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Correlation of Features', y=1.05, size=15)
sns.heatmap(data.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, 
            linecolor='white', annot=True)

觀察:
酒精與葡萄酒質量的相關性最高，其次是各種酸度、硫酸鹽、密度和氯化物。

使用分類器：

將葡萄酒分成兩組;“優質”>5為“好酒”

y = data.quality                  # set 'quality' as target
X = data.drop('quality', axis=1)  # rest are features
print(y.shape, X.shape)           # check correctness

# Create a new y1
y1 = (y > 5).astype(int)
y1.head()

 # plot histogram
ax = y1.plot.hist(color='green')
ax.set_title('Wine quality distribution', fontsize=14)
ax.set_xlabel('aggregated target value')

利用隨機森林分類器訓練預測模型

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, log_loss
from sklearn.metrics import confusion_matrix

將數據分割為訓練和測試數據集

seed = 8 # set seed for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y1, test_size=0.2,
                                                    random_state=seed)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

對隨機森林分類器進行交叉驗證訓練和評價

# Instantiate the Random Forest Classifier
RF_clf = RandomForestClassifier(random_state=seed)
RF_clf

# 在訓練數據集上計算k-fold交叉驗證，並查看平均精度得分
cv_scores = cross_val_score(RF_clf,X_train, y_train, cv=10, scoring='accuracy')
print('The accuracy scores for the iterations are {}'.format(cv_scores))
print('The mean accuracy score is {}'.format(cv_scores.mean()))

執行預測

RF_clf.fit(X_train, y_train)
pred_RF = RF_clf.predict(X_test)

# Print 5 results to see
for i in range(0,5):
    print('Actual wine quality is ', y_test.iloc[i], ' and predicted is ', pred_RF[i])

在前五名中，有一個錯誤。讓我們看看指標。

print(accuracy_score(y_test, pred_LR))
print(log_loss(y_test, pred_LR))

print(confusion_matrix(y_test, pred_LR))

總共有81個分類錯誤。

與Logistic回歸分類器相比，隨機森林分類器更優。

讓我們調優隨機森林分類器的超參數

from sklearn.model_selection import GridSearchCV
grid_values = {'n_estimators':[50,100,200],'max_depth':[None,30,15,5],
               'max_features':['auto','sqrt','log2'],'min_samples_leaf':[1,20,50,100]}
grid_RF = GridSearchCV(RF_clf,param_grid=grid_values,scoring='accuracy')
grid_RF.fit(X_train, y_train)

grid_RF.best_params_

除了估計數之外，其他推薦值是默認值。

RF_clf = RandomForestClassifier(n_estimators=100,random_state=seed)
RF_clf.fit(X_train,y_train)
pred_RF = RF_clf.predict(X_test)
print(accuracy_score(y_test,pred_RF))
print(log_loss(y_test,pred_RF))

print(confusion_matrix(y_test,pred_RF))

通過超參數調諧，射頻分類器的准確度已提高到82.5%，日志損失值也相應降低。分類錯誤的數量也減少到56個。

將隨機森林分類器作為基本推薦器，將紅酒分為“推薦”(6級以上)或“不推薦”(5級以下)，預測准確率為82.5%似乎是合理的。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 葡萄酒數據集數據分析主成分分析PCA數據降維原理及python應用（葡萄酒案例分析） LDA線性判別分析原理及python應用（葡萄酒案例分析）【Python】【數據分析】葡萄酒質量評價 LIBSVM (三) 葡萄酒種類識別拓端數據tecdat|R語言主成分分析（PCA）葡萄酒可視化：主成分得分散點圖和載荷圖 Python數據分析實戰之葡萄酒質量分析使用ML.NET實現白葡萄酒品質預測前饋神經網絡練習：使用tensorflow進行葡萄酒種類識別白葡萄酒質量評分預測分析