鳶尾花(iris)數據集分析
Iris 鳶尾花數據集是一個經典數據集,在統計學習和機器學習領域都經常被用作示例。數據集內包含 3 類共 150 條記錄,每類各 50 個數據,每條記錄都有 4 項特征:花萼長度、花萼寬度、花瓣長度、花瓣寬度,可以通過這4個特征預測鳶尾花卉屬於(iris-setosa, iris-versicolour, iris-virginica)中的哪一品種。
據說在現實中,這三種花的基本判別依據其實是種子(因為花瓣非常容易枯萎)。
0 准備數據
下面對 iris 進行探索性分析,首先導入相關包和數據集:
# 導入相關包 import numpy as np import pandas as pd from pandas import plotting %matplotlib inline import matplotlib.pyplot as plt plt.style.use('seaborn') import seaborn as sns sns.set_style("whitegrid") from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.neighbors import KNeighborsClassifier from sklearn import svm from sklearn import metrics from sklearn.tree import DecisionTreeClassifier
# 導入數據集 iris = pd.read_csv('F:\pydata\dataset\kaggle\iris.csv', usecols=[1, 2, 3, 4, 5])
查看數據集信息:
iris.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): SepalLengthCm 150 non-null float64 SepalWidthCm 150 non-null float64 PetalLengthCm 150 non-null float64 PetalWidthCm 150 non-null float64 Species 150 non-null object dtypes: float64(4), object(1) memory usage: 5.9+ KB
查看數據集的頭 5 條記錄:
iris.head()

1 探索性分析
先查看數據集各特征列的摘要統計信息:
iris.describe()

通過Violinplot 和 Pointplot,分別從數據分布和斜率,觀察各特征與品種之間的關系:
# 設置顏色主題 antV = ['#1890FF', '#2FC25B', '#FACC14', '#223273', '#8543E0', '#13C2C2', '#3436c7', '#F04864']
# 繪制 Violinplot f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True) sns.despine(left=True) sns.violinplot(x='Species', y='SepalLengthCm', data=iris, palette=antV, ax=axes[0, 0]) sns.violinplot(x='Species', y='SepalWidthCm', data=iris, palette=antV, ax=axes[0, 1]) sns.violinplot(x='Species', y='PetalLengthCm', data=iris, palette=antV, ax=axes[1, 0]) sns.violinplot(x='Species', y='PetalWidthCm', data=iris, palette=antV, ax=axes[1, 1]) plt.show()

# 繪制 pointplot f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True) sns.despine(left=True) sns.pointplot(x='Species', y='SepalLengthCm', data=iris, color=antV[0], ax=axes[0, 0]) sns.pointplot(x='Species', y='SepalWidthCm', data=iris, color=antV[0], ax=axes[0, 1]) sns.pointplot(x='Species', y='PetalLengthCm', data=iris, color=antV[0], ax=axes[1, 0]) sns.pointplot(x='Species', y='PetalWidthCm', data=iris, color=antV[0], ax=axes[1, 1]) plt.show()

生成各特征之間關系的矩陣圖:
g = sns.pairplot(data=iris, palette=antV, hue= 'Species')

使用 Andrews Curves 將每個多變量觀測值轉換為曲線並表示傅立葉級數的系數,這對於檢測時間序列數據中的異常值很有用。
Andrews Curves 是一種通過將每個觀察映射到函數來可視化多維數據的方法。
plt.subplots(figsize = (10,8)) plotting.andrews_curves(iris, 'Species', colormap='cool') plt.show()

下面分別基於花萼和花瓣做線性回歸的可視化:
g = sns.lmplot(data=iris, x='SepalWidthCm', y='SepalLengthCm', palette=antV, hue='Species')

g = sns.lmplot(data=iris, x='PetalWidthCm', y='PetalLengthCm', palette=antV, hue='Species')

最后,通過熱圖找出數據集中不同特征之間的相關性,高正值或負值表明特征具有高度相關性:
fig=plt.gcf()
fig.set_size_inches(12, 8) fig=sns.heatmap(iris.corr(), annot=True, cmap='GnBu', linewidths=1, linecolor='k', square=True, mask=False, vmin=-1, vmax=1, cbar_kws={"orientation": "vertical"}, cbar=True)

從熱圖可看出,花萼的寬度和長度不相關,而花瓣的寬度和長度則高度相關。
2 機器學習
接下來,通過機器學習,以花萼和花瓣的尺寸為根據,預測其品種。
在進行機器學習之前,將數據集拆分為訓練和測試數據集。首先,使用標簽編碼將 3 種鳶尾花的品種名稱轉換為分類值(0, 1, 2)。
# 載入特征和標簽集 X = iris[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']] y = iris['Species']
# 對標簽集進行編碼 encoder = LabelEncoder() y = encoder.fit_transform(y) print(y)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
接着,將數據集以 7: 3 的比例,拆分為訓練數據和測試數據:
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state = 101) print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
(105, 4) (105,) (45, 4) (45,)
檢查不同模型的准確性:
# Support Vector Machine model = svm.SVC() model.fit(train_X, train_y) prediction = model.predict(test_X) print('The accuracy of the SVM is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
The accuracy of the SVM is: 1.0
# Logistic Regression model = LogisticRegression() model.fit(train_X, train_y) prediction = model.predict(test_X) print('The accuracy of the Logistic Regression is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
The accuracy of the Logistic Regression is: 0.9555555555555556
# Decision Tree model=DecisionTreeClassifier() model.fit(train_X, train_y) prediction = model.predict(test_X) print('The accuracy of the Decision Tree is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
The accuracy of the Decision Tree is: 0.9555555555555556
# K-Nearest Neighbours model=KNeighborsClassifier(n_neighbors=3) model.fit(train_X, train_y) prediction = model.predict(test_X) print('The accuracy of the KNN is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
The accuracy of the KNN is: 1.0
上面使用了數據集的所有特征,下面將分別使用花瓣和花萼的尺寸:
petal = iris[['PetalLengthCm', 'PetalWidthCm', 'Species']] train_p,test_p=train_test_split(petal,test_size=0.3,random_state=0) train_x_p=train_p[['PetalWidthCm','PetalLengthCm']] train_y_p=train_p.Species test_x_p=test_p[['PetalWidthCm','PetalLengthCm']] test_y_p=test_p.Species sepal = iris[['SepalLengthCm', 'SepalWidthCm', 'Species']] train_s,test_s=train_test_split(sepal,test_size=0.3,random_state=0) train_x_s=train_s[['SepalWidthCm','SepalLengthCm']] train_y_s=train_s.Species test_x_s=test_s[['SepalWidthCm','SepalLengthCm']] test_y_s=test_s.Species
model=svm.SVC()
model.fit(train_x_p,train_y_p)
prediction=model.predict(test_x_p)
print('The accuracy of the SVM using Petals is: {0}'.format(metrics.accuracy_score(prediction,test_y_p))) model.fit(train_x_s,train_y_s) prediction=model.predict(test_x_s) print('The accuracy of the SVM using Sepal is: {0}'.format(metrics.accuracy_score(prediction,test_y_s)))
The accuracy of the SVM using Petals is: 0.9777777777777777 The accuracy of the SVM using Sepal is: 0.8
model = LogisticRegression()
model.fit(train_x_p, train_y_p)
prediction = model.predict(test_x_p)
print('The accuracy of the Logistic Regression using Petals is: {0}'.format(metrics.accuracy_score(prediction,test_y_p))) model.fit(train_x_s, train_y_s) prediction = model.predict(test_x_s) print('The accuracy of the Logistic Regression using Sepals is: {0}'.format(metrics.accuracy_score(prediction,test_y_s)))
The accuracy of the Logistic Regression using Petals is: 0.6888888888888889 The accuracy of the Logistic Regression using Sepals is: 0.6444444444444445
model=DecisionTreeClassifier()
model.fit(train_x_p, train_y_p)
prediction = model.predict(test_x_p)
print('The accuracy of the Decision Tree using Petals is: {0}'.format(metrics.accuracy_score(prediction,test_y_p))) model.fit(train_x_s, train_y_s) prediction = model.predict(test_x_s) print('The accuracy of the Decision Tree using Sepals is: {0}'.format(metrics.accuracy_score(prediction,test_y_s)))
The accuracy of the Decision Tree using Petals is: 0.9555555555555556 The accuracy of the Decision Tree using Sepals is: 0.6666666666666666
model=KNeighborsClassifier(n_neighbors=3) model.fit(train_x_p, train_y_p) prediction = model.predict(test_x_p) print('The accuracy of the KNN using Petals is: {0}'.format(metrics.accuracy_score(prediction,test_y_p))) model.fit(train_x_s, train_y_s) prediction = model.predict(test_x_s) print('The accuracy of the KNN using Sepals is: {0}'.format(metrics.accuracy_score(prediction,test_y_s)))
The accuracy of the KNN using Petals is: 0.9777777777777777 The accuracy of the KNN using Sepals is: 0.7333333333333333
從中不難看出,使用花瓣的尺寸來訓練數據較花萼更准確。正如在探索性分析的熱圖中所看到的那樣,花萼的寬度和長度之間的相關性非常低,而花瓣的寬度和長度之間的相關性非常高。