第一個應用：鳶尾花分類

本文轉載自查看原文 2019-08-01 10:13 899 python機器學習

第一個應用：鳶尾花分類

本例中我們用到了鳶尾花（Iris）數據集，這是機器學習和統計學中一個經典的數據集。

本文章所有知識都來自《python機器學習基礎教程》這本書，有需要的道友請留言。

初識數據：都有哪些數據呢？

from sklearn.datasets import load_iris

data = load_iris()
print('key of load_iris:\n{}'.format(data.keys()))

結果：
key of load_iris:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

data：數據列表，data 里面是花萼長度、花萼寬度、花瓣長度、花瓣寬度的測量數據

from sklearn.datasets import load_iris

data = load_iris()
# print('key of load_iris:\n{}'.format(data.keys()))
print('data of load_iris:\n{}'.format(data.data[:5]))


結果：
D:\software\Anaconda3\python.exe D:/MyCode/learn/11.py
data of load_iris:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

target：結果（分類的結果，這里一共三個分類，分別是0、1、2）

from sklearn.datasets import load_iris

data = load_iris()
# print('key of load_iris:\n{}'.format(data.keys()))
# print('data of load_iris:\n{}'.format(data.data[:5]))
print('target of load_iris:\n{}'.format(data.target))


結果：

D:\software\Anaconda3\python.exe D:/MyCode/learn/11.py
data of load_iris:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]

target_name:分類的名稱（3種類別）

from sklearn.datasets import load_iris

data = load_iris()
# print('key of load_iris:\n{}'.format(data.keys()))
# print('data of load_iris:\n{}'.format(data.data[:5]))
# print('target of load_iris:\n{}'.format(data.target))
print('target of load_iris:\n{}'.format(data.target_names))


結果：
D:\software\Anaconda3\python.exe D:/MyCode/learn/11.py
target_name of load_iris:
['setosa' 'versicolor' 'virginica']

DESCR:數據的介紹

filename：文件所在路徑

feature_names：數據描述

他們的關系如下圖：

訓練數據與測試數據

在有監督學習中，數據分為兩種，訓練數據和測試數據。

訓練數據用來給程序學習，並且包含數據和結果兩部分。

測試數據用來判斷我們的程序算法的准確性。用來評估模型性能，叫作測試數據（test data）、測試集（test set）或留出集（hold-out set）。

scikit-learn 中的 train_test_split 函數可以打亂數據集並進行拆分。這個函數將 75% 的行數據及對應標簽作為訓練集，剩下 25% 的數據及其標簽作為測試集。訓練集與測試集的分配比例可以是隨意的，但使用 25% 的數據作為測試集是很好的經驗法則。

train_test_split的用法說明，請看這里https://blog.csdn.net/mrxjh/article/details/78481578

train_test_split的作用是使用偽隨機器將數據集打亂，x_train包含75%行數據，x_test包含25行數據，代碼如下：

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


data = load_iris()
x_train, x_test, y_train, y_test = train_test_split(data['data'],data['target'],random_state=0)

print('x_train length is:', len(x_train))
print('x_test length is:', len(x_test))
print('y_train length is:', len(y_train))
print('y_test length is:', len(y_test))



結果：
D:\software\Anaconda3\python.exe D:/MyCode/learn/11.py
x_train length is: 112
x_test length is: 38
y_train length is: 112
y_test length is: 38

觀察數據

代碼：

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
from introduction_to_ml_with_python import mglearn


iris_dataset = load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris_dataset['data'],iris_dataset['target'],random_state=0)

iris_dataframe = pd.DataFrame(x_train, columns=iris_dataset.feature_names)

grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o', hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

plt.show()

其中，

mglearn可以到https://github.com/amueller/introduction_to_ml_with_python.git下載，也可以到我的代碼庫下載，名字叫16.py,項目地址如下：https://gitee.com/hardykay/machineLearning.git

在這里先聲明一下，這都是為了學習，如有侵權，請聯系我，立馬刪除。還有不會安裝的道友請自行學習pip命令和git的代碼拉取。

構建第一個模型：k近鄰算法

# 構建算法
knn = KNeighborsClassifier(n_neighbors=1, algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, p=2, weights='uniform')
knn.fit(x_train, y_train)

KNeighborsClassifier的使用

KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n-jobs=1)

n_neighbors 就是 kNN 里的 k，就是在做分類時，我們選取問題點最近的多少個最近鄰。

weights 是在進行分類判斷時給最近鄰附上的加權，默認的 'uniform' 是等權加權，還有 'distance' 選項是按照距離的倒數進行加權，也可以使用用戶自己設置的其他加權方法。舉個例子：假如距離詢問點最近的三個數據點中，有 1 個 A 類和 2 個 B 類，並且假設 A 類離詢問點非常近，而兩個 B 類距離則稍遠。在等權加權中，3NN 會判斷問題點為 B 類；而如果使用距離加權，那么 A 類有更高的權重（因為更近），如果它的權重高於兩個 B 類的權重的總和，那么算法會判斷問題點為 A 類。權重功能的選項應該視應用的場景而定。

algorithm 是分類時采取的算法，有 'brute'、'kd_tree' 和 'ball_tree'。kd_tree 的算法在 kd 樹文章中有詳細介紹，而 ball_tree 是另一種基於樹狀結構的 kNN 算法，brute 則是最直接的蠻力計算。根據樣本量的大小和特征的維度數量，不同的算法有各自的優勢。默認的 'auto' 選項會在學習時自動選擇最合適的算法，所以一般來講選擇 auto 就可以。

leaf_size 是 kd_tree 或 ball_tree 生成的樹的樹葉（樹葉就是二叉樹中沒有分枝的節點）的大小。在 kd 樹文章中我們所有的二叉樹的葉子中都只有一個數據點，但實際上樹葉中可以有多於一個的數據點，算法在達到葉子時在其中執行蠻力計算即可。對於很多使用場景來說，葉子的大小並不是很重要，我們設 leaf_size=1 就好。

metric 和 p，是我們在 kNN 入門文章中介紹過的距離函數的選項，如果 metric ='minkowski' 並且 p=p 的話，計算兩點之間的距離就是

模型評估

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import mglearn
from sklearn.neighbors import KNeighborsClassifier
import numpy as np


iris_dataset = load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], random_state=0)

iris_dataframe = pd.DataFrame(x_train, columns=iris_dataset.feature_names)

grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o', hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

# plt.show()

# 構建算法
knn = KNeighborsClassifier(n_neighbors=1, algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, p=2, weights='uniform')
knn.fit(x_train, y_train)

# 我們將這朵花的測量數據轉換為二維 NumPy 數組的一行，這是因為 scikit-learn的輸入數據必須是二維數組。 X_new = np.array([[5, 2.9, 1, 0.2]]) # print("X_new.shape: {}".format(X_new.shape)) prediction = knn.predict(X_new) # 根據我們模型的預測，這朵新的鳶尾花屬於類別 0，也就是說它屬於 setosa 品種。但我們 # 怎么知道能否相信這個模型呢？我們並不知道這個樣本的實際品種，這也是我們構建模型 # 的重點啊！ # print("Prediction: {}".format(prediction)) # print("Predicted target name: {}".format(iris_dataset['target_names'][prediction])) y_pred = knn.predict(x_test) # print("Test set predictions:\n {}".format(y_pred)) print("Test set score: {:.2f}".format(knn.score(x_test, y_test)))


結果：

D:\software\Anaconda3\python.exe D:/MyCode/machineLearning/18.py
Test set score: 0.97

總結：

1、使用from sklearn.datasets import load_iris獲得鳶尾花的數據。

2、使用from sklearn.model_selection import train_test_split將數據打亂，並分為訓練數據和測試數據。

3、使用import pandas as pd分析數據。

4、使用knn算法做機器學習算法

　　初始算法：

knn = KNeighborsClassifier(n_neighbors=1, algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, p=2, weights='uniform')

　　加入訓練數據進行學習：

knn.fit(x_train, y_train)

　　加入測試集合進行測試：

y_pred = knn.predict(x_test)   #y_pred就是測試結果

　　評估：

knn.score(x_test, y_test)
或
print("Test set score: {:.2f}".format(knn.score(x_test, y_test)))

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 應用：鳶尾花分類你的第一個神經網絡—神經網絡實現鳶尾花分類 TensorFlow2.1入門學習筆記(5)——構建第一個神經網絡，鳶尾花分類問題（附源碼） Python鳶尾花分類實現實驗02 鳶尾花分類 KNN鳶尾花數據分類 TensorFlow實現鳶尾花分類 pytorch解決鳶尾花分類對鳶尾花數據進行分類的思路 K-Means 鳶尾花分類