BPNN、決策樹、KNN、SVM分類鳶尾花數據集Python實現

本文轉載自查看原文 2020-10-05 01:15 3329 大數據分析/ python/ 基礎知識

數據集處理
方法1 DecisionTree
方法2 BPNN
方法3 SVM
- 理解
  - SVM
  - 核函數
- 實現
方法4 KNN
- KNN分類器實現
- 分類結果

數據集處理

數據獲取

使用sklearn的dataset獲取數據

from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
iris_feature = iris['data']
iris_target = iris['target']
iris_target_name = iris['target_names']

數據划分

使用sklearn自帶的函數將其分割為訓練集和測試集

訓練集和測試集比例為2：1
為方便比較不同方法的優劣，我們固定隨機數種子為10

feature_train, feature_test, target_train, target_test = train_test_split(iris_feature, iris_target, test_size=0.33，random_state=10)

可視化

使用plt對數據進行可視化，數據集展示如下

def show():
    t0 = [index for index in range(len(iris_target)) if iris_target[index] == 0]
    t1 = [index for index in range(len(iris_target)) if iris_target[index] == 1]
    t2 = [index for index in range(len(iris_target)) if iris_target[index] == 2]
    plt.rcParams['font.sans-serif'] = ['SimHei']  # 顯示中文標簽
    # plt.rcParams['axes.unicode_minus'] = False
    plt.scatter(x=iris_feature[t0, 0], y=iris_feature[t0, 1], color='r', label='Iris-virginica')
    plt.scatter(x=iris_feature[t1, 0], y=iris_feature[t1, 1], color='g', label='Iris-setosa')
    plt.scatter(x=iris_feature[t2, 0], y=iris_feature[t2, 1], color='b', label='Iris-versicolor')

    plt.xlabel("花萼長度")
    plt.ylabel("花瓣長度")
    plt.title("數據集展示")
    plt.show()

方法1 DecisionTree

類定義

為了構建決策樹，需要先定義節點類

class Node:
    def __init__(self, dimension, threshold, isLeaf, left, right, species):
        self.dimension = dimension  # 划分維度
        self.threshold = threshold  # 划分閾值
        self.isLeaf = isLeaf  # 是否是葉節點
        self.left = left  # 左支（葉節點時為None）
        self.right = right  # 右支（葉節點時為None）
        self.species = species  # 分類（如果是葉節點）

構建決策樹

決策樹部分，采用CART算法構建決策樹，下面將按照依賴關系自底向上介紹結構化方法

基尼值

計算公式為

基尼值越小說明該數據集中不同類的數據越少

p_v代表了v類數據在總類中的頻率

代碼實現如下

def get_gini(label):
    """
    計算GINI值
    :param label: 數組，里面存的是分類
    :return: 返回Gini值
    """
    gini = 1
    dic = {}
    for target in label:
        if target in dic.keys():
            dic[target] += 1
        else:
            dic[target] = 1
    for value in dic.values():
        tmp = value / len(label)
        gini -= tmp * tmp
    return gini

基尼系數

計算公式如下

因為鳶尾花數據集的屬性都是浮點數，為了二分化，我們需要尋找一個閾值，這里采用的方法是枚舉所有的划分情況，因此需要做：

排序給定維度下的屬性
選取相鄰屬性值的平均值作為候選閾值，並去重
遍歷所有可能的閾值，選取基尼系數最小的划分閾值，返回基尼系數和划分閾值

代碼實現如下

def get_gini_index_min(feature, label, dimension):
    """
    獲取某個維度的最小GiniIndex
    :param feature: 所有屬性list
    :param label: 標記list
    :param dimension: 維度(從0開始)
    :return: gini_index(最小GiniIndex)  threshold(對應閾值)
    """
    attr = feature[:, dimension]
    gini_index = 1
    threshold = 0
    attr_sort = sorted(attr)
    candicate_thre = []
    # 尋找候選閾值
    for i in range(len(attr_sort) - 1):
        tmp = (attr_sort[i] + attr_sort[i + 1]) / 2
        if tmp not in candicate_thre:
            candicate_thre.append(tmp)
    # 尋找最小GiniIndex
    for thre_tmp in candicate_thre:
        index_small_list = [index for index in range(len(feature)) if attr[index] < thre_tmp]
        label_small_tmp = label[index_small_list]
        index_large_list = [index for index in range(len(feature)) if attr[index] >= thre_tmp]
        label_large_tmp = label[index_large_list]
        gini_index_tmp = get_gini(label_small_tmp) * len(label_small_tmp) / len(attr) + get_gini(label_large_tmp) * len(
            label_large_tmp) / len(attr)
        if gini_index_tmp < gini_index:
            gini_index = gini_index_tmp
            threshold = thre_tmp
    print(gini_index, threshold)
    return gini_index, threshold

尋找划分維度

鳶尾花數據集有四個維度的數據，我們需要確定選取哪個維度的數據作為划分依據，因此，我們依次計算各個維度下的最小基尼系數，選取最小基尼系數最小的維度作為划分維度

有了上面計算最小基尼系數的方法，我們可以來選取基尼系數最小的數據維度

def find_dimension_by_GiniIndex(feature, label):
    """
    尋找划分維度
    :param feature: 所有屬性list
    :param label: 標記list
    :return: gini_index, threshold, dimension
    """
    dimension = 0
    threshold = 0
    gini_index_min = 1
    for d in range(len(feature[1])):
        gini_index, thre = get_gini_index_min(feature, label, d)
        if gini_index < gini_index_min:
            gini_index_min = gini_index
            dimension = d
            threshold = thre
    print(gini_index, threshold, dimension)
    return gini_index, threshold, dimension

構建決策樹

有了以上的工具，就用遞歸的方法構建決策樹了

遞歸的終點有兩種情況

dataset只有一個元素了，那就不用再分了
dataset里有很多元素，但都是同一類型的，體現在GiniIndex=0，說明已經純潔，不用再遞歸

實現如下

def devide_by_dimension_and_thre(feature, label, threshold, dimension):
    """
    根據閾值和維度來划分數據集，返回小集和大集
    :param feature: 所有屬性list
    :param label: 標記list
    :param threshold: 划分閾值
    :param dimension: 划分維度
    :return: feature_small, label_small, feature_large, label_large
    """
    attr = feature[:, dimension]
    index_small_list = [index for index in range(len(feature)) if attr[index] < threshold]
    feature_small = feature[index_small_list]
    label_small = label[index_small_list]
    index_large_list = [index for index in range(len(feature)) if attr[index] >= threshold]
    feature_large = feature[index_large_list]
    label_large = label[index_large_list]
    return feature_small, label_small, feature_large, label_large


def build_tree(feature, label):
    """
    遞歸構建決策樹
    :param feature: 所有屬性list
    :param label: 標記list
    :return: 決策樹的根Node節點
    """
    if len(label) > 1:
        gini_index, threshold, dimension = find_dimension_by_GiniIndex(feature, label)
        if gini_index == 0:  # gini_index = 0，說明全都是同一種類型，就是葉節點
            return Node(dimension, threshold, True, None, None, label[0])
            print('end')
        else:
            # gini_index != 0，說明還不純，繼續划分，遞歸構建左支和右支
            feature_small, label_small, feature_large, label_large = devide_by_dimension_and_thre(feature, label,
                                                                                                  threshold,
                                                                                                  dimension)
            left = build_tree(feature_small, label_small)
            right = build_tree(feature_large, label_large)
            return Node(dimension, threshold, False, left, right, None)
    else:
        # 如果只有一個數據，直接是葉節點
        return Node(None, None, True, None, None, label[0])

分類結果

使用graphviz對訓練出的決策樹進行可視化

通過對測試集的預測來驗證准確性

def predict(root: Node, feature_line):
    """
    使用該方法進行預測
    :param root: 決策樹根節點
    :param feature_line: 需要預測的屬性值
    :return: 預測結構 label
    """
    node = root
    while not node.isLeaf:
        if feature_line[node.dimension] < node.threshold:
            node = node.left
        else:
            node = node.right
    return node.species


def score(root, feature, label):
    """
    模型得分評估
    :param root: 決策樹根節點
    :param feature: 測試集屬性list
    :param label: 測試集標記list
    :return: 正確率
    """
    correct = 0
    for index in range(len(feature)):
        type = predict(root, feature[index])
        if type == label[index]:
            correct += 1
    print('correct rate is', correct / len(feature))
    
    
res = build_tree(feature_train, target_train)
score(res, feature_test, target_test)

得到正確率為0.96

經過驗證，隨機選取划分數據集的隨機數種子（既按照2：1的訓練集：測試集比例，隨機划分），正確率都在90%以上，說明決策樹方法能有效划分鳶尾花數據集

方法2 BPNN

BPNN（Back Propagation Neural Network）的主要思想是通過神經網絡正向傳播輸出結果，通過反向傳播（Back Propagation）方式傳遞誤差，並對網絡中的參數進行優化，以訓練出一個神經網絡。

這里直接通過構造一個BP神經網絡，來實現對鳶尾花數據集分類例子，用代碼來講述對其的理解。

網絡搭建

構建一個如圖所示的神經網絡

一些定義

輸入層：input
隱藏層：hide
輸出層：ouput

算法實現

初始化參數

類定義及初始化如下

class NeuralNetwork(object):
    def __init__(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        """
        :param input_nodes:  輸入層節點個數
        :param hidden_nodes:  隱藏層節點個數
        :param output_nodes:  輸出層節點個數
        :param learning_rate:  學習率
        """
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes
        self.weights_input_to_hidden = np.random.normal(0.0, self.hidden_nodes ** -0.5,
                                                        (self.hidden_nodes, self.input_nodes))

        self.weights_hidden_to_output = np.random.normal(0.0, self.output_nodes ** -0.5,
                                                         (self.output_nodes, self.hidden_nodes))

        self.lr = learning_rate  # 學習率
        self.activation_function = self.sigmoid

    def sigmoid(self, x):
        return 1.0 / (1 + np.exp(-x))

選擇sigmoid函數作為激活函數

向前傳播

向前傳播指的已知各個節點的參數，如何得到神經網絡的輸出。

輸入層inputs
隱藏層輸入：通過輸入層x權重得到隱藏層輸入

hidden_inputs = np.dot(self.weights_input_to_hidden, inputs)
隱藏層輸出：通過隱藏層輸入帶入激活函數中獲得

hidden_outputs = self.activation_function(hidden_inputs)
結果層輸入：通過隱藏層x權重得到結果層輸入

final_inputs = np.dot(self.weights_hidden_to_output, hidden_outputs)
結果層輸出：盡管很多書上在這里會再使用一次激活函數，但因為期望的輸出結果為分類target(0、1、2)，而sigmoid函數的取值為(0,1)，所以這里我們選擇不再使用一次激活函數。結果證明這樣處理下，仍然能夠保持較好的准確率。

final_outputs = final_inputs

完整代碼

    def train(self, inputs_list, targets_list):
        # 正向傳播
        inputs = np.array(inputs_list, ndmin=2).T
        targets = np.array(targets_list, ndmin=2).T

        hidden_inputs = np.dot(self.weights_input_to_hidden, inputs)
        hidden_outputs = self.activation_function(hidden_inputs)

        final_inputs = np.dot(self.weights_hidden_to_output, hidden_outputs)
        final_outputs = final_inputs  # 因為的取值為0、1、2，所以這里不再用激活函數了，否則結果會被限制在0到1
        # 未完，見下

反向傳播

反向傳播主要是使用了梯度下降的方法來對參數進行修正，以提高擬合效果

(1)計算總誤差

計算總的誤差為：

我們反向傳播的目的就是對參數進行修正，使得E_total達到最小。

(2)修正隱藏層-輸出層參數

以權重weights_hidden_to_output[0]為例（為了表示方便記為w[0]），如果我們想知道他對總體誤差產生了多少影響，可以對其求偏導。

同理，可以計算出所有的weights_hidden_to_output

代碼實現如下

delta_output_out = final_outputs - targets
delta_output_in = delta_output_out
delta_weight_ho_out = np.dot(delta_output_in, hidden_outputs.T)
self.weights_hidden_to_output -= (self.lr * delta_weight_ho_out)

(3)修正輸入層-隱藏層參數

這里需要先知道中間使用的激活函數sigmoid函數的求導
$$
sigmoid'(f(x))=f'(x)f(x)(1-f(x))
$$
以權重weights_input_to_hidden[0]為例（為了表示方便記為w[0]），如果我們想知道他對總體誤差產生了多少影響，可以對其求偏導。

同理，可以計算出所有的weights_input_to_hidden

代碼實現如下

delta_hidden_out = np.dot(self.weights_hidden_to_output.T, delta_output_in)
delta_hidden_in = delta_hidden_out * hidden_outputs * (1 - hidden_outputs)
delta_wih = np.dot(delta_hidden_in, inputs.T)
self.weights_input_to_hidden -= (self.lr * delta_wih)

關於正向傳播、反向傳播部分的參考

https://www.cnblogs.com/charlotte77/p/5629865.html

模型訓練

epochs = 1000  # 訓練次數
learning_rate = 0.001
hidden_nodes = 10
output_nodes = 1
batch_size = 50
input_nodes = train_features.shape[1]
network = NeuralNetwork(input_nodes, hidden_nodes, output_nodes, learning_rate)

for e in range(epochs):  # 進行epochs次訓練
    batch = np.random.choice(len(train_features), size=batch_size)  # 從訓練集中隨機挑選50個樣本進行訓練
    for record, target in zip(train_features[batch],
                              train_targets[batch]):
        network.train(record, target)

分類結果

1000次訓練下的損失函數圖如下

訓練集的分類正確率為 0.98
測試集的分類正確率為 0.96

說明BPNN方法能有效划分鳶尾花數據集

方法3 SVM

理解

SVM

SVM是一種監督學習算法，主要思想是建立一個最優決策超平面，使得該平面兩側距平面最近的兩類樣本之間的距離最大化，從而對分類問題提供良好的泛化能力

以下圖為例，黃色和藍色是兩種決策超平面，而黃色平面兩側距平面最近的兩類樣本之間的距離較大，所以可以稱黃色是最優決策超平面。

而“支持向量”指訓練集中的一些訓練點，這些訓練點最靠近決策面，是最難分類的數據點。比如圖中畫了虛線的四個點就是這種點。

尋找到這類超平面后，我們假設超平面方程為
$$
W^TX+b=0
$$
X為輸入向量，W為權值向量，b為偏置，則可根據以下兩個標准分為兩類
$$
W^TX+b>0
$$

$$
W^TX+b<0
$$

核函數

為了划

分非線性數據，我們不能使用線性結果對其進行划分，如圖，我們為了划分兩類數據，沒辦法使用一條直線進行划分，而需要用曲線進行划分

從高維的角度理解這個問題，原理是將數據映射到高維數據，在高維空間線性可分。

比如我們做一個從二維到三維的映射之后，就可以使用一個平面來划分這兩類數據

這種將原始空間中的向量作為輸入向量，並返回特征空間（轉換后的數據空間,可能是高維）中向量的點積的函數稱為核函數。

一個來源網上的例子：

在下面的實現里，我們選用rbf作為核函數，徑向基函數 (Radial Basis Function 簡稱 RBF)，就是某種沿徑向對稱的標量函數，最常用的是高斯核函數。

高斯核本質是在衡量樣本和樣本之間的“相似度”，在一個刻畫“相似度”的空間中，讓同類樣本更好的聚在一起，進而線性可分。

1，使用一個非線性映射將數據變換到一個特征空間 F
2，在特征空間使用線性學習器分類

實現

使用Sklearn自帶的SVM模型進行實現

import matplotlib.pyplot as plt
from sklearn import svm

from data import feature_train, target_train, feature_test, target_test

svm_classifier = svm.SVC(C=1.0, kernel='rbf', decision_function_shape='ovr', gamma=0.01)
svm_classifier.fit(feature_train, target_train)

print("訓練集:", svm_classifier.score(feature_train, target_train))
print("測試集:", svm_classifier.score(feature_test, target_test))
target_test_predict = svm_classifier.predict(feature_test)
comp = zip(target_test, target_test_predict)
print(list(comp))

plt.figure()
plt.subplot(121)
plt.scatter(feature_test[:, 0], feature_test[:, 1], c=target_test.reshape((-1)), edgecolors='k', s=50)
plt.subplot(122)
plt.scatter(feature_test[:, 0], feature_test[:, 1], c=target_test_predict.reshape((-1)), edgecolors='k', s=50)
plt.show()

分類結果如下

訓練集的准確率: 0.95
測試集的准確率: 0.92

方法4 KNN

KNN分類器實現

距離計算

計算公式為
$$
d=\sqrt{(x0-y0)^2+(x1-y1)2+(x2-y2)^2+(x3-y3)2}
$$

    def get_distance(self, feature_line1, feature_line2):
        tmp = 0
        for i in range(len(feature_line1)):
            tmp += (feature_line1[i] - feature_line2[i]) ** 2
        return tmp ** 0.5

選擇類型

直接選擇距離最近的k-訓練集中出現頻率最高的種類作為分類結果

 def get_type(self, k, feature_line):
        dic = {}
        for index in range(len(self.feature)):
            dist = self.get_distance(self.feature[index], feature_line)
            dic[index] = dist
        # sort
        sort_dic = sorted(dic.items(), key=lambda x: x[1], reverse=False)
        # print(sort_dic)
        vote = {}
        for i in range(k):
            index = sort_dic[i][0]
            type = self.labels[index]
            if type not in vote.keys():
                vote[type] = 1
            else:
                vote[type] += 1
        vote_rank = sorted(vote.items(), key=lambda x: x[1], reverse=True)
        # print(vote_rank)
        return vote_rank[0][0]

完整代碼

class KNNClassifier:
    def __init__(self, feature, labels):
        self.feature = feature
        self.labels = labels

    def get_distance(self, feature_line1, feature_line2):
        tmp = 0
        for i in range(len(feature_line1)):
            tmp += (feature_line1[i] - feature_line2[i]) ** 2
        return tmp ** 0.5

    def get_type(self, k, feature_line):
        dic = {}
        for index in range(len(self.feature)):
            dist = self.get_distance(self.feature[index], feature_line)
            dic[index] = dist
        # sort
        sort_dic = sorted(dic.items(), key=lambda x: x[1], reverse=False)
        # print(sort_dic)
        vote = {}
        for i in range(k):
            index = sort_dic[i][0]
            type = self.labels[index]
            if type not in vote.keys():
                vote[type] = 1
            else:
                vote[type] += 1
        vote_rank = sorted(vote.items(), key=lambda x: x[1], reverse=True)
        # print(vote_rank)
        return vote_rank[0][0]

    def predict(self, k, feature):
        res = []
        for feature_line in feature:
            res.append(self.get_type(k, feature_line))
        return res

    def score(self, k, feature, labels):
        predict_set = self.predict(k, feature)
        return len([index for index in range(len(labels)) if predict_set[index] == labels[index]]) / len(labels)

分類結果

選取k=5，訓練結果如下：

訓練集的准確率: 0.97

測試集的准確率: 0.96

使用plt繪制分布圖：

可以看出，KNN能夠很好地對鳶尾花數據集進行分類

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 02-26 決策樹(鳶尾花分類) 使用決策樹完成鳶尾花分類 KNN鳶尾花數據分類 python 鳶尾花數據集報表展示分析鳶尾花數據集鳶尾花數據集鳶尾花數據集分析 python構建bp神經網絡_鳶尾花分類(一個隱藏層)__1.數據集機器學習:鳶尾花數據集決策樹算法-實戰篇-鳶尾花及波士頓房價預測