Sklearn庫例子——決策樹分類

本文轉載自查看原文 2016-12-08 12:55 5751 機器學習/ 數據挖掘/ Python

Sklearn上關於決策樹算法使用的介紹：http://scikit-learn.org/stable/modules/tree.html

1、關於決策樹：決策樹是一個非參數的監督式學習方法，主要用於分類和回歸。算法的目標是通過推斷數據特征，學習決策規則從而創建一個預測目標變量的模型。如下如所示，決策樹通過一系列if-then-else 決策規則近似估計一個正弦曲線。

決策樹優勢：

簡單易懂，原理清晰，決策樹可以實現可視化
數據准備簡單。其他的方法需要實現數據歸一化，創建虛擬變量，刪除空白變量。(注意：這個模塊不支持缺失值)
使用決策樹的代價是數據點的對數級別。
能夠處理數值和分類數據
能夠處理多路輸出問題
使用白盒子模型(內部結構可以直接觀測的模型)。一個給定的情況是可以觀測的，那么就可以用布爾邏輯解釋這個結果。相反，如果在一個黑盒模型(ANN)，結果可能很難解釋
可以通過統計學檢驗驗證模型。這也使得模型的可靠性計算變得可能
即使模型假設違反產生數據的真實模型，表現性能依舊很好。

決策樹劣勢：

可能會建立過於復雜的規則，即過擬合。為避免這個問題，剪枝、設置葉節點的最小樣本數量、設置決策樹的最大深度有時候是必要的。
決策樹有時候是不穩定的，因為數據微小的變動，可能生成完全不同的決策樹。可以通過總體平均(ensemble)減緩這個問題。應該指的是多次實驗。
學習最優決策樹是一個NP完全問題。所以，實際決策樹學習算法是基於試探性算法，例如在每個節點實現局部最優值的貪心算法。這樣的算法是無法保證返回一個全局最優的決策樹。可以通過隨機選擇特征和樣本訓練多個決策樹來緩解這個問題。
有些問題學習起來非常難，因為決策樹很難表達。如：異或問題、奇偶校驗或多路復用器問題
如果有些因素占據支配地位，決策樹是有偏的。因此建議在擬合決策樹之前先平衡數據的影響因子。

2、分類

DecisionTreeClassifier 能夠實現多類別的分類。輸入兩個向量：向量X，大小為[n_samples,n_features]，用於記錄訓練樣本；向量Y，大小為[n_samples]，用於存儲訓練樣本的類標簽。

from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
 
clf.predict([[2., 2.]])
clf.predict_proba([[2., 2.]])

下面我們使用iris數據集：

from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
 
# export the tree in Graphviz format using the export_graphviz exporter
with open("iris.dot", 'w') as f:
    f = tree.export_graphviz(clf, out_file=f)
 
# predict the class of samples
clf.predict(iris.data[:1, :])
# the probability of each class
clf.predict_proba(iris.data[:1, :])

　安裝Graphviz將其添加到環境變量，使用dot創建一個PDF文件。dot -Tpdf iris.dot -o iris.pdf　

　關於安裝Graphviz方法請參照：http://blog.csdn.net/lanchunhui/article/details/49472949

運行結果在文件夾下會有：

這兩個文件。我們打開iris.pdf

你也可以通過安裝pydotplus包。安裝方式：pip install pydotplus.在Python 中直接生成：

import pydotplus 
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = pydotplus.graph_from_dot_data(dot_data) 
graph.write_pdf("iris.pdf")

　注意：運行這段代碼是會出錯。我解決了很久沒有解決掉。可以參考：http://stackoverflow.com/questions/31209016/python-pydot-and-decisiontree/36456995#36456995

下面代碼是Sklearn官網上的演示代碼：

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Parameters
n_classes = 3
plot_colors = "bry"
plot_step = 0.02

# Load data
iris = load_iris()

for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
                                [1, 2], [1, 3], [2, 3]]):
    # We only take the two corresponding features
    X = iris.data[:, pair]
    y = iris.target

    # Train
    clf = DecisionTreeClassifier().fit(X, y)

    # Plot the decision boundary
    plt.subplot(2, 3, pairidx + 1)

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)

    plt.xlabel(iris.feature_names[pair[0]])
    plt.ylabel(iris.feature_names[pair[1]])
    plt.axis("tight")

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],
                    cmap=plt.cm.Paired)

    plt.axis("tight")

plt.suptitle("Decision surface of a decision tree using paired features")
plt.legend()
plt.show()

　代碼運行結果：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 sklearn之決策樹分類 sklearn之決策樹 python+sklearn實現決策樹（分類樹） SKlearn中分類決策樹的重要參數詳解決策樹（一）決策樹分類決策樹分類 sklearn--決策樹和基於決策樹的集成模型【sklearn決策樹算法】DecisionTreeClassifier(API)的使用以及決策樹代碼實例 - 鳶尾花分類 Sklearn_決策樹_回歸樹決策樹分類算法