用Python構建和可視化決策樹

本文轉載自查看原文 2020-11-02 00:05 1293

作者|Nikhil Adithyan
編譯|VK
來源|Towards Data Science

決策樹

決策樹是當今最強大的監督學習方法的組成部分。決策樹基本上是一個二叉樹的流程圖，其中每個節點根據某個特征變量將一組觀測值拆分。

決策樹的目標是將數據分成多個組，這樣一個組中的每個元素都屬於同一個類別。決策樹也可以用來近似連續的目標變量。在這種情況下，樹將進行拆分，使每個組的均方誤差最小。

決策樹的一個重要特性是它們很容易被解釋。你根本不需要熟悉機器學習技術就可以理解決策樹在做什么。決策樹圖很容易解釋。

利弊

決策樹方法的優點是：

決策樹能夠生成可理解的規則。
決策樹在不需要大量計算的情況下進行分類。
決策樹能夠處理連續變量和分類變量。
決策樹提供了一個明確的指示，哪些字段是最重要的。

決策樹方法的缺點是：

決策樹不太適合於目標是預測連續屬性值的估計任務。
決策樹在類多、訓練樣本少的分類問題中容易出錯。
決策樹的訓練在計算上可能很昂貴。生成決策樹的過程在計算上非常昂貴。在每個節點上，每個候選拆分字段都必須進行排序，才能找到其最佳拆分。在某些算法中，使用字段組合，必須搜索最佳組合權重。剪枝算法也可能是昂貴的，因為許多候選子樹必須形成和比較。

Python決策樹

Python是一種通用編程語言，它為數據科學家提供了強大的機器學習包和工具。在本文中，我們將使用python最著名的機器學習包scikit-learn來構建決策樹模型。我們將使用scikit learn提供的“DecisionTreeClassifier”算法創建模型，然后使用“plot_tree”函數可視化模型。

步驟1：導入包

我們構建模型的主要軟件包是pandas、scikit learn和NumPy。按照代碼在python中導入所需的包。

import pandas as pd # 數據處理
import numpy as np # 使用數組
import matplotlib.pyplot as plt # 可視化
from matplotlib import rcParams # 圖大小
from termcolor import colored as cl # 文本自定義

from sklearn.tree import DecisionTreeClassifier as dtc # 樹算法
from sklearn.model_selection import train_test_split # 拆分數據
from sklearn.metrics import accuracy_score # 模型准確度
from sklearn.tree import plot_tree # 樹圖

rcParams['figure.figsize'] = (25, 20)

在導入構建我們的模型所需的所有包之后，是時候導入數據並對其進行一些EDA了。

步驟2：導入數據和EDA

在這一步中，我們將使用python中提供的“Pandas”包來導入並在其上進行一些EDA。我們將建立我們的決策樹模型，數據集是一個葯物數據集，它是基於特定的標准給病人開的處方。讓我們用python導入數據!

Python實現：

df = pd.read_csv('drug.csv')
df.drop('Unnamed: 0', axis = 1, inplace = True)

print(cl(df.head(), attrs = ['bold']))

輸出：

   Age Sex      BP Cholesterol  Na_to_K   Drug
0   23   F    HIGH        HIGH   25.355  drugY
1   47   M     LOW        HIGH   13.093  drugC
2   47   M     LOW        HIGH   10.114  drugC
3   28   F  NORMAL        HIGH    7.798  drugX
4   61   F     LOW        HIGH   18.043  drugY

現在我們對數據集有了一個清晰的概念。導入數據后，讓我們使用“info”函數獲取有關數據的一些基本信息。此函數提供的信息包括條目數、索引號、列名、非空值計數、屬性類型等。

Python實現：

df.info()

輸出：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          200 non-null    int64  
 1   Sex          200 non-null    object 
 2   BP           200 non-null    object 
 3   Cholesterol  200 non-null    object 
 4   Na_to_K      200 non-null    float64
 5   Drug         200 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 9.5+ KB

步驟3：數據處理

我們可以看到像Sex, BP和Cholesterol這樣的屬性在本質上是分類的和對象類型的。問題是，scikit-learn中的決策樹算法本質上不支持X變量(特征)是“對象”類型。因此，有必要將這些“object”值轉換為“binary”值。讓我們用python來實現

Python實現：

for i in df.Sex.values:
    if i  == 'M':
        df.Sex.replace(i, 0, inplace = True)
    else:
        df.Sex.replace(i, 1, inplace = True)

for i in df.BP.values:
    if i == 'LOW':
        df.BP.replace(i, 0, inplace = True)
    elif i == 'NORMAL':
        df.BP.replace(i, 1, inplace = True)
    elif i == 'HIGH':
        df.BP.replace(i, 2, inplace = True)

for i in df.Cholesterol.values:
    if i == 'LOW':
        df.Cholesterol.replace(i, 0, inplace = True)
    else:
        df.Cholesterol.replace(i, 1, inplace = True)

print(cl(df, attrs = ['bold']))

輸出：

     Age  Sex  BP  Cholesterol  Na_to_K   Drug
0     23    1   2            1   25.355  drugY
1     47    1   0            1   13.093  drugC
2     47    1   0            1   10.114  drugC
3     28    1   1            1    7.798  drugX
4     61    1   0            1   18.043  drugY
..   ...  ...  ..          ...      ...    ...
195   56    1   0            1   11.567  drugC
196   16    1   0            1   12.006  drugC
197   52    1   1            1    9.894  drugX
198   23    1   1            1   14.020  drugX
199   40    1   0            1   11.349  drugX

[200 rows x 6 columns]

我們可以觀察到所有的“object”值都被處理成“binary”值來表示分類數據。例如，在膽固醇屬性中，顯示“低”的值被處理為0，“高”則被處理為1。現在我們准備好從數據中創建因變量和自變量。

步驟4：拆分數據

在將我們的數據處理為正確的結構之后，我們現在設置“X”變量（自變量），“Y”變量（因變量）。讓我們用python來實現

Python實現：

X_var = df[['Sex', 'BP', 'Age', 'Cholesterol', 'Na_to_K']].values # 自變量
y_var = df['Drug'].values # 因變量

print(cl('X variable samples : {}'.format(X_var[:5]), attrs = ['bold']))
print(cl('Y variable samples : {}'.format(y_var[:5]), attrs = ['bold']))

輸出：

X variable samples : [[ 1.     2.    23.     1.    25.355]
 [ 1.     0.    47.     1.    13.093]
 [ 1.     0.    47.     1.    10.114]
 [ 1.     1.    28.     1.     7.798]
 [ 1.     0.    61.     1.    18.043]]
Y variable samples : ['drugY' 'drugC' 'drugC' 'drugX' 'drugY']

我們現在可以使用scikit learn中的“train_test_split”算法將數據分成訓練集和測試集，其中包含我們定義的X和Y變量。按照代碼在python中拆分數據。

Python實現：

X_train, X_test, y_train, y_test = train_test_split(X_var, y_var, test_size = 0.2, random_state = 0)

print(cl('X_train shape : {}'.format(X_train.shape), attrs = ['bold'], color = 'black'))
print(cl('X_test shape : {}'.format(X_test.shape), attrs = ['bold'], color = 'black'))
print(cl('y_train shape : {}'.format(y_train.shape), attrs = ['bold'], color = 'black'))
print(cl('y_test shape : {}'.format(y_test.shape), attrs = ['bold'], color = 'black'))

輸出：

X_train shape : (160, 5)
X_test shape : (40, 5)
y_train shape : (160,)
y_test shape : (40,)

現在我們有了構建決策樹模型的所有組件。所以，讓我們繼續用python構建我們的模型。

步驟5：建立模型和預測

在scikit學習包提供的“DecisionTreeClassifier”算法的幫助下，構建決策樹是可行的。之后，我們可以使用我們訓練過的模型來預測我們的數據。最后，我們的預測結果的精度可以用“准確度”評估指標來計算。讓我們用python來完成這個過程！

Python實現：

model = dtc(criterion = 'entropy', max_depth = 4)
model.fit(X_train, y_train)

pred_model = model.predict(X_test)

print(cl('Accuracy of the model is {:.0%}'.format(accuracy_score(y_test, pred_model)), attrs = ['bold']))

輸出：

Accuracy of the model is 88%

在代碼的第一步中，我們定義了一個名為“model”變量的變量，我們在其中存儲DecisionTreeClassifier模型。接下來，我們將使用我們的訓練集對模型進行擬合和訓練。之后，我們定義了一個變量，稱為“pred_model”變量，其中我們將模型預測的所有值存儲在數據上。最后，我們計算了我們的預測值與實際值的精度，其准確率為88%。

步驟6：可視化模型

現在我們有了決策樹模型，讓我們利用python中scikit learn包提供的“plot_tree”函數來可視化它。按照代碼從python中的決策樹模型生成一個漂亮的樹圖。

Python實現：

feature_names = df.columns[:5]
target_names = df['Drug'].unique().tolist()

plot_tree(model, 
          feature_names = feature_names, 
          class_names = target_names, 
          filled = True, 
          rounded = True)

plt.savefig('tree_visualization.png')

輸出：

結論

有很多技術和其他算法用於優化決策樹和避免過擬合，比如剪枝。雖然決策樹通常是不穩定的，這意味着數據的微小變化會導致最優樹結構的巨大變化，但其簡單性使其成為廣泛應用的有力候選。在神經網絡流行之前，決策樹是機器學習中最先進的算法。其他一些集成模型，比如隨機森林模型，比普通決策樹模型更強大。

決策樹由於其簡單性和可解釋性而非常強大。決策樹和隨機森林在用戶注冊建模、信用評分、故障預測、醫療診斷等領域有着廣泛的應用。我為本文提供了完整的代碼。

完整代碼：

import pandas as pd # 數據處理
import numpy as np # 使用數組
import matplotlib.pyplot as plt # 可視化
from matplotlib import rcParams # 圖大小
from termcolor import colored as cl # 文本自定義

from sklearn.tree import DecisionTreeClassifier as dtc # 樹算法
from sklearn.model_selection import train_test_split # 拆分數據
from sklearn.metrics import accuracy_score # 模型准確度
from sklearn.tree import plot_tree # 樹圖

rcParams['figure.figsize'] = (25, 20)

df = pd.read_csv('drug.csv')
df.drop('Unnamed: 0', axis = 1, inplace = True)

print(cl(df.head(), attrs = ['bold']))

df.info()

for i in df.Sex.values:
    if i  == 'M':
        df.Sex.replace(i, 0, inplace = True)
    else:
        df.Sex.replace(i, 1, inplace = True)
        
for i in df.BP.values:
    if i == 'LOW':
        df.BP.replace(i, 0, inplace = True)
    elif i == 'NORMAL':
        df.BP.replace(i, 1, inplace = True)
    elif i == 'HIGH':
        df.BP.replace(i, 2, inplace = True)
        
for i in df.Cholesterol.values:
    if i == 'LOW':
        df.Cholesterol.replace(i, 0, inplace = True)
    else:
        df.Cholesterol.replace(i, 1, inplace = True)
        
print(cl(df, attrs = ['bold']))

X_var = df[['Sex', 'BP', 'Age', 'Cholesterol', 'Na_to_K']].values # 自變量
y_var = df['Drug'].values # 因變量

print(cl('X variable samples : {}'.format(X_var[:5]), attrs = ['bold']))
print(cl('Y variable samples : {}'.format(y_var[:5]), attrs = ['bold']))

X_train, X_test, y_train, y_test = train_test_split(X_var, y_var, test_size = 0.2, random_state = 0)

print(cl('X_train shape : {}'.format(X_train.shape), attrs = ['bold'], color = 'red'))
print(cl('X_test shape : {}'.format(X_test.shape), attrs = ['bold'], color = 'red'))
print(cl('y_train shape : {}'.format(y_train.shape), attrs = ['bold'], color = 'green'))
print(cl('y_test shape : {}'.format(y_test.shape), attrs = ['bold'], color = 'green'))

model = dtc(criterion = 'entropy', max_depth = 4)
model.fit(X_train, y_train)

pred_model = model.predict(X_test)

print(cl('Accuracy of the model is {:.0%}'.format(accuracy_score(y_test, pred_model)), attrs = ['bold']))

feature_names = df.columns[:5]
target_names = df['Drug'].unique().tolist()

plot_tree(model, 
          feature_names = feature_names, 
          class_names = target_names, 
          filled = True, 
          rounded = True)

plt.savefig('tree_visualization.png')

原文鏈接：https://towardsdatascience.com/building-and-visualizing-decision-tree-in-python-2cfaafd8e1bb

歡迎關注磐創AI博客站：
http://panchuang.net/

sklearn機器學習中文官方文檔：
http://sklearn123.com/

歡迎關注磐創博客資源匯總站：
http://docs.panchuang.net/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 決策樹的可視化輸出用graphviz可視化決策樹決策樹的圖形可視化 [python機器學習及實踐(4)]Sklearn實現決策樹並用Graphviz可視化決策樹決策樹（三）決策樹與Jupyter小部件的交互式可視化 sklearn決策樹應用及可視化如何使用sklearn對決策樹進行可視化基於Breast Cancer dataset的決策樹分類及可視化被驚艷到了！決策樹可視化 Python決策樹可視化：GraphViz's executables not found的解決方法