Python機器學習筆記(2)——決策樹-DecisionTreeClassifier-GridSearchCV

本文轉載自查看原文 2021-10-26 12:02 91 機器學習

一、內容

1、決策樹算法原理
2、數據預處理示例
3、決策模型的建立
4、參數的選擇
5、交叉驗證及多參數選擇

二、決策樹算法原理

決策樹是類似於樹的結構，分支節點表示對一個特征進行測試。根據測試結果進行分類，樹葉代表一個類別。
1,最經典的機器學習模型之一;
2,預測結果容易理解，易於解釋;
3,可處理類別型和連續型數據;

2.1 先對哪個特征分類？

信息的量化：信息熵、基尼不純度
信息熵：信息的混亂程度；減少熵，就是信息增益，優先選擇信息增益最大的特征進行分類；
基尼不存度：也是表示信息的增益層度或者混亂程度，相對比信息熵，優勢在於運算速度更快；
什么樣的特征帶來最多的信息變化幅度，我們就選擇哪一個特征來分組，如果特征為連續值，需要對數據離散化處理；

2.2 決策樹的解決過擬合方法

前剪枝：設定一個閥值，信息熵減少的數量小於這個值，停止創建分支；
后剪枝：決策樹創建完成后，對節點檢查其信息熵的增益；
空值決策樹的最大深度；

此處使用泰坦尼克號，乘客的信息進行獲救的預測，部分數據如下顯示：

三、數據預處理

3.1 讀入數據，選擇特征值

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.read_csv("datasets/tt/train.csv")

#選擇“['Survived', 'Pclass', 'Sex', 'Age', 'SibSp','Parch','Fare','Embarked']”字段數據作為訓練數據（[是否獲救、船艙登記、性別、年齡、是否兄弟姐妹同在、是否父母同在、票價、登船港口]。
data=data[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp','Parch','Fare','Embarked']]
data.info()
#<class 'pandas.core.frame.dataframe'="">
#RangeIndex: 891 entries, 0 to 890
#Data columns (total 8 columns):
#Survived    891 non-null int64
#Pclass      891 non-null int64
#Sex         891 non-null object
#Age         714 non-null float64
#SibSp       891 non-null int64
#Parch       891 non-null int64
#Fare        891 non-null float64
#Embarked    889 non-null object
#dtypes: float64(2), int64(4), object(2)
#memory usage: 55.8+ KB

# 發現Age有空值，此處采取用平均值填充
data['Age']=data['Age'].fillna(data['Age'].mean())

# 年齡數值化和船艙等級、登船港口獨熱編碼處理
data['Sex']=data['Sex'].apply(lambda x : 1 if x == 'male' else 0)
data['p1']=np.array(data['Pclass'] == 1).astype(np.int32)
data['p2']=np.array(data['Pclass'] == 2).astype(np.int32)
data['p3']=np.array(data['Pclass'] == 3).astype(np.int32)
del data['Pclass']
data['e1']=np.array(data['Embarked'] == 'S').astype(np.int32)
data['e2']=np.array(data['Embarked'] == 'C').astype(np.int32)
data['e3']=np.array(data['Embarked'] == 'Q').astype(np.int32)
del data['Embarked']

# 提取訓練數據
data_train=data[[ 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'p1', 'p2', 'p3','e1', 'e2', 'e3']].values
data_target=data['Survived'].values.reshape(len(data),1)

四、決策模型的建立

數據划分

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(data_train,data_target,test_size=0.2,random_state=24)

from sklearn.tree import DecisionTreeClassifier
model=DecisionTreeClassifier()
model.fit(x_train,y_train)
model.score(x_test,y_test),model.score(x_train,y_train)
>>`(0.77094972067039103, 0.9845505617977528)`

發現數據過擬合,此處分別對最大深度及信息增益閥值進行分類討論

4.1 設置最大深度：

def m_scores(depth):
    model=DecisionTreeClassifier(max_depth=depth)
    model.fit(x_train,y_train)
    train_score = model.score(x_train, y_train)
    test_score = model.score(x_test, y_test)
    return train_score,test_score

depths = range(2,15)
scores = [m_scores(depth) for depth in depths]
scores
>>>
[(0.7837078651685393, 0.7988826815642458),
 (0.8188202247191011, 0.81564245810055869),
 (0.824438202247191, 0.81564245810055869),
 (0.8455056179775281, 0.81564245810055869),
 (0.86235955056179781, 0.87150837988826813),
 (0.8721910112359551, 0.82681564245810057),
 (0.8820224719101124, 0.78770949720670391),
 (0.90730337078651691, 0.83798882681564246),
 (0.9143258426966292, 0.81564245810055869),
 (0.9283707865168539, 0.79329608938547491),
 (0.9438202247191011, 0.77653631284916202),
 (0.9536516853932584, 0.77653631284916202),
 (0.9606741573033708, 0.77094972067039103)]

train_s = [s[0] for s in scores]
test_s = [s[1] for s in scores]
plt.plot(depths, train_s,'g-')
plt.plot(depths, test_s)
plt.show()

4.2 設置最小信息增益閥值

def v_score(value):
    model = DecisionTreeClassifier(min_impurity_decrease=value)
    model.fit(x_train, y_train)
    train_score = model.score(x_train, y_train)
    test_score = model.score(x_test, y_test)
    return train_score,test_score,

values = np.linspace(0,0.5,50)
scores = [v_score(value) for value in values]
train_s = [s[0] for s in scores]
test_s = [s[1] for s in scores]
plt.plot(values,train_s,'r')
plt.plot(values,test_s)
plt.show()

查看最高分值

best_index = np.argmax(test_s)
dest_score = test_s[best_index]
dest_value = values[best_index]
dest_score,dest_value
>> `(0.9845505617977528, 0.0)`

五、交叉驗證及多參數選擇

但是此處有兩個問題：
1，當同時對‘max_depth’和‘min_impurity_decrease’兩個參數進行設置時，如何選擇？
2，通過‘train_test_split’划分數據是個隨機過程，怎么樣解決隨機划分的差異問題？

因此需要使用交叉驗證，即把所有數據加入到評價測試中，這樣可以避免有些數據只參加了訓練；
例如把數據分為10分，1份為測試數據，其余9份為訓練數據，然后對10份數據依次循環，這樣就使用了全部的數據進行評價。

from sklearn.model_selection import GridSearchCV

# 設置需要傳入的參數
values = np.linspace(0, 0.5, 50)
depths = range(2,15)

# 設置參數字典
param_grid = {'max_depth': depths, 'min_impurity_decrease': values}

# 初始化的分類器、參數取值，交叉驗證的次數（cv=5,即數據划分5份）
model = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)

# 直接把全部數據進行訓練
model.fit(data_train, data_target)

查看最優參數,可知最大深度設置為6，最小信息增益閥值為0.0，評價分數為0.81

model.best_params_,model.best_score_
>>> `({'max_depth': 6, 'min_impurity_decrease': 0.0}, 0.81369656644278443)`

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【機器學習筆記之二】決策樹的python實現 python機器學習之決策樹機器學習之路: python 決策樹分類DecisionTreeClassifier 預測泰坦尼克號乘客是否幸存機器學習——決策樹，DecisionTreeClassifier參數詳解，決策樹可視化查看樹結構機器學習筆記（三）決策樹、線性回歸《機器學習實戰》筆記——決策樹（ID3）《機器學習(周志華)》筆記--決策樹（1）--決策樹模型、決策樹簡史、基本流程《機器學習(周志華)》筆記--決策樹（5）--軸平行划分：單變量決策樹、多變量決策樹機器學習 | 算法筆記- 決策樹（Decision Tree）《機器學習》（西瓜書）筆記（4）--決策樹