Python机器学习笔记(2)——决策树-DecisionTreeClassifier-GridSearchCV

本文转载自查看原文 2021-10-26 12:02 91 机器学习

一、内容

1、决策树算法原理
2、数据预处理示例
3、决策模型的建立
4、参数的选择
5、交叉验证及多参数选择

二、决策树算法原理

决策树是类似于树的结构，分支节点表示对一个特征进行测试。根据测试结果进行分类，树叶代表一个类别。
1,最经典的机器学习模型之一;
2,预测结果容易理解，易于解释;
3,可处理类别型和连续型数据;

2.1 先对哪个特征分类？

信息的量化：信息熵、基尼不纯度
信息熵：信息的混乱程度；减少熵，就是信息增益，优先选择信息增益最大的特征进行分类；
基尼不存度：也是表示信息的增益层度或者混乱程度，相对比信息熵，优势在于运算速度更快；
什么样的特征带来最多的信息变化幅度，我们就选择哪一个特征来分组，如果特征为连续值，需要对数据离散化处理；

2.2 决策树的解决过拟合方法

前剪枝：设定一个阀值，信息熵减少的数量小于这个值，停止创建分支；
后剪枝：决策树创建完成后，对节点检查其信息熵的增益；
空值决策树的最大深度；

此处使用泰坦尼克号，乘客的信息进行获救的预测，部分数据如下显示：

三、数据预处理

3.1 读入数据，选择特征值

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.read_csv("datasets/tt/train.csv")

#选择“['Survived', 'Pclass', 'Sex', 'Age', 'SibSp','Parch','Fare','Embarked']”字段数据作为训练数据（[是否获救、船舱登记、性别、年龄、是否兄弟姐妹同在、是否父母同在、票价、登船港口]。
data=data[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp','Parch','Fare','Embarked']]
data.info()
#<class 'pandas.core.frame.dataframe'="">
#RangeIndex: 891 entries, 0 to 890
#Data columns (total 8 columns):
#Survived    891 non-null int64
#Pclass      891 non-null int64
#Sex         891 non-null object
#Age         714 non-null float64
#SibSp       891 non-null int64
#Parch       891 non-null int64
#Fare        891 non-null float64
#Embarked    889 non-null object
#dtypes: float64(2), int64(4), object(2)
#memory usage: 55.8+ KB

# 发现Age有空值，此处采取用平均值填充
data['Age']=data['Age'].fillna(data['Age'].mean())

# 年龄数值化和船舱等级、登船港口独热编码处理
data['Sex']=data['Sex'].apply(lambda x : 1 if x == 'male' else 0)
data['p1']=np.array(data['Pclass'] == 1).astype(np.int32)
data['p2']=np.array(data['Pclass'] == 2).astype(np.int32)
data['p3']=np.array(data['Pclass'] == 3).astype(np.int32)
del data['Pclass']
data['e1']=np.array(data['Embarked'] == 'S').astype(np.int32)
data['e2']=np.array(data['Embarked'] == 'C').astype(np.int32)
data['e3']=np.array(data['Embarked'] == 'Q').astype(np.int32)
del data['Embarked']

# 提取训练数据
data_train=data[[ 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'p1', 'p2', 'p3','e1', 'e2', 'e3']].values
data_target=data['Survived'].values.reshape(len(data),1)

四、决策模型的建立

数据划分

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(data_train,data_target,test_size=0.2,random_state=24)

from sklearn.tree import DecisionTreeClassifier
model=DecisionTreeClassifier()
model.fit(x_train,y_train)
model.score(x_test,y_test),model.score(x_train,y_train)
>>`(0.77094972067039103, 0.9845505617977528)`

发现数据过拟合,此处分别对最大深度及信息增益阀值进行分类讨论

4.1 设置最大深度：

def m_scores(depth):
    model=DecisionTreeClassifier(max_depth=depth)
    model.fit(x_train,y_train)
    train_score = model.score(x_train, y_train)
    test_score = model.score(x_test, y_test)
    return train_score,test_score

depths = range(2,15)
scores = [m_scores(depth) for depth in depths]
scores
>>>
[(0.7837078651685393, 0.7988826815642458),
 (0.8188202247191011, 0.81564245810055869),
 (0.824438202247191, 0.81564245810055869),
 (0.8455056179775281, 0.81564245810055869),
 (0.86235955056179781, 0.87150837988826813),
 (0.8721910112359551, 0.82681564245810057),
 (0.8820224719101124, 0.78770949720670391),
 (0.90730337078651691, 0.83798882681564246),
 (0.9143258426966292, 0.81564245810055869),
 (0.9283707865168539, 0.79329608938547491),
 (0.9438202247191011, 0.77653631284916202),
 (0.9536516853932584, 0.77653631284916202),
 (0.9606741573033708, 0.77094972067039103)]

train_s = [s[0] for s in scores]
test_s = [s[1] for s in scores]
plt.plot(depths, train_s,'g-')
plt.plot(depths, test_s)
plt.show()

4.2 设置最小信息增益阀值

def v_score(value):
    model = DecisionTreeClassifier(min_impurity_decrease=value)
    model.fit(x_train, y_train)
    train_score = model.score(x_train, y_train)
    test_score = model.score(x_test, y_test)
    return train_score,test_score,

values = np.linspace(0,0.5,50)
scores = [v_score(value) for value in values]
train_s = [s[0] for s in scores]
test_s = [s[1] for s in scores]
plt.plot(values,train_s,'r')
plt.plot(values,test_s)
plt.show()

查看最高分值

best_index = np.argmax(test_s)
dest_score = test_s[best_index]
dest_value = values[best_index]
dest_score,dest_value
>> `(0.9845505617977528, 0.0)`

五、交叉验证及多参数选择

但是此处有两个问题：
1，当同时对‘max_depth’和‘min_impurity_decrease’两个参数进行设置时，如何选择？
2，通过‘train_test_split’划分数据是个随机过程，怎么样解决随机划分的差异问题？

因此需要使用交叉验证，即把所有数据加入到评价测试中，这样可以避免有些数据只参加了训练；
例如把数据分为10分，1份为测试数据，其余9份为训练数据，然后对10份数据依次循环，这样就使用了全部的数据进行评价。

from sklearn.model_selection import GridSearchCV

# 设置需要传入的参数
values = np.linspace(0, 0.5, 50)
depths = range(2,15)

# 设置参数字典
param_grid = {'max_depth': depths, 'min_impurity_decrease': values}

# 初始化的分类器、参数取值，交叉验证的次数（cv=5,即数据划分5份）
model = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)

# 直接把全部数据进行训练
model.fit(data_train, data_target)

查看最优参数,可知最大深度设置为6，最小信息增益阀值为0.0，评价分数为0.81

model.best_params_,model.best_score_
>>> `({'max_depth': 6, 'min_impurity_decrease': 0.0}, 0.81369656644278443)`

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 【机器学习笔记之二】决策树的python实现 python机器学习之决策树机器学习之路: python 决策树分类DecisionTreeClassifier 预测泰坦尼克号乘客是否幸存机器学习——决策树，DecisionTreeClassifier参数详解，决策树可视化查看树结构机器学习笔记（三）决策树、线性回归《机器学习实战》笔记——决策树（ID3）《机器学习(周志华)》笔记--决策树（1）--决策树模型、决策树简史、基本流程【Python机器学习实战】决策树和集成学习（二）——决策树的实现 Python_sklearn机器学习库学习笔记（四）decision_tree（决策树）机器学习（三）决策树学习