python機器學習(四)分類算法-決策樹


 

一、決策樹的原理

決策樹思想的來源非常朴素,程序設計中的條件分支結構就是if-then結構,最早的決策樹就是利用這類結構分割數據的一種分類學習方法 。

二、決策樹的現實案例

相親

 
相親決策樹

女兒:多大年紀了?
母親:26。
女兒:長的帥不帥?
母親:挺帥的。
女兒:收入高不?
母親:不算很高,中等情況。
女兒:是公務員不?
母親:是,在稅務局上班呢。
女兒:那好,我去見見。

銀行是否發放貸款

行長:是否有自己的房子?
職員:有。
行長:可以考慮放貸。
職員:如果沒有自己的房子呢?
行長:是否有穩定工作?
職員:有。
行長:可以考慮放貸。
職員:那如果沒有呢?
行長:既沒有自己的房子,也沒有穩定工作,那咱還放啥貸款?
職員:懂了。

 

 
貸款決策樹

預測足球隊是否奪冠

 
預測決策樹

三、信息論基礎

信息熵:

假如我們競猜32只足球隊誰是冠軍?我可以把球編上號,從1到32,然后提問:冠 軍在1-16號嗎?依次進行二分法詢問,只需要五次,就可以知道結果。
32支球隊,問詢了5次,信息量定義為5比特,log32=5比特。比特就是表示信息的單位。
假如有64支球隊的話,那么我們需要二分法問詢6次,信息量就是6比特,log64=6比特。
問詢了多少次,專業術語稱之為信息熵,單位為比特。
公式為:

 

 
信息熵

 

信息熵的作用:
決策樹生成的過程中,信息熵大的作為根節點,信息熵小的作為葉子節點,按照信息熵的從大到小原則,生成決策樹。

條件熵:

條件熵H(D|A)表示在已知隨機變量A的條件下隨機變量D的不確定性。
公式為:

 

 
條件熵

通俗來講就是,知道A情況下,D的信息量。

信息增益:

特征A對訓練數據集D的信息增益g(D,A),定義為集合D的信息熵H(D)與特征A給定條件下D的信息條件熵H(D|A)之差。
公式為:

 

 
信息增益

怎么理解信息增益呢?信息增益表示得知特征X的信息而使得類Y的信息的不確定性減少的程度。簡單講,就是知道的增多,使得不知道的(不確定的)就減少。

四、 決策樹API

決策樹:

sklearn.tree.DecisionTreeClassifier


class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, max_depth=None,random_state=None)
決策樹分類器
criterion:默認是’gini’系數,也可以選擇信息增益的熵’entropy’
max_depth:樹的深度大小
random_state:隨機數種子

method:
dec.fit(X,y): 根據數據集(X,y)建立決策樹分類器
dec.apply(X): 返回每個樣本被預測為的葉子的索引。
dec.cost_complexity_pruning_path(X,y): 在最小成本復雜性修剪期間計算修剪路徑。
dec.decision_path(X): 返回樹中的決策路徑
dec.get_depth(): 返回樹的深度
dec.get_n_leaves(): 返回決策樹的葉子節點
dec.get_params(): 返回評估器的參數
dec.predict(X): 預測X的類或回歸值
dec.predict_log_proba(X): 預測X的類的log值
dec.predict_proba(X): 預測X分類的概率值
dec.score(X,y): 測試數據X和標簽值y之間的平均准確率
dec.set_params(min_samples_split=3): 設置評估器的參數
X 表示訓練集,y表示特征值

決策樹的生成與本地保存:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
li = load_iris()
dec = DecisionTreeClassifier()
# 根據訓練集(X,y)建立決策樹分類器
dec.fit(li.data,li.target)
# 預測X的類或回歸值
dec.predict(li.data)
# 測試數據X和標簽值y之間的平均准確率
dec.score(li.data,li.target)
# 保存樹文件 tree.dot
tree.export_graphviz(dec,out_file='tree.dot')
tree.dot 保存結果:


digraph Tree {
node [shape=box] ;
0 [label="X[2] <= 2.45\ngini = 0.667\nsamples = 150\nvalue = [50, 50, 50]"] ;
1 [label="gini = 0.0\nsamples = 50\nvalue = [50, 0, 0]"] ;
.....

五、實現案例

1、導入數據,划分測試集,訓練集

from sklearn import tree
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import graphviz
import pandas as pd
data = load_wine()
dataFrame = pd.concat([pd.DataFrame(X_train),pd.DataFrame(y_train)],axis=1)
print(dataFrame)
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=30)

 

 
dataFrame

2、模型實例化

clf = tree.DecisionTreeClassifier(criterion='gini'
                                 ,max_depth=None
                                 ,min_samples_leaf=1
                                 ,min_samples_split=2
                                 ,random_state=0
                                 ,splitter='best'
                                 )

3、數據代入訓練

4、測試集導入打分

score = clf.score(X_test,y_test)

 

5、graphviz畫出決策樹

feature_name = ['酒精','蘋果酸','','灰的鹼性','','總酚','類黃酮','非黃烷類酚類','花青素','顏色強度','色調','od280/od315稀釋葡萄酒','脯氨酸']
dot_data  = tree.export_graphviz(clf
                             ,out_file=None
                             ,feature_names=feature_name
                             ,class_names=["紅酒","白酒","葡萄酒"]  #別名
                             ,filled=True
                             ,rounded=True
                            )
graph = graphviz.Source(dot_data)

 

 
決策樹

6、參數的重要性討論

clf.feature_importances_
[*zip(feature_name,clf.feature_importances_)]

 

[('酒精', 0.0),
('蘋果酸', 0.0),
('灰', 0.023800041266200594),
('灰的鹼性', 0.0),
('鎂', 0.0),
('總酚', 0.0),
('類黃酮', 0.14796731056604398),
('非黃烷類酚類', 0.0),
('花青素', 0.023717402234026283),
('顏色強度', 0.3324466124446747),
('色調', 0.021345662010623646),
('od280/od315稀釋葡萄酒', 0.0),
('脯氨酸', 0.45072297147843077)]

7、參數的穩定性和隨機性

問題:為什么大家對同一份數據進行執行,結果分數會不一樣呢?同一個人執行同一份數據,每次的分數結果也會不一樣?

答:是因為訓練數據集在模型里每次都是隨機划分的,所以執行的結果會不穩定,那么要怎么才會穩定呢?參數random_state就是用來干這個事情的,只要給random_state設置了值,那么每次執行的結果就都會是一樣的了。具體設置多少呢?是不一定的,從0到n可以自己嘗試,哪個值得到的score高就用哪個。

clf = tree.DecisionTreeClassifier(criterion="entropy",random_state=30)
clf = clf.fit(X_train, y_train)
score = clf.score(X_test, y_test) #返回預測的准確度

8、剪枝參數調優

為什么要剪枝?

剪枝參數的默認值會讓樹無限增長,這些樹在某些數據集中可能非常巨大,非常消耗內存。其次,決策樹的無限增長會造成過擬合線性,導致模型在訓練集中表現很好,但是在測試集中表現一般。之所以在訓練集中表現很好,是包括了訓練集中的噪音在里面,造成了過擬合現象。所以,我們要對決策樹進行剪枝參數調優。

常用的參數主要有:

  • min_samples_leaf: 葉子的最小樣本量,如果少於設定的值,則停止分枝;太小引起過擬合,太大阻止模型學習數據;建議從5開始;

  • min_samples_split: 分枝節點的樣本量,如果少於設定的值,那么就停止分枝。

  • max_depth:樹的深度,超過深度的樹枝會被剪掉,建議從3開始,看效果決定是否要增加深度;如果3的准確率只有50%,那增加深度,如果3的准確率80%,90%,那考慮限定深度,不用再增加深度了,節省計算空間。

clf = tree.DecisionTreeClassifier(min_samples_leaf=10
                                , min_samples_split=20
                                , max_depth=3)
clf.fit(X_train, y_train)
dot_data = tree.export_graphviz(clf
                               ,feature_names=feature_name
                               ,class_names=["紅酒","白酒","葡萄酒"]  #別名
                               ,filled=True
                               ,rounded=True)
graphviz.Source(dot_data)

 

 
image.png

9、確定最優剪枝參數

import matplotlib.pyplot as plt
score_list = []
for i in range(1,11):
    clf = tree.DecisionTreeClassifier(max_depth=i
                                    ,criterion="entropy"
                                    ,random_state=30
                                    ,splitter="random")
    clf.fit(Xtrain, Ytrain)
    score = clf.score(Xtest, Ytest)
    score_list.append(score)
plt.plot(range(1,11),score_list,color="red",label="max_depth",linewidth=2)
plt.legend()
plt.show()

 

 
確定最優參數

 

可以看到,max_depth在=3的時候,score已經達到了最高,再增加深度,則會增加過擬合的風險。

六、 決策樹的優缺點

優點

  • 簡單的理解和解釋,樹木可視化。
  • 需要很少的數據准備,其他技術通常需要數據歸一化。

缺點

  • 決策樹學習者可以創建不能很好地推廣數據的過於復雜的樹,被稱為過擬合。
  • 決策樹可能不穩定,因為數據的小變化可能會導致完全不同的樹
    被生成。

改進

  • 減枝cart算法
  • 隨機森林

七、環境准備

在Windows中直接利用pip是無法進行安裝的,網上有很多的方法,具體每個人可能報錯的原因不一樣,這里我說明一下我是怎么解決的:

下載並安裝Graphviz
設置環境變量
為Python加載Graphviz
資源下載
Graphviz的官網下載:https://graphviz.gitlab.io/_pages/Download/Download_windows.html,下載后按照提示進行安裝就可以了;

在anaconda中新建了一個graphviz文件夾,安裝在此文件夾中方便查找;

設置環境變量:
如果沒配置環境變量會出現如下報錯:

'dot' 不是內部或外部命令,也不是可運行的程序 或批處理文件。
在:我的電腦-系統屬性-高級系統設置-高級-環境變量-系統變量-找到Path 進行環境配置

將上面安裝的graphviz中的bin路徑添加到path中(添加不是重建)

我這里是:D:\Anaconda\graphviz\bin

測試安裝:
1.win+R
2.輸入命令:dot -version
3.觀察到如下信息,則該設置生效;

加載graphviz
此時再利用pip進行安裝即可:

pip install graphviz

八、完整代碼

1、鴛鴦花實現:

代碼實現:

# coding: utf-8

# In[1]:


from sklearn import datasets
import math
import numpy as np


# In[69]:


def getInformationEntropy(arr, leng):
    # print("length = ",leng)
    return -(arr[0] / leng * math.log(arr[0] / leng if arr[0] > 0 else 1) + arr[1] / leng * math.log(
        arr[1] / leng if arr[1] > 0 else 1) + arr[2] / leng * math.log(arr[2] / leng if arr[2] > 0 else 1))


# informationEntropy = getInformationEntropy(num,length)
# print(informationEntropy)


# In[105]:


# 離散化特征一的值
def discretization(index):
    feature1 = np.array([iris.data[:, index], iris.target]).T
    feature1 = feature1[feature1[:, 0].argsort()]

    counter1 = np.array([0, 0, 0])
    counter2 = np.array([0, 0, 0])

    resEntropy = 100000
    for i in range(len(feature1[:, 0])):

        counter1[int(feature1[i, 1])] = counter1[int(feature1[i, 1])] + 1
        counter2 = np.copy(counter1)

        for j in range(i + 1, len(feature1[:, 0])):

            counter2[int(feature1[j, 1])] = counter2[int(feature1[j, 1])] + 1
            # print(i,j,counter1,counter2)
            # 貪心算法求最優的切割點
            if i != j and j != len(feature1[:, 0]) - 1:

                # print(counter1,i+1,counter2-counter1,j-i,np.array(num)-counter2,length-j-1)

                sum = (i + 1) * getInformationEntropy(counter1, i + 1) + (j - i) * getInformationEntropy(
                    counter2 - counter1, j - i) + (length - j - 1) * getInformationEntropy(np.array(num) - counter2,
                                                                                           length - j - 1)
                if sum < resEntropy:
                    resEntropy = sum
                    res = np.array([i, j])
    res_value = [feature1[res[0], 0], feature1[res[1], 0]]
    print(res, resEntropy, res_value)
    return res_value


# In[122]:


# 求合適的分割值
def getRazors():
    a = []
    for i in range(len(iris.feature_names)):
        print(i)
        a.append(discretization(i))

    return np.array(a)


# In[326]:


# 隨機抽取80%的訓練集和20%的測試集
def divideData():
    completeData = np.c_[iris.data, iris.target.T]
    np.random.shuffle(completeData)
    trainData = completeData[range(int(length * 0.8)), :]
    testData = completeData[range(int(length * 0.8), length), :]
    return [trainData, testData]


# In[213]:


def getEntropy(counter):
    res = 0
    denominator = np.sum(counter)
    if denominator == 0:
        return 0
    for value in counter:
        if value == 0:
            continue
        res += value / denominator * math.log(value / denominator if value > 0 and denominator > 0 else 1)
    return -res


# In[262]:


def findMaxIndex(dataSet):
    maxIndex = 0
    maxValue = -1
    for index, value in enumerate(dataSet):
        if value > maxValue:
            maxIndex = index
            maxValue = value
    return maxIndex


# In[308]:


def recursion(featureSet, dataSet, counterSet):
    # print("函數開始,剩余特征:",featureSet,"  剩余結果長度:",len(dataSet))

    if (counterSet[0] == 0 and counterSet[1] == 0 and counterSet[2] != 0):
        return iris.target_names[2]
    if (counterSet[0] != 0 and counterSet[1] == 0 and counterSet[2] == 0):
        return iris.target_names[0]
    if (counterSet[0] == 0 and counterSet[1] != 0 and counterSet[2] == 0):
        return iris.target_names[1]

    if len(featureSet) == 0:
        return iris.target_names[findMaxIndex(counterSet)]
    if len(dataSet) == 0:
        return []

    res = 1000
    final = 0
    # print("剩余特征數目", len(featureSet))
    for feature in featureSet:
        i = razors[feature][0]
        j = razors[feature][1]
        # print("i = ",i," j = ",j)
        set1 = []
        set2 = []
        set3 = []
        counter1 = [0, 0, 0]
        counter2 = [0, 0, 0]
        counter3 = [0, 0, 0]
        for data in dataSet:
            index = int(data[-1])
            # print("data ",data," index ",index)

            if data[feature] < i:
                set1.append(data)
                counter1[index] = counter1[index] + 1
            elif data[feature] >= i and data[feature] <= j:
                set2.append(data)
                counter2[index] = counter2[index] + 1
            else:
                set3.append(data)
                counter3[index] = counter3[index] + 1

        a = (len(set1) * getEntropy(counter1) + len(set2) * getEntropy(counter2) + len(set3) * getEntropy(
            counter3)) / len(dataSet)

        # print("特征編號:",feature,"選取該特征得到的信息熵:",a)
        if a < res:
            res = a
            final = feature

    # 返回被選中的特征的下標
    # sequence.append(final)
    # print("最終在本節點上選取的特征編號是:",final)
    featureSet.remove(final)
    child = [0, 0, 0, 0]
    child[0] = final
    child[1] = recursion(featureSet, set1, counter1)
    child[2] = recursion(featureSet, set2, counter2)
    child[3] = recursion(featureSet, set3, counter3)

    return child


# In[322]:


def judge(data, tree):
    root = "unknow"
    while (len(tree) > 0):
        if isinstance(tree, str) and tree in iris.target_names:
            return tree
        root = tree[0]
        if (isinstance(root, str)):
            return root

        if isinstance(root, int):
            if data[root] < razors[root][0] and tree[1] != []:
                tree = tree[1]
            elif tree[2] != [] and (tree[1] == [] or (data[root] >= razors[root][0] and data[root] <= razors[root][1])):
                tree = tree[2]
            else:
                tree = tree[3]
    return root


# In[327]:


if __name__ == '__main__':

    iris = datasets.load_iris()
    num = [0, 0, 0]
    for row in iris.data:
        num[int(row[-1])] = num[int(row[-1])] + 1

    length = len(iris.target)
    [trainData, testData] = divideData()

    razors = getRazors()

    tree = recursion(list(range(len(iris.feature_names))), trainData,
                     [np.sum(trainData[:, -1] == 0), np.sum(trainData[:, -1] == 1), np.sum(trainData[:, -1] == 2)])
    print("本次選取的訓練集構建出的樹: ", tree)
    index = 0
    right = 0
    for data in testData:
        result = judge(testData[index], tree)
        truth = iris.target_names[int(testData[index][-1])]

        print("result is ", result, "  truth is ", truth)
        index = index + 1
        if result == truth:
            right = right + 1
    print("正確率 : ", right / index)

執行結果:

C:\Anaconda3\python.exe "C:\Program Files\JetBrains\PyCharm 2019.1.1\helpers\pydev\pydevconsole.py" --mode=client --port=57893
import sys; print('Python %s on %s' % (sys.version, sys.platform))
sys.path.extend(['C:\\app\\PycharmProjects', 'C:/app/PycharmProjects'])
Python 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 7.12.0
Python 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] on win32
runfile('C:/app/PycharmProjects/ArtificialIntelligence/test.py', wdir='C:/app/PycharmProjects/ArtificialIntelligence')
0
[ 54 148] 38.11466484080816 [5.5, 7.7]
1
[ 75 148] 60.48021792365705 [3.0, 4.2]
2
[ 49 148] 4.6815490923045076 [1.9, 6.7]
3
[ 49 148] 4.6815490923045076 [0.6, 2.5]
本次選取的訓練集構建出的樹:  [3, 'setosa', [0, [], [1, [], [2, 'setosa', 'versicolor', 'setosa'], 'setosa'], 'virginica'], 'setosa']
result is  versicolor   truth is  virginica
result is  setosa   truth is  setosa
result is  versicolor   truth is  versicolor
result is  versicolor   truth is  virginica
result is  versicolor   truth is  versicolor
result is  setosa   truth is  setosa
result is  versicolor   truth is  versicolor
result is  setosa   truth is  setosa
result is  versicolor   truth is  virginica
result is  versicolor   truth is  virginica
result is  setosa   truth is  setosa
result is  setosa   truth is  setosa
result is  versicolor   truth is  virginica
result is  setosa   truth is  setosa
result is  setosa   truth is  setosa
result is  versicolor   truth is  versicolor
result is  setosa   truth is  setosa
result is  setosa   truth is  setosa
result is  versicolor   truth is  versicolor
result is  versicolor   truth is  virginica
result is  versicolor   truth is  virginica
result is  versicolor   truth is  versicolor
result is  setosa   truth is  setosa
result is  versicolor   truth is  virginica
result is  versicolor   truth is  virginica
result is  versicolor   truth is  virginica
result is  versicolor   truth is  versicolor
result is  versicolor   truth is  virginica
result is  setosa   truth is  setosa
result is  setosa   truth is  setosa
正確率 :  0.6333333333333333

2、聲音和頭發類型實現:

代碼實現:

from math import log
import operator

def calcShannonEnt(dataSet):  # 計算數據的熵(entropy)
    numEntries=len(dataSet)  # 數據條數
    labelCounts={}
    for featVec in dataSet:
        currentLabel=featVec[-1] # 每行數據的最后一個字(類別)
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1  # 統計有多少個類以及每個類的數量
    shannonEnt=0
    for key in labelCounts:
        prob=float(labelCounts[key])/numEntries # 計算單個類的熵值
        shannonEnt-=prob*log(prob,2) # 累加每個類的熵值
    return shannonEnt

def createDataSet1():    # 創造示例數據
    dataSet = [['', '', ''],
               ['', '', ''],
               ['', '', ''],
               ['', '', ''],
               ['', '', ''],
               ['', '', ''],
               ['', '', ''],
               ['', '', '']]
    labels = ['頭發','聲音']  #兩個特征
    return dataSet,labels

def splitDataSet(dataSet,axis,value): # 按某個特征分類后的數據
    retDataSet=[]
    for featVec in dataSet:
        if featVec[axis]==value:
            reducedFeatVec =featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

def chooseBestFeatureToSplit(dataSet):  # 選擇最優的分類特征
    numFeatures = len(dataSet[0])-1
    baseEntropy = calcShannonEnt(dataSet)  # 原始的熵
    bestInfoGain = 0
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet,i,value)
            prob =len(subDataSet)/float(len(dataSet))
            newEntropy +=prob*calcShannonEnt(subDataSet)  # 按特征分類后的熵
        infoGain = baseEntropy - newEntropy  # 原始熵與按特征分類后的熵的差值
        if (infoGain>bestInfoGain):   # 若按某特征划分后,熵值減少的最大,則次特征為最優分類特征
            bestInfoGain=infoGain
            bestFeature = i
    return bestFeature

def majorityCnt(classList):    #按分類后類別數量排序,比如:最后分類為2男1女,則判定為男;
    classCount={}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote]=0
        classCount[vote]+=1
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList=[example[-1] for example in dataSet]  # 類別:男或女
    if classList.count(classList[0])==len(classList):
        return classList[0]
    if len(dataSet[0])==1:
        return majorityCnt(classList)
    bestFeat=chooseBestFeatureToSplit(dataSet) #選擇最優特征
    bestFeatLabel=labels[bestFeat]
    myTree={bestFeatLabel:{}} #分類結果以字典形式保存
    del(labels[bestFeat])
    featValues=[example[bestFeat] for example in dataSet]
    uniqueVals=set(featValues)
    for value in uniqueVals:
        subLabels=labels[:]
        myTree[bestFeatLabel][value]=createTree(splitDataSet\
                            (dataSet,bestFeat,value),subLabels)
    return myTree


if __name__=='__main__':
    dataSet, labels=createDataSet1()  # 創造示列數據
    print(createTree(dataSet, labels))  # 輸出決策樹模型結果

執行結果:

C:\Anaconda3\python.exe "C:\Program Files\JetBrains\PyCharm 2019.1.1\helpers\pydev\pydevconsole.py" --mode=client --port=65305
import sys; print('Python %s on %s' % (sys.version, sys.platform))
sys.path.extend(['C:\\app\\PycharmProjects', 'C:/app/PycharmProjects'])
Python 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 7.12.0
Python 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] on win32
runfile('C:/app/PycharmProjects/ArtificialIntelligence/K_means.py', wdir='C:/app/PycharmProjects/ArtificialIntelligence')
{'聲音': {'': '', '': {'頭發': {'': '', '': ''}}}}

3、泰坦尼克號實現

代碼實現:

#coding=utf-8
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer #特征轉換器
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn import tree

#1.數據獲取
titanic = pd.read_csv('C:/data/titanic.txt')
#print titanic.head()
#print titanic.info()
X = titanic[['pclass','age','sex']]  #提取要分類的特征。一般可以通過最大熵原理進行特征選擇
y = titanic['survived']
print (X.shape)
#(1313, 3)
#print X.head()
#print X['age']

#2.數據預處理:訓練集測試集分割,數據標准化
X['age'].fillna(X['age'].mean(),inplace=True)   #age只有633個,需補充,使用平均數或者中位數都是對模型偏離造成最小的策略
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=33)  # 將數據進行分割

vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))   #對訓練數據的特征進行提取
X_test = vec.transform(X_test.to_dict(orient='record'))         #對測試數據的特征進行提取
#轉換特征后,凡是類別型型的特征都單獨獨成剝離出來,獨成一列特征,數值型的則不變
print (vec.feature_names_)  #['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male']

#3.使用決策樹對測試數據進行類別預測
dtc = DecisionTreeClassifier()
dtc.fit(X_train,y_train)
y_predict = dtc.predict(X_test)

#4.獲取結果報告
print ('Accracy:',dtc.score(X_test,y_test))
print (classification_report(y_predict,y_test,target_names=['died','servived']))

#5.將生成的決策樹保存
with open("jueceshu.dot", 'w') as f:
    f = tree.export_graphviz(dtc, out_file = f)

執行結果:

C:\Anaconda3\python.exe "C:\Program Files\JetBrains\PyCharm 2019.1.1\helpers\pydev\pydevconsole.py" --mode=client --port=55077
import sys; print('Python %s on %s' % (sys.version, sys.platform))
sys.path.extend(['C:\\app\\PycharmProjects', 'C:/app/PycharmProjects'])
Python 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 7.12.0
Python 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] on win32
runfile('C:/app/PycharmProjects/ArtificialIntelligence/classification.py', wdir='C:/app/PycharmProjects/ArtificialIntelligence')
(1313, 3)
C:\Anaconda3\lib\site-packages\pandas\core\generic.py:6245: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male']
Accracy: 0.7811550151975684
              precision    recall  f1-score   support
        died       0.91      0.78      0.84       236
    servived       0.58      0.80      0.67        93
    accuracy                           0.78       329
   macro avg       0.74      0.79      0.75       329
weighted avg       0.81      0.78      0.79       329

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM