機器學習之決策樹（ID3）算法

本文轉載自查看原文 2018-06-13 14:53 854 python/ 機器學習

最近剛把《機器學習實戰》中的決策樹過了一遍，接下來通過書中的實例，來溫習決策樹構造算法中的ID3算法。

海洋生物數據：

	不浮出水面是否可以生存	是否有腳蹼	屬於魚類
1	是	是	是
2	是	是	是
3	是	否	否
4	否	是	否
5	否	是	否

轉換成數據集：

def createDataSet():
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [0, 1, 'no']]
    labels = ['no surfacing','flippers']
    return dataSet, labels

一、基礎知識

1、熵

我把它簡單的理解為用來度量數據的無序程度。數據越有序，熵值越低；數據越混亂或者分散，熵值越高。所以數據集分類后標簽越統一，熵越低；標簽越分散，熵越高。

更理論一點的解釋：

熵被定義為信息的期望值，而如何理解信息？如果待分類的事物可能划分在多個分類中，則符號的信息定義為：

其中x_i是選擇該分類的概率，即該類別個數 / 總個數。

為了計算熵，我們需要計算所有類別所有可能值包含的信息期望值，公式如下：

其中n是分類的數目。

計算給定數據集的香農熵：

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    #創建字典，計算每種標簽對應的樣本數
    labelCounts = {}
    
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    #根據上面的公式計算香農熵
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2)
    return shannonEnt

運行代碼，數據集myDat1只有兩個類別，myDat2有三個類別：

>>> myDat1

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

>>> trees.calcShannonEnt(myDat1)

0.9709505944546686

>>> myDat2

[[1, 1, 'maybe'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

>>> trees.calcShannonEnt(myDat2)

1.3709505944546687

2、信息增益

信息增益可以衡量划分數據集前后數據（標簽）向有序性發展的程度。

信息增益=原數據香農熵-划分數據集之后的新數據香農熵

二、按給定特征划分數據集

三個輸入參數：待划分的數據集、划分數據集的特征位置、需要滿足的當前特征的值

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            #獲得除當前位置以外的特征元素
            reducedFeatVec = featVec[:axis]            
            reducedFeatVec.extend(featVec[axis+1:])            
            #把每個樣本特征堆疊在一起，變成一個子集合
            retDataSet.append(reducedFeatVec)
    return retDataSet

運行結果：

>>> myDat

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

>>> trees.splitDataSet(myDat,0,1)

[[1, 'yes'], [1, 'yes'], [0, 'no']]

>>> trees.splitDataSet(myDat,0,0)

[[1, 'no'], [1, 'no']]

三、選擇最好的數據集划分方式，即選擇出最合適的特征用於划分數據集

def chooseBestFeatureToSplit(dataSet):
    # 計算出數據集的特征個數
    numFeatures = len(dataSet[0]) – 1
    # 算出原始數據集的香農熵
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):
        # 抽取出數據集中所有第i個特征
        featList = [example[i] for example in dataSet]
        # 當前特征集合
        uniqueVals = set(featList)    
        newEntropy = 0.0
        # 根據特征划分數據集，並計算出香農熵和信息增益
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        
        # 返回最大信息增益的特征
        if(infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i        
    return bestFeature

四、如果數據集已經處理了所有特征屬性，但是類標依然不是唯一的，此時采用多數表決的方式決定該葉子節點的分類。

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]

五、創建決策樹

接下來我們將利用上面學習的單元模塊創建決策樹。

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    # 如果划分的數據集只有一個類別，則返回此類別
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    # 如果使用完所有特征屬性之后，類別標簽仍不唯一，則使用majorityCnt函數，多數表決法，哪種類別標簽多，則分為此類別
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)        
    return myTree

每次遇到遞歸問題總會頭腦發昏，為了便於理解，我把一個創建決策樹的處理過程重頭到尾梳理了一遍。

原始數據集:

dataset: [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

labels: [no surfacing, flippers]

在調用createTree(dataSet,labels)函數之后，數據操作如下（每一個色塊代表一次完整的createTree調用過程）：

1、

dataset: [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

labels: [no surfacing, flippers]

classList=['yes', 'yes', 'no', 'no', 'no']

選擇最好的特征來分類：bestFeat= 0

bestFeatLabel =no surfacing

構造樹：myTree {'no surfacing': {}}

去除這個特征后，label=['flippers']

這個特征（no surfacing）的值：featValues= [1, 1, 1, 0, 0]

特征類別 uniqueVals=[0, 1]

（1）類別值為0的時候：

子標簽=['flippers']

分出的子集 splitDataSet(dataSet, bestFeat, value) = [[1, 'no'], [1, 'no']]

myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)

1-1、

dataset: [[1, 'no'], [1, 'no']]

labels: ['flippers']

classList=['no', 'no']

滿足classList中只有一個類別，返回no

myTree[bestFeatLabel][0] =’no’

myTree[bestFeatLabel] {0: 'no'}

也就是myTree {'no surfacing': {0: 'no'}}

（2）類別值為1的時候：

子標簽=['flippers']

分出的子集 splitDataSet(dataSet, bestFeat, value) = [[1, 'yes'], [1, 'yes'], [0, 'no']]

myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)

1-2、

dataset: [[1, 'yes'], [1, 'yes'], [0, 'no']]

labels: ['flippers']

classList=['yes', 'yes', 'no']

選擇最好的特征來分類：bestFeat= 0

bestFeatLabel = flippers

構造樹：myTree {'flippers': {}}

去除這個特征后，label=[]

這個特征（flippers）的值：featValues= [1, 1, 0]

特征類別 uniqueVals=[0, 1]

（1）類別值為0的時候：

子標簽=[]

分出的子集 splitDataSet(dataSet, bestFeat, value) = [['no']]

myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)

1-2-1、

dataset: [['no']]

labels: []

classList=['no']

滿足classList中只有一個類別，返回no

myTree[bestFeatLabel][0] =’no’

myTree[bestFeatLabel] {0: 'no'}

也就是myTree {'flipper': {0: 'no'}}

（2）類別值為1的時候：

子標簽=[]

分出的子集 splitDataSet(dataSet, bestFeat, value) = [['yes'], ['yes']]

myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)

1-2-2、

dataset: [['yes'], ['yes']]

labels: []

classList=['yes', 'yes']

滿足classList中只有一個類別，返回yes

myTree[bestFeatLabel][1] =’yes’

myTree[bestFeatLabel] {0: 'no', 1: 'yes'}

也就是myTree: {'flippers': {0: 'no', 1: 'yes'}}

myTree[bestFeatLabel][1] ={'flippers': {0: 'no', 1: 'yes'}}

myTree[bestFeatLabel] {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}

也就是myTree: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

例子中的決策樹可視化圖：

六、使用決策樹做分類

def classify(inputTree, featLabels, testVec):
    firstStr = inputTree.keys()[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex] == key:
            if type(secondDict[key]).__name__=='dict':
                classLabel = classify(secondDict[key], featLabels, testVec)
            else: classLabel = secondDict[key]
    return classLabel

輸出結果：

>>> myTree

{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

>>> labels

['no surfacing', 'flippers']

>>> trees.classify(myTree,labels,[1,0])

'no'

>>> trees.classify(myTree,labels,[1,1])

'yes'

七、 決策樹的存儲

構造決策樹是很耗時的任務，然而用創建好的決策樹解決分類問題，則可以很快的完成，可以通過使用pickle模塊存儲決策樹。

def storeTree(inputTree, filename):
    import pickle
    fw = open(filename,'w')
    pickle.dump(inputTree,fw)
    fw.close()

def grabTree(filename):
    import pickle
    fr = open(filename)
    return pickle.load(fr)

參考資料：

[1] 《機器學習實戰》

[2] 《機器學習實戰》筆記——決策樹（ID3）https://www.cnblogs.com/DianeSoHungry/p/7059104.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【機器學習】ID3算法構建決策樹《機器學習實戰》筆記——決策樹（ID3）機器學習：決策樹——分類樹 ID3算法代碼+案例機器學習總結（八）決策樹ID3，C4.5算法，CART算法 python機器學習筆記：ID3決策樹算法實戰 Python3實現機器學習經典算法（三）ID3決策樹機器學習之ID3決策樹python算法實現機器學習算法總結(二)——決策樹（ID3, C4.5, CART）機器學習-ID3決策樹算法（附matlab/octave代碼）機器學習回顧篇（7）：決策樹算法（ID3、C4.5）