決策樹算法（ID3）

本文轉載自查看原文 2017-11-14 23:25 2416 機器學習

Day	Outlook	Temperature	Humidity	Wind	PlayTennis
1	Sunny	Hot	High	Weak	No
2	Sunny	Hot	High	Strong	No
3	Overcast	Hot	High	Weak	Yes
4	Rain	Mild	High	Weak	Yes
5	Rain	Cool	Normal	Weak	Yes
6	Rain	Cool	Normal	Strong	No
7	Overcast	Cool	Normal	Strong	Yes
8	Sunny	Mild	High	Weak	No
9	Sunny	Cool	Normal	Weak	Yes
10	Rain	Mild	Normal	Weak	Yes
11	Sunny	Mild	Normal	Strong	Yes
12	Overcast	Mild	High	Strong	Yes
13	Overcast	Hot	Normal	Weak	Yes
14	Rain	Mild	High	Strong	No

對於上面例子，如何判斷是否要去playtennis?

可以采用決策樹的方式。

決策樹是一種以實例為基礎的歸納學習算法。從無序列/無規則的數據中，推導出樹形表示的分類判決。

優點：計算量小、顯示清晰

缺點：容易過擬合（需要修枝）（譬如，使用day做判決，一一對應雖然很准確，但是不能用在其他地方）、對時間順序的數據，需要過多預處理工作

ID3算法：

1、對於實例，計算各個屬性的信息增益

2、對於信息增益最大的屬性P作為根節點，P的各個取值的樣本作為子集進行分類

3、對於子集下，若只含有正例或反例，直接得到判決；否則遞歸調用算法，再次尋找子節點

熵：表示隨機變量的不確定性。

條件熵：在一個條件下，隨機變量的不確定性。

信息增益：熵 - 條件熵，在一個條件下，信息不確定性減少的程度。

用信息增益最大的屬性作為結點，是因為最終去不去打球的不確定性，在獲得該屬性的結果后，不確定性大大降低。

也就是說，該屬性對於打球的選擇很重要。

對於解決上述問題，

首先，計算系統熵，PlayTennis

P(No) = 5/14

P(Yes) = 9/14

Entropy(S) = -(9/14)*log(9/14)-(5/14)*log(5/14) = 0.94

然后，計算各個屬性的熵。

譬如：Wind

其中，Wind中取值為weak的記錄有8條，其中，playtennis的正例6個，負例2個；取值為strong的記錄有6條，正例為3個，負例為3個。

Entrogy(weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811

Entrogy(strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1.0

對應的信息增益為：

Gain(Wind) = Entropy(S) – (8/14)* Entrogy(weak)-(6/14)* Entrogy(strong) = 0.048

同理，Gain(Humidity = 0.151;Gain(Outlook = 0.247;Gain(Temperature = 0.029

此時，可以得到跟節點為：Outlook

對應點決策樹：

Outlook分為三個集合：

Sunny：{1，2，8，9，11}，正例：2、反例：3

Overcast：{3，7，12，13}，正例：4、反例：0

Rain：{4，5，6，10，14}，正例：3、反例：2

至此，可以得到：

Sunny:

Day	Outlook	Temperature	Humidity	Wind	PlayTennis
1	Sunny	Hot	High	Weak	No
2	Sunny	Hot	High	Strong	No
8	Sunny	Mild	High	Weak	No
9	Sunny	Cool	Normal	Weak	Yes
11	Sunny	Mild	Normal	Strong	Yes

Entropy(S) = -(3/5)*log(3/5)-(2/5)*log(2/5) = 0.971

對於Wind，weak時，正例為1，反例為2；Strong時，正例為1，反例為1.

Entrogy(weak) = -(1/3)*log(1/3)-(2/3)*log(2/3) = 0.918

Entrogy(strong) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1

Gain(Wind) = Entropy(S) – 3/5* Entrogy(weak)-2/5* Entrogy(strong) = 0.0202

同理，Gain(Humidity) = 0.971;Gain(Temperature) = 0.571

此時，可以畫出部分決策樹：

其中，python代碼：

import math
#香農公式計算信息熵
def calcShannonEnt(dataset):
    numEntries = len(dataset)
    labelCounts = {}
    for featVec in dataset:
        currentLabel = featVec[-1]#最后一位表示分類
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] +=1
         
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob*math.log(prob, 2)
    return shannonEnt
     
def CreateDataSet():
    dataset = [['sunny', 'hot','high','weak', 'no' ],
               ['sunny', 'hot','high','strong', 'no' ],
               ['overcast', 'hot','high','weak', 'yes' ],
               ['rain', 'mild','high','weak', 'yes' ],
               ['rain', 'cool','normal','weak', 'yes' ],
                ['rain', 'cool','normal','strong', 'no' ],
                ['overcast', 'cool','normal','strong', 'yes' ],
                ['sunny', 'mild','high','weak', 'no' ],
                ['sunny', 'cool','normal','weak', 'yes' ],
                ['rain', 'mild','normal','weak', 'yes' ],
                ['sunny', 'mild','normal','strong', 'yes' ],
                ['overcast', 'mild','high','strong', 'yes' ],
                ['overcast', 'hot','normal','weak', 'yes' ],
                ['rain', 'mild','high','strong', 'no' ],
               ]
    labels = ['outlook', 'temperature', 'humidity', 'wind']
    return dataset, labels
#選取屬性axis的值value的樣本表
def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
     
    return retDataSet
#選取信息增益最大的屬性作為節點
def chooseBestFeatureToSplit(dataSet):
    numberFeatures = len(dataSet[0])-1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numberFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy =0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if(infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature
#對於屬性已經用完，仍然沒有分類的情況，采用投票表決的方法 
def majorityCnt(classList):
    classCount ={}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote]=0
        classCount[vote] += 1
    return max(classCount)
  
 
def createTree(dataSet, labels):
    classList = [example[-1] for example in dataSet]
    #類別相同停止划分
    if classList.count(classList[0])==len(classList):
        return classList[0]
    #屬性用完，投票表決
    if len(dataSet[0])==1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree
 
         
         
myDat,labels = CreateDataSet()
tree = createTree(myDat,labels)
print tree

在計算決策樹的時候，sklearn庫提供了決策樹的計算方法（tree），但是，這個庫提供的是：

scikit-learn uses an optimised version of the CART algorithm.

對於本文中使用的ID3算法是不支持的。

然而https://pypi.python.org/pypi/decision-tree-id3/0.1.2

該庫支持ID3算法。

按照官網說明，注意安裝時的依賴庫的版本，該升級的升級，該安裝的安裝即可。‘

from id3 import Id3Estimator
from id3 import export_graphviz

X = [['sunny',    'hot',   'high',   'weak'],
     ['sunny',    'hot',   'high',   'strong'], 
     ['overcast', 'hot',   'high',   'weak'], 
     ['rain',     'mild',  'high',   'weak'], 
     ['rain',     'cool',  'normal', 'weak'], 
     ['rain',     'cool',  'normal', 'strong'], 
     ['overcast', 'cool',  'normal', 'strong'], 
     ['sunny',    'mild',  'high',   'weak'], 
     ['sunny',    'cool',  'normal', 'weak'], 
     ['rain',     'mild',  'normal', 'weak'], 
     ['sunny',    'mild',  'normal', 'strong'], 
     ['overcast', 'mild',  'high',   'strong'], 
     ['overcast', 'hot',   'normal', 'weak'], 
     ['rain',     'mild',  'high',   'strong'], 
]
Y = ['no','no','yes','yes','yes','no','yes','no','yes','yes','yes','yes','yes','no']
f = ['outlook','temperature','humidity','wind']
estimator = Id3Estimator()
estimator.fit(X, Y,check_input=True)
export_graphviz(estimator.tree_, 'tree.dot', f)

　　然后通過GraphViz工具生成PDF

dot -Tpdf tree.dot -o tree.pdf

　　結果：

當然，你也可以進行預測判斷：

print estimator.predict([['rain',     'mild',  'high',   'strong']])

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 決策樹之ID3算法決策樹 - ID3算法 ID3決策樹算法決策樹--ID3 算法（一）決策樹算法以及matlab實現ID3算法決策樹ID3算法的java實現決策樹之python實現ID3算法（例子）【機器學習】ID3算法構建決策樹決策樹算法原理(ID3，C4.5) 決策樹ID3算法--python實現