決策樹算法(ID3)


Day

Outlook

Temperature

Humidity

Wind

PlayTennis

1

Sunny

Hot

High

Weak

No

2

Sunny

Hot

High

Strong

No

3

Overcast

Hot

High

Weak

Yes

4

Rain

Mild

High

Weak

Yes

5

Rain

Cool

Normal

Weak

Yes

6

Rain

Cool

Normal

Strong

No

7

Overcast

Cool

Normal

Strong

Yes

8

Sunny

Mild

High

Weak

No

9

Sunny

Cool

Normal

Weak

Yes

10

Rain

Mild

Normal

Weak

Yes

11

Sunny

Mild

Normal

Strong

Yes

12

Overcast

Mild

High

Strong

Yes

13

Overcast

Hot

Normal

Weak

Yes

14

Rain

Mild

High

Strong

No

對於上面例子,如何判斷是否要去playtennis?

可以采用決策樹的方式。

決策樹是一種以實例為基礎的歸納學習算法。從無序列/無規則的數據中,推導出樹形表示的分類判決。

優點:計算量小、顯示清晰

缺點:容易過擬合(需要修枝)(譬如,使用day做判決,一一對應雖然很准確,但是不能用在其他地方)、對時間順序的數據,需要過多預處理工作

 

ID3算法:

1、對於實例,計算各個屬性的信息增益

2、對於信息增益最大的屬性P作為根節點,P的各個取值的樣本作為子集進行分類

3、對於子集下,若只含有正例或反例,直接得到判決;否則遞歸調用算法,再次尋找子節點

 

 

熵:表示隨機變量的不確定性。

條件熵:在一個條件下,隨機變量的不確定性。

信息增益:熵 - 條件熵,在一個條件下,信息不確定性減少的程度。

 

用信息增益最大的屬性作為結點,是因為最終去不去打球的不確定性,在獲得該屬性的結果后,不確定性大大降低。

也就是說,該屬性對於打球的選擇很重要。

 

 

對於解決上述問題,

首先,計算系統熵,PlayTennis

P(No) = 5/14

P(Yes) = 9/14

Entropy(S) = -(9/14)*log(9/14)-(5/14)*log(5/14) = 0.94

 

然后,計算各個屬性的熵。

譬如:Wind

其中,Wind中取值為weak的記錄有8條,其中,playtennis的正例6個,負例2個;取值為strong的記錄有6條,正例為3個,負例為3個。

Entrogy(weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811

Entrogy(strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1.0

對應的信息增益為:

Gain(Wind) = Entropy(S) – (8/14)* Entrogy(weak)-(6/14)* Entrogy(strong) = 0.048

 

同理,Gain(Humidity = 0.151;Gain(Outlook = 0.247;Gain(Temperature = 0.029

 

此時,可以得到跟節點為:Outlook

對應點決策樹:

Outlook分為三個集合:

Sunny:{1,2,8,9,11},正例:2、反例:3

Overcast:{3,7,12,13},正例:4、反例:0

Rain:{4,5,6,10,14},正例:3、反例:2

至此,可以得到:

Sunny:

Day

Outlook

Temperature

Humidity

Wind

PlayTennis

1

Sunny

Hot

High

Weak

No

2

Sunny

Hot

High

Strong

No

8

Sunny

Mild

High

Weak

No

9

Sunny

Cool

Normal

Weak

Yes

11

Sunny

Mild

Normal

Strong

Yes

 

Entropy(S) = -(3/5)*log(3/5)-(2/5)*log(2/5) = 0.971

對於Wind,weak時,正例為1,反例為2;Strong時,正例為1,反例為1.

Entrogy(weak) = -(1/3)*log(1/3)-(2/3)*log(2/3) = 0.918

Entrogy(strong) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1

 

Gain(Wind) = Entropy(S) – 3/5* Entrogy(weak)-2/5* Entrogy(strong) = 0.0202

 

同理,Gain(Humidity) = 0.971;Gain(Temperature) = 0.571

 

 

此時,可以畫出部分決策樹:

 

 

 

其中,python代碼:

import math
#香農公式計算信息熵
def calcShannonEnt(dataset):
    numEntries = len(dataset)
    labelCounts = {}
    for featVec in dataset:
        currentLabel = featVec[-1]#最后一位表示分類
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] +=1
         
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob*math.log(prob, 2)
    return shannonEnt
     
def CreateDataSet():
    dataset = [['sunny', 'hot','high','weak', 'no' ],
               ['sunny', 'hot','high','strong', 'no' ],
               ['overcast', 'hot','high','weak', 'yes' ],
               ['rain', 'mild','high','weak', 'yes' ],
               ['rain', 'cool','normal','weak', 'yes' ],
                ['rain', 'cool','normal','strong', 'no' ],
                ['overcast', 'cool','normal','strong', 'yes' ],
                ['sunny', 'mild','high','weak', 'no' ],
                ['sunny', 'cool','normal','weak', 'yes' ],
                ['rain', 'mild','normal','weak', 'yes' ],
                ['sunny', 'mild','normal','strong', 'yes' ],
                ['overcast', 'mild','high','strong', 'yes' ],
                ['overcast', 'hot','normal','weak', 'yes' ],
                ['rain', 'mild','high','strong', 'no' ],
               ]
    labels = ['outlook', 'temperature', 'humidity', 'wind']
    return dataset, labels
#選取屬性axis的值value的樣本表
def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
     
    return retDataSet
#選取信息增益最大的屬性作為節點
def chooseBestFeatureToSplit(dataSet):
    numberFeatures = len(dataSet[0])-1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numberFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy =0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if(infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature
#對於屬性已經用完,仍然沒有分類的情況,采用投票表決的方法 
def majorityCnt(classList):
    classCount ={}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote]=0
        classCount[vote] += 1
    return max(classCount)
  
 
def createTree(dataSet, labels):
    classList = [example[-1] for example in dataSet]
    #類別相同停止划分
    if classList.count(classList[0])==len(classList):
        return classList[0]
    #屬性用完,投票表決
    if len(dataSet[0])==1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree
 
         
         
myDat,labels = CreateDataSet()
tree = createTree(myDat,labels)
print tree

  

 在計算決策樹的時候,sklearn庫提供了決策樹的計算方法(tree),但是,這個庫提供的是:

scikit-learn uses an optimised version of the CART algorithm.

對於本文中使用的ID3算法是不支持的。

 然而https://pypi.python.org/pypi/decision-tree-id3/0.1.2

該庫支持ID3算法。

按照官網說明,注意安裝時的依賴庫的版本,該升級的升級,該安裝的安裝即可。‘

from id3 import Id3Estimator
from id3 import export_graphviz

X = [['sunny',    'hot',   'high',   'weak'],
     ['sunny',    'hot',   'high',   'strong'], 
     ['overcast', 'hot',   'high',   'weak'], 
     ['rain',     'mild',  'high',   'weak'], 
     ['rain',     'cool',  'normal', 'weak'], 
     ['rain',     'cool',  'normal', 'strong'], 
     ['overcast', 'cool',  'normal', 'strong'], 
     ['sunny',    'mild',  'high',   'weak'], 
     ['sunny',    'cool',  'normal', 'weak'], 
     ['rain',     'mild',  'normal', 'weak'], 
     ['sunny',    'mild',  'normal', 'strong'], 
     ['overcast', 'mild',  'high',   'strong'], 
     ['overcast', 'hot',   'normal', 'weak'], 
     ['rain',     'mild',  'high',   'strong'], 
]
Y = ['no','no','yes','yes','yes','no','yes','no','yes','yes','yes','yes','yes','no']
f = ['outlook','temperature','humidity','wind']
estimator = Id3Estimator()
estimator.fit(X, Y,check_input=True)
export_graphviz(estimator.tree_, 'tree.dot', f)

  然后通過GraphViz工具生成PDF

dot -Tpdf tree.dot -o tree.pdf

  結果:

 

當然,你也可以進行預測判斷:

print estimator.predict([['rain',     'mild',  'high',   'strong']])

  

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM