| Day |
Outlook |
Temperature |
Humidity |
Wind |
PlayTennis |
| 1 |
Sunny |
Hot |
High |
Weak |
No |
| 2 |
Sunny |
Hot |
High |
Strong |
No |
| 3 |
Overcast |
Hot |
High |
Weak |
Yes |
| 4 |
Rain |
Mild |
High |
Weak |
Yes |
| 5 |
Rain |
Cool |
Normal |
Weak |
Yes |
| 6 |
Rain |
Cool |
Normal |
Strong |
No |
| 7 |
Overcast |
Cool |
Normal |
Strong |
Yes |
| 8 |
Sunny |
Mild |
High |
Weak |
No |
| 9 |
Sunny |
Cool |
Normal |
Weak |
Yes |
| 10 |
Rain |
Mild |
Normal |
Weak |
Yes |
| 11 |
Sunny |
Mild |
Normal |
Strong |
Yes |
| 12 |
Overcast |
Mild |
High |
Strong |
Yes |
| 13 |
Overcast |
Hot |
Normal |
Weak |
Yes |
| 14 |
Rain |
Mild |
High |
Strong |
No |
對於上面例子,如何判斷是否要去playtennis?
可以采用決策樹的方式。
決策樹是一種以實例為基礎的歸納學習算法。從無序列/無規則的數據中,推導出樹形表示的分類判決。
優點:計算量小、顯示清晰
缺點:容易過擬合(需要修枝)(譬如,使用day做判決,一一對應雖然很准確,但是不能用在其他地方)、對時間順序的數據,需要過多預處理工作
ID3算法:
1、對於實例,計算各個屬性的信息增益
2、對於信息增益最大的屬性P作為根節點,P的各個取值的樣本作為子集進行分類
3、對於子集下,若只含有正例或反例,直接得到判決;否則遞歸調用算法,再次尋找子節點

熵:表示隨機變量的不確定性。
條件熵:在一個條件下,隨機變量的不確定性。
信息增益:熵 - 條件熵,在一個條件下,信息不確定性減少的程度。
用信息增益最大的屬性作為結點,是因為最終去不去打球的不確定性,在獲得該屬性的結果后,不確定性大大降低。
也就是說,該屬性對於打球的選擇很重要。
對於解決上述問題,
首先,計算系統熵,PlayTennis
P(No) = 5/14
P(Yes) = 9/14
Entropy(S) = -(9/14)*log(9/14)-(5/14)*log(5/14) = 0.94
然后,計算各個屬性的熵。
譬如:Wind
其中,Wind中取值為weak的記錄有8條,其中,playtennis的正例6個,負例2個;取值為strong的記錄有6條,正例為3個,負例為3個。
Entrogy(weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811
Entrogy(strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1.0
對應的信息增益為:
Gain(Wind) = Entropy(S) – (8/14)* Entrogy(weak)-(6/14)* Entrogy(strong) = 0.048
同理,Gain(Humidity = 0.151;Gain(Outlook = 0.247;Gain(Temperature = 0.029
此時,可以得到跟節點為:Outlook
對應點決策樹:
Outlook分為三個集合:
Sunny:{1,2,8,9,11},正例:2、反例:3
Overcast:{3,7,12,13},正例:4、反例:0
Rain:{4,5,6,10,14},正例:3、反例:2
至此,可以得到:
Sunny:
| Day |
Outlook |
Temperature |
Humidity |
Wind |
PlayTennis |
| 1 |
Sunny |
Hot |
High |
Weak |
No |
| 2 |
Sunny |
Hot |
High |
Strong |
No |
| 8 |
Sunny |
Mild |
High |
Weak |
No |
| 9 |
Sunny |
Cool |
Normal |
Weak |
Yes |
| 11 |
Sunny |
Mild |
Normal |
Strong |
Yes |
Entropy(S) = -(3/5)*log(3/5)-(2/5)*log(2/5) = 0.971
對於Wind,weak時,正例為1,反例為2;Strong時,正例為1,反例為1.
Entrogy(weak) = -(1/3)*log(1/3)-(2/3)*log(2/3) = 0.918
Entrogy(strong) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1
Gain(Wind) = Entropy(S) – 3/5* Entrogy(weak)-2/5* Entrogy(strong) = 0.0202
同理,Gain(Humidity) = 0.971;Gain(Temperature) = 0.571
此時,可以畫出部分決策樹:

其中,python代碼:
import math
#香農公式計算信息熵
def calcShannonEnt(dataset):
numEntries = len(dataset)
labelCounts = {}
for featVec in dataset:
currentLabel = featVec[-1]#最后一位表示分類
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] +=1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob*math.log(prob, 2)
return shannonEnt
def CreateDataSet():
dataset = [['sunny', 'hot','high','weak', 'no' ],
['sunny', 'hot','high','strong', 'no' ],
['overcast', 'hot','high','weak', 'yes' ],
['rain', 'mild','high','weak', 'yes' ],
['rain', 'cool','normal','weak', 'yes' ],
['rain', 'cool','normal','strong', 'no' ],
['overcast', 'cool','normal','strong', 'yes' ],
['sunny', 'mild','high','weak', 'no' ],
['sunny', 'cool','normal','weak', 'yes' ],
['rain', 'mild','normal','weak', 'yes' ],
['sunny', 'mild','normal','strong', 'yes' ],
['overcast', 'mild','high','strong', 'yes' ],
['overcast', 'hot','normal','weak', 'yes' ],
['rain', 'mild','high','strong', 'no' ],
]
labels = ['outlook', 'temperature', 'humidity', 'wind']
return dataset, labels
#選取屬性axis的值value的樣本表
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
#選取信息增益最大的屬性作為節點
def chooseBestFeatureToSplit(dataSet):
numberFeatures = len(dataSet[0])-1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0
bestFeature = -1
for i in range(numberFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList)
newEntropy =0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
if(infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
#對於屬性已經用完,仍然沒有分類的情況,采用投票表決的方法
def majorityCnt(classList):
classCount ={}
for vote in classList:
if vote not in classCount.keys():
classCount[vote]=0
classCount[vote] += 1
return max(classCount)
def createTree(dataSet, labels):
classList = [example[-1] for example in dataSet]
#類別相同停止划分
if classList.count(classList[0])==len(classList):
return classList[0]
#屬性用完,投票表決
if len(dataSet[0])==1:
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
return myTree
myDat,labels = CreateDataSet()
tree = createTree(myDat,labels)
print tree
在計算決策樹的時候,sklearn庫提供了決策樹的計算方法(tree),但是,這個庫提供的是:
scikit-learn uses an optimised version of the CART algorithm.
對於本文中使用的ID3算法是不支持的。
然而https://pypi.python.org/pypi/decision-tree-id3/0.1.2
該庫支持ID3算法。
按照官網說明,注意安裝時的依賴庫的版本,該升級的升級,該安裝的安裝即可。‘
from id3 import Id3Estimator
from id3 import export_graphviz
X = [['sunny', 'hot', 'high', 'weak'],
['sunny', 'hot', 'high', 'strong'],
['overcast', 'hot', 'high', 'weak'],
['rain', 'mild', 'high', 'weak'],
['rain', 'cool', 'normal', 'weak'],
['rain', 'cool', 'normal', 'strong'],
['overcast', 'cool', 'normal', 'strong'],
['sunny', 'mild', 'high', 'weak'],
['sunny', 'cool', 'normal', 'weak'],
['rain', 'mild', 'normal', 'weak'],
['sunny', 'mild', 'normal', 'strong'],
['overcast', 'mild', 'high', 'strong'],
['overcast', 'hot', 'normal', 'weak'],
['rain', 'mild', 'high', 'strong'],
]
Y = ['no','no','yes','yes','yes','no','yes','no','yes','yes','yes','yes','yes','no']
f = ['outlook','temperature','humidity','wind']
estimator = Id3Estimator()
estimator.fit(X, Y,check_input=True)
export_graphviz(estimator.tree_, 'tree.dot', f)
然后通過GraphViz工具生成PDF
dot -Tpdf tree.dot -o tree.pdf
結果:

當然,你也可以進行預測判斷:
print estimator.predict([['rain', 'mild', 'high', 'strong']])

