決策樹的python實現

本文轉載自查看原文 2017-05-16 18:15 12013

決策樹的Python實現

2017-04-07 Anne Python技術博文

前言：

決策樹的一個重要的任務是為了理解數據中所蘊含的知識信息，因此決策樹可以使用不熟悉的數據集合，並從中提取出一系列規則，這些機器根據數據集創建規則的過程，就是機器學習的過程。

決策樹優點：

1：計算復雜度不高

2：輸出結果易於理解

3：對中間值的缺失不敏感

4：可以處理不相關特征數據

缺點：可能會產生過度匹配問題

使用數據類型：數值型和標稱型

基於Python逐步實現Decision Tree(決策樹)，分為以下幾個步驟：

加載數據集
熵的計算
根據最佳分割feature進行數據分割
根據最大信息增益選擇最佳分割feature
遞歸構建決策樹
樣本分類

1.加載數據集

from numpy import *
#load "iris.data" to workspace
traindata = loadtxt("D:\ZJU_Projects\machine learning\ML_Action\Dataset\Iris.data",delimiter = ',',usecols = (0,1,2,3),dtype = float)
trainlabel = loadtxt("D:\ZJU_Projects\machine learning\ML_Action\Dataset\Iris.data",delimiter = ',',usecols = (range(4,5)),dtype = str)
feaname = ["#0","#1","#2","#3"] # feature names of the 4 attributes (features)

2. 熵的計算

entropy是香農提出來的（信息論大牛），定義見wiki

注意這里的entropy是H(C|X=xi)而非H(C|X), H（C|X）的計算見第下一個點，還要乘以概率加和

Code：

from math import log
def calentropy(label):
n = label.size # the number of samples
#print n
count = {} #create dictionary "count"
for curlabel in label:
if curlabel not in count.keys():
count[curlabel] = 0
count[curlabel] += 1
entropy = 0
#print count
for key in count:
pxi = float(count[key])/n #notice transfering to float first
entropy -= pxi*log(pxi,2)
return entropy
#testcode:
#x = calentropy(trainlabel)

3. 根據最佳分割feature進行數據分割

假定我們已經得到了最佳分割feature，在這里進行分割（最佳feature為splitfea_idx）

第二個函數idx2data是根據splitdata得到的分割數據的兩個index集合返回datal (samples less than pivot), datag(samples greater than pivot), labell, labelg。這里我們根據所選特征的平均值作為pivot

Code:

#split the dataset according to label "splitfea_idx"
def splitdata(oridata,splitfea_idx):
arg = args[splitfea_idx] #get the average over all dimensions
idx_less = [] #create new list including data with feature less than pivot
idx_greater = [] #includes entries with feature greater than pivot
n = len(oridata)
for idx in range(n):
d = oridata[idx]
if d[splitfea_idx] < arg:
#add the newentry into newdata_less set
idx_less.append(idx)
else:
idx_greater.append(idx)
return idx_less,idx_greater
#testcode:2
#idx_less,idx_greater = splitdata(traindata,2)
#give the data and labels according to index
def idx2data(oridata,label,splitidx,fea_idx):
idxl = splitidx[0] #split_less_indices
idxg = splitidx[1] #split_greater_indices
datal = []
datag = []
labell = []
labelg = []
for i in idxl:
datal.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))
for i in idxg:
datag.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))
labell = label[idxl]
labelg = label[idxg]
return datal,datag,labell,labelg

這里args是參數，決定分裂節點的閾值（每個參數對應一個feature，大於該值分到>branch，小於該值分到<branch）,我們可以定義如下：

args = mean(traindata,axis = 0)

測試：按特征2進行分類，得到的less和greater set of indices分別為：

也就是按args[2]進行樣本集分割，<和>args[2]的branch分別有57和93個樣本。

4. 根據最大信息增益選擇最佳分割feature

信息增益為代碼中的info_gain, 注釋中是熵的計算

Code:

#select the best branch to split
def choosebest_splitnode(oridata,label):
n_fea = len(oridata[0])
n = len(label)
base_entropy = calentropy(label)
best_gain = -1
for fea_i in range(n_fea): #calculate entropy under each splitting feature
cur_entropy = 0
idxset_less,idxset_greater = splitdata(oridata,fea_i)
prob_less = float(len(idxset_less))/n
prob_greater = float(len(idxset_greater))/n
#entropy(value|X) = \sum{p(xi)*entropy(value|X=xi)}
cur_entropy += prob_less*calentropy(label[idxset_less])
cur_entropy += prob_greater * calentropy(label[idxset_greater])
info_gain = base_entropy - cur_entropy #notice gain is before minus after
if(info_gain>best_gain):
best_gain = info_gain
best_idx = fea_i
return best_idx
#testcode:
#x = choosebest_splitnode(traindata,trainlabel)

這里的測試針對所有數據，分裂一次選擇哪個特征呢？

5. 遞歸構建決策樹

詳見code注釋，buildtree遞歸地構建樹。

遞歸終止條件：

①該branch內沒有樣本（subset為空） or

②分割出的所有樣本屬於同一類 or

③由於每次分割消耗一個feature，當沒有feature的時候停止遞歸，返回當前樣本集中大多數sample的label

#create the decision tree based on information gain
def buildtree(oridata, label):
if label.size==0: #if no samples belong to this branch
return "NULL"
listlabel = label.tolist()
#stop when all samples in this subset belongs to one class
if listlabel.count(label[0])==label.size:
return label[0]
#return the majority of samples' label in this subset if no extra features avaliable
if len(feanamecopy)==0:
cnt = {}
for cur_l in label:
if cur_l not in cnt.keys():
cnt[cur_l] = 0
cnt[cur_l] += 1
maxx = -1
for keys in cnt:
if maxx < cnt[keys]:
maxx = cnt[keys]
maxkey = keys
return maxkey
bestsplit_fea = choosebest_splitnode(oridata,label) #get the best splitting feature
print bestsplit_fea,len(oridata[0])
cur_feaname = feanamecopy[bestsplit_fea] # add the feature name to dictionary
print cur_feaname
nodedict = {cur_feaname:{}}
del(feanamecopy[bestsplit_fea]) #delete current feature from feaname
split_idx = splitdata(oridata,bestsplit_fea) #split_idx: the split index for both less and greater
data_less,data_greater,label_less,label_greater = idx2data(oridata,label,split_idx,bestsplit_fea)
#build the tree recursively, the left and right tree are the "<" and ">" branch, respectively
nodedict[cur_feaname]["<"] = buildtree(data_less,label_less)
nodedict[cur_feaname][">"] = buildtree(data_greater,label_greater)
return nodedict
#testcode:
#mytree = buildtree(traindata,trainlabel)
#print mytree

Result:

mytree就是我們的結果，#1表示當前使用第一個feature做分割，'<'和'>'分別對應less 和 greater的數據。

6. 樣本分類

根據構建出的mytree進行分類，遞歸走分支

#classify a new sample
def classify(mytree,testdata):
if type(mytree).__name__ != 'dict':
return mytree
fea_name = mytree.keys()[0] #get the name of first feature
fea_idx = feaname.index(fea_name) #the index of feature 'fea_name'
val = testdata[fea_idx]
nextbranch = mytree[fea_name]
#judge the current value > or < the pivot (average)
if val>args[fea_idx]:
nextbranch = nextbranch[">"]
else:
nextbranch = nextbranch["<"]
return classify(nextbranch,testdata)
#testcode
tt = traindata[0]
x = classify(mytree,tt)
print x

Result：

為了驗證代碼准確性，我們換一下args參數，把它們都設成0（很小）

args = [0,0,0,0]

建樹和分類的結果如下：

可見沒有小於pivot(0)的項，於是dict中每個<的key對應的value都為空。

微信掃一掃
關注該公眾號

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python實現決策樹 python實現決策樹決策樹的python實現決策樹的python實現決策樹算法-Python實現 Python實現天氣決策樹模型 Python簡單實現決策樹決策樹：原理以及python實現決策樹（基於增益率）之python實現剪枝決策樹原理與Python實現