統計學習方法（五）——決策樹

本文轉載自查看原文 2014-07-14 21:37 3072 決策樹/ 統計學習方法

/*先把標題給寫了，這樣就能經常提醒自己*/

　　決策樹是一種容易理解的分類算法，它可以認為是if-then規則的一個集合。主要的優點是模型具有可讀性，且分類速度較快，不用進行過多的迭代訓練之類。決策樹學習通常包括3個步驟：特征選擇、決策樹的生成和決策樹的修剪。比較常用到的算法有ID3、C4.5和CART。

1. 決策樹模型

　　決策樹是一種樹形結構的分類模型，它由結點和有向邊組成，結點分為內部結點和葉結點，內部結點表示一個特征或屬性，葉結點表示一個類。

決策樹的分類即是從樹的根節點開始對實例的某一個特征進行判斷，通過內部結點逐步下潛到葉結點的過程。

2. 特征選擇

　　特征選擇在於選取對訓練數據具有分類能力的特征，通常的選擇准則是信息增益或信息增益率。為了便於說明，書中給出了一個例子

希望通過所給的訓練數據學習一個貸款申請的決策樹，當新客戶提出貸款申請時，根據申請人的特征決定是否可貸。

從認知上個人覺得特征的選擇就是找出一些具有代表性，對於分類辨識度高的特征，如此能夠快速准確的為實例分類，從數學的角度上來講，就要涉及到信息論與概率統計中的熵了。在此不贅述太多，直接給出特征選擇的算法（信息增益）。

輸入：訓練數據集D和特征A；

輸出：特征A對訓練數據集D的信息增益和增益率

（1）計算數據集D的經驗熵

（2）計算特征A的經驗條件熵

（3）計算信息增益

（4）信息增益率

對於書中的例子，首先計算經驗熵

然后計算各特征的信息增益，分別以表示年齡、有工作、有房子和信貸情況4個特征，則

分別計算的信息增益，由於的信息增益值最大，則選擇其為最優特征，當然也可以計算出信息增益率的結果作為選擇的依據。

3. 決策樹的生成

ID3和C4.5算法基本上一樣，只是在特征選擇的依據上C4.5采用了改進后的信息增益率。因為本文只介紹其中的ID3算法即可。

ID3算法步驟

輸入：訓練數據集D，特征集A，閾值e

輸出：決策樹T

（1）若D中所有實例屬於同一類Ck，則T為單結點樹，並將類Ck作為該結點的類標記，返回T；

（2）若A=空，則T為單結點樹，將D中實例數最多的類Ck作為結點類標記，返回T；

（3）否則，計算A中各特征對D的信息增益，選擇信息增益值最大的特征Ag；

（4）如果Ag的信息增益小於閾值e，則T為單結點樹，將D中最多的類Ck作為結點類標記，返回T；

（5）否則，對Ag的每一可能值ai，依Ag=ai將D分割為若干子集Di，將Di中實例數最大多的類作為類標記，構建子結點，由結點及其子結點構成樹T，返回T；

（6）對於第i個子結點，以Di為訓練集，以A-Ag為特征集，遞歸調用步驟（1）~（5），得到子樹Ti，返回Ti。

從描述上感覺決策樹的生成還是挺簡單明了的，但是具體的實現上樹的生成是最最難的，要注意的細節很多，花了倆個晚上才搞好的，遇到了好多坑

代碼塊1：信息增益類

package org.juefan.decisiontree;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import org.juefan.basic.FileIO;
import org.juefan.bayes.Data;

public class InfoGain {
        //數據實例存儲類
    class Data {
        public ArrayList<Object> x;
        public Object y;
        
        /**讀取一行數據轉化為標准格式*/
        public Data(String content){
            String[] strings = content.split("\t| |:");
            ArrayList<Object> xList = new ArrayList<Object>();
            for(int i = 1; i < strings.length; i++){
                xList.add(strings[i]);
            }
            this.x = new ArrayList<>();
            this.x = xList;
            this.y = strings[0];
        }
        
        public Data(){
            x  = new ArrayList<>();
            y = 0;
        }
        
        public String toString(){
            StringBuilder builder = new StringBuilder();
            builder.append("[ ");
            for(int i = 0; i < x.size() - 1; i++)
                builder.append(x.get(i).toString()).append(",");
            builder.append(x.get(x.size() - 1).toString());
            builder.append(" ]");
            return builder.toString();
        }
    }
    
    //返回底數為2的對數值
    public static double log2(double d){
        return Math.log(d)/Math.log(2);
    }
    
    /**
     * 計算經驗熵
     * @param datas 當前數據集，可以為訓練數據集中的子集
     * @return 返回當前數據集的經驗熵
     */
    public  double getEntropy(ArrayList<Data> datas){
        int counts = datas.size();
        double entropy = 0;
        Map<Object, Double> map = new HashMap<Object, Double>();
        for(Data data: datas){
            if(map.containsKey(data.y)){
                map.put(data.y, map.get(data.y) + 1);
            }else {
                map.put(data.y, 1D);
            }
        }      
        for(double v: map.values())
            entropy -= (v/counts * log2(v/counts));
        return entropy;
    }

    /**
     * 計算條件熵
     * @param datas 當前數據集，可以為訓練數據集中的子集
     * @param feature 待計算的特征位置
     * @return 第feature個特征的條件熵
     */
    public double getCondiEntropy(ArrayList<Data> datas, int feature){
        int counts = datas.size();
        double condiEntropy = 0;
        Map<Object, ArrayList<Data>> tmMap = new HashMap<>();
        for(Data data: datas){
            if(tmMap.containsKey(data.x.get(feature))){
                tmMap.get(data.x.get(feature)).add(data);
            }else {
                ArrayList<Data> tmDatas = new ArrayList<>();
                tmDatas.add(data);
                tmMap.put(data.x.get(feature), tmDatas);
            }
        }        
        for(ArrayList<Data> datas2: tmMap.values()){
            condiEntropy += (double)datas2.size()/counts * getEntropy(datas2);
        }
        return condiEntropy;
    }
    
    /**
     * 計算信息增益（ID3算法）
     * @param datas 當前數據集，可以為訓練數據集中的子集
     * @param feature 待計算的特征位置
     * @return 第feature個特征的信息增益
     */
    public double getInfoGain(ArrayList<Data> datas, int feature){
        return getEntropy(datas) - getCondiEntropy(datas, feature);
    }
    
    /**
     * 計算信息增益率（C4.5算法）
     * @param datas 當前數據集，可以為訓練數據集中的子集
     * @param feature 待計算的特征位置
     * @return 第feature個特征的信息增益率
     */
    public double getInfoGainRatio(ArrayList<Data> datas, int feature){
        return getInfoGain(datas, feature)/getEntropy(datas);
    }
}

代碼塊2：決策樹類

package org.juefan.decisiontree;

import java.util.ArrayList;
import java.util.List;

public class TreeNode {
    private String feature;　　//候選特征
    private List<TreeNode> childTreeNode;
    private String targetFunValue;　　//特征對應的值
    private String nodeName;　　//分類的類別
    
    public TreeNode(String nodeName){       
        this.nodeName = nodeName;
        this.childTreeNode = new ArrayList<TreeNode>();
    }
    
    public TreeNode(){
        this.childTreeNode = new ArrayList<TreeNode>();
    }

    public void printTree(){
        if(targetFunValue != null)
            System.out.print("特征值: " + targetFunValue + "\t");
        if(nodeName != null)
            System.out.print("類型: " + nodeName + "\t");
        System.out.println();
        for(TreeNode treeNode: childTreeNode){
            System.out.println("當前特征為：" + feature);
            treeNode.printTree();
        }
    }

    public String getAttributeValue() {
        return feature;
    }

    public void setAttributeValue(String attributeValue) {
        this.feature = attributeValue;
    }

    public List<TreeNode> getChildTreeNode() {
        return childTreeNode;
    }

    public void setChildTreeNode(List<TreeNode> childTreeNode) {
        this.childTreeNode = childTreeNode;
    }

    public String getTargetFunValue() {
        return targetFunValue;
    }

    public void setTargetFunValue(String targetFunValue) {
        this.targetFunValue = targetFunValue;
    }

    public String getNodeName() {
        return nodeName;
    }

    public void setNodeName(String nodeName) {
        this.nodeName = nodeName;
    }
}

代碼塊3：決策樹的生成

package org.juefan.decisiontree;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import org.juefan.basic.FileIO;
import org.juefan.bayes.Data;

public class DecisionTree {
    public static final double e = 0.1;
    public InfoGain infoGain = new InfoGain();
    
    public TreeNode buildTree(ArrayList<Data> datas, ArrayList<String> featureName){
        TreeNode treeNode = new TreeNode();
        ArrayList<String> feaName = new ArrayList<>();
        feaName = featureName;
        if(isSingle(datas) || getMaxInfoGain(datas) < e){
            treeNode.setNodeName(getLabel(datas).toString());
            return treeNode;
        }else  {
            int feature = getMaxInfoGainFeature(datas);
            treeNode.setAttributeValue(feaName.get(feature + 1));
            ArrayList<String> tList = new ArrayList<>();
            tList = feaName;
            Map<Object, ArrayList<Data>> tMap = new HashMap<>();
            for(Data data: datas){
                if(tMap.containsKey(data.x.get(feature))){
                    Data tData = new Data();
                    for(int i = 0; i < data.x.size(); i++)
                        if(i != feature)
                            tData.x.add(data.x.get(i));
                    tData.y = data.y;
                    tMap.get(data.x.get(feature)).add(tData);
                }else {
                    Data tData = new Data();
                    for(int i = 0; i < data.x.size(); i++)
                        if(i != feature)
                            tData.x.add(data.x.get(i));
                    tData.y = data.y;
                    ArrayList<Data> tDatas = new ArrayList<>();
                    tDatas.add(tData);
                    tMap.put(data.x.get(feature),tDatas);
                }
            }
            List<TreeNode> treeNodes = new ArrayList<>();
            int child = 0;
            for(Object key: tMap.keySet()){
                //這一步太坑爹了，java的拷背坑真多啊，害我浪費了半天的時間
                ArrayList<String> tList2 = new ArrayList<>(tList);
                tList2.remove(feature + 1);
                treeNodes.add(buildTree(tMap.get(key), tList2));
                treeNodes.get(child ++).setTargetFunValue(key.toString());
            }
            treeNode.setChildTreeNode(treeNodes);
            feaName.remove(feature + 1);
        }    
        return treeNode;
    }
    
    /**
     * 獲取實例中的最大類
     * @param datas 實例集
     * @return 出現次數最多的類
     */
    public Object getLabel(ArrayList<Data> datas){
        Map<Object, Integer> map = new HashMap<Object, Integer>();
        Object label = null;
        int max = 0;
        for(Data data: datas){
            if(map.containsKey(data.y)){
                map.put(data.y, map.get(data.y) + 1);
                if(map.get(data.y) > max){
                    max = map.get(data.y);
                    label = data.y;
                }
            }else {
                map.put(data.y, 1);
            }
        }
        return label;
    }
    
    /**
     * 計算信息增益（率）的最大值
     * @param datas
     * @return 最大的信息增益值
     */
    public double getMaxInfoGain(ArrayList<Data> datas){
        double max = 0;
        for(int i = 0; i < datas.get(0).x.size(); i++){
            double temp = infoGain.getInfoGain(datas, i);
            if(temp > max)
                max = temp;
        }
        return max;
    }
    
    /**信息增益最大的特征*/
    public int getMaxInfoGainFeature(ArrayList<Data> datas){
        double max = 0;
        int feature = 0;
        for(int i = 0; i < datas.get(0).x.size(); i++){
            double temp = infoGain.getInfoGain(datas, i);
            if(temp > max){
                max = temp;
                feature = i;
            }
        }
        return feature;
    }
    
    /**判斷是否只有一類*/
    public boolean isSingle(ArrayList<Data> datas){
        Set<Object> set = new HashSet<>();
        for(Data data: datas)
            set.add(data.y);
        return set.size() == 1? true:false;
    }
    
    public static void main(String[] args) {
        ArrayList<Data> datas = new ArrayList<>();
        FileIO fileIO = new FileIO();
        DecisionTree decisionTree = new DecisionTree();
        fileIO.setFileName(".//file//decision.tree.txt");
        fileIO.FileRead("utf-8");
        ArrayList<String> featureName = new ArrayList<>();
        //獲取文件的標頭
        for(String string: fileIO.fileList.get(0).split("\t"))
            featureName.add(string);
        for(int i = 1; i < fileIO.fileList.size(); i++){
            datas.add(new Data(fileIO.fileList.get(i)));
        }
        TreeNode treeNode = new TreeNode();
        treeNode = decisionTree.buildTree(datas, featureName);
        treeNode.printTree();
    }
}

運行情況：

輸入文件 ".//file//decision.tree.txt" 內容為:

類型年齡有工作有自己的房子信貸情況
否青年否否一般
否青年否否好
是青年是否好
是青年是是一般
否青年否否一般
否中年否否一般
否中年否否好
是中年是是好
是中年否是非常好
是中年否是非常好
是老年否是非常好
是老年否是好
是老年是否好
是老年是否非常好
否老年否否一般

運行結果為：

當前特征為：有自己的房子
特征值: 是類型: 是
當前特征為：有自己的房子
特征值: 否
當前特征為：有工作
特征值: 是類型: 是
當前特征為：有工作
特征值: 否類型: 否

對代碼有興趣的可以上本人的GitHub查看：https://github.com/JueFan/StatisticsLearningMethod/

里面也有具體的實例數據

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 5、統計學習方法--決策樹《統計學習方法》——從零實現決策樹【統計學習方法】決策樹統計學習方法之決策樹（2）信息增益比，決策樹的生成算法統計學習方法ｃ++實現之四　決策樹統計學習方法李航---第5章決策樹決策樹（統計學習方法（李航））的貸款的例子的實現李航-統計學習方法-筆記-5：決策樹統計學習五：1.決策樹基本概念統計學習方法（一）