概述

分類決策樹模型是一種描述對實例進行分類的樹形結構。決策樹可以看為一個if-then規則集合，具有“互斥完備”性質。決策樹基本上都是采用的是貪心（即非回溯）的算法，自頂向下遞歸分治構造。
生成決策樹一般包含三個步驟：
- 特征選擇
- 決策樹生成
- 剪枝

決策樹算法種類

決策樹主要有 ID3, C4.5, C5.0 and CART幾種， ID3, C4.5, 和CART實際都采用的是貪心（即非回溯）的算法，自頂向下遞歸分治構造。對於每一個決策要求分成的組之間的“差異”最大。各種決策樹算法之間的主要區別就是對這個“差異”衡量方式的區別。
ID3和CART方法大約同時獨立發明，形成了決策樹歸納研究的基礎。
ID3請參考：http://blog.csdn.net/acdreamers/article/details/44661149
在信息論中，期望信息越小，那么信息增益就越大，從而純度就越高。ID3算法的核心思想就是以信息增益來度量屬性的選擇，選擇分裂后信息增益最大的屬性進行分裂。該算法采用自頂向下的貪婪搜索遍歷可能的決策空間。

在信息增益中，重要性的衡量標准就是看特征能夠為分類系統帶來多少信息，帶來的信息越多，該特征越重要。在認識信息增益之前，先來看看信息熵的定義(每一個類別的概率是P(Ci))

熵這個概念最早起源於物理學，在物理學中是用來度量一個熱力學系統的無序程度，而在信息學里面，熵是對不確定性的度量。在1948年，香農引入了信息熵，將其定義為離散隨機事件出現的概率，一個系統越是有序，信息熵就越低，反之一個系統越是混亂，它的信息熵就越高。所以信息熵可以被認為是系統有序化程度的一個度量。

C4.5:是在ID3決策樹的基礎之上進行改進，C4.5克服了ID3的2個缺點：
（1）用信息增益選擇屬性時偏向於選擇分枝比較多的屬性值，即取值多的屬性
（2）不能處理連貫屬性
C4.5是這樣做的：
（1）選取能夠得到最大信息增益率（information gain ratio）的特征來划分數據，並且像ID3一樣執行后剪枝。也就是說采用比率，能較好解決ID3的第（1）個缺點
（2）當特征數值連續時，在分類的時候進行離散化。
C5.0是Quinlan最新發布版本的決策樹算法，需要專利授權。相對於C 4.5而言，該方法計算時占用內存更少，建立了更小的規則集，計算結果也更加准確。C5.0算法由於執行效率和內存使用改進、適用大數據集。C5.0算法選擇分支變量的依據：以信息熵的下降速度作為確定最佳分支變量和分割閥值的依據。信息熵的下降意味着信息的不確定性下降。

    （1）ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical targets. Trees are grown to their maximum size and then a pruning step is usually applied to   improve the ability of the tree to generalise to unseen data.
    （2）C4.5 is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. These accuracy of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a rule’s precondition if the accuracy of the rule improves without it.
    （3）C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets than C4.5 while being  more accurate.
    （4）CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports numerical target varia

CART（分類回歸樹），非常類似於C4.5。它使用基尼不純度（Gini Impurity）來決定划分。差別請參考：http://blog.csdn.net/lingtianyulong/article/details/34522757

它和C45基本上是類似的算法，主要區別：1）它的葉節點不是具體的分類，而是是一個函數f()，該函數定義了在該條件下的回歸函數。2）CART是二叉樹，而不是多叉樹。

Spark MLlib決策樹代碼詳細分析

Spark MLlib中決策樹 Spark2.0,基於DataFrame的API。
決策樹和決策樹的組合，是解決分類問題和回歸問題比較流行的一類算法。具備了諸多優點：
- 結果易於解釋；
- 可以處理類別特征；
- 可以擴展到多分類；
- 不需要對特征進行歸一化；
- 可以分析各feature之間的相互作用。
  隨機森林，boosting算法都是決策樹的組合。
  Spark中決策樹可以解決二分類，多分類和回歸問題，可以使用連續的和類別的特征。由於數據集是按行進行分區的，可以對大型數據集（百萬甚至十億級的數據集）進行分布式訓練

（1）Decision trees and their ensembles are popular methods for the machine learning tasks of classification and regression. Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions. Tree ensemble algorithms such as random forests and boosting are among the top performers for classification and regression tasks. 
（2）The spark.ml implementation supports decision trees for binary and multiclass classification and for regression, using both continuous and categorical features. The implementation partitions data by rows, allowing distributed training with millions or even billions of instances.

在Spark Pipeline中編寫決策樹流程，以下三個特征轉換器（Feature Transformers）常用，但不是特別好理解（結合本部分后面各自例子看吧）。
http://spark.apache.org/docs/latest/ml-features.html#vectorindexer

VectorIndexer

主要作用：提高決策樹或隨機森林等ML方法的分類效果。
VectorIndexer是對數據集特征向量中的類別（離散值）特征（index categorical features categorical features ）進行編號。
它能夠自動判斷那些特征是離散值型的特征，並對他們進行編號，具體做法是通過設置一個maxCategories，特征向量中某一個特征不重復取值個數小於maxCategories，則被重新編號為0～K（K<=maxCategories-1）。某一個特征不重復取值個數大於maxCategories，則該特征視為連續值，不會重新編號（不會發生任何改變）。結合例子看吧，實在太繞了。

VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following:

    Take an input column of type Vector and a parameter maxCategories. Decide which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical.
Compute 0-based category indices for each categorical feature.
Index categorical features and transform original feature values to indices.

    Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance.

    This transformed data could then be passed to algorithms such as DecisionTreeRegressor that handle categorical features.

用一個簡單的數據集舉例如下：

//定義輸入輸出列和最大類別數為5，某一個特征
//（即某一列）中多於5個取值視為連續值
VectorIndexerModel featureIndexerModel=new VectorIndexer()
                 .setInputCol("features")
                 .setMaxCategories(5)
                 .setOutputCol("indexedFeatures")
                 .fit(rawData);
//加入到Pipeline
Pipeline pipeline=new Pipeline()
                 .setStages(new PipelineStage[]
                         {labelIndexerModel,
                         featureIndexerModel,
                         dtClassifier,
                         converter});
pipeline.fit(rawData).transform(rawData).select("features","indexedFeatures").show(20,false);
//顯示如下的結果：        
+-------------------------+-------------------------+
|features                 |indexedFeatures          |
+-------------------------+-------------------------+
|(3,[0,1,2],[2.0,5.0,7.0])|(3,[0,1,2],[2.0,1.0,1.0])|
|(3,[0,1,2],[3.0,5.0,9.0])|(3,[0,1,2],[3.0,1.0,2.0])|
|(3,[0,1,2],[4.0,7.0,9.0])|(3,[0,1,2],[4.0,3.0,2.0])|
|(3,[0,1,2],[2.0,4.0,9.0])|(3,[0,1,2],[2.0,0.0,2.0])|
|(3,[0,1,2],[9.0,5.0,7.0])|(3,[0,1,2],[9.0,1.0,1.0])|
|(3,[0,1,2],[2.0,5.0,9.0])|(3,[0,1,2],[2.0,1.0,2.0])|
|(3,[0,1,2],[3.0,4.0,9.0])|(3,[0,1,2],[3.0,0.0,2.0])|
|(3,[0,1,2],[8.0,4.0,9.0])|(3,[0,1,2],[8.0,0.0,2.0])|
|(3,[0,1,2],[3.0,6.0,2.0])|(3,[0,1,2],[3.0,2.0,0.0])|
|(3,[0,1,2],[5.0,9.0,2.0])|(3,[0,1,2],[5.0,4.0,0.0])|
+-------------------------+-------------------------+
結果分析：特征向量包含3個特征，即特征0，特征1，特征2。如Row=1,對應的特征分別是2.0,5.0,7.0.被轉換為2.0,1.0,1.0。
我們發現只有特征1，特征2被轉換了，特征0沒有被轉換。這是因為特征0有6中取值（2，3，4，5，8，9），多於前面的設置setMaxCategories(5)
，因此被視為連續值了，不會被轉換。
特征1中，（4，5，6，7，9）-->(0,1,2,3,4,5)
特征2中,  (2,7,9)-->(0,1,2)

輸出DataFrame格式說明（Row=1）：
3個特征 特征0，1，2      轉換前的值  
|(3,    [0,1,2],      [2.0,5.0,7.0])
3個特征 特征1，1，2       轉換后的值
|(3,    [0,1,2],      [2.0,1.0,1.0])|

StringIndexer

理解了前面的VectorIndexer之后，StringIndexer對數據集的label進行重新編號就很容易理解了，都是采用類似的轉換思路，看下面的例子就可以了。

//定義一個StringIndexerModel，將label轉換成indexedlabel
StringIndexerModel labelIndexerModel=new StringIndexer().
                setInputCol("label")
                .setOutputCol("indexedLabel")
                .fit(rawData);
//加labelIndexerModel加入到Pipeline中
Pipeline pipeline=new Pipeline()
                 .setStages(new PipelineStage[]
                         {labelIndexerModel,
                         featureIndexerModel,
                         dtClassifier,
                         converter});
//查看結果
pipeline.fit(rawData).transform(rawData).select("label","indexedLabel").show(20,false);

按label出現的頻次，轉換成0～num numOfLabels-1(分類個數)，頻次最高的轉換為0，以此類推：
label=3，出現次數最多，出現了4次，轉換（編號）為0
其次是label=2，出現了3次，編號為1，以此類推
+-----+------------+
|label|indexedLabel|
+-----+------------+
|3.0  |0.0         |
|4.0  |3.0         |
|1.0  |2.0         |
|3.0  |0.0         |
|2.0  |1.0         |
|3.0  |0.0         |
|2.0  |1.0         |
|3.0  |0.0         |
|2.0  |1.0         |
|1.0  |2.0         |
+-----+------------+

在其它地方應用StringIndexer時還需要注意兩個問題：（1）StringIndexer本質上是對String類型–>index( number);如果是：數值(numeric)–>index(number),實際上是對把數值先進行了類型轉換（ cast numeric to string and then index the string values.），也就是說無論是String，還是數值，都可以重新編號（Index); （2）利用獲得的模型轉化新數據集時，可能遇到異常情況，見下面例子。

StringIndexer對String按頻次進行編號
 id | category | categoryIndex
----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 2  | c        | 1.0
 3  | a        | 0.0
 4  | a        | 0.0
 5  | c        | 1.0
 如果轉換模型（關系）是基於上面數據得到的 (a,b,c)->(0.0,2.0,1.0),如果用此模型轉換category多於（a,b,c)的數據，比如多了d，e，就會遇到麻煩：
 id | category | categoryIndex
----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 2  | d        | ？
 3  | e        | ？
 4  | a        | 0.0
 5  | c        | 1.0
 Spark提供了兩種處理方式：
 StringIndexerModel labelIndexerModel=new StringIndexer().
                setInputCol("label")
                .setOutputCol("indexedLabel")
                //.setHandleInvalid("error")
                .setHandleInvalid("skip")
                .fit(rawData);
 （1）默認設置，也就是.setHandleInvalid("error")：會拋出異常
 org.apache.spark.SparkException: Unseen label: d，e
 （2）.setHandleInvalid("skip") 忽略這些label所在行的數據，正常運行，將輸出如下結果：
 id | category | categoryIndex
----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 4  | a        | 0.0
 5  | c        | 1.0

IndexToString

相應的，有StringIndexer，就應該有IndexToString。在應用StringIndexer對labels進行重新編號后，帶着這些編號后的label對數據進行了訓練，並接着對其他數據進行了預測，得到預測結果，預測結果的label也是重新編號過的，因此需要轉換回來。見下面例子，轉換回來的convetedPrediction才和原始的label對應。

Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings. A common use case is to produce indices from labels with StringIndexer, train a model with those indices and retrieve the original labels from the column of predicted indices with IndexToString.

                .setInputCol("prediction")//Spark默認預測label行
                .setOutputCol("convetedPrediction")//轉換回來的預測label
                .setLabels(labelIndexerModel.labels());//需要指定前面建好相互相互模型
Pipeline pipeline=new Pipeline()
                 .setStages(new PipelineStage[]
                         {labelIndexerModel,
                         featureIndexerModel,
                         dtClassifier,
                         converter});
pipeline.fit(rawData).transform(rawData)
        .select("label","prediction","convetedPrediction").show(20,false);  
|label|prediction|convetedPrediction|
+-----+----------+------------------+
|3.0  |0.0       |3.0               |
|4.0  |1.0       |2.0               |
|1.0  |2.0       |1.0               |
|3.0  |0.0       |3.0               |
|2.0  |1.0       |2.0               |
|3.0  |0.0       |3.0               |
|2.0  |1.0       |2.0               |
|3.0  |0.0       |3.0               |
|2.0  |1.0       |2.0               |
|1.0  |2.0       |1.0               |
+-----+----------+------------------+

Spark MLlib中樹剪枝方法與決策樹參數設置

**剪枝的參數設置：** 先（預）剪枝方法，通過提前停止樹的構建，而對樹進行“剪枝”：通過設置如下的條件，進行剪枝：

maxDepth：限定決策樹的最大可能深度。但由於其它終止條件或者是被剪枝的緣故，最終的決策樹的深度可能要比maxDepth小。
minInfoGain:最小信息增益（設置閾值），小於該值將不帶繼續分叉;
minInstancesPerNode：如果某個節點的樣本數量小於該值，則該節點將不再被分叉。（設置閾值）
實際上要想獲得一個適當的閾值是相當困難的。高閾值可能導致過分簡化的樹，而低閾值可能簡化不夠。
預剪枝方法minInfoGain、minInstancesPerNode實際上是通過不斷修改停止條件來得到合理的結果，這並不是一個好辦法，事實上，我們常常甚至不知道要尋找什么樣的結果。這樣就需要對樹進行后剪枝了（后剪枝不需要用戶指定參數，是更為理想化的剪枝方法）
Spark MLLib中用了后剪枝方法沒有？目前我還沒研究明白。
當然后剪枝方法也不總是比先剪枝方法更有效。為了尋找最佳的模型，更合理的做法是：同時使用這兩種剪技術。

DecisionTreeClassifier dtClassifier=new DecisionTreeClassifier()
            .setLabelCol("indexedLabel")
            .setFeaturesCol("indexedFeatures")
            .setMaxDepth(maxDepth)
            //.setMinInfoGain(0.5)
            //.setMinInstancesPerNode(10)
            //.setImpurity("gini")//Gini不純度
            .setImpurity("entropy")//或者熵

  The node depth is equal to the maxDepth training parameter.
     No split candidate leads to an information gain greater than minInfoGain.
     No split candidate produces child nodes which each have at least minInstancesPerNode training instances.

節點不純度和信息增益方法設置：
分類問題可設置：

.setImpurity(“gini”)//Gini不純度 
.setImpurity(“entropy”)//或者熵

這里寫圖片描述

分類結果評估

（1）手工探索：可以簡單設置一個循環，對關鍵參數MaxDepth，兩種不純度不同組合計算准確度，accuracy。
（2）利用CrossValidator交叉驗證方法，可參考本人另一篇文章：
Spark2.0基於Pipeline、交叉驗證、ParamMap的模型選擇和超參數調優
http://www.cnblogs.com/itboys/p/8310134.html
（3）兩類的分類問題的除accuracy外的其他評價方法（指標），可參考本人另外一篇文章：
Logistic回歸參數設置，分類結果評估（Spark2.0、Python Scikit）

http://www.cnblogs.com/itboys/p/8315834.html

Spark2.0決策樹分類問題完整代碼

package my.spark.ml.practice; 
import org.apache.log4j.Level; 
import org.apache.log4j.Logger; 
import org.apache.spark.ml.Pipeline; 
import org.apache.spark.ml.PipelineModel; 
import org.apache.spark.ml.PipelineStage; 
import org.apache.spark.ml.classification.DecisionTreeClassificationModel; 
import org.apache.spark.ml.classification.DecisionTreeClassifier; 
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator; 
import org.apache.spark.ml.feature.IndexToString; 
import org.apache.spark.ml.feature.StringIndexer; 
import org.apache.spark.ml.feature.StringIndexerModel; 
import org.apache.spark.ml.feature.VectorIndexer; 
import org.apache.spark.ml.feature.VectorIndexerModel; 
import org.apache.spark.sql.Dataset; 
import org.apache.spark.sql.Row; 
import org.apache.spark.sql.SparkSession;

public class myDecisionTreeClassifer {

public static void main(String[] args) {
    SparkSession spark=SparkSession
                       .builder()
                       .master("local[4]")
                       .appName("myDecisonTreeClassifer")
                       .getOrCreate();
    //屏蔽日志
      Logger.getLogger("org.apache.spark").setLevel(Level.WARN);
      Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF); 
    //-------------------0 加載數據------------------------------------------
    String path="/home/hadoop/spark/spark-2.0.0-bin-hadoop2.6" +
              "/data/mllib/sample_multiclass_classification_data.txt";
            //"/data/mllib/sample_libsvm_data.txt";
    Dataset<Row> rawData=spark.read().format("libsvm").load(path);
    Dataset<Row>[] split=rawData.randomSplit(new double[]{0.8,0.2});
    Dataset<Row> training=split[0];
    Dataset<Row> test=split[1];     

    //rawData.show(100);//加載數據檢查，顯示100行數，每一行都不截斷
    //-------------------1 建立決策樹訓練的Pipeline-------------------------------
    //1.1 對label進行重新編號
    StringIndexerModel labelIndexerModel=new StringIndexer().
            setInputCol("label")
            .setOutputCol("indexedLabel")
            //.setHandleInvalid("error")
            .setHandleInvalid("skip")
            .fit(rawData);
    //1.2 對特征向量進行重新編號
    // Automatically identify categorical features, and index them.
    // Set maxCategories so features with > 5 distinct values are 
    //treated as continuous.
    //針對離散型特征而言的，對離散型特征值進行編號。
    //.setMaxCategories(5)表示假如特征值的取值多於四種，則視為連續值
    //也就是這樣設置就無效了
    VectorIndexerModel featureIndexerModel=new VectorIndexer()
             .setInputCol("features")
             .setMaxCategories(5)
             .setOutputCol("indexedFeatures")
             .fit(rawData);
    //1.3 決策樹分類器
    /*DecisionTreeClassifier dtClassifier=
            new DecisionTreeClassifier()
            .setLabelCol("indexedLabel")//使用index后的label
            .setFeaturesCol("indexedFeatures");//使用index后的features
            */
    //1.3 決策樹分類器參數設置
    for(int maxDepth=2;maxDepth<10;maxDepth++){
        DecisionTreeClassifier dtClassifier=new DecisionTreeClassifier()
        .setLabelCol("indexedLabel")
        .setFeaturesCol("indexedFeatures")
        .setMaxDepth(maxDepth)
        //.setMinInfoGain(0.5)
        //.setMinInstancesPerNode(10)
        //.setImpurity("gini")//Gini不純度
        .setImpurity("entropy")//或者熵
        //.setMaxBins(100）//其它可調試的還有一些參數
        ;

        //1.4 將編號后的預測label轉換回來
    IndexToString converter=new IndexToString()
            .setInputCol("prediction")//自動產生的預測label行名字
            .setOutputCol("convetedPrediction")
            .setLabels(labelIndexerModel.labels());
    //Pileline這四個階段，
    Pipeline pipeline=new Pipeline()
             .setStages(new PipelineStage[]
                     {labelIndexerModel,
                     featureIndexerModel,
                     dtClassifier,
                     converter});
    //在訓練集上訓練pipeline模型
    PipelineModel pipelineModel=pipeline.fit(training);
    //-----------------------------3 多分類結果評估----------------------------
    //預測
    Dataset<Row> testPrediction=pipelineModel.transform(test);
    MulticlassClassificationEvaluator evaluator=
            new MulticlassClassificationEvaluator()
                .setLabelCol("indexedLabel")
                .setPredictionCol("prediction")
                .setMetricName("accuracy");
    //評估
    System.out.println("MaxDepth is: "+maxDepth);
    double accuracy= evaluator.evaluate(testPrediction);
    System.out.println("accuracy is: "+accuracy);
    //輸出決策樹模型
    DecisionTreeClassificationModel treeModel =
              (DecisionTreeClassificationModel) (pipelineModel.stages()[2]);
    System.out.println("Learned classification tree model depth"
                    +treeModel.depth()+" numNodes "+treeModel.numNodes());
              //+ treeModel.toDebugString());   //輸出整個決策樹規則集    
    }//maxDepth循環                   
}
}

評估分類模型在測試集上的表現：
entropy:
MaxDepth is: 2
accuracy is: 0.896551724137931
Learned classification tree model depth2 numNodes 5
MaxDepth is: 3
accuracy is: 0.9310344827586207
Learned classification tree model depth3 numNodes 7
MaxDepth is: 4
accuracy is: 0.9310344827586207
Learned classification tree model depth4 numNodes 9
MaxDepth is: 5
accuracy is: 0.9310344827586207
Learned classification tree model depth5 numNodes 11
MaxDepth is: 6
accuracy is: 0.9310344827586207
Learned classification tree model depth5 numNodes 11

Gini:
MaxDepth is: 2
accuracy is: 0.8928571428571429
Learned classification tree model depth2 numNodes 5
MaxDepth is: 3
accuracy is: 0.9642857142857143
Learned classification tree model depth3 numNodes 9
MaxDepth is: 4
accuracy is: 0.9285714285714286
Learned classification tree model depth4 numNodes 13
MaxDepth is: 5
accuracy is: 0.9285714285714286
Learned classification tree model depth4 numNodes 13
MaxDepth is: 6
accuracy is: 0.9285714285714286
Learned classification tree model depth4 numNodes 13

另外：treeModel.toDebugString()將獲得類似下面的規則集：
MaxDepth is: 3
accuracy is: 0.9666666666666667
Learned classification tree model depthDecisionTreeClassificationModel 
(uid=dtc_62e3aea12022) of depth 3 with 9 nodes
  If (feature 2 <= -0.694915)
   Predict: 0.0
  Else (feature 2 > -0.694915)
   If (feature 3 <= 0.25)
    If (feature 2 <= 0.322034)
     Predict: 2.0
    Else (feature 2 > 0.322034)
     Predict: 1.0
   Else (feature 3 > 0.25)
    If (feature 2 <= 0.288136)
     Predict: 1.0
    Else (feature 2 > 0.288136)
     Predict: 1.0

結論：

提高樹的深度一般可以得到更精確的模型，但是深度越大，模型越復雜，對訓練數據集的過擬合程度越嚴重。
兩種不純度方法對性能的差異影響
（上述結論參考了《Spark機器學習》 Machine Learning with Spark一書，113頁）

其它算法

后剪枝有多種計算方法，這里分析一種比較簡單的算法：
偽代碼：

基於已有的樹切分測試數據集：
    如果存在任一子集是一棵樹，則在該子集遞歸剪枝過程
    計算將當前兩個葉節點合並后的誤差（1）
    計算不合並的誤差
    如果合並后會降低誤差的話，就將葉節點合並

這里（1）所說的誤差是：計算每條數據的值與均值的差的平方，最后其求和,即平
方誤差的總值。這個值是混亂程度的一種表示方法。越混亂，值應該越大。所以誤
差的降低，就變得更“純”了，所以“如果合並后會降低誤差的話，就將葉節點合並”。

本文參考了以下博客：

（1）http://blog.csdn.net/gumpeng/article/details/51397737
Python版本非常詳細的注釋，以及最為通俗易懂的例子。
（2 ID3：
http://blog.csdn.net/acdreamers/article/details/44661149
（3）Scikit決策樹：
http://blog.csdn.net/sandyzhs/article/details/46814805
（4）基尼不純度與熵的差別
http://blog.csdn.net/lingtianyulong/article/details/34522757
（5）算法雜貨鋪——分類算法之決策樹(Decision tree)
http://www.cnblogs.com/leoo2sk/archive/2010/09/19/decision-tree.html
（6）決策樹剪枝
http://blog.sina.com.cn/s/blog_4e4dec6c0101fdz6.html
（7）Opencv2.4.9源碼分析——Decision Trees
http://blog.csdn.net/zhaocj/article/details/50503450
里面非常詳細的代碼說明和原理說明

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spark2.0機器學習系列之6：GBDT（梯度提升決策樹）、GBDT與隨機森林差異、參數調試及Scikit代碼分析 Spark機器學習(6)：決策樹算法機器學習Sklearn系列：（三）決策樹機器學習之決策樹學習機器學習之決策樹機器學習-決策樹機器學習之決策樹算法機器學習：決策樹機器學習之決策樹機器學習（三）決策樹學習