Spark Random Forest classifier 隨機森林分類

本文轉載自查看原文 2020-03-04 11:51 1492 spark

1、概述

隨機森林是決策樹的集合。隨機森林是用於分類和回歸的最成功的機器學習模型之一。他們結合了許多決策樹，以減少過度擬合的風險。像決策樹一樣，隨機森林處理分類特征，擴展到多類分類設置，不需要特征縮放，並且能夠捕獲非線性和特征交互。

spark.mllib支持使用連續和分類功能對二元和多類分類以及進行回歸的隨機森林。

基礎算法

隨機森林分別訓練一組決策樹，因此可以並行進行訓練。該算法將隨機性注入訓練過程中，因此每個決策樹都略有不同。合並來自每棵樹的預測可以減少預測的方差，從而提高測試數據的性能。

訓練

注入訓練過程的隨機性包括：

    在每次迭代中對原始數據集進行二次采樣以獲得不同的訓練集（也稱為自舉）。
    考慮要在每個樹節點上分割的要素的不同隨機子集。

除了這些隨機化之外，決策樹訓練的方式與單個決策樹的訓練方式相同。

參數

    numTrees：森林中的樹木數量。
        增加樹的數量將減少預測的方差，從而提高模型的測試時間准確性。
        訓練時間在樹木數量上大致呈線性增加。
    maxDepth：森林中每棵樹的最大深度。
        深度的增加使模型更具表現力和功能。但是，深樹需要更長的訓練時間，也更容易過度擬合。
        通常，使用隨機森林比使用單個決策樹訓練更深的樹是可以接受的。與隨機森林相比，一棵樹更可能過度擬合（由於對森林中的多棵樹進行平均而減少了方差）。

2、code

package com.home.spark.ml

import org.apache.spark.SparkConf
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SparkSession


object Ex_RandomForests {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf(true).setMaster("local[2]").setAppName("spark ml")
    val spark = SparkSession.builder().config(conf).getOrCreate()

    //rdd轉換成df或者ds需要SparkSession實例的隱式轉換
    //導入隱式轉換，注意這里的spark不是包名，而是SparkSession的對象名
    import spark.implicits._

    // Load and parse the data file, converting it to a DataFrame.
//    val data = spark.read.format("libsvm").load("input/sample_libsvm_data.txt")

    val rawData = spark.sparkContext.textFile("input/iris.data.txt")
      .map(_.split(","))
      .map(a=>Iris(
        Vectors.dense(a(0).toDouble, a(1).toDouble, a(2).toDouble, a(3).toDouble),
        a(4))).toDF()

    rawData.createOrReplaceTempView("iris")
    val data = spark.sql("select * from iris")

    // Index labels, adding metadata to the label column.
    // Fit on whole dataset to include all labels in index.
    val labelIndexer = new StringIndexer()
      .setInputCol("label")
      .setOutputCol("indexedLabel")
      .fit(data)
    // Automatically identify categorical features, and index them.
    // Set maxCategories so features with > 4 distinct values are treated as continuous.
    val featureIndexer = new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(4)
      .fit(data)

    // Split the data into training and test sets (30% held out for testing).
    val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

    // Train a RandomForest model.
    val rf = new RandomForestClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("indexedFeatures")
      .setNumTrees(10)

    // Convert indexed labels back to original labels.
    val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

    // Chain indexers and forest in a Pipeline.
    val pipeline = new Pipeline()
      .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))

    // Train model. This also runs the indexers.
    val model = pipeline.fit(trainingData)

    // Make predictions.
    val predictions = model.transform(testData)

    // Select example rows to display.
    predictions.select("predictedLabel", "label", "features").show(30,false)

    // Select (prediction, true label) and compute test error.
    val evaluator = new MulticlassClassificationEvaluator()
      .setLabelCol("indexedLabel")
      .setPredictionCol("prediction")
      .setMetricName("accuracy")
    val accuracy = evaluator.evaluate(predictions)
    println("Test Error = " + (1.0 - accuracy))

    val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]
    println("Learned classification forest model:\n" + rfModel.toDebugString)



  }

}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 隨機森林分類器（Random Forest）隨機森林（Random Forest）用隨機森林分類 [Machine Learning & Algorithm] 隨機森林（Random Forest） R語言之Random Forest隨機森林機器學習技法之隨機森林（Random Forest）圖解機器學習 | 隨機森林分類模型詳解 Spark2 Random Forests 隨機森林（數據科學學習手札26）隨機森林分類器原理詳解&Python與R實現 Spark隨機森林實戰