再xgboost的源碼中有xgboost的SparkWithDataFrame的實現,如下:https://github.com/dmlc/xgboost/tree/master/jvm-packages。但是由於各種各樣的原因吧,這些代碼在我的IDE里面編譯不過,因此又寫了如下代碼以供以后查閱使用。
package xgboost
import ml.dmlc.xgboost4j.scala.spark.{XGBoost, XGBoostModel}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.sql.{Row, DataFrame, SparkSession}
object App{
def main(args: Array[String]): Unit ={
val trainPath: String = "xxx/train.txt"
val testPath: String = "xxx/test.txt"
val binaryModelPath: String = "xxx/model.binary"
val textModelPath: String = "xxx/model.txt"
val spark = SparkSession
.builder()
.master("yarn")
.getOrCreate()
// define xgboost parameters
val maxDepth = 3
val numRound = 4
val nworker = 1
val paramMap = List(
"eta" -> 0.1,
"max_depth" -> maxDepth,
"objective" -> "binary:logistic").toMap
//read libsvm file
var dfTrain = spark.read.format("libsvm").load(trainPath).toDF("labelCol", "featureCol")
var dfTest = spark.read.format("libsvm").load(testPath).toDF("labelCol", "featureCol")
dfTrain.show(true)
printf("begin...")
val model:XGBoostModel = XGBoost.trainWithDataFrame(dfTrain, paramMap, numRound, nworker,
useExternalMemory = true,
featureCol = "featureCol", labelCol = "labelCol",
missing = 0.0f)
//predict the test set
val predict:DataFrame = model.transform(dfTest)
val scoreAndLabels = predict.select(model.getPredictionCol, model.getLabelCol)
.rdd
.map{case Row(score:Double, label:Double) => (score, label)}
//get the auc
val metric = new BinaryClassificationMetrics(scoreAndLabels)
val auc = metric.areaUnderROC()
println("auc:" + auc)
//save model
this.saveBinaryModel(model, spark, binaryModelPath)
this.saveTextModel(model, spark, textModelPath, numRound, maxDepth)
}
def saveBinaryModel(model:XGBoostModel, spark: SparkSession, path: String): Unit = {
model.saveModelAsHadoopFile(path)(spark.sparkContext)
}
def saveTextModel(model:XGBoostModel, spark: SparkSession, path: String, numRound: Int, maxDepth: Int): Unit = {
val dumpModel = model
.booster
.getModelDump()
.toList
.zipWithIndex
.map(x => s"booster:[${x._2}]\n${x._1}")
val header = s"numRound: $numRound, maxDepth: $maxDepth"
print(dumpModel)
import spark.implicits._
val text: List[String] = header +: dumpModel
text.toDF
.coalesce(1)
.write
.mode("overwrite")
.text(path)
}
}
其中:
1.訓練集和測試集都是libsvm格式,如下所示:
1 3:1 10:1 11:1 21:1 30:1 34:1 36:1 40:1 41:1 53:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 105:1 117:1 124:1
0 3:1 10:1 20:1 21:1 23:1 34:1 36:1 39:1 41:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 120:1
2.最終生成的模型如下所示:
numRound: 4, maxDepth: 3 booster:[0] 0:[f29<2] yes=1,no=2,missing=2 1:leaf=0.152941 2:leaf=-0.191209 booster:[1] 0:[f29<2] yes=1,no=2,missing=2 1:leaf=0.141901 2:leaf=-0.174499 booster:[2] 0:[f29<2] yes=1,no=2,missing=2 1:leaf=0.132731 2:leaf=-0.161685 booster:[3] 0:[f29<2] yes=1,no=2,missing=2 1:leaf=0.124972 2:leaf=-0.15155
相關解釋:”numRound: 4, maxDepth: 3”表示生成樹的個數為4,樹的最大深度為3;booster[n]表示第n棵樹;以下保存樹的結構,0號節點為根節點,每個節點有兩個子節點,節點序號按層序技術,即1號和2號節點為根節點0號節點的子節點,相同層的節點有相同縮進,且比父節點多一級縮進。
在節點行,首先聲明節點序號,中括號里寫明該節點采用第幾個特征(如f29即為訓練數據的第29個特征),同時表明特征值划分條件,“[f29<2] yes=1,no=2,missing=2”:表示f29號特征大於2時該樣本划分到1號葉子節點,f29>=2時划分到2號葉子節點,當沒有該特征(None)划分到2號葉子節點。
3.預測的結果如下:
|labelCol|featureCol |probabilities |prediction| |1.0 |(126,[2,9,10,20,29,33,35,39,40,52,57,64,68,76,85,87,91,94,101,104,116,123],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|[0.3652743101119995,0.6347256898880005]|1.0 | |0.0 |(126,[2,9,19,20,22,33,35,38,40,52,55,64,68,76,85,87,91,94,101,105,115,119],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|[0.6635029911994934,0.3364970088005066]|0.0 |
