一、分布式估算圓周率
1.計算原理
假設正方形的面積S等於x²,而正方形的內切圓的面積C等於Pi×(x/2)²,因此圓面積與正方形面積之比C/S就為Pi/4,於是就有Pi=4×C/S。
可以利用計算機隨機產生大量位於正方形內部的點,通過點的數量去近似表示面積。假設位於正方形中點的數量為Ps,落在圓內的點的數量為Pc,則隨機點的數量趨近於無窮時,4×Pc/Ps將逼近於Pi。
2.IDEA下直接運行
(1)啟動IDEA,Create New Project-Scala-選擇JDK和Scala SDK(Create-Browse-/home/jun/scala-2.12.6/lib下的所有jar包)-Finish
(2)右鍵src-New-Package-輸入com.jun-OK
(3)File-Project Structure-Libraries-+Java-/home/jun/spark-2.3.1-bin-hadoop2.7-jars下的所有jar包-OK
(4)右鍵com.jun - Name(sparkPi)- Kind(Object)- OK,在編輯區寫入下面的代碼
package com.jun import scala.math.random import org.apache.spark._ object sparkPi { def main(args: Array[String]){ val conf = new SparkConf().setAppName("spark Pi") val spark = new SparkContext(conf) val slices = if (args.length > 0) args(0).toInt else 2 val n = 100000 * slices val count = spark.parallelize(1 to n, slices).map { i => val x = random * 2 - 1 val y = random * 2 - 1 if (x*x + y*y < 1) 1 else 0 }.reduce(_ + _) println("Pi is roughly " + 4.0 * count / n) spark.stop() } }
(5)Run-Edit Configuration-+-Application-寫入下面的運行參數配置-OK
(6)右鍵單擊代碼編輯區-Run sparkPi
出現了一個錯誤,這個問題是因為版本不匹配導致的,通過查看Spark官網可以看到,spark-2.3.1僅支持scala-2.11.x所以要將scala換成2.11版本。
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps; at org.apache.spark.internal.config.ConfigHelpers$.stringToSeq(ConfigBuilder.scala:48) at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$toSequence$1.apply(ConfigBuilder.scala:124) at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$toSequence$1.apply(ConfigBuilder.scala:124) at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefault(ConfigBuilder.scala:142) at org.apache.spark.internal.config.package$.<init>(package.scala:152) at org.apache.spark.internal.config.package$.<clinit>(package.scala) at org.apache.spark.SparkConf$.<init>(SparkConf.scala:668) at org.apache.spark.SparkConf$.<clinit>(SparkConf.scala) at org.apache.spark.SparkConf.set(SparkConf.scala:94) at org.apache.spark.SparkConf$$anonfun$loadFromSystemProperties$3.apply(SparkConf.scala:76) at org.apache.spark.SparkConf$$anonfun$loadFromSystemProperties$3.apply(SparkConf.scala:75) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:789) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:231) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:462) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:462) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:788) at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:75) at org.apache.spark.SparkConf.<init>(SparkConf.scala:70) at org.apache.spark.SparkConf.<init>(SparkConf.scala:57) at com.jun.sparkPi$.main(sparkPi.scala:8) at com.jun.sparkPi.main(sparkPi.scala) Process finished with exit code 1
Spark官網在spark2.3.1版本介紹中有這么一段說明,於是將scala版本換成2.11.8,然而又由於idea和scala插件版本不對應,最后決定采取聯網安裝scala插件的辦法。
Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.3.1 uses Scala 2.11. You will need to use a compatible Scala version (2.11.x).
然后再執行,在一對日志文本中找到輸出的結果:
2018-07-24 11:00:17 INFO DAGScheduler:54 - ResultStage 0 (reduce at sparkPi.scala:16) finished in 0.779 s 2018-07-24 11:00:17 INFO DAGScheduler:54 - Job 0 finished: reduce at sparkPi.scala:16, took 1.286323 s Pi is roughly 3.13792 2018-07-24 11:00:18 INFO AbstractConnector:318 - Stopped Spark@2c9399a4{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 2018-07-24 11:00:18 INFO BlockManagerInfo:54 - Removed broadcast_0_piece0 on master:35290 in memory (size: 1176.0 B, free: 323.7 MB)
3.分布式運行前的准備
分布式運行是指在客戶端以命令行方式向Spark集群提交jar包的運行方式,所以需要將上述程序變異成jar包。
(1)File-Project Structure-Artifacts-+-jar-From modules with dependencies-將Main Class設置為com.jun.sparkPi-OK-在Output Layout下只留下一個compile output-OK
(2)Build-Build Artifacts-Build
(3)將輸出的jar包復制到Spark安裝目錄下
[jun@master bin]$ cp /home/jun/IdeaProjects/sparkAPP/out/artifacts/sparkAPP_jar/sparkAPP.jar /home/jun/spark-2.3.1-bin-hadoop2.7/
4.分布式運行
(1)本地模式
[jun@master bin]$ /home/jun/spark-2.3.1-bin-hadoop2.7/bin/spark-submit --master local --class com.jun.sparkPi /home/jun/spark-2.3.1-bin-hadoop2.7/sparkAPP.jar
結果為本地命令行輸出:
2018-07-24 11:12:21 INFO TaskSetManager:54 - Finished task 1.0 in stage 0.0 (TID 1) in 34 ms on localhost (executor driver) (2/2) 2018-07-24 11:12:21 INFO DAGScheduler:54 - ResultStage 0 (reduce at sparkPi.scala:16) finished in 1.591 s 2018-07-24 11:12:21 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool 2018-07-24 11:12:21 INFO DAGScheduler:54 - Job 0 finished: reduce at sparkPi.scala:16, took 1.833831 s Pi is roughly 3.14082 2018-07-24 11:12:21 INFO AbstractConnector:318 - Stopped Spark@285f09de{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 2018-07-24 11:12:21 INFO SparkUI:54 - Stopped Spark web UI at http://master:4040 2018-07-24 11:12:21 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped! 2018-07-24 11:12:21 INFO MemoryStore:54 - MemoryStore cleared 2018-07-24 11:12:21 INFO BlockManager:54 - BlockManager stopped
(2)Hadoop Yarn-cluster模式
[jun@master spark-2.3.1-bin-hadoop2.7]$ bin/spark-submit --master yarn --deploy-mode cluster sparkAPP.jar
命令行返回處理信息:
2018-07-24 11:17:14 INFO Client:54 - client token: N/A diagnostics: N/A ApplicationMaster host: 192.168.1.102 ApplicationMaster RPC port: 0 queue: default start time: 1532402191014 final status: SUCCEEDED tracking URL: http://master:18088/proxy/application_1532394200431_0002/ user: jun
結果在Tracking URL里的logs中的stdout中查看
2018-07-24 11:17:14 INFO DAGScheduler:54 - ResultStage 0 (reduce at sparkPi.scala:16) finished in 0.910 s 2018-07-24 11:17:14 INFO DAGScheduler:54 - Job 0 finished: reduce at sparkPi.scala:16, took 0.970826 s Pi is roughly 3.14076 2018-07-24 11:17:14 INFO AbstractConnector:318 - Stopped Spark@76017b73{HTTP/1.1,[http/1.1]}{0.0.0.0:0} 2018-07-24 11:17:14 INFO SparkUI:54 - Stopped Spark web UI at http://slave1:41837 2018-07-24 11:17:14 INFO YarnAllocator:54 - Driver requested a total number of 0 executor(s).
(3)Hadoop Yarn-client模式
[jun@master spark-2.3.1-bin-hadoop2.7]$ bin/spark-submit --master yarn --deploy-mode client sparkAPP.jar
結果就在本地客戶端查看
2018-07-24 11:20:21 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 (TID 0) in 3592 ms on slave1 (executor 1) (2/2) 2018-07-24 11:20:21 INFO YarnScheduler:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool 2018-07-24 11:20:21 INFO DAGScheduler:54 - ResultStage 0 (reduce at sparkPi.scala:16) finished in 12.041 s 2018-07-24 11:20:21 INFO DAGScheduler:54 - Job 0 finished: reduce at sparkPi.scala:16, took 13.017473 s Pi is roughly 3.1387 2018-07-24 11:20:22 INFO AbstractConnector:318 - Stopped Spark@29a6924f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 2018-07-24 11:20:22 INFO SparkUI:54 - Stopped Spark web UI at http://master:4040 2018-07-24 11:20:22 INFO YarnClientSchedulerBackend:54 - Interrupting monitor thread 2018-07-24 11:20:22 INFO YarnClientSchedulerBackend:54 - Shutting down all executors 2018-07-24 11:20:22 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - Asking each executor t
5.代碼分析
TODO
二、基於Spark MLlib的貸款風險預測
1.計算原理
有一個CSV文件,里面存儲的是用戶信用數據集。例如,
1,1,18,4,2,1049,1,2,4,2,1,4,2,21,3,1,1,3,1,1,1 1,1,9,4,0,2799,1,3,2,3,1,2,1,36,3,1,2,3,2,1,1 1,2,12,2,9,841,2,4,2,2,1,4,1,23,3,1,1,2,1,1,1 1,1,12,4,0,2122,1,3,3,3,1,2,1,39,3,1,2,2,2,1,2 1,1,12,4,0,2171,1,3,4,3,1,4,2,38,1,2,2,2,1,1,2 1,1,10,4,0,2241,1,2,1,3,1,3,1,48,3,1,2,2,2,1,2
在用戶信用度數據集里,每條樣本用兩個類別來標記,1(可信)和0(不可信),每個樣本的特征包括21個字段,其中第一個字段1或0表示是否可信,另外20個特征字段分別為:存款、期限、歷史記錄、目的、數額、儲蓄、是否在職、分期付款額、婚姻、擔保人、居住時間、資產、年齡、歷史信用、居住公寓、貸款、職業、監護人、是否有電話、外籍。
其中運用了決策樹模型和隨機森林模型來對銀行信用貸款的風險做分類預測。
2.運行程序
(1)在IDEA中新建Scala項目,包,類,配置Project SDK與Scala SDK,將csv文件復制到新建的項目下,將下面的代碼復制到代碼編輯區
package com.jun import org.apache.spark._ import org.apache.spark.rdd.RDD import org.apache.spark.sql.SQLContext import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ import org.apache.spark.sql._ import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.feature.StringIndexer import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.tuning.{ ParamGridBuilder, CrossValidator } import org.apache.spark.ml.{ Pipeline, PipelineStage } import org.apache.spark.mllib.evaluation.RegressionMetrics object Credit { case class Credit( creditability: Double, balance: Double, duration: Double, history: Double, purpose: Double, amount: Double, savings: Double, employment: Double, instPercent: Double, sexMarried: Double, guarantors: Double, residenceDuration: Double, assets: Double, age: Double, concCredit: Double, apartment: Double, credits: Double, occupation: Double, dependents: Double, hasPhone: Double, foreign: Double ) def parseCredit(line: Array[Double]): Credit = { Credit( line(0), line(1) - 1, line(2), line(3), line(4), line(5), line(6) - 1, line(7) - 1, line(8), line(9) - 1, line(10) - 1, line(11) - 1, line(12) - 1, line(13), line(14) - 1, line(15) - 1, line(16) - 1, line(17) - 1, line(18) - 1, line(19) - 1, line(20) - 1 ) } def parseRDD(rdd: RDD[String]): RDD[Array[Double]] = { rdd.map(_.split(",")).map(_.map(_.toDouble)) } def main(args: Array[String]) { val conf = new SparkConf().setAppName("SparkDFebay") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) import sqlContext._ import sqlContext.implicits._ val creditDF = parseRDD(sc.textFile("germancredit.csv")).map(parseCredit).toDF().cache() creditDF.registerTempTable("credit") creditDF.printSchema creditDF.show sqlContext.sql("SELECT creditability, avg(balance) as avgbalance, avg(amount) as avgamt, avg(duration) as avgdur FROM credit GROUP BY creditability ").show creditDF.describe("balance").show creditDF.groupBy("creditability").avg("balance").show val featureCols = Array("balance", "duration", "history", "purpose", "amount", "savings", "employment", "instPercent", "sexMarried", "guarantors", "residenceDuration", "assets", "age", "concCredit", "apartment", "credits", "occupation", "dependents", "hasPhone", "foreign") val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features") val df2 = assembler.transform(creditDF) df2.show val labelIndexer = new StringIndexer().setInputCol("creditability").setOutputCol("label") val df3 = labelIndexer.fit(df2).transform(df2) df3.show val splitSeed = 5043 val Array(trainingData, testData) = df3.randomSplit(Array(0.7, 0.3), splitSeed) val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(3).setNumTrees(20).setFeatureSubsetStrategy("auto").setSeed(5043) val model = classifier.fit(trainingData) val evaluator = new BinaryClassificationEvaluator().setLabelCol("label") val predictions = model.transform(testData) model.toDebugString val accuracy = evaluator.evaluate(predictions) println("accuracy before pipeline fitting" + accuracy) val rm = new RegressionMetrics( predictions.select("prediction", "label").rdd.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double])) ) println("MSE: " + rm.meanSquaredError) println("MAE: " + rm.meanAbsoluteError) println("RMSE Squared: " + rm.rootMeanSquaredError) println("R Squared: " + rm.r2) println("Explained Variance: " + rm.explainedVariance + "\n") val paramGrid = new ParamGridBuilder() .addGrid(classifier.maxBins, Array(25, 31)) .addGrid(classifier.maxDepth, Array(5, 10)) .addGrid(classifier.numTrees, Array(20, 60)) .addGrid(classifier.impurity, Array("entropy", "gini")) .build() val steps: Array[PipelineStage] = Array(classifier) val pipeline = new Pipeline().setStages(steps) val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(10) val pipelineFittedModel = cv.fit(trainingData) val predictions2 = pipelineFittedModel.transform(testData) val accuracy2 = evaluator.evaluate(predictions2) println("accuracy after pipeline fitting" + accuracy2) println(pipelineFittedModel.bestModel.asInstanceOf[org.apache.spark.ml.PipelineModel].stages(0)) pipelineFittedModel .bestModel.asInstanceOf[org.apache.spark.ml.PipelineModel] .stages(0) .extractParamMap val rm2 = new RegressionMetrics( predictions2.select("prediction", "label").rdd.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double])) ) println("MSE: " + rm2.meanSquaredError) println("MAE: " + rm2.meanAbsoluteError) println("RMSE Squared: " + rm2.rootMeanSquaredError) println("R Squared: " + rm2.r2) println("Explained Variance: " + rm2.explainedVariance + "\n") } }
(2)編輯啟動配置,Edit Configuration-Application-Name(Credit),Main Class(com.jun.Credit),Program arguments(/home/jun/IdeaProjects/Credit),VM options(-Dspark.master=local -Dspark.app.name=Credit -server -XX:PermSize=128M -XX:MaxPermSize=256M)
(3)Run Credit
(4)控制台輸出結果為:日志INFO太多了,看不到啥。考慮將INFO日志隱藏,方法就是將spark安裝文件夾下的默認日志配置文件拷貝到工程的src下並修改在控制台顯示的日志的級別。
[jun@master conf]$ cp /home/jun/spark-2.3.1-bin-hadoop2.7/conf/log4j.properties.template /home/jun/IdeaProjects/Credit/src/ [jun@master conf]$ cd /home/jun/IdeaProjects/Credit/src/ [jun@master src]$ mv log4j.properties.template log4j.properties [jun@master src]$ gedit log4j.properties
在日志的配置文件中修改日志級別,只將ERROR級別的日志輸出在控制台
log4j.rootCategory=ERROR, console
再次運行,最后的結果為:
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128M; support was removed in 8.0 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=256M; support was removed in 8.0 root |-- creditability: double (nullable = false) |-- balance: double (nullable = false) |-- duration: double (nullable = false) |-- history: double (nullable = false) |-- purpose: double (nullable = false) |-- amount: double (nullable = false) |-- savings: double (nullable = false) |-- employment: double (nullable = false) |-- instPercent: double (nullable = false) |-- sexMarried: double (nullable = false) |-- guarantors: double (nullable = false) |-- residenceDuration: double (nullable = false) |-- assets: double (nullable = false) |-- age: double (nullable = false) |-- concCredit: double (nullable = false) |-- apartment: double (nullable = false) |-- credits: double (nullable = false) |-- occupation: double (nullable = false) |-- dependents: double (nullable = false) |-- hasPhone: double (nullable = false) |-- foreign: double (nullable = false) +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+ |creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign| +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+ | 1.0| 0.0| 18.0| 4.0| 2.0|1049.0| 0.0| 1.0| 4.0| 1.0| 0.0| 3.0| 1.0|21.0| 2.0| 0.0| 0.0| 2.0| 0.0| 0.0| 0.0| | 1.0| 0.0| 9.0| 4.0| 0.0|2799.0| 0.0| 2.0| 2.0| 2.0| 0.0| 1.0| 0.0|36.0| 2.0| 0.0| 1.0| 2.0| 1.0| 0.0| 0.0| | 1.0| 1.0| 12.0| 2.0| 9.0| 841.0| 1.0| 3.0| 2.0| 1.0| 0.0| 3.0| 0.0|23.0| 2.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0| | 1.0| 0.0| 12.0| 4.0| 0.0|2122.0| 0.0| 2.0| 3.0| 2.0| 0.0| 1.0| 0.0|39.0| 2.0| 0.0| 1.0| 1.0| 1.0| 0.0| 1.0| | 1.0| 0.0| 12.0| 4.0| 0.0|2171.0| 0.0| 2.0| 4.0| 2.0| 0.0| 3.0| 1.0|38.0| 0.0| 1.0| 1.0| 1.0| 0.0| 0.0| 1.0| | 1.0| 0.0| 10.0| 4.0| 0.0|2241.0| 0.0| 1.0| 1.0| 2.0| 0.0| 2.0| 0.0|48.0| 2.0| 0.0| 1.0| 1.0| 1.0| 0.0| 1.0| | 1.0| 0.0| 8.0| 4.0| 0.0|3398.0| 0.0| 3.0| 1.0| 2.0| 0.0| 3.0| 0.0|39.0| 2.0| 1.0| 1.0| 1.0| 0.0| 0.0| 1.0| | 1.0| 0.0| 6.0| 4.0| 0.0|1361.0| 0.0| 1.0| 2.0| 2.0| 0.0| 3.0| 0.0|40.0| 2.0| 1.0| 0.0| 1.0| 1.0| 0.0| 1.0| | 1.0| 3.0| 18.0| 4.0| 3.0|1098.0| 0.0| 0.0| 4.0| 1.0| 0.0| 3.0| 2.0|65.0| 2.0| 1.0| 1.0| 0.0| 0.0| 0.0| 0.0| | 1.0| 1.0| 24.0| 2.0| 3.0|3758.0| 2.0| 0.0| 1.0| 1.0| 0.0| 3.0| 3.0|23.0| 2.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| | 1.0| 0.0| 11.0| 4.0| 0.0|3905.0| 0.0| 2.0| 2.0| 2.0| 0.0| 1.0| 0.0|36.0| 2.0| 0.0| 1.0| 2.0| 1.0| 0.0| 0.0| | 1.0| 0.0| 30.0| 4.0| 1.0|6187.0| 1.0| 3.0| 1.0| 3.0| 0.0| 3.0| 2.0|24.0| 2.0| 0.0| 1.0| 2.0| 0.0| 0.0| 0.0| | 1.0| 0.0| 6.0| 4.0| 3.0|1957.0| 0.0| 3.0| 1.0| 1.0| 0.0| 3.0| 2.0|31.0| 2.0| 1.0| 0.0| 2.0| 0.0| 0.0| 0.0| | 1.0| 1.0| 48.0| 3.0| 10.0|7582.0| 1.0| 0.0| 2.0| 2.0| 0.0| 3.0| 3.0|31.0| 2.0| 1.0| 0.0| 3.0| 0.0| 1.0| 0.0| | 1.0| 0.0| 18.0| 2.0| 3.0|1936.0| 4.0| 3.0| 2.0| 3.0| 0.0| 3.0| 2.0|23.0| 2.0| 0.0| 1.0| 1.0| 0.0| 0.0| 0.0| | 1.0| 0.0| 6.0| 2.0| 3.0|2647.0| 2.0| 2.0| 2.0| 2.0| 0.0| 2.0| 0.0|44.0| 2.0| 0.0| 0.0| 2.0| 1.0| 0.0| 0.0| | 1.0| 0.0| 11.0| 4.0| 0.0|3939.0| 0.0| 2.0| 1.0| 2.0| 0.0| 1.0| 0.0|40.0| 2.0| 1.0| 1.0| 1.0| 1.0| 0.0| 0.0| | 1.0| 1.0| 18.0| 2.0| 3.0|3213.0| 2.0| 1.0| 1.0| 3.0| 0.0| 2.0| 0.0|25.0| 2.0| 0.0| 0.0| 2.0| 0.0| 0.0| 0.0| | 1.0| 1.0| 36.0| 4.0| 3.0|2337.0| 0.0| 4.0| 4.0| 2.0| 0.0| 3.0| 0.0|36.0| 2.0| 1.0| 0.0| 2.0| 0.0| 0.0| 0.0| | 1.0| 3.0| 11.0| 4.0| 0.0|7228.0| 0.0| 2.0| 1.0| 2.0| 0.0| 3.0| 1.0|39.0| 2.0| 1.0| 1.0| 1.0| 0.0| 0.0| 0.0| +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+ only showing top 20 rows +-------------+------------------+------------------+------------------+ |creditability| avgbalance| avgamt| avgdur| +-------------+------------------+------------------+------------------+ | 0.0|0.9033333333333333|3938.1266666666666| 24.86| | 1.0|1.8657142857142857| 2985.442857142857|19.207142857142856| +-------------+------------------+------------------+------------------+ +-------+------------------+ |summary| balance| +-------+------------------+ | count| 1000| | mean| 1.577| | stddev|1.2576377271108938| | min| 0.0| | max| 3.0| +-------+------------------+ +-------------+------------------+ |creditability| avg(balance)| +-------------+------------------+ | 0.0|0.9033333333333333| | 1.0|1.8657142857142857| +-------------+------------------+ +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+ |creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign| features| +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+ | 1.0| 0.0| 18.0| 4.0| 2.0|1049.0| 0.0| 1.0| 4.0| 1.0| 0.0| 3.0| 1.0|21.0| 2.0| 0.0| 0.0| 2.0| 0.0| 0.0| 0.0|(20,[1,2,3,4,6,7,...| | 1.0| 0.0| 9.0| 4.0| 0.0|2799.0| 0.0| 2.0| 2.0| 2.0| 0.0| 1.0| 0.0|36.0| 2.0| 0.0| 1.0| 2.0| 1.0| 0.0| 0.0|(20,[1,2,4,6,7,8,...| | 1.0| 1.0| 12.0| 2.0| 9.0| 841.0| 1.0| 3.0| 2.0| 1.0| 0.0| 3.0| 0.0|23.0| 2.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|[1.0,12.0,2.0,9.0...| | 1.0| 0.0| 12.0| 4.0| 0.0|2122.0| 0.0| 2.0| 3.0| 2.0| 0.0| 1.0| 0.0|39.0| 2.0| 0.0| 1.0| 1.0| 1.0| 0.0| 1.0|[0.0,12.0,4.0,0.0...| | 1.0| 0.0| 12.0| 4.0| 0.0|2171.0| 0.0| 2.0| 4.0| 2.0| 0.0| 3.0| 1.0|38.0| 0.0| 1.0| 1.0| 1.0| 0.0| 0.0| 1.0|[0.0,12.0,4.0,0.0...| | 1.0| 0.0| 10.0| 4.0| 0.0|2241.0| 0.0| 1.0| 1.0| 2.0| 0.0| 2.0| 0.0|48.0| 2.0| 0.0| 1.0| 1.0| 1.0| 0.0| 1.0|[0.0,10.0,4.0,0.0...| | 1.0| 0.0| 8.0| 4.0| 0.0|3398.0| 0.0| 3.0| 1.0| 2.0| 0.0| 3.0| 0.0|39.0| 2.0| 1.0| 1.0| 1.0| 0.0| 0.0| 1.0|[0.0,8.0,4.0,0.0,...| | 1.0| 0.0| 6.0| 4.0| 0.0|1361.0| 0.0| 1.0| 2.0| 2.0| 0.0| 3.0| 0.0|40.0| 2.0| 1.0| 0.0| 1.0| 1.0| 0.0| 1.0|[0.0,6.0,4.0,0.0,...| | 1.0| 3.0| 18.0| 4.0| 3.0|1098.0| 0.0| 0.0| 4.0| 1.0| 0.0| 3.0| 2.0|65.0| 2.0| 1.0| 1.0| 0.0| 0.0| 0.0| 0.0|[3.0,18.0,4.0,3.0...| | 1.0| 1.0| 24.0| 2.0| 3.0|3758.0| 2.0| 0.0| 1.0| 1.0| 0.0| 3.0| 3.0|23.0| 2.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|(20,[0,1,2,3,4,5,...| | 1.0| 0.0| 11.0| 4.0| 0.0|3905.0| 0.0| 2.0| 2.0| 2.0| 0.0| 1.0| 0.0|36.0| 2.0| 0.0| 1.0| 2.0| 1.0| 0.0| 0.0|(20,[1,2,4,6,7,8,...| | 1.0| 0.0| 30.0| 4.0| 1.0|6187.0| 1.0| 3.0| 1.0| 3.0| 0.0| 3.0| 2.0|24.0| 2.0| 0.0| 1.0| 2.0| 0.0| 0.0| 0.0|[0.0,30.0,4.0,1.0...| | 1.0| 0.0| 6.0| 4.0| 3.0|1957.0| 0.0| 3.0| 1.0| 1.0| 0.0| 3.0| 2.0|31.0| 2.0| 1.0| 0.0| 2.0| 0.0| 0.0| 0.0|[0.0,6.0,4.0,3.0,...| | 1.0| 1.0| 48.0| 3.0| 10.0|7582.0| 1.0| 0.0| 2.0| 2.0| 0.0| 3.0| 3.0|31.0| 2.0| 1.0| 0.0| 3.0| 0.0| 1.0| 0.0|[1.0,48.0,3.0,10....| | 1.0| 0.0| 18.0| 2.0| 3.0|1936.0| 4.0| 3.0| 2.0| 3.0| 0.0| 3.0| 2.0|23.0| 2.0| 0.0| 1.0| 1.0| 0.0| 0.0| 0.0|[0.0,18.0,2.0,3.0...| | 1.0| 0.0| 6.0| 2.0| 3.0|2647.0| 2.0| 2.0| 2.0| 2.0| 0.0| 2.0| 0.0|44.0| 2.0| 0.0| 0.0| 2.0| 1.0| 0.0| 0.0|[0.0,6.0,2.0,3.0,...| | 1.0| 0.0| 11.0| 4.0| 0.0|3939.0| 0.0| 2.0| 1.0| 2.0| 0.0| 1.0| 0.0|40.0| 2.0| 1.0| 1.0| 1.0| 1.0| 0.0| 0.0|[0.0,11.0,4.0,0.0...| | 1.0| 1.0| 18.0| 2.0| 3.0|3213.0| 2.0| 1.0| 1.0| 3.0| 0.0| 2.0| 0.0|25.0| 2.0| 0.0| 0.0| 2.0| 0.0| 0.0| 0.0|[1.0,18.0,2.0,3.0...| | 1.0| 1.0| 36.0| 4.0| 3.0|2337.0| 0.0| 4.0| 4.0| 2.0| 0.0| 3.0| 0.0|36.0| 2.0| 1.0| 0.0| 2.0| 0.0| 0.0| 0.0|[1.0,36.0,4.0,3.0...| | 1.0| 3.0| 11.0| 4.0| 0.0|7228.0| 0.0| 2.0| 1.0| 2.0| 0.0| 3.0| 1.0|39.0| 2.0| 1.0| 1.0| 1.0| 0.0| 0.0| 0.0|[3.0,11.0,4.0,0.0...| +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+ only showing top 20 rows +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+ |creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign| features|label| +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+ | 1.0| 0.0| 18.0| 4.0| 2.0|1049.0| 0.0| 1.0| 4.0| 1.0| 0.0| 3.0| 1.0|21.0| 2.0| 0.0| 0.0| 2.0| 0.0| 0.0| 0.0|(20,[1,2,3,4,6,7,...| 0.0| | 1.0| 0.0| 9.0| 4.0| 0.0|2799.0| 0.0| 2.0| 2.0| 2.0| 0.0| 1.0| 0.0|36.0| 2.0| 0.0| 1.0| 2.0| 1.0| 0.0| 0.0|(20,[1,2,4,6,7,8,...| 0.0| | 1.0| 1.0| 12.0| 2.0| 9.0| 841.0| 1.0| 3.0| 2.0| 1.0| 0.0| 3.0| 0.0|23.0| 2.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|[1.0,12.0,2.0,9.0...| 0.0| | 1.0| 0.0| 12.0| 4.0| 0.0|2122.0| 0.0| 2.0| 3.0| 2.0| 0.0| 1.0| 0.0|39.0| 2.0| 0.0| 1.0| 1.0| 1.0| 0.0| 1.0|[0.0,12.0,4.0,0.0...| 0.0| | 1.0| 0.0| 12.0| 4.0| 0.0|2171.0| 0.0| 2.0| 4.0| 2.0| 0.0| 3.0| 1.0|38.0| 0.0| 1.0| 1.0| 1.0| 0.0| 0.0| 1.0|[0.0,12.0,4.0,0.0...| 0.0| | 1.0| 0.0| 10.0| 4.0| 0.0|2241.0| 0.0| 1.0| 1.0| 2.0| 0.0| 2.0| 0.0|48.0| 2.0| 0.0| 1.0| 1.0| 1.0| 0.0| 1.0|[0.0,10.0,4.0,0.0...| 0.0| | 1.0| 0.0| 8.0| 4.0| 0.0|3398.0| 0.0| 3.0| 1.0| 2.0| 0.0| 3.0| 0.0|39.0| 2.0| 1.0| 1.0| 1.0| 0.0| 0.0| 1.0|[0.0,8.0,4.0,0.0,...| 0.0| | 1.0| 0.0| 6.0| 4.0| 0.0|1361.0| 0.0| 1.0| 2.0| 2.0| 0.0| 3.0| 0.0|40.0| 2.0| 1.0| 0.0| 1.0| 1.0| 0.0| 1.0|[0.0,6.0,4.0,0.0,...| 0.0| | 1.0| 3.0| 18.0| 4.0| 3.0|1098.0| 0.0| 0.0| 4.0| 1.0| 0.0| 3.0| 2.0|65.0| 2.0| 1.0| 1.0| 0.0| 0.0| 0.0| 0.0|[3.0,18.0,4.0,3.0...| 0.0| | 1.0| 1.0| 24.0| 2.0| 3.0|3758.0| 2.0| 0.0| 1.0| 1.0| 0.0| 3.0| 3.0|23.0| 2.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|(20,[0,1,2,3,4,5,...| 0.0| | 1.0| 0.0| 11.0| 4.0| 0.0|3905.0| 0.0| 2.0| 2.0| 2.0| 0.0| 1.0| 0.0|36.0| 2.0| 0.0| 1.0| 2.0| 1.0| 0.0| 0.0|(20,[1,2,4,6,7,8,...| 0.0| | 1.0| 0.0| 30.0| 4.0| 1.0|6187.0| 1.0| 3.0| 1.0| 3.0| 0.0| 3.0| 2.0|24.0| 2.0| 0.0| 1.0| 2.0| 0.0| 0.0| 0.0|[0.0,30.0,4.0,1.0...| 0.0| | 1.0| 0.0| 6.0| 4.0| 3.0|1957.0| 0.0| 3.0| 1.0| 1.0| 0.0| 3.0| 2.0|31.0| 2.0| 1.0| 0.0| 2.0| 0.0| 0.0| 0.0|[0.0,6.0,4.0,3.0,...| 0.0| | 1.0| 1.0| 48.0| 3.0| 10.0|7582.0| 1.0| 0.0| 2.0| 2.0| 0.0| 3.0| 3.0|31.0| 2.0| 1.0| 0.0| 3.0| 0.0| 1.0| 0.0|[1.0,48.0,3.0,10....| 0.0| | 1.0| 0.0| 18.0| 2.0| 3.0|1936.0| 4.0| 3.0| 2.0| 3.0| 0.0| 3.0| 2.0|23.0| 2.0| 0.0| 1.0| 1.0| 0.0| 0.0| 0.0|[0.0,18.0,2.0,3.0...| 0.0| | 1.0| 0.0| 6.0| 2.0| 3.0|2647.0| 2.0| 2.0| 2.0| 2.0| 0.0| 2.0| 0.0|44.0| 2.0| 0.0| 0.0| 2.0| 1.0| 0.0| 0.0|[0.0,6.0,2.0,3.0,...| 0.0| | 1.0| 0.0| 11.0| 4.0| 0.0|3939.0| 0.0| 2.0| 1.0| 2.0| 0.0| 1.0| 0.0|40.0| 2.0| 1.0| 1.0| 1.0| 1.0| 0.0| 0.0|[0.0,11.0,4.0,0.0...| 0.0| | 1.0| 1.0| 18.0| 2.0| 3.0|3213.0| 2.0| 1.0| 1.0| 3.0| 0.0| 2.0| 0.0|25.0| 2.0| 0.0| 0.0| 2.0| 0.0| 0.0| 0.0|[1.0,18.0,2.0,3.0...| 0.0| | 1.0| 1.0| 36.0| 4.0| 3.0|2337.0| 0.0| 4.0| 4.0| 2.0| 0.0| 3.0| 0.0|36.0| 2.0| 1.0| 0.0| 2.0| 0.0| 0.0| 0.0|[1.0,36.0,4.0,3.0...| 0.0| | 1.0| 3.0| 11.0| 4.0| 0.0|7228.0| 0.0| 2.0| 1.0| 2.0| 0.0| 3.0| 1.0|39.0| 2.0| 1.0| 1.0| 1.0| 0.0| 0.0| 0.0|[3.0,11.0,4.0,0.0...| 0.0| +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+ only showing top 20 rows accuracy before pipeline fitting0.7264394897138242 MSE: 0.22442244224422442 MAE: 0.22442244224422442 RMSE Squared: 0.47373245850820106 R Squared: -0.1840018388690956 Explained Variance: 0.09866135128364424 accuracy after pipeline fitting0.7523847833582331 RandomForestClassificationModel (uid=rfc_3146cd3eaaac) with 60 trees MSE: 0.23762376237623759 MAE: 0.2376237623762376 RMSE Squared: 0.48746667822143247 R Squared: -0.25364900586139494 Explained Variance: 0.15708699582829524 Process finished with exit code 0
從accuracy before pipeline fitting0.7264394897138242和accuracy after pipeline fitting0.7523847833582331可以看到,程序可以用管道訓練得到的最優模型進行預測應用,將預測結果與標簽做比較,預測結果取得了75.24%的准確率,而使用標簽則取得了72.64的准確率。
3.代碼分析
TODO