SparkMLlib分類算法之支持向量機
(一),概念
支持向量機(support vector machine)是一種分類算法,通過尋求結構化風險最小來提高學習機泛化能力,實現經驗風險和置信范圍的最小化,從而達到在統計樣本量較少的情況下,亦能獲得良好統計規律的目的。通俗來講,它是一種二類分類模型,其基本模型定義為特征空間上的間隔最大的線性分類器,即支持向量機的學習策略便是間隔最大化,最終可轉化為一個凸二次規划問題的求解。參考網址:http://www.cnblogs.com/end/p/3848740.html
(二),SparkMLlib中SVM回歸應用
1,數據集:參考這篇SparkMLlib學習分類算法之邏輯回歸算法
2,處理數據及獲取訓練集和測試集
val orig_file=sc.textFile("train_nohead.tsv") //println(orig_file.first()) val data_file=orig_file.map(_.split("\t")).map{ r => val trimmed =r.map(_.replace("\"","")) val lable=trimmed(r.length-1).toDouble val feature=trimmed.slice(4,r.length-1).map(d => if(d=="?")0.0 else d.toDouble) LabeledPoint(lable,Vectors.dense(feature)) } /*特征標准化優化*/ val vectors=data_file.map(x =>x.features) val rows=new RowMatrix(vectors) println(rows.computeColumnSummaryStatistics().variance)//每列的方差 val scaler=new StandardScaler(withMean=true,withStd=true).fit(vectors)//標准化 val scaled_data=data_file.map(point => LabeledPoint(point.label,scaler.transform(point.features))) .randomSplit(Array(0.7,0.3),11L) val data_train=scaled_data(0) val data_test=scaled_data(1)
2,建立支持向量機模型及模型評估
/*訓練 SVM 模型**/ val model_Svm=SVMWithSGD.train(data_train,numIteration) val correct_svm=data_test.map{ point => if(model_Svm.predict(point.features)==point.label) 1 else 0 }.sum()/data_test.count()//精確度:0.6060885608856088 val metrics=Seq(model_Svm).map{ model => val socreAndLabels=data_test.map { point => (model.predict(point.features), point.label) } val metrics=new BinaryClassificationMetrics(socreAndLabels) (model.getClass.getSimpleName,metrics.areaUnderPR(),metrics.areaUnderROC()) } val allMetrics = metrics allMetrics.foreach{ case (m, pr, roc) => println(f"$m, Area under PR: ${pr * 100.0}%2.4f%%, Area under ROC: ${roc * 100.0}%2.4f%%") } /* SVMModel, Area under PR: 72.5527%, Area under ROC: 60.4180%*/
3,模型參數調優
邏輯回歸(SGD)和 SVM 模型有相同的參數,原因是它們都使用隨機梯度下降( SGD )作為基礎優化技術。不同點在於二者采用的損失函數不同
3.1 定義調參函數及模型評估函數
/*調參函數*/ def trainWithParams(input: RDD[LabeledPoint], regParam: Double, numIterations: Int, updater: Updater, stepSize: Double) = { val svm = new SVMWithSGD svm.optimizer.setNumIterations(numIterations). setUpdater(updater).setRegParam(regParam).setStepSize(stepSize) svm.run(input) } /*評估函數*/ def createMetrics(label: String, data: RDD[LabeledPoint], model: ClassificationModel) = { val scoreAndLabels = data.map { point => (model.predict(point.features), point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (label, metrics.areaUnderROC) }
3.2 改變迭代次數(發現一旦完成特定次數的迭代,再增大迭代次數對結果的影響較小)
val iterResults = Seq(1, 5, 10, 50).map { param => val model = trainWithParams(data_train, 0.0, param, new SimpleUpdater, 1.0) createMetrics(s"$param iterations", data_test, model) } iterResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") } /* 1 iterations, AUC = 59.02% 5 iterations, AUC = 60.04% 10 iterations, AUC = 60.42% 50 iterations, AUC = 60.42% */
3.3 ,改變步長(以看出步長增長過大對性能有負面影響)
在 SGD 中,在訓練每個樣本並更新模型的權重向量時,步長用來控制算法在最陡的梯度方向上應該前進多遠。較大的步長收斂較快,但是步長太大可能導致收斂到局部最優解。
val stepResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map { param => val model = trainWithParams(data_train, 0.0, numIteration, new SimpleUpdater, param) createMetrics(s"$param step size", data_test, model) } stepResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") } /* 0.001 step size, AUC = 59.02% 0.01 step size, AUC = 59.02% 0.1 step size, AUC = 59.01% 1.0 step size, AUC = 60.42% 10.0 step size, AUC = 56.09% */
3.4 正則化
val regResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map { param => val model = trainWithParams(data_train, param, numIteration, new SquaredL2Updater, 1.0) createMetrics(s"$param L2 regularization parameter", data_test, model) } regResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") } /* 0.001 L2 regularization parameter, AUC = 60.42% 0.01 L2 regularization parameter, AUC = 60.42% 0.1 L2 regularization parameter, AUC = 60.37% 1.0 L2 regularization parameter, AUC = 60.56% 10.0 L2 regularization parameter, AUC = 41.54% */
可以看出,低等級的正則化對模型的性能影響不大。然而,增大正則化可以看到欠擬合會導致較低模型性能。
(三),總結
1,提高精確度感覺蠻難的,前提還是要先分析數據,對不同特征加以處理吧。。。。。
2,以后多學習。。。。