首先參考的是這篇文章:http://blog.csdn.net/sadfasdgaaaasdfa/article/details/45970185
但是其中的函數太老了。所以要改。另外出發點是我自己的這篇文章 http://www.cnblogs.com/charlesblc/p/6206198.html 里面關於梯度下降的那幅圖片。
改來改去,在隨機化向量上耗費了很多時間,最后還是做好了。代碼如下:
package com.spark.my import org.apache.log4j.{Level, Logger} import org.apache.spark.{SparkConf, SparkContext} import breeze.linalg.DenseVector import breeze.numerics.exp /** * Created by baidu on 16/11/28. */ object GradientDemo{ case class DataPoint(x: DenseVector[Double], y: Double) // case class見下文 def parsePoint(x: Array[Double]): DataPoint = { //DataPoint(Vectors.dense(x.slice(0, x.size-2)), x(x.size-1)) DataPoint(DenseVector(x.slice(0, x.size-2)), x(x.size-1)) } def main(args: Array[String]) { Logger.getLogger("org.apache.spark").setLevel(Level.WARN) val conf = new SparkConf() val sc = new SparkContext(conf) println("Begin load gradient file") // 裝載數據集 val text = sc.textFile("hdfs://master.Hadoop:8390/gradient_data/spam.data.txt") val lines = text.map { line => line.split(" ").map(_.toDouble) } val points = lines.map(parsePoint(_)) // (parsePoint(_))看起來是一樣的 var w = DenseVector.rand(lines.first().size - 2) val iterations = 100 for (i <- 1 to iterations) { val gradient = points.map(p => (1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x) .reduce(_ + _) w -= gradient } println("Finish data loading, w num: " + w.length + "; w: " + w) } }
然后在m42n05機器上,先用的是把 http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/spam.data 這個文件拷貝到Hadoop上:
$hadoop fs -mkdir /gradient_data $ hadoop fs -put spam.data.txt /gradient_data/ $ hadoop fs -ls /gradient_data/ Found 1 items -rw-r--r-- 3 work supergroup 698341 2016-12-21 17:59 /gradient_data/spam.data.txt
然后把jar包也拷貝過來,運行命令:
$ ./bin/spark-submit --class com.spark.my.GradientDemo --master spark://10.117.146.12:7077 myjars/scala-demo.jar 得到輸出: 16/12/21 18:17:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/12/21 18:17:58 INFO util.log: Logging initialized @1689ms 16/12/21 18:17:58 INFO server.Server: jetty-9.2.z-SNAPSHOT 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@107ed6fc{/jobs,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1643d68f{/jobs/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@186978a6{/jobs/job,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2e029d61{/jobs/job/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@482d776b{/stages,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4052274f{/stages/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@132ddbab{/stages/stage,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@297ea53a{/stages/stage/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@acb0951{/stages/pool,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5bf22f18{/stages/pool/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@267f474e{/storage,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7a7471ce{/storage/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@28276e50{/storage/rdd,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@62e70ea3{/storage/rdd/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3efe7086{/environment,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@675d8c96{/environment/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@741b3bc3{/executors,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2ed3b1f5{/executors/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@63648ee9{/executors/threadDump,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@68d6972f{/executors/threadDump/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@45be7cd5{/static,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7651218e{/,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3185fa6b{/api,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6d366c9b{/stages/stage/kill,null,AVAILABLE} 16/12/21 18:17:58 INFO server.ServerConnector: Started ServerConnector@53e211ee{HTTP/1.1}{0.0.0.0:4040} 16/12/21 18:17:58 INFO server.Server: Started @1811ms 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6e0d4a8{/metrics/json,null,AVAILABLE} Begin load gradient file 16/12/21 18:18:00 INFO mapred.FileInputFormat: Total input paths to process : 1 16/12/21 18:18:02 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 16/12/21 18:18:02 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Finish data loading, w num: 56; w: DenseVector(0.5742670447735152, 0.3793477463119241, 0.9681722093411653, 0.5967720119758925, 1.513648869152009, 0.8246263930800145, 0.8513296345703405, 0.5016541916805365, 0.10371045067354999, 1.0622529560536655, 0.7333760424194737, 2.1149483032187897, 0.9299367625800867, 0.7255747859512406, 0.13008556580706143, 1.4831202765138185, 0.7729907277492736, 0.9723309264036033, 13.394753146641808, 0.5531526429090097, 2.7444722115693665, 0.11325813324181622, 0.5096129116641023, 0.7201439311127137, 0.44719912156747926, 0.8273500952621051, 0.6736417633922696, 0.046531684571481415, 0.017895929000231802, 0.4726397794671698, 0.394438566392741, 0.8438784726078483, 0.4144073806784945, 0.18873920886297268, 0.4760240368798872, 0.31604719205329873, 0.694745503752298, 0.721380820951884, 0.988535475648986, 0.13515871744899247, 0.15694652560543523, 0.6939378895510522, 0.9279201378471407, 0.3336083293555714, 0.38938263676999685, 0.17159756568171308, 0.18897754115255144, 0.7281027812135723, 0.7233165381530381, 1.1093715737790655, 0.15675561193336351, 2.059622965151493, 0.6839713282339183, 0.11528695729374866, 7.413534050555067, 23.13404922028611) 16/12/21 18:18:07 INFO server.ServerConnector: Stopped ServerConnector@53e211ee{HTTP/1.1}{0.0.0.0:4040} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6d366c9b{/stages/stage/kill,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3185fa6b{/api,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7651218e{/,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@45be7cd5{/static,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@68d6972f{/executors/threadDump/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@63648ee9{/executors/threadDump,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@2ed3b1f5{/executors/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@741b3bc3{/executors,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@675d8c96{/environment/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3efe7086{/environment,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@62e70ea3{/storage/rdd/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@28276e50{/storage/rdd,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7a7471ce{/storage/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@267f474e{/storage,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5bf22f18{/stages/pool/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@acb0951{/stages/pool,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@297ea53a{/stages/stage/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@132ddbab{/stages/stage,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4052274f{/stages/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@482d776b{/stages,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@2e029d61{/jobs/job/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@186978a6{/jobs/job,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1643d68f{/jobs/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@107ed6fc{/jobs,null,UNAVAILABLE}
可以看到數據正常進行了處理。
在代碼的迭代循環里面再加上這么一句,看看過程:
println("In data loading, w num: " + w.length + "; w: " + w)
然后重新拷貝jar包,然后運行。發現增加了很多中間數據,但是每次改動不大,有的只是最后幾個數字改動:
In data loading, w num: 56; w: DenseVector(0.8387794911469437, 0.041931950643148204, 0.610593576873822, 0.775693127624059, 0.9595814255406686, 0.8346753461732199, 1.3049939469403333, 0.7056665962054256, 0.4607139317388798, 0.7272237992038442, 0.658182563650663, 0.733627042229442, 0.49543528179048996, 0.43928474305383947, 0.7784540121519834, 3.3618947233533456, 0.8863247999385253, 0.4007587753541083, 2.0631977325748334, 0.8211289850510815, 1.2076387347473903, 0.43209585536401196, 0.8361371667999544, 0.3902040623717107, 0.9249800607229486, 0.9684655358995048, 0.7122113545634148, 0.7564214721597596, 0.9295754044438086, 0.0667831407627083, 0.8262226990678785, 0.9866253536733688, 0.7214690647928418, 0.5992067836236182, 0.801215365214358, 1.0206941788488395, 0.8887684894893382, 0.39696145592511084, 0.7994301499483707, 0.39766237687949973, 0.3213782652296576, 0.3959330364022269, 0.6573698429264838, 0.5725594506918451, 0.932872703406284, 0.4276515117478306, 0.8908902872993782, 0.6281143587881469, 0.5136752276267151, 1.0933173640821512, 0.10820509511118362, 1.9426418431339785, 0.2017114624971559, 0.9827542778431644, 5.224634203803431, 16.694903977208174)
In data loading, w num: 56; w: DenseVector(0.8387794911469437, 0.041931950643148204, 0.6105935768739001, 0.775693127624059, 0.9595814255414439, 0.8346753461732199, 1.3049939469403333, 0.7056665962054256, 0.4607139317388798, 0.7272237992038442, 0.658182563650663, 0.733627042229442, 0.49543528179048996, 0.43928474305383947, 0.7784540121519834, 3.3618947233534118, 0.8863247999385373, 0.4007587753541083, 2.0631977325749897, 0.8211289850510815, 1.2076387347474142, 0.43209585536401196, 0.8361371667999544, 0.3902040623717107, 0.9249800607229486, 0.9684655358995048, 0.7122113545634148, 0.7564214721597596, 0.9295754044438086, 0.0667831407627083, 0.8262226990678785, 0.9866253536733688, 0.7214690647928418, 0.5992067836236182, 0.801215365214358, 1.0206941788488395, 0.8887684894893382, 0.39696145592511084, 0.7994301499483707, 0.3976623768795117, 0.3213782652296576, 0.3959330364022269, 0.6573698429264838, 0.5725594506918451, 0.932872703406296, 0.4276515117478306, 0.8908902872993782, 0.6281143587881469, 0.5136752276267151, 1.093317364082217, 0.10820509511118362, 1.942641843152015, 0.2017114624971559, 0.982754277843168, 5.22463420411604, 16.694903977520784)
梯度下降原理
梯度下降原理講的比較好的,可以看這里:
http://blog.csdn.net/woxincd/article/details/7040944
還有這篇:
http://www.cnblogs.com/maybe2030/p/5089753.html?utm_source=tuicool&utm_medium=referral
仔細看了一下,發現上面的公式,和代碼里面的公式好像不太一樣。應該是代碼里面用到了Sigmoid函數。
還需要好好領悟一下。
上面代碼里面用到的公式主要是:
(1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x)
上面p.x是一個n維的vector,p.y是一個數值。
然后 reduce(_+_)是說把沒一行的都加起來。也就是最后是一個n維的vector.
然后 w -= gradient
然后迭代N次,得到一個新的w.
case class
case class和class的區別可以看:http://www.tuicool.com/articles/yEZr6ve
在Scala中存在case class,它其實就是一個普通的class。但是它又和普通的class略有區別,如下:
1、初始化的時候可以不用new,當然你也可以加上,普通類一定需要加new;
2、toString的實現更漂亮;
3、默認實現了equals 和hashCode;
4、默認是可以序列化的,也就是實現了Serializable ;
5、自動從scala.Product中繼承一些函數;
6、case class構造函數的參數是public級別的,我們可以直接訪問;
7、支持模式匹配。
Breeze
另外,上面的DenseVector其實都是用的Breeze里面的類
LinearRegressionWithSGD
另外,這是Spark里面實現的線性回歸,是基於隨機梯度下降的。相似的函數還有:
MLlib中可用的線性回歸算法有:LinearRegressionWithSGD,RidgeRegressionWithSGD,LassoWithSGD;MLlib回歸分析中涉及到的主要類有,GeneralizedLinearAlgorithm,GradientDescent。
Scala用Java
上文最后用的是DenseVector,所以沒有用下面這段。但是下面這段說明了Scala里面可以用Java的:
import java.util.Random val rand = new Random(53)