Spark Mllib之分層抽樣

本文轉載自查看原文 2018-10-16 23:36 899 Spark MLlib

Spark中組件Mllib的學習之基礎概念篇
1、解釋
分層抽樣的概念就不講了，具體的操作：
RDD有個操作可以直接進行抽樣：sampleByKey和sample等，這里主要介紹這兩個
（1）將字符串長度為2划分為層2，字符串長度為3划分為層1，對層1和層2按不同的概率進行抽樣
數據

aa
bb
cc
dd
ee
aaa
bbb
ccc
ddd
eee

比如：
val fractions: Map[Int, Double] = List((1, 0.2), (2, 0.8)).toMap //設定抽樣格式
sampleByKey(withReplacement = false, fractions, 0)
fractions表示在層1抽0.2，在層2中抽0.8
withReplacement false表示不重復抽樣
0表示隨機的seed

源碼：

 /** * Return a subset of this RDD sampled by key (via stratified sampling). * * Create a sample of this RDD using variable sampling rates for different keys as specified by * `fractions`, a key to sampling rate map, via simple random sampling with one pass over the * RDD, to produce a sample of size that's approximately equal to the sum of * math.ceil(numItems * samplingRate) over all key values. * * @param withReplacement whether to sample with or without replacement * @param fractions map of specific keys to sampling rates * @param seed seed for the random number generator * @return RDD containing the sampled subset */ def sampleByKey(withReplacement: Boolean, fractions: Map[K, Double], seed: Long = Utils.random.nextLong): RDD[(K, V)] = self.withScope { require(fractions.values.forall(v => v >= 0.0), "Negative sampling rates.") val samplingFunc = if (withReplacement) { StratifiedSamplingUtils.getPoissonSamplingFunction(self, fractions, false, seed) } else { StratifiedSamplingUtils.getBernoulliSamplingFunction(self, fractions, false, seed) } self.mapPartitionsWithIndex(samplingFunc, preservesPartitioning = true) }

2、代碼：

import org.apache.spark.{SparkConf, SparkContext} object StratifiedSamplingLearning { def main(args: Array[String]) { val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass.getSimpleName.filter(!_.equals('$'))) val sc = new SparkContext(conf) println("First:") val data = sc.textFile("D:\\TestData\\StratifiedSampling.txt") //讀取數
      .map(row => { //開始處理
      if (row.length == 3) //判斷字符數
        (row, 1) //建立對應map
      else (row, 2) //建立對應map
    }).map(each => (each._2, each._1)) data.foreach(println) println("sampleByKey:") val fractions: Map[Int, Double] = List((1, 0.2), (2, 0.8)).toMap //設定抽樣格式
    val approxSample = data.sampleByKey(withReplacement = false, fractions, 0) //計算抽樣樣本
    approxSample.foreach(println) println("Second:") val randRDD = sc.parallelize(List((7, "cat"), (6, "mouse"), (7, "cup"), (6, "book"), (7, "tv"), (6, "screen"), (7, "heater"))) val sampleMap = List((7, 0.4), (6, 0.8)).toMap val sample2 = randRDD.sampleByKey(false, sampleMap, 42).collect sample2.foreach(println) println("Third:") val a = sc.parallelize(1 to 20, 3) val b = a.sample(true, 0.8, 0) val c = a.sample(false, 0.8, 0) println("RDD a : " + a.collect().mkString(" , ")) println("RDD b : " + b.collect().mkString(" , ")) println("RDD c : " + c.collect().mkString(" , ")) sc.stop } }

3、結果：

First: (2,aa) (1,bbb) (2,bb) (1,ccc) (2,cc) (1,ddd) (2,dd) (1,eee) (2,ee) (1,aaa)
 sampleByKey: (2,aa) (2,bb) (2,cc) (2,ee) Second: (7,cat) (6,mouse) (6,book) (6,screen) (7,heater)
 Third: RDD a : 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 RDD b : 2 , 4 , 5 , 6 , 10 , 14 , 19 , 20 RDD c : 1 , 2 , 4 , 5 , 8 , 10 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 分層抽樣 StratifiedKFold實現分層抽樣 sklearn,交叉驗證中的分層抽樣 SAS 分層抽樣示例代碼 sklearn.model_selection.StratifiedShuffleSplit 分層抽樣（交叉驗證法的一種） csv數據集按比例分割訓練集、驗證集和測試集，即分層抽樣的方法【抽樣調查】分層隨機抽樣 Spark Mllib源碼分析 spark MLlib的 pipeline方式 Spark MLlib介紹