spark rdd median 中位數求解

本文轉載自查看原文 2017-07-12 10:47 2546 python/ spark

lookup(key)

Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.

>>> l = range(1000) >>> rdd = sc.parallelize(zip(l, l), 10) >>> rdd.lookup(42) # slow [42] >>> sorted = rdd.sortByKey() >>> sorted.lookup(42) # fast [42] >>> sorted.lookup(1024) [] >>> rdd2 = sc.parallelize([(('a', 'b'), 'c')]).groupByKey() >>> list(rdd2.lookup(('a', 'b'))[0]) ['c']

You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]:

  import org.apache.spark.SparkContext._

  val rdd: RDD[Int] = ???

  val sorted = rdd.sortBy(identity).zipWithIndex().map {
    case (v, idx) => (idx, v)
  }

  val count = sorted.count()

  val median: Double = if (count % 2 == 0) {
    val l = count / 2 - 1
    val r = l + 1
    (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
  } else sorted.lookup(count / 2).head.toDouble


實驗：

all_data = sc.parallelize([25,1,2,3,4,5,6,7,8,100])
all_data.sortBy(lambda x:x).zipWithIndex().map(lambda x: (x[1],x[0])).collect
[(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 25), (9, 100)]

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【算法之美】求解兩個有序數組的中位數 — leetcode 4. Median of Two Sorted Arrays MAD（Median absolute deviation, 中位數絕對偏差） [LeetCode] Sliding Window Median 滑動窗口中位數求解兩個升序序列的中位數 [LintCode] Median（期望時間復雜度O(n)求中位數和第k大數） Median of Two Sorted 求兩個有序數組的中位數 [LeetCode] Find Median from Data Stream 找出數據流的中位數 LeetCode：4_Median of Two Sorted Arrays | 求兩個排序數組的中位數 | Hard [LeetCode] 4. Median of Two Sorted Arrays 兩個有序數組的中位數 Google 面試題：Java實現用最大堆和最小堆查找中位數 Find median with min heap and max heap in Java