Spark中repartition和partitionBy的区别

本文转载自查看原文 2018-10-25 23:21 4778 Spark

repartition 和 partitionBy 都是对数据进行重新分区，默认都是使用 HashPartitioner，区别在于partitionBy 只能用于 PairRDD，但是当它们同时都用于 PairRDD时，结果却不一样：

不难发现，其实 partitionBy 的结果才是我们所预期的，我们打开 repartition 的源码进行查看：

/** * Return a new RDD that has exactly numPartitions partitions. * * Can increase or decrease the level of parallelism in this RDD. Internally, this uses * a shuffle to redistribute data. * * If you are decreasing the number of partitions in this RDD, consider using `coalesce`, * which can avoid performing a shuffle. * * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207. */ def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { coalesce(numPartitions, shuffle = true) } /** * Return a new RDD that is reduced into `numPartitions` partitions. * * This results in a narrow dependency, e.g. if you go from 1000 partitions * to 100 partitions, there will not be a shuffle, instead each of the 100 * new partitions will claim 10 of the current partitions. If a larger number * of partitions is requested, it will stay at the current number of partitions. * * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, * this may result in your computation taking place on fewer nodes than * you like (e.g. one node in the case of numPartitions = 1). To avoid this, * you can pass shuffle = true. This will add a shuffle step, but means the * current upstream partitions will be executed in parallel (per whatever * the current partitioning is). * * @note With shuffle = true, you can actually coalesce to a larger number * of partitions. This is useful if you have a small number of partitions, * say 100, potentially with a few partitions being abnormally large. Calling * coalesce(1000, shuffle = true) will result in 1000 partitions with the * data distributed using a hash partitioner. The optional partition coalescer * passed in must be serializable. */ def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty) (implicit ord: Ordering[T] = null) : RDD[T] = withScope { require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.") if (shuffle) { /** Distributes elements evenly across output partitions, starting from a random partition. */ val distributePartition = (index: Int, items: Iterator[T]) => { var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions) items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner // will mod it with the number of total partitions.
          position = position + 1 (position, t) } } : Iterator[(Int, T)] // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD( new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition), new HashPartitioner(numPartitions)), numPartitions, partitionCoalescer).values } else { new CoalescedRDD(this, numPartitions, partitionCoalescer) } }

即使是RairRDD也不会使用自己的key，repartition 其实使用了一个随机生成的数来当做 Key，而不是使用原来的 Key！！

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 Spark partitionBy spark partition 理解 / coalesce 与 repartition的区别 Spark笔记-repartition和coalesce spark算子：partitionBy对数据进行分区 Spark中cache和persist的区别 Spark中的reduceByKey()和groupByKey()的区别 SPARK SQL 中registerTempTable与saveAsTable的区别 Spark中 RDD、DF、DS的区别与联系 spark 的createDstream和createDirectStream区别 Spark TempView和GlobalTempView的区别