spark.sql.shuffle.partitions和spark.default.parallelism的區別

本文轉載自查看原文 2019-06-01 19:04 2946 Spark

在關於spark任務並行度的設置中，有兩個參數我們會經常遇到，spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么這兩個參數到底有什么區別的？

首先，讓我們來看下它們的定義

Property Name

Default

Meaning

spark.sql.shuffle.partitions

200

Configures the number of partitions to use when shuffling data for joins or aggregations.

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD.

For operations like parallelize with no parent RDDs, it depends on the cluster manager:
- Local mode: number of cores on the local machine
- Mesos fine grained mode: 8
- Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

看起來它們的定義似乎也很相似，但在實際測試中，

spark.default.parallelism只有在處理RDD時才會起作用，對Spark SQL的無效。 spark.sql.shuffle.partitions則是對Spark SQL專用的設置

我們可以在提交作業的通過 --conf 來修改這兩個設置的值，方法如下：

spark-submit --conf spark.sql.shuffle.partitions=20 --conf spark.default.parallelism=20

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 spark.sql.shuffle.partitions 和 spark.default.parallelism 的區別 spark.sql.shuffle.partitions到底影響什么 spark通過合理設置spark.default.parallelism參數提高執行效率 spark提交命令 spark-submit 的參數 executor-memory、executor-cores、num-executors、spark.default.parallelism分析簡要MR與Spark在Shuffle區別 MR的shuffle和Spark的shuffle之間的區別 Spark Shuffle Spark Shuffle之Sort Shuffle Spark的Shuffle和MR的Shuffle異同 Spark中的Spark Shuffle詳解