在關於spark任務並行度的設置中,有兩個參數我們會經常遇到,spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么這兩個參數到底有什么區別的?
首先,讓我們來看下它們的定義
Property Name | Default | Meaning |
---|---|---|
spark.sql.shuffle.partitions | 200 | Configures the number of partitions to use when shuffling data for joins or aggregations. |
spark.default.parallelism | For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager: - Local mode: number of cores on the local machine - Mesos fine grained mode: 8 - Others: total number of cores on all executor nodes or 2, whichever is larger |
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user. |
看起來它們的定義似乎也很相似,但在實際測試中,
- spark.default.parallelism只有在處理RDD時才會起作用,對Spark SQL的無效。
- spark.sql.shuffle.partitions則是對sparks SQL專用的設置