在關於spark任務並行度的設置中,有兩個參數我們會經常遇到,spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么這兩個參數到底有什么區別的?
首先,讓我們來看下它們的定義
Property Name | Default | Meaning |
spark.sql.shuffle.partitions | 200 | Configures the number of partitions to use when shuffling data for joins or aggregations. |
spark.default.parallelism | For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager: |
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user. |
看起來它們的定義似乎也很相似,但在實際測試中,
spark.default.parallelism只有在處理RDD時才會起作用,對Spark SQL的無效。 spark.sql.shuffle.partitions則是對Spark SQL專用的設置
我們可以在提交作業的通過 --conf 來修改這兩個設置的值,方法如下:
spark-submit --conf spark.sql.shuffle.partitions=20 --conf spark.default.parallelism=20