spark.sql.shuffle.partitions 和 spark.default.parallelism 的區別


 

在關於spark任務並行度的設置中,有兩個參數我們會經常遇到,spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么這兩個參數到底有什么區別的?

首先,讓我們來看下它們的定義

Property Name Default Meaning
spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.
spark.default.parallelism For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD.

For operations like parallelize with no parent RDDs, it depends on the cluster manager:
- Local mode: number of cores on the local machine
- Mesos fine grained mode: 8
- Others: total number of cores on all executor nodes or 2, whichever is larger
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

看起來它們的定義似乎也很相似,但在實際測試中,

  • spark.default.parallelism只有在處理RDD時才會起作用,對Spark SQL的無效。
  • spark.sql.shuffle.partitions則是對sparks SQL專用的設置

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM