官網的話
什么是Shuffle
In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation. During computations, a single task will operate on a single partition - thus, to organize all the data for a single reduceByKey reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.
我直接復制了整段話,其實用概括起來就是:
把不同節點的數據拉取到同一個節點的過程就叫做Shuffle
有哪些Shuffle算子
Operations which can cause a shuffle include
repartition operations like repartition and coalesce,
‘ByKey operations (except for counting) like groupByKey and reduceByKey,
and join operations like cogroup and join.
這一句話完美總結了Spark中Shuffle算子的分類:
重分區算子
(repartition ,coalesce)
ByKey算子
(groupByKey ,reduceByKey)
Join算子
(cogroup ,join)
詳細總結三類Shuffle算子
其實官網寫那幾個就是最常用的了
重分區算子
repartition
coalesce
ByKey算子
groupByKey
reduceByKey
aggregateByKey
combineByKey
sortByKey
sortBy
Join算子
cogroup
join
leftOuterJoin
intersection
subtract
subtractByKey
(姑且把后面三個也放到Join類算子)
后記
官網說了三類,這里再加一類:
去重算子
distinct
原文鏈接:https://blog.csdn.net/Android_xue/article/details/102806676