spark中的shuffle算子

本文轉載自查看原文 2021-02-27 16:43 316 Spark

官網的話
什么是Shuffle

In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation.
During computations, a single task will operate on a single partition - thus,
to organize all the data for a single reduceByKey reduce task to execute,
Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys,
and then bring together values across partitions to compute the final result for each key - this is called the shuffle.

我直接復制了整段話，其實用概括起來就是：

把不同節點的數據拉取到同一個節點的過程就叫做Shuffle

有哪些Shuffle算子
Operations which can cause a shuffle include
repartition operations like repartition and coalesce,
‘ByKey operations (except for counting) like groupByKey and reduceByKey,
and join operations like cogroup and join.

這一句話完美總結了Spark中Shuffle算子的分類：

重分區算子
（repartition ，coalesce）
ByKey算子
（groupByKey ，reduceByKey）
Join算子
（cogroup ，join）
詳細總結三類Shuffle算子
其實官網寫那幾個就是最常用的了

重分區算子
repartition
coalesce

ByKey算子
groupByKey
reduceByKey
aggregateByKey
combineByKey
sortByKey
sortBy

Join算子
cogroup
join
leftOuterJoin
intersection
subtract
subtractByKey
（姑且把后面三個也放到Join類算子）
后記
官網說了三類，這里再加一類：

去重算子
distinct

原文鏈接：https://blog.csdn.net/Android_xue/article/details/102806676

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spark中的Spark Shuffle詳解【Spark調優】：盡量避免使用shuffle類算子【Spark篇】---Spark中Action算子【Spark篇】---Spark中控制算子 Spark中的各種action算子操作（java版） Spark的Shuffle和MR的Shuffle異同 Spark shuffle詳細過程 Spark Shuffle詳解 Spark源碼分析 – Shuffle spark shuffle讀操作