Spark源码分析之Sort-Based Shuffle读写流程

本文转载自查看原文 2017-12-16 22:33 1914 Spark

一、概述

我们知道Spark Shuffle机制总共有三种：

1.未优化的Hash Shuffle：每一个ShuffleMapTask都会为每一个ReducerTask创建一个单独的文件，总的文件数是S * R,不仅文件数量很多，造成频繁的磁盘和网络I/O,而且内存负担也很大，GC频繁,经常出现OOM。

2.优化后Hash Shuffle：改进后的Shuffle,启用consolidation机制，Executor每一个core上的ShuffleMapTask共享文件，减少文件数目，比如Executor有2个core,总共有20个ShuffleMapTask，ReducerTask任务为4个，那么这里总共只有2 * 4 = 8个文件，和未优化之前相比较20 * 4 = 80个文件比较，改进较大。但是如果数据很大的情况下，优化后的Hash Shuffle依然会存在各种问题。比如数据量很大的时候groupByKey操作，必须保证每一个partition的数据内存可以存放。

3.Sort-Based Shuffle: 为了缓解Shuffle过程产生文件数过多和Writer缓存开销过大的问题，spark引入了类似于hadoop Map-Reduce的shuffle机制。该机制每一个ShuffleMapTask不会为后续的任务创建单独的文件，而是会将所有的Task结果写入同一个文件，并且对应生成一个索引文件。以前的数据是放在内存缓存中，等到数据完了再刷到磁盘，现在为了减少内存的使用，在内存不够用的时候，可以将输出溢写到磁盘，结束的时候，再将这些不同的文件联合内存的数据一起进行归并，从而减少内存的使用量。一方面文件数量显著减少，另一方面减少Writer缓存所占用的内存大小，而且同时避免GC的风险和频率。

二、Sort-BasedShuffle写机制

2.1 ShuffleMapTask获取ShuffleManager

Spark1.6之后，取消hash机制的shuffle, 只剩下基于sort的shuffle机制。我们可以在配置文件指定spark.shuffle.manager，如果没有指定默认就是sort，但是tungsten-sort也是基于SortShuffleManager的

valshortShuffleMgrNames = Map( "sort"-> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName, "tungsten-sort"->classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName) val shuffleMgrName= conf.get("spark.shuffle.manager","sort") val shuffleMgrClass= shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase,shuffleMgrName) val shuffleManager= instantiateClass[ShuffleManager](shuffleMgrClass)

2.2 根据ShuffleManager获取writer

ShuffleManager会根据注册的handle来决定实例化哪一个writer.如果注册的是SerializedShuffleHandle，就获取UnsafeShuffleWriter；如果注册的是BypassMergeSortShuffleHandle，就获取BypassMergeSortShuffleWriter；如果注册的是BaseShuffleHandle，就获取SortShuffleWriter

首先：我们看一下ShuffleManager如何注册ShuffleHandle的？

override def registerShuffle[K, V, C](shuffleId: Int, numMaps: Int, dependency: ShuffleDependency[K, V, C]): ShuffleHandle = { // 如果满足使用BypassMergeSort,就优先使用BypassMergeSortShuffleHandle
  if (SortShuffleWriter.shouldBypassMergeSort(SparkEnv.get.conf, dependency)) { new BypassMergeSortShuffleHandle[K, V]( shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]]) } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) { // 如果支持序列化模式，则使用SerializedShuffleHandle
    new SerializedShuffleHandle[K, V]( shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]]) } else { // 否则使用BaseShuffleHandle
    new BaseShuffleHandle(shuffleId, numMaps, dependency) } }

然后：看一下shouldBypassMergeSort这个方法，判断是否应该使用BypassMergeSort

使用这个模式需要满足的条件：

# 不能指定aggregator,即不能聚合

# 不能指定ordering，即分区内数据不能排序

# 分区的数目 < spark.shuffle.sort.bypassMergeThrshold指定的阀值

def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = { // We cannot bypass sorting if we need to do map-side aggregation.
  if (dep.mapSideCombine) { require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!") false } else { val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200) dep.partitioner.numPartitions <= bypassMergeThreshold } }

最后：我们分析一下canUseSerializedShuffle函数，来确定是否使用Tungsten-Sort支持的序列化模式SerializedShuffleHandle

满足条件：

# shuffle依赖不带有聚合操作

# 支持序列化值的重新定位

# 分区数量少于16777216个

 def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = { val shufId = dependency.shuffleId // 获取分区数
    val numPartitions = dependency.partitioner.numPartitions // 如果不支持序列化值的重新定位
    if (!dependency.serializer.supportsRelocationOfSerializedObjects) { log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " + s"${dependency.serializer.getClass.getName}, does not support object relocation") false } // 如果定义聚合器
    else if (dependency.aggregator.isDefined) { log.debug( s"Can't use serialized shuffle for shuffle $shufId because an aggregator is defined") false } // 如果分区数量大于16777216个
    else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) { log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " + s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions") false } else { log.debug(s"Can use serialized shuffle for shuffle $shufId") true } } } override def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext): ShuffleWriter[K, V] = { numMapsForShuffle.putIfAbsent( handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps) val env = SparkEnv.get handle match { // 如果使用SerializedShuffleHandle则获取UnsafeShuffleWriter
    case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
      new UnsafeShuffleWriter( env.blockManager, shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver], context.taskMemoryManager(), unsafeShuffleHandle, mapId, context, env.conf) // 如果使用BypassMergeSortShuffleHandle则获取BypassMergeSortShuffleWriter
    case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
      new BypassMergeSortShuffleWriter( env.blockManager, shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver], bypassMergeSortHandle, mapId, context, env.conf) // 如果使用BaseShuffleHandle则获取SortShuffleWriter
    case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
      new SortShuffleWriter(shuffleBlockResolver, other, mapId, context) } }

2.3 BypassMergeSortShuffleWriter的写机制分析

BypassMergeSortShuffleWriter：实现带Hash风格的基于Sort的Shuffle机制。在Reducer端任务数比较少的情况下，基于Hash的Shuffle实现机制明显比Sort的Shuffle实现快。所以基于Sort的Shuffle实现机制提供一个方案，当Reducer任务数少于配置的属性spark.shuffle.sort.bypassMergeThreshold设置的个数的时候，则使用此种方案。

特点：

# 主要用于处理不需要排序和聚合的Shuffle操作，所以数据是直接写入文件，数据量较大的时候，网络I/O和内存负担较重

# 主要适合处理Reducer任务数量比较少的情况下

# 将每一个分区写入一个单独的文件，最后将这些文件合并,减少文件数量；但是这种方式需要并发打开多个文件，对内存消耗比较大

public void write(Iterator<Product2<K,V>> records)throws IOException{ assert (partitionWriters== null); if (!records.hasNext()) { partitionLengths= new long[numPartitions]; shuffleBlockResolver.writeIndexFileAndCommit(shuffleId,mapId, partitionLengths,null); mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(),partitionLengths); return; } final SerializerInstanceserInstance = serializer.newInstance(); final long openStartTime= System.nanoTime(); // 构建一个对于task结果进行分区的数量的writer数组，即每一个分区对应着一个writer // 这种写入方式，会同时打开numPartition个writer,所以分区数不宜设置过大 // 避免带来过重的内存开销。现在默认writer的缓存大小是32k，比起以前100k小太多了
  partitionWriters= new DiskBlockObjectWriter[numPartitions]; // 构建一个对于task结果进行分区的数量的FileSegment数组，寄一个分区的writer对应着一组FileSegment
  partitionWriterSegments= new FileSegment[numPartitions]; for (int i = 0; i < numPartitions; i++) { // 创建临时的shuffle block,返回一个(shuffle blockid,file)的元组
    final Tuple2<TempShuffleBlockId,File> tempShuffleBlockIdPlusFile= blockManager.diskBlockManager().createTempShuffleBlock(); // 获取该分区对应的文件
    final Filefile = tempShuffleBlockIdPlusFile._2(); // 获取该分区对应的blockId
    final BlockIdblockId = tempShuffleBlockIdPlusFile._1(); // 构造每一个分区的writer
    partitionWriters[i] = blockManager.getDiskWriter(blockId,file, serInstance,fileBufferSize, writeMetrics); } writeMetrics.incWriteTime(System.nanoTime() -openStartTime); // 如果有数据，获取数据，对key进行分区，然后将<key,value>写入该分区对应的文件
  while (records.hasNext()) { final Product2<K,V> record =records.next(); final K key = record._1(); partitionWriters[partitioner.getPartition(key)].write(key,record._2()); } // 遍历所有分区的writer列表，刷新数据到文件，构建FileSegment数组
  for (inti = 0; i < numPartitions; i++) { final DiskBlockObjectWriterwriter = partitionWriters[i]; // 把数据刷到磁盘，构建一个FileSegment
    partitionWriterSegments[i] =writer.commitAndGet(); writer.close(); } // 根据shuffleId和mapId，构建ShuffleDataBlockId，创建文件，文件格式为: //shuffle_{shuffleId}_{mapId}_{reduceId}.data
  File output= shuffleBlockResolver.getDataFile(shuffleId,mapId); // 创建临时文件
  File tmp= Utils.tempFileWith(output); try { // 合并前面的生成的各个中间临时文件，并获取分区对应的数据大小，然后就可以计算偏移量
    partitionLengths= writePartitionedFile(tmp); // 创建索引文件，将每一个分区的起始位置、结束位置和偏移量写入索引， // 且将合并的data文件临时文件重命名，索引文件的临时文件重命名
 shuffleBlockResolver.writeIndexFileAndCommit(shuffleId,mapId, partitionLengths,tmp); } finally { if (tmp.exists() && !tmp.delete()) { logger.error("Error while deleting tempfile {}",tmp.getAbsolutePath()); } } // 封装并返回任何结果
  mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(),partitionLengths); } private long[]writePartitionedFile(FileoutputFile) throws IOException { // 构建一个分区数量的数组
  final long[] lengths = new long[numPartitions]; if (partitionWriters== null) { // We werepassed an empty iterator
    return lengths; } // 创建合并文件的临时文件输出流
  final FileOutputStreamout = new FileOutputStream(outputFile,true); final long writeStartTime= System.nanoTime(); boolean threwException= true; try { // 进行分区文件的合并，返回每一个分区文件长度
    for (inti = 0; i < numPartitions; i++) { // 获取该分区对应的FileSegment对应的文件
      final Filefile = partitionWriterSegments[i].file(); // 如果文件存在
      if (file.exists()) { final FileInputStreamin = new FileInputStream(file); boolean copyThrewException= true; try { // 把该文件拷贝到合并文件的临时文件，并返回文件长度
          lengths[i] =Utils.copyStream(in,out, false,transferToEnabled); copyThrewException= false; } finally { Closeables.close(in,copyThrewException); } if (!file.delete()) { logger.error("Unable to delete file forpartition {}",i); } } } threwException= false; } finally { Closeables.close(out,threwException); writeMetrics.incWriteTime(System.nanoTime() -writeStartTime); } partitionWriters= null; return lengths; } defwriteIndexFileAndCommit( shuffleId: Int, mapId: Int, lengths: Array[Long], dataTmp: File): Unit = { // 获取索引文件
  val indexFile= getIndexFile(shuffleId,mapId) // 生成临时的索引文件
  val indexTmp= Utils.tempFileWith(indexFile) try { val out = new DataOutputStream(newBufferedOutputStream(newFileOutputStream(indexTmp))) Utils.tryWithSafeFinally{ // 将offset写入索引文件写入临时的索引文件
      var offset= 0L out.writeLong(offset) for (length <- lengths) { offset += length out.writeLong(offset) } } { out.close() } // 获取数据文件
    val dataFile= getDataFile(shuffleId,mapId) // There isonly one IndexShuffleBlockResolver per executor, this synchronization make sure // the following check and rename areatomic.
    synchronized { // 传递进去的索引、数据文件以及每一个分区的文件的长度
      val existingLengths= checkIndexAndDataFile(indexFile,dataFile, lengths.length) if (existingLengths!= null) { // Anotherattempt for the same task has already written our map outputs successfully, // so just use the existingpartition lengths and delete our temporary map outputs.
        System.arraycopy(existingLengths,0, lengths,0, lengths.length) if (dataTmp!= null && dataTmp.exists()) { dataTmp.delete() } indexTmp.delete() } else { // This is thefirst successful attempt in writing the map outputs for this task, // so override any existing indexand data files with the ones we wrote.
        if (indexFile.exists()) { indexFile.delete() } if (dataFile.exists()) { dataFile.delete() } if (!indexTmp.renameTo(indexFile)) { throw new IOException("fail to rename file"+ indexTmp + " to " + indexFile) } if (dataTmp!= null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) { throw new IOException("fail to rename file"+ dataTmp + " to " + dataFile) } } } } finally { if (indexTmp.exists() && !indexTmp.delete()) { logError(s"Failed to delete temporary index file at${indexTmp.getAbsolutePath}") } } }

总结：

基于BypassMergeSortShuffleWriter的机制：

# 首先确定ShuffleMapTask的结果应该分为几个分区，并且为每一个分区创建一个DiskBlockObjectWriter和临时文件

# 将每一个ShuffleMapTask的结果通过Partitioner进行分区,写入对应分区的临时文件

# 将分区刷到磁盘文件，并且创建每一个分区文件对应的FileSegment数组

# 根据shuffleId和mapId，构建ShuffleDataBlockId，创建合并文件data和合并文件的临时文件，文件格式为:

shuffle_{shuffleId}_{mapId}_{reduceId}.data

# 将每一个分区对应的文件的数据合并到合并文件的临时文件，并且返回一个每一个分区对应的文件长度的数组

# 创建索引文件index和索引临时文件，每一个分区的长度和offset写入索引文件等；并且重命名临时data文件和临时index文件

# 将一些信息封装到MapStatus返回

2.4 SortShuffleWriter的写机制分析

SortShuffleWriter它主要是判断在Map端是否需要本地进行combine操作。如果需要聚合，则使用PartitionedAppendOnlyMap；如果不进行combine操作，则使用PartitionedPairBuffer添加数据存放于内存中。然后无论哪一种情况都需要判断内存是否足够，如果内存不够而且又申请不到内存，则需要进行本地磁盘溢写操作，把相关的数据写入溢写到临时文件。最后把内存里的数据和磁盘溢写的临时文件的数据进行合并，如果需要则进行一次归并排序，如果没有发生溢写则是不需要归并排序，因为都在内存里。最后生成合并后的data文件和index文件。

2.4.1 遍历数据，将task的输出写入文件

# 创建外部排序器ExternalSorter, 只是根据是否需要本地combine与否从而决定是否传入aggregator和keyOrdering参数

# 将写入数据全部放入外部排序器ExternalSorter,并且根据是否需要spill进行spill操作

# 创建data文件和临时的data文件，文件格式为'shuffle_{shuffleId}_{mapId}_{reducerId}.data' 先将数据写入临时data文件

# 创建index索引文件和临时index文件，写入每一个分区的offset以及length信息等，并且重命名data临时文件和index临时文件

# 把部分信息封装到MapStatus返回

override def write(records: Iterator[Product2[K, V]]): Unit = { // 是否map端需要在本地进行combine操作，如果需要，则需要传入aggregator和keyOrdering，创建ExternalSorter // aggregator用于指示进行combiner的操作( keyOrdering用于传递key的排序规则);
  sorter = if (dep.mapSideCombine) { require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!") new ExternalSorter[K, V, C]( context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer) } else { // 如果不需要在本地进行combine操作, 就不需要aggregator和keyOrdering // 那么本地每个分区的数据不会做聚合和排序
    new ExternalSorter[K, V, V]( context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer) } // 将写入数据全部放入外部排序器ExternalSorter,并且根据是否需要spill进行spill操作
 sorter.insertAll(records) // 创建data文件，文件格式为'shuffle_{shuffleId}_{mapId}_{reducerId}.data'
  val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId) // 为data文件创建临时的文件
  val tmp = Utils.tempFileWith(output) try { // 创建Shuffle Block Id:shuffle_{shuffleId}_{mapId}_{reducerId}
    val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID) val partitionLengths = sorter.writePartitionedFile(blockId, tmp) // 创建index索引文件，写入每一个分区的offset以及length信息等，并且重命名data临时文件和index临时文件
 shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp) // 把部分信息封装到MapStatus返回
    mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths) } finally { if (tmp.exists() && !tmp.delete()) { logError(s"Error while deleting temp file ${tmp.getAbsolutePath}") } } }

2.4.2 将写入数据全部放入外部排序器ExternalSorter,并且根据是否需要spill进行spill操作

# 判断aggregator是否为空，如果不为空，表示需要在本地combine

# 如果需要本地combine，则使用PartitionedAppendOnlyMap，先在内存进行聚合，如果需要一些磁盘，则开始溢写磁盘

# 如果不进行combine操作，则使用PartitionedPairBuffer添加数据存放于内存中，如果需要一些磁盘，则开始溢写磁盘

def insertAll(records: Iterator[Product2[K, V]]): Unit = { // 判断aggregator是否为空，如果不为空，表示需要在本地combine
  val shouldCombine = aggregator.isDefined // 如果需要本地combine
  if (shouldCombine) { // 使用AppendOnlyMap优先在内存中进行combine // 获取aggregator的merge函数，用于merge新的值到聚合记录
    val mergeValue = aggregator.get.mergeValue // 获取aggregator的createCombiner函数，用于创建聚合的初始值
    val createCombiner = aggregator.get.createCombiner var kv: Product2[K, V] = null
    // 创建update函数，如果有值进行mergeValue,如果没有则createCombiner
    val update = (hadValue: Boolean, oldValue: C) => { if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2) } while (records.hasNext) { // 处理一个元素，就更新一次结果
 addElementsRead() // 取出一个(key,value)
      kv = records.next() // 对key计算分区，然后开始进行merge
 map.changeValue((getPartition(kv._1), kv._1), update) // 如果需要溢写内存数据到磁盘
      maybeSpillCollection(usingMap = true) } } else { // 不需要进行本地combine
    while (records.hasNext) { // 处理一个元素，就更新一次结果
 addElementsRead() // 取出一个(key,value)
      val kv = records.next() // 往PartitionedPairBuffer添加数据
 buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C]) // 如果需要溢写内存数据到磁盘
      maybeSpillCollection(usingMap = false) } } }

2.4.3 根据是否需要本地combine,从而决定初始化哪一个数据结构

private def maybeSpillCollection(usingMap: Boolean): Unit = { var estimatedSize = 0L
  // 如果使用PartitionedAppendOnlyMap存放数据，主要方便进行聚合
  if (usingMap) { // 首先估计一下该map的大小
    estimatedSize = map.estimateSize() // 然后会根据预估的map大小决定是否需要进行spill
    if (maybeSpill(map, estimatedSize)) { map = new PartitionedAppendOnlyMap[K, C] } } else {//否则使用PartitionedPairBuffer,以用于本地不需要进行聚合的情况 // 首先估计一下该map的大小
    estimatedSize = buffer.estimateSize() // 然后会根据预估的map大小决定是否需要进行spill
    if (maybeSpill(buffer, estimatedSize)) { buffer = new PartitionedPairBuffer[K, C] } } if (estimatedSize > _peakMemoryUsedBytes) { _peakMemoryUsedBytes = estimatedSize } }

2.4.4 判断是否需要溢写磁盘，如果需要则开始溢写

# 如果已经读取的数据是32的倍数且预计的当前需要的内存大于阀值的时候，准备申请内存

# 申请不成功或者申请完毕之后还是当前需要的内存还是不够，则表示需要进行spill

# 如果需要spill，则调用spill方法开始溢写磁盘，溢写完毕之后释放内存

protected def maybeSpill(collection: C, currentMemory: Long): Boolean = { var shouldSpill = false
  // 如果读取的数据是32的倍数，而且当前内存大于内存阀值，默认是5M // 会先尝试向MemoryManager申请(2 * currentMemory - myMemoryThreshold)大小的内存 // 如果能够申请到，则不进行Spill操作，而是继续向Buffer中存储数据， // 否则就会调用spill()方法将Buffer中数据输出到磁盘文件
  if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) { // 向MemoryManager申请内存的大小
    val amountToRequest = 2 * currentMemory - myMemoryThreshold // 分配内存,并更新已经使用的内存
    val granted = acquireMemory(amountToRequest) // 更新现在内存阀值
    myMemoryThreshold += granted // 再次判断当前内存是否大于阀值，如果还是大于阀值则继续spill
    shouldSpill = currentMemory >= myMemoryThreshold } shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold // 如果需要进行spill，则开始进行spill操作
  if (shouldSpill) { _spillCount += 1 logSpillage(currentMemory) // 开始spill
 spill(collection) _elementsRead = 0 _memoryBytesSpilled += currentMemory // 释放内存
 releaseMemory() } shouldSpill }

2.4.5 溢写磁盘

# 返回一个根据指定的比较器排序的迭代器

# 溢写内存里的数据到磁盘一个临时文件

# 更新溢写的临时磁盘文件

override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = { // 返回一个根据指定的比较器排序的迭代器
  val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator) // 溢写内存里的数据到磁盘一个临时文件
  val spillFile = spillMemoryIteratorToDisk(inMemoryIterator) // 更新溢写的临时磁盘文件
  spills += spillFile }

2.4.6 溢写内存里的数据到磁盘一个临时文件

# 创建临时的blockId和文件

# 针对临时文件创建DiskBlockObjectWriter

# 循环读取内存里的数据

# 内存里的数据数据写入文件

# 将数据刷到磁盘

# 创建SpilledFile然后返回

private[this] def spillMemoryIteratorToDisk(inMemoryIterator: WritablePartitionedIterator) : SpilledFile = { // 因为这些文件在shuffle期间可能被读取，他们压缩应该被spark.shuffle.spill.compress控制而不是 // spark.shuffle.compress，所以我们需要创建临时的shuffle block
  val (blockId, file) = diskBlockManager.createTempShuffleBlock() // These variables are reset after each flush
  var objectsWritten: Long = 0 val spillMetrics: ShuffleWriteMetrics = new ShuffleWriteMetrics // 创建针对临时文件的writer
  val writer: DiskBlockObjectWriter = blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, spillMetrics) // 批量写入磁盘的列表
  val batchSizes = new ArrayBuffer[Long] // 每一个分区有多少数据
  val elementsPerPartition = new Array[Long](numPartitions) // 刷新数据到磁盘
  def flush(): Unit = { // 每一个分区对应文件刷新到磁盘，并返回对应的FileSegment
    val segment = writer.commitAndGet() // 获取该FileSegment对应的文件的长度，并且更新batchSizes
    batchSizes += segment.length _diskBytesSpilled += segment.length objectsWritten = 0 } var success = false
  try { // 循环读取内存里的数据
    while (inMemoryIterator.hasNext) { // 获取partitionId
      val partitionId = inMemoryIterator.nextPartition() require(partitionId >= 0 && partitionId < numPartitions, s"partition Id: ${partitionId} should be in the range [0, ${numPartitions})") // 内存里的数据数据写入文件
 inMemoryIterator.writeNext(writer) elementsPerPartition(partitionId) += 1 objectsWritten += 1
      // 将数据刷到磁盘
      if (objectsWritten == serializerBatchSize) { flush() } } // 遍历完了之后，刷新到磁盘
    if (objectsWritten > 0) { flush() } else { writer.revertPartialWritesAndClose() } success = true } finally { if (success) { writer.close() } else { // This code path only happens if an exception was thrown above before we set success; // close our stuff and let the exception be thrown further
 writer.revertPartialWritesAndClose() if (file.exists()) { if (!file.delete()) { logWarning(s"Error deleting ${file}") } } } } // 创建SpilledFile然后返回
 SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition) }

2.4.7 对结果排序，合并文件

# 溢写文件为空，则内存足够，不需要溢写结果到磁盘, 返回一个对结果排序的迭代器, 遍历数据写入data临时文件；再将数据刷到磁盘文件，返回FileSegment对象；构造一个分区文件长度的数组

# 溢写文件不为空，则需要将溢写的文件和内存数据合并，合并之后则需要进行归并排序(merge-sort)；数据写入data临时文件，再将数据刷到磁盘文件，返回FileSegment对象；构造一个分区文件长度的数组

# 返回分区文件长度的数组

def writePartitionedFile(blockId: BlockId, outputFile: File): Array[Long] = { // Track location of each range in the output file // 临时的data文件跟踪每一个分区的位置 // 创建每一个分区对应的文件长度的数组
  val lengths = new Array[Long](numPartitions) // 创建DiskBlockObjectWriter对象
  val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize, context.taskMetrics().shuffleWriteMetrics) // 判断是否有进行spill的文件
  if (spills.isEmpty) { // 如果是空的表示我们只有内存数据,内存足够，不需要溢写结果到磁盘 // 如果指定aggregator，就返回PartitionedAppendOnlyMap里的数据，否则返回 // PartitionedPairBuffer里的数据
    val collection = if (aggregator.isDefined) map else buffer // 返回一个对结果排序的迭代器
    val it = collection.destructiveSortedWritablePartitionedIterator(comparator) while (it.hasNext) { // 获取partitionId
      val partitionId = it.nextPartition() // 通过writer将内存数据写入临时文件
      while (it.hasNext && it.nextPartition() == partitionId) { it.writeNext(writer) } // 数据刷到磁盘，并且创建FileSegment数组
      val segment = writer.commitAndGet() // 构造一个分区文件长度的数组
      lengths(partitionId) = segment.length } } else { // 否则，表示有溢写文件，则需要进行归并排序(merge-sort) // We must perform merge-sort; get an iterator by partition and write everything directly. // 每一个分区的数据都写入到data文件的临时文件
    for ((id, elements) <- this.partitionedIterator) { if (elements.hasNext) { for (elem <- elements) { writer.write(elem._1, elem._2) } // 数据刷到磁盘，并且创建FileSegment数组
        val segment = writer.commitAndGet() // 构造一个分区文件长度的数组
        lengths(id) = segment.length } } } writer.close() context.taskMetrics().incMemoryBytesSpilled(memoryBytesSpilled) context.taskMetrics().incDiskBytesSpilled(diskBytesSpilled) context.taskMetrics().incPeakExecutionMemory(peakMemoryUsedBytes) lengths }

2.4.8 返回遍历所有数据的迭代器

# 没有溢写，则判断是否需要对key排序，如果不需要则只是将数据按照partitionId排序，否则首先按照partitionId排序，然后partition内部再按照key排序

# 如果发生溢写，则需要将磁盘上溢写文件和内存里的数据进行合并

def partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])] = {
  // 是否需要本地combine
  val usingMap = aggregator.isDefined
  val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer
  // 如果没有发生磁盘溢写
  if (spills.isEmpty) {
    // Special case: if we have only in-memory data, we don't need to merge streams, and perhaps
    // we don't even need to sort by anything other than partition ID
    // 而且不需要排序
    if (!ordering.isDefined) {
      // 数据只是按照partitionId排序，并不会对key进行排序
      groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None)))
    } else {
      // 否则我们需要先按照partitionId排序，然后分区内部对key进行排序
      groupByPartition(destructiveIterator(
        collection.partitionedDestructiveSortedIterator(Some(keyComparator))))
    }
  } else {
    // 如果发生了溢写操作，则需要将磁盘上溢写文件和内存里的数据进行合并
    merge(spills, destructiveIterator(
      collection.partitionedDestructiveSortedIterator(comparator)))
  }
}

三 Sort-Based Shuffle读机制

假设我们执行了reduceByKey算子，那么生成的RDD的就是ShuffleRDD,下游在运行任务的时候，则需要获取上游ShuffleRDD的数据，所以ShuffleRDD的compute方法是Shuffle读的起点。

下游的ReducerTask，可能是ShuffleMapTask也有可能是ResultTask，首先会去Driver获取parent stage中ShuffleMapTask输出的位置信息，根据位置信息获取index文件,然后解析index文件，从index文件中获取相关的位置等信息，然后读data文件获取属于自己那部分内容。

那什么ReducerTask什么时候去获取数据呢？当parent stage的所有ShuffleMapTask结束后再去fetch，然后一边fetch一边计算。

3.1ShuffleRDD的compute方法

override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = { // ResultTask或者ShuffleMapTask在执行到ShuffleRDD时，肯定会调用ShuffleRDD的compute方法 // 来计算当前这个RDD的partition数据
  val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]] // 获取ShuffleManager的reader去拉取ShuffleMapTask，需要聚合的数据
  SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context) .read() .asInstanceOf[Iterator[(K, C)]] }

三 Sort-Based Shuffle读机制 假设我们执行了reduceByKey算子，那么生成的RDD的就是ShuffleRDD,下游在运行任务的时候，则需要获取上游ShuffleRDD的数据，所以ShuffleRDD的compute方法是Shuffle读的起点。 下游的ReducerTask，可能是ShuffleMapTask也有可能是ResultTask，首先会去Driver获取parent stage中ShuffleMapTask输出的位置信息，根据位置信息获取index文件,然后解析index文件，从index文件中获取相关的位置等信息，然后读data文件获取属于自己那部分内容。 那什么ReducerTask什么时候去获取数据呢？当parent stage的所有ShuffleMapTask结束后再去fetch，然后一边fetch一边计算。 3.1 ShuffleRDD的compute方法

override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = { // ResultTask或者ShuffleMapTask在执行到ShuffleRDD时，肯定会调用ShuffleRDD的compute方法 // 来计算当前这个RDD的partition数据
  val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]] // 获取ShuffleManager的reader去拉取ShuffleMapTask，需要聚合的数据
  SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context) .read() .asInstanceOf[Iterator[(K, C)]] }

3.2 调用BlockStoreShuffleReader的read方法开始读取数据 # 创建ShuffleBlockFetcherIterator，一个迭代器，它获取多个块，对于本地块，从本地读取对于远程块，通过远程方法读取 # 如果reduce端需要聚合：如果map端已经聚合过了，则对读取到的聚合结果进行聚合； 如果map端没有聚合，则针对未合并的<k,v>进行聚合 # 如果需要对key排序，则进行排序。基于sort的shuffle实现过程中，默认只是按照partitionId排序。在每一个partition内部并没有排序，因此添加了keyOrdering变量，提供是否需要对分区内部的key排序

override def read(): Iterator[Product2[K, C]] = { // 构造ShuffleBlockFetcherIterator，一个迭代器，它获取多个块，对于本地块，从本地读取 // 对于远程块，通过远程方法读取
  val blockFetcherItr = new ShuffleBlockFetcherIterator( context, blockManager.shuffleClient, blockManager, //MapOutputTracker在SparkEnv启动的时候实例化
 mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition), // Note: we use getSizeAsMb when no suffix is provided for backwards compatibility
    SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024, SparkEnv.get.conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue)) // 基于配置文件对于流进行包装
  val wrappedStreams = blockFetcherItr.map { case (blockId, inputStream) => serializerManager.wrapStream(blockId, inputStream) } // 获取序列化实例
  val serializerInstance = dep.serializer.newInstance() // 对于每一个流创建一个<key,value>迭代器
  val recordIter = wrappedStreams.flatMap { wrappedStream => serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator } // Update the context task metrics for each record read.
  val readMetrics = context.taskMetrics.createTempShuffleReadMetrics() val metricIter = CompletionIterator[(Any, Any), Iterator[(Any, Any)]]( recordIter.map { record => readMetrics.incRecordsRead(1) record }, context.taskMetrics().mergeShuffleReadMetrics()) // An interruptible iterator must be used here in order to support task cancellation
  val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter) // 如果reduce端需要聚合
  val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) { // 如果map端已经聚合过了
    if (dep.mapSideCombine) { //则对读取到的聚合结果进行聚合
      val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]] // 针对map端各个partition对key进行聚合后的结果再次聚合
 dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context) } else { // 如果map端没有聚合，则针对未合并的<k,v>进行聚合
      val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]] dep.aggregator.get.combineValuesByKey(keyValuesIterator, context) } } else { require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!") interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]] } // 如果需要对key排序，则进行排序。基于sort的shuffle实现过程中，默认只是按照partitionId排序 // 在每一个partition内部并没有排序，因此添加了keyOrdering变量，提供是否需要对分区内部的key排序
 dep.keyOrdering match { case Some(keyOrd: Ordering[K]) =>
      // 为了减少内存压力和避免GC开销，引入了外部排序器，当内存不足时会根据配置文件 // spark.shuffle.spill决定是否进行spill操作
      val sorter =
        new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), serializer = dep.serializer) sorter.insertAll(aggregatedIter) context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled) context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled) context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes) CompletionIterator[Product2[K, C], Iterator[Product2[K, C]]](sorter.iterator, sorter.stop()) case None =>
      // 不需要排序直接返回
 aggregatedIter } } def partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])] = { // 是否需要本地combine
  val usingMap = aggregator.isDefined val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer // 如果没有发生磁盘溢写
  if (spills.isEmpty) { // Special case: if we have only in-memory data, we don't need to merge streams, and perhaps // we don't even need to sort by anything other than partition ID // 而且不需要排序
    if (!ordering.isDefined) { // 数据只是按照partitionId排序，并不会对key进行排序
 groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None))) } else { // 否则我们需要先按照partitionId排序，然后分区内部对key进行排序
 groupByPartition(destructiveIterator( collection.partitionedDestructiveSortedIterator(Some(keyComparator)))) } } else { // 如果发生了溢写操作，则需要将磁盘上溢写文件和内存里的数据进行合并
 merge(spills, destructiveIterator( collection.partitionedDestructiveSortedIterator(comparator))) } }

3.2 调用BlockStoreShuffleReader的read方法开始读取数据

# 创建ShuffleBlockFetcherIterator，一个迭代器，它获取多个块，对于本地块，从本地读取对于远程块，通过远程方法读取

# 如果reduce端需要聚合：如果map端已经聚合过了，则对读取到的聚合结果进行聚合；如果map端没有聚合，则针对未合并的<k,v>进行聚合

# 如果需要对key排序，则进行排序。基于sort的shuffle实现过程中，默认只是按照partitionId排序。在每一个partition内部并没有排序，因此添加了keyOrdering变量，提供是否需要对分区内部的key排序

override def read(): Iterator[Product2[K, C]] = { // 构造ShuffleBlockFetcherIterator，一个迭代器，它获取多个块，对于本地块，从本地读取 // 对于远程块，通过远程方法读取
  val blockFetcherItr = new ShuffleBlockFetcherIterator( context, blockManager.shuffleClient, blockManager, //MapOutputTracker在SparkEnv启动的时候实例化
 mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition), // Note: we use getSizeAsMb when no suffix is provided for backwards compatibility
    SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024, SparkEnv.get.conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue)) // 基于配置文件对于流进行包装
  val wrappedStreams = blockFetcherItr.map { case (blockId, inputStream) => serializerManager.wrapStream(blockId, inputStream) } // 获取序列化实例
  val serializerInstance = dep.serializer.newInstance() // 对于每一个流创建一个<key,value>迭代器
  val recordIter = wrappedStreams.flatMap { wrappedStream => serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator } // Update the context task metrics for each record read.
  val readMetrics = context.taskMetrics.createTempShuffleReadMetrics() val metricIter = CompletionIterator[(Any, Any), Iterator[(Any, Any)]]( recordIter.map { record => readMetrics.incRecordsRead(1) record }, context.taskMetrics().mergeShuffleReadMetrics()) // An interruptible iterator must be used here in order to support task cancellation
  val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter) // 如果reduce端需要聚合
  val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) { // 如果map端已经聚合过了
    if (dep.mapSideCombine) { //则对读取到的聚合结果进行聚合
      val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]] // 针对map端各个partition对key进行聚合后的结果再次聚合
 dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context) } else { // 如果map端没有聚合，则针对未合并的<k,v>进行聚合
      val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]] dep.aggregator.get.combineValuesByKey(keyValuesIterator, context) } } else { require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!") interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]] } // 如果需要对key排序，则进行排序。基于sort的shuffle实现过程中，默认只是按照partitionId排序 // 在每一个partition内部并没有排序，因此添加了keyOrdering变量，提供是否需要对分区内部的key排序
 dep.keyOrdering match { case Some(keyOrd: Ordering[K]) =>
      // 为了减少内存压力和避免GC开销，引入了外部排序器，当内存不足时会根据配置文件 // spark.shuffle.spill决定是否进行spill操作
      val sorter =
        new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), serializer = dep.serializer) sorter.insertAll(aggregatedIter) context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled) context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled) context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes) CompletionIterator[Product2[K, C], Iterator[Product2[K, C]]](sorter.iterator, sorter.stop()) case None =>
      // 不需要排序直接返回
 aggregatedIter } }

3.3 通过MapOutputTracker的getMapSizesByExecutorId去获取MapStatus

要去读取数据，我们就需要知道从哪儿读取，读取哪一些数据，这些信息在上游shuffle会存封装在MapStatus中

def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, endPartition: Int) : Seq[(BlockManagerId, Seq[(BlockId, Long)])] = { logDebug(s"Fetching outputs for shuffle $shuffleId, partitions $startPartition-$endPartition") // 根据shuffleId获取MapStatus
  val statuses = getStatuses(shuffleId) // 将得到MapStatus数组进行转化
  statuses.synchronized { return MapOutputTracker.convertMapStatuses(shuffleId, startPartition, endPartition, statuses) } }

# 向MapOutputTrackerMasterEndpoint发送GetMapOutputStatuses消息, MapOutputTrackerMasterEndpoint收到消息之后，MapOutputTrackerMaster会添加这个请求到队列，并且它有一个后台线程一直不断从该队列获取请求，获取请求之后返回。

override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = { // 如果接收的是GetMapOutputStatuses消息，则表示获取MapOutput状态
  case GetMapOutputStatuses(shuffleId: Int) => val hostPort = context.senderAddress.hostPort logInfo("Asked to send map output locations for shuffle " + shuffleId + " to " + hostPort) // 调用MapOutputTrackerMaster的post方法获取
    val mapOutputStatuses = tracker.post(new GetMapOutputMessage(shuffleId, context)) case StopMapOutputTracker => logInfo("MapOutputTrackerMasterEndpoint stopped!") context.reply(true) stop() } def post(message: GetMapOutputMessage): Unit = { // 往这个队列插入一个请求
 mapOutputRequests.offer(message) } private class MessageLoop extends Runnable { override def run(): Unit = { try { while (true) { try { // 从队列中取出一个GetMapOutputMessage消息
          val data = mapOutputRequests.take() if (data == PoisonPill) { // Put PoisonPill back so that other MessageLoops can see it.
 mapOutputRequests.offer(PoisonPill) return } val context = data.context val shuffleId = data.shuffleId val hostPort = context.senderAddress.hostPort logDebug("Handling request to send map output locations for shuffle " + shuffleId +
            " to " + hostPort) // 根据shuffleId获取序列化的MapOutputStatuses
          val mapOutputStatuses = getSerializedMapOutputStatuses(shuffleId) context.reply(mapOutputStatuses) } catch { case NonFatal(e) => logError(e.getMessage, e) } } } catch { case ie: InterruptedException => // exit
 } } }

3.4 创建ShuffleBlockFetcherIterator，在其内部会调用初始化方法initialize方法

# 切分本地和远程的block，并且将远程block随机排序

# 发送请求到远程获取block数据

# 拉取本地block的数据

private[this] def initialize(): Unit = { // Add a task completion callback (called in both success case and failure case) to cleanup.
  context.addTaskCompletionListener(_ => cleanup()) // 切分本地和远程的block
  val remoteRequests = splitLocalRemoteBlocks() // 然后进行随机排序
  fetchRequests ++= Utils.randomize(remoteRequests) assert ((0 == reqsInFlight) == (0 == bytesInFlight), "expected reqsInFlight = 0 but found reqsInFlight = " + reqsInFlight +
    ", expected bytesInFlight = 0 but found bytesInFlight = " + bytesInFlight) // 发送请求到远程获取数据
 fetchUpToMaxBytes() val numFetches = remoteRequests.size - fetchRequests.size logInfo("Started " + numFetches + " remote fetches in" + Utils.getUsedTimeMs(startTime)) // 拉取本地的数据
 fetchLocalBlocks() logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime)) }

3.5 切分本地和远程的block

private[this] def splitLocalRemoteBlocks(): ArrayBuffer[FetchRequest] = { // 远端请求从最多5个node去获取数据,每一个节点拉取的数据取决于spark.reducer.maxMbInFlight即maxBytesInFlight参数 // 加入整个集群只允许每次在5台拉取5G的数据，那么每一节点只允许拉取1G数据,这样就可以允许他们并行从5个节点获取， // 而不是主动从一个节点获取
  val targetRequestSize = math.max(maxBytesInFlight / 5, 1L) logDebug("maxBytesInFlight: " + maxBytesInFlight + ", targetRequestSize: " + targetRequestSize) // 创建FetchRequest队列，用于存放拉取的数据的请求,每一个请求可能包含多个block, // 具体多少取决于总的请求block大小是否超过目标阀值
  val remoteRequests = new ArrayBuffer[FetchRequest] var totalBlocks = 0
  for ((address, blockInfos) <- blocksByAddress) { // 获取block的大小，并更新总的block数量信息
    totalBlocks += blockInfos.size // 要获取的数据在本地
    if (address.executorId == blockManager.blockManagerId.executorId) { // 更新要从本地block拉取的集合
      localBlocks ++= blockInfos.filter(_._2 != 0).map(_._1) // 更新要拉取的block数量
      numBlocksToFetch += localBlocks.size } else {//数据不在本地时
      val iterator = blockInfos.iterator var curRequestSize = 0L // 当前请求的大小 // 存放当前的远端请求
      var curBlocks = new ArrayBuffer[(BlockId, Long)] // 遍历每一个block
      while (iterator.hasNext) { val (blockId, size) = iterator.next() // 过滤掉空的block
        if (size > 0) { curBlocks += ((blockId, size)) // 更新要拉取的远端的blockId的集合列表
          remoteBlocks += blockId // 更新要拉取的block数量
          numBlocksToFetch += 1 curRequestSize += size } else if (size < 0) { throw new BlockException(blockId, "Negative block size " + size) } // 如果当前请求的大小已经超过了阀值
        if (curRequestSize >= targetRequestSize) { // 创建一个新的FetchRequest,放到请求队列
          remoteRequests += new FetchRequest(address, curBlocks) // 重置当前block列表
          curBlocks = new ArrayBuffer[(BlockId, Long)] logDebug(s"Creating fetch request of $curRequestSize at $address") // 重置当前请求数量为0
          curRequestSize = 0 } } // 添加最终的请求
      if (curBlocks.nonEmpty) { remoteRequests += new FetchRequest(address, curBlocks) } } } logInfo(s"Getting $numBlocksToFetch non-empty blocks out of $totalBlocks blocks") remoteRequests }

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 Apache Spark源码走读之24 -- Sort-based Shuffle的设计与实现 Spark源码分析 – Shuffle Spark On YARN启动流程源码分析（一） Collections.shuffle()源码分析 Collections.shuffle()源码分析 soundtouch源码分析__based on csdn ： Spark On YARN（Yarn-Cluster模式）启动流程源码分析（二） Spark（五十一）：Spark On YARN（Yarn-Cluster模式）启动流程源码分析（二） Spark的Shuffle和MR的Shuffle异同