Spark源碼系列（九）spark源碼分析以及優化

本文轉載自查看原文 2020-05-27 15:44 592 spark源碼分析

第一章、spark源碼分析之RDD四種依賴關系

一、RDD四種依賴關系

RDD四種依賴關系，分別是 ShuffleDependency、PrunDependency、RangeDependency和OneToOneDependency四種依賴關系。如下圖所示：org.apache.spark.Dependency有兩個一級子類，分別是 ShuffleDependency 和 NarrowDependency。其中，NarrowDependency 是一個抽象類，它有三個實現類，分別是OneToOneDependency、RangeDependency和 PruneDependency。

二、RDD的窄依賴

我們先來看窄RDD是如何確定依賴的父RDD的分區的呢？NarrowDependency 定義了一個抽象方法，如下：

/**
   * Get the parent partitions for a child partition.
   * @param partitionId a partition of the child RDD
   * @return the partitions of the parent RDD that the child partition depends upon
   */
  def getParents(partitionId: Int): Seq[Int]

其輸入參數是子RDD 的分區Id，輸出是子RDD 分區依賴的父RDD 的 partition 的 id 序列。

下面，分別看三種子類的實現：

OneToOneDependency

首先，OneToOneDependency的getParent實現如下：

override def getParents(partitionId: Int): List[Int] = List(partitionId)

就一行代碼，實現比較簡單，子RDD對應的partition index 跟父 RDD 的partition 的 index 一樣。相當於父RDD 的每一個partition 復制到子RDD 的對應分區中，分區的關系是一對一的。RDD的關系也是一對一的。

RangeDependency

其次，RangeDependency的 getParent 實現如下：

/**
 * :: DeveloperApi ::
 * Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
 * @param rdd the parent RDD
 * @param inStart the start of the range in the parent RDD
 * @param outStart the start of the range in the child RDD
 * @param length the length of the range
 */
@DeveloperApi
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
  extends NarrowDependency[T](rdd) {

  override def getParents(partitionId: Int): List[Int] = {
    if (partitionId >= outStart && partitionId < outStart + length) {
      List(partitionId - outStart + inStart)
   } else {
      Nil
   }
 }
}

首先解釋三個變量：inStart：父RDD range 的起始位置；outStart：子RDD range 的起始位置；length：range 的長度。

獲取父RDD 的partition index 的規則是：如果子RDD 的 partition index 在父RDD 的range 內，則返回的父RDD partition是子RDD partition index - 父 RDD 分區range 起始 + 子RDD 分區range 起始。其中，（- 父 RDD 分區range 起始 + 子RDD 分區range 起始）即子RDD 的分區的 range 起始位置和父RDD 的分區的 range 的起始位置的相對距離。子RDD 的 parttion index 加上這個相對距離就是對應父的RDD partition。否則是無依賴的父 RDD 的partition index。父子RDD的分區關系是一對一的。RDD 的關系可能是一對一（length 是1 ，就是特殊的 OneToOneDependency），也可能是多對一，也可能是一對多。

PruneDependency

最后，PruneDependency的 getParent 實現如下：

  /**
  * Represents a dependency between the PartitionPruningRDD and its parent. In this
  * case, the child RDD contains a subset of partitions of the parents'.
  */
 private[spark] class PruneDependency[T](rdd: RDD[T], partitionFilterFunc: Int => Boolean)
   extends NarrowDependency[T](rdd) {
 
   @transient
   val partitions: Array[Partition] = rdd.partitions
     .filter(s => partitionFilterFunc(s.index)).zipWithIndex
     .map { case(split, idx) => new PartitionPruningRDDPartition(idx, split) : Partition }
 
   override def getParents(partitionId: Int): List[Int] = {
     List(partitions(partitionId).asInstanceOf[PartitionPruningRDDPartition].parentSplit.index)
   }
 }

首先，解釋三個變量: rdd 是指向父RDD 的實例引用；partitionFilterFunc 是一個回調函數，作用是過濾出符合條件的父 RDD 的 partition 集合；PartitionPruningRDDPartition類聲明如下：

private[spark] class PartitionPruningRDDPartition(idx: Int, val parentSplit: Partition)
  extends Partition {
  override val index = idx
}

partitions的生成過程如下：先根據父RDD 引用獲取父RDD 對應的 partition集合，然后根據過濾函數和partition index ，過濾出想要的父RDD 的 partition 集合並且從0 開始編號，最后，根據父RDD 的 partition 和新編號實例化新的PartitionPruningRDDPartition實例，並放入到 partitions 集合中，相當於是先對parent RDD 的分區做Filter 剪枝操作。

在getParent 方法中，先根據子RDD 的 partition index 獲取到對應的 parent RDD 的對應分區，然后獲取Partition 的成員函數 index，該index 就是父RDD 的 partition 在父RDD 的所有分區中的 index。子RDD partition 和父RDD partition的關系是一對一的，父RDD 和子RDD 的關系是多對一，也可能是一對多，也可能是一對一。

簡言之，在窄依賴中，子RDD 的partition 和父RDD 的 partition 的關系是一對一的。

三、RDD的寬依賴

下面重點看 ShuffleDependency，ShuffleDependency代表的是一個 shuffle stage 的輸出。先來看其構造方法，即其依賴的變量或實例：

 @DeveloperApi
 class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient private val _rdd: RDD[_ <: Product2[K, V]],
     val partitioner: Partitioner,
     val serializer: Serializer = SparkEnv.get.serializer,
     val keyOrdering: Option[Ordering[K]] = None,
     val aggregator: Option[Aggregator[K, V, C]] = None,
     val mapSideCombine: Boolean = false)
   extends Dependency[Product2[K, V]]

其中，_rdd 代指父RDD實例；partitioner是用於給shuffle的輸出分區的分區器；serializer，主要用於序列化，默認是org.apache.spark.serializer.JavaSerializer，可以通過spark.serializer 參數指定；keyOrdering RDD shuffle的key 的順序。aggregator，map或reduce 端用於RDD shuffle的combine聚合器；mapSideCombine 是否執行部分的聚合（即 map端的預聚合，可以提高網絡傳輸效率和reduce 端的執行效率），默認是false。因為並不是所有的都適合這樣做。比如求全局平均值，均值，平方差等，但像全局最大值，最小值等是適合用mapSideCombine 的。注意，當mapSideCombine 為 true時，必須設置combine聚合器，因為 shuffle 前需要使用聚合器做 map-combine 操作。

partitioner的7種實現

partitioner 定義了 RDD 里的key-value 對是如何按 key 來分區的。映射每一個 key 到一個分區 id，從 0 到分區數 - 1；注意，分區器必須是確定性的，即給定同一個 key，必須返回同一個分區，便於任務失敗時，追溯分區數據，確保了每一個要參與計算的分區數據的一致性。即 partition 確定了 shuffle 過程中數據是要流向哪個具體的分區的。

org.apache.spark.Partition的 7 個實現類如下：

我們先來看Partitioner 的方法定義：

 abstract class Partitioner extends Serializable {
   def numPartitions: Int
   def getPartition(key: Any): Int
 }

其中，numPartitions 是返回子RDD 的 partition 數量；getPartition 會根據指定的 key 返回子RDD 的 partition index。

HashPartitioner 的 getPartition 的實現如下，思路是 key.hashcode() mod 子RDD的 partition 數量：

 def getPartition(key: Any): Int = key match {
     case null => 0
     case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
   }

RangePartitioner 的 getPartition 的實現如下：

  def getPartition(key: Any): Int = {
     val k = key.asInstanceOf[K]
     var partition = 0
     if (rangeBounds.length <= 128) { // 不大於 128 分區
       // If we have less than 128 partitions naive search
       while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
         partition += 1
       }
     } else { // 大於 128 個分區數量
       // Determine which binary search method to use only once.
       partition = binarySearch(rangeBounds, k) // 二分查找
       // binarySearch either returns the match location or -[insertion point]-1
       if (partition < 0) {
         partition = -partition-1
       }
       if (partition > rangeBounds.length) {
         partition = rangeBounds.length
       }
     }
     if (ascending) {
       partition
     } else {
       rangeBounds.length - partition
     }
   }

PythonPartitioner 的 getPartition 如下，跟hash 很相似：

 override def getPartition(key: Any): Int = key match {
     case null => 0
     // we don't trust the Python partition function to return valid partition ID's so
     // let's do a modulo numPartitions in any case
     case key: Long => Utils.nonNegativeMod(key.toInt, numPartitions)
     case _ => Utils.nonNegativeMod(key.hashCode(), numPartitions)
   }

PartitionIdPassthrough 的 getPartition 如下：

 override def getPartition(key: Any): Int = key.asInstanceOf[Int]

GridPartitioner 的 getPartition 如下，思想，二元組定位到網格的partition：

 override val numPartitions: Int = rowPartitions * colPartitions
 
   /**
    * Returns the index of the partition the input coordinate belongs to.
    *
    * @param key The partition id i (calculated through this method for coordinate (i, j) in
    *            `simulateMultiply`, the coordinate (i, j) or a tuple (i, j, k), where k is
    *            the inner index used in multiplication. k is ignored in computing partitions.
    * @return The index of the partition, which the coordinate belongs to.
    */
   override def getPartition(key: Any): Int = {
     key match {
       case i: Int => i
       case (i: Int, j: Int) =>
         getPartitionId(i, j)
       case (i: Int, j: Int, _: Int) =>
         getPartitionId(i, j)
       case _ =>
         throw new IllegalArgumentException(s"Unrecognized key: $key.")
     }
   }
 
   /** Partitions sub-matrices as blocks with neighboring sub-matrices. */
   private def getPartitionId(i: Int, j: Int): Int = {
     require(0 <= i && i < rows, s"Row index $i out of range [0, $rows).")
     require(0 <= j && j < cols, s"Column index $j out of range [0, $cols).")
     i / rowsPerPart + j / colsPerPart * rowPartitions
   }

包括匿名類，還有好多種，就不一一介紹了。總而言之，寬依賴是根據partitioner 確定分區內的數據具體到哪個分區。

至此，RDD 的窄依賴和寬依賴都介紹清楚了。

第二章、spark源碼分析之 SparkContext 的初始化過程

一、創建或使用現有session

從Spark 2.0 開始，引入了 SparkSession的概念，創建或使用已有的session 代碼如下：

 val spark = SparkSession
   .builder
  .appName("SparkTC")
   .getOrCreate()

首先，使用了 builder 模式來創建或使用已存在的SparkSession，org.apache.spark.sql.SparkSession.Builder#getOrCreate 代碼如下：

  def getOrCreate(): SparkSession = synchronized {
   assertOnDriver() // 注意，spark session只能在 driver端創建並訪問
   // Get the session from current thread's active session.
 // activeThreadSession 是一個InheritableThreadLocal（繼承自ThreadLocal）方法。因為數據在 ThreadLocal中存放着，所以不需要加鎖
   var session = activeThreadSession.get()
 // 如果session不為空，且session對應的sparkContext已經停止了，可以使用現有的session
   if ((session ne null) && !session.sparkContext.isStopped) {
     options.foreach { case (k, v) => session.sessionState.conf.setConfString(k, v) }
     if (options.nonEmpty) {
       logWarning("Using an existing SparkSession; some configuration may not take effect.")
     }
     return session
   }
 
   // 給SparkSession 對象加鎖，防止重復初始化 session
 SparkSession.synchronized {
     // If the current thread does not have an active session, get it from the global session.
 // 如果默認session 中有session存在，切其sparkContext 已經停止，也可以使用
     session = defaultSession.get()
     if ((session ne null) && !session.sparkContext.isStopped) {
       options.foreach { case (k, v) => session.sessionState.conf.setConfString(k, v) }
       if (options.nonEmpty) {
         logWarning("Using an existing SparkSession; some configuration may not take effect.")
       }
       return session
     }
 
     // 創建session
     val sparkContext = userSuppliedContext.getOrElse { // 默認userSuppliedContext肯定沒有SparkSession對象
       val sparkConf = new SparkConf()
       options.foreach { case (k, v) => sparkConf.set(k, v) }
 
       // set a random app name if not given.
       if (!sparkConf.contains("spark.app.name")) {
         sparkConf.setAppName(java.util.UUID.randomUUID().toString)
       }
 
       SparkContext.getOrCreate(sparkConf)
       // Do not update `SparkConf` for existing `SparkContext`, as it's shared by all sessions.
     }
 
     // Initialize extensions if the user has defined a configurator class.
     val extensionConfOption = sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
     if (extensionConfOption.isDefined) {
       val extensionConfClassName = extensionConfOption.get
       try {
         val extensionConfClass = Utils.classForName(extensionConfClassName)
         val extensionConf = extensionConfClass.newInstance()
           .asInstanceOf[SparkSessionExtensions => Unit]
         extensionConf(extensions)
       } catch {
         // Ignore the error if we cannot find the class or when the class has the wrong type.
         case e @ (_: ClassCastException |
                   _: ClassNotFoundException |
                   _: NoClassDefFoundError) =>
           logWarning(s"Cannot use $extensionConfClassName to configure session extensions.", e)
       }
     }
    // 初始化 SparkSession，並把剛初始化的 SparkContext 傳遞給它
     session = new SparkSession(sparkContext, None, None, extensions)
     options.foreach { case (k, v) => session.initialSessionOptions.put(k, v) }
 // 設置 default session
     setDefaultSession(session)
 // 設置 active session
 setActiveSession(session)
 
     // Register a successfully instantiated context to the singleton. This should be at the
     // end of the class definition so that the singleton is updated only if there is no
     // exception in the construction of the instance.
     // 設置 apark listener ，當application 結束時，default session 重置
 sparkContext.addSparkListener(new SparkListener {
       override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd): Unit = {
         defaultSession.set(null)
       }
     })
   }
 
   return session
 }

org.apache.spark.SparkContext#getOrCreate方法如下：

  def getOrCreate(config: SparkConf): SparkContext = {
   // Synchronize to ensure that multiple create requests don't trigger an exception
   // from assertNoOtherContextIsRunning within setActiveContext
 // 使用Object 對象鎖
   SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
 // activeContext是一個AtomicReference 實例，它的數據set或update都是原子性的
     if (activeContext.get() == null) {
 // 一個session 只有一個 SparkContext 上下文對象
       setActiveContext(new SparkContext(config), allowMultipleContexts = false)
     } else {
       if (config.getAll.nonEmpty) {
         logWarning("Using an existing SparkContext; some configuration may not take effect.")
       }
     }
     activeContext.get()
   }
 }

二、Spark Context 初始化

SparkContext 代表到 spark 集群的連接，它可以用來在spark集群上創建 RDD，accumulator和broadcast 變量。一個JVM 只能有一個活動的 SparkContext 對象，當創建一個新的時候，必須調用stop 方法停止活動的 SparkContext。當調用了構造方法后，會初始化類的成員變量，然后進入初始化過程。由 try catch 塊包圍，這個 try catch 塊是在執行構造函數時執行的

這塊孤立的代碼塊如下：　　

   try {
   // 1. 初始化 configuration
   _conf = config.clone()
   _conf.validateSettings()
 
   if (!_conf.contains("spark.master")) {
     throw new SparkException("A master URL must be set in your configuration")
   }
   if (!_conf.contains("spark.app.name")) {
     throw new SparkException("An application name must be set in your configuration")
   }
 
   // log out spark.app.name in the Spark driver logs
   logInfo(s"Submitted application: $appName")
 
   // System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
   if (master == "yarn" && deployMode == "cluster" && !_conf.contains("spark.yarn.app.id")) {
     throw new SparkException("Detected yarn cluster mode, but isn't running on a cluster. " +
       "Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
   }
 
   if (_conf.getBoolean("spark.logConf", false)) {
     logInfo("Spark configuration:\n" + _conf.toDebugString)
   }
 
   // Set Spark driver host and port system properties. This explicitly sets the configuration
   // instead of relying on the default value of the config constant.
   _conf.set(DRIVER_HOST_ADDRESS, _conf.get(DRIVER_HOST_ADDRESS))
   _conf.setIfMissing("spark.driver.port", "0")
 
   _conf.set("spark.executor.id", SparkContext.DRIVER_IDENTIFIER)
 
   _jars = Utils.getUserJars(_conf)
   _files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.nonEmpty))
     .toSeq.flatten
   // 2. 初始化日志目錄並設置壓縮類
   _eventLogDir =
     if (isEventLogEnabled) {
       val unresolvedDir = conf.get("spark.eventLog.dir", EventLoggingListener.DEFAULT_LOG_DIR)
         .stripSuffix("/")
       Some(Utils.resolveURI(unresolvedDir))
     } else {
       None
     }
 
   _eventLogCodec = {
     val compress = _conf.getBoolean("spark.eventLog.compress", false)
     if (compress && isEventLogEnabled) {
       Some(CompressionCodec.getCodecName(_conf)).map(CompressionCodec.getShortName)
     } else {
       None
     }
   }
   // 3. LiveListenerBus負責將SparkListenerEvent異步地傳遞給對應注冊的SparkListener.
   _listenerBus = new LiveListenerBus(_conf)
 
   // Initialize the app status store and listener before SparkEnv is created so that it gets
   // all events.
   // 4. 給 app 提供一個 kv store（in-memory）
   _statusStore = AppStatusStore.createLiveStore(conf)
   // 5. 注冊 AppStatusListener 到 LiveListenerBus 中
   listenerBus.addToStatusQueue(_statusStore.listener.get)
 
   // Create the Spark execution environment (cache, map output tracker, etc)
   // 6. 創建 driver端的 env
   // 包含所有的spark 實例運行時對象（master 或 worker），包含了序列化器，RPCEnv，block manager， map out tracker等等。
   // 當前的spark 通過一個全局的變量代碼找到 SparkEnv，所有的線程可以訪問同一個SparkEnv，
   // 創建SparkContext之后，可以通過 SparkEnv.get方法來訪問它。
   _env = createSparkEnv(_conf, isLocal, listenerBus)
   SparkEnv.set(_env)
 
   // If running the REPL, register the repl's output dir with the file server.
   _conf.getOption("spark.repl.class.outputDir").foreach { path =>
     val replUri = _env.rpcEnv.fileServer.addDirectory("/classes", new File(path))
     _conf.set("spark.repl.class.uri", replUri)
   }
   // 7. 從底層監控 spark job 和 stage 的狀態並匯報的 API
   _statusTracker = new SparkStatusTracker(this, _statusStore)
 
   // 8. console 進度條
   _progressBar =
     if (_conf.get(UI_SHOW_CONSOLE_PROGRESS) && !log.isInfoEnabled) {
       Some(new ConsoleProgressBar(this))
     } else {
       None
     }
 
   // 9. spark ui, 使用jetty 實現
   _ui =
     if (conf.getBoolean("spark.ui.enabled", true)) {
       Some(SparkUI.create(Some(this), _statusStore, _conf, _env.securityManager, appName, "",
         startTime))
     } else {
       // For tests, do not enable the UI
       None
     }
   // Bind the UI before starting the task scheduler to communicate
   // the bound port to the cluster manager properly
   _ui.foreach(_.bind())
 
   // 10. 創建 hadoop configuration
   _hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)
 
   // 11. Add each JAR given through the constructor
   if (jars != null) {
     jars.foreach(addJar)
   }
 
   if (files != null) {
     files.foreach(addFile)
   }
   // 12. 計算 executor 的內存
   _executorMemory = _conf.getOption("spark.executor.memory")
     .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
     .orElse(Option(System.getenv("SPARK_MEM"))
     .map(warnSparkMem))
     .map(Utils.memoryStringToMb)
     .getOrElse(1024)
 
   // Convert java options to env vars as a work around
   // since we can't set env vars directly in sbt.
   for { (envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))
     value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {
     executorEnvs(envKey) = value
   }
   Option(System.getenv("SPARK_PREPEND_CLASSES")).foreach { v =>
     executorEnvs("SPARK_PREPEND_CLASSES") = v
   }
   // The Mesos scheduler backend relies on this environment variable to set executor memory.
   // TODO: Set this only in the Mesos scheduler.
   executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
   executorEnvs ++= _conf.getExecutorEnv
   executorEnvs("SPARK_USER") = sparkUser
 
   // We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
   // retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
   // 13. 創建 HeartbeatReceiver endpoint
   _heartbeatReceiver = env.rpcEnv.setupEndpoint(
     HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))
 
   // Create and start the scheduler
   // 14. 創建 task scheduler 和 scheduler backend
   val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
   _schedulerBackend = sched
   _taskScheduler = ts
   // 15. 創建DAGScheduler實例
   _dagScheduler = new DAGScheduler(this)
   _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
 
   // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
   // constructor
   // 16. 啟動 task scheduler
   _taskScheduler.start()
 
   // 17. 從task scheduler 獲取 application ID
   _applicationId = _taskScheduler.applicationId()
   // 18. 從 task scheduler 獲取 application attempt id
   _applicationAttemptId = taskScheduler.applicationAttemptId()
   _conf.set("spark.app.id", _applicationId)
   if (_conf.getBoolean("spark.ui.reverseProxy", false)) {
     System.setProperty("spark.ui.proxyBase", "/proxy/" + _applicationId)
   }
   // 19. 為ui 設置 application id
   _ui.foreach(_.setAppId(_applicationId))
   // 20. 初始化 block manager
   _env.blockManager.initialize(_applicationId)
 
   // The metrics system for Driver need to be set spark.app.id to app ID.
   // So it should start after we get app ID from the task scheduler and set spark.app.id.
   // 21. 啟動 metricsSystem
   _env.metricsSystem.start()
   // Attach the driver metrics servlet handler to the web ui after the metrics system is started.
   // 22. 將 metricSystem 的 servlet handler 給 ui 用
   _env.metricsSystem.getServletHandlers.foreach(handler => ui.foreach(_.attachHandler(handler)))
 
   // 23. 初始化 event logger listener
   _eventLogger =
     if (isEventLogEnabled) {
       val logger =
         new EventLoggingListener(_applicationId, _applicationAttemptId, _eventLogDir.get,
           _conf, _hadoopConfiguration)
       logger.start()
       listenerBus.addToEventLogQueue(logger)
       Some(logger)
     } else {
       None
     }
 
   // Optionally scale number of executors dynamically based on workload. Exposed for testing.
   // 24. 如果啟用了動態分配 executor， 需要實例化 executorAllocationManager 並啟動之
   val dynamicAllocationEnabled = Utils.isDynamicAllocationEnabled(_conf)
   _executorAllocationManager =
     if (dynamicAllocationEnabled) {
       schedulerBackend match {
         case b: ExecutorAllocationClient =>
           Some(new ExecutorAllocationManager(
             schedulerBackend.asInstanceOf[ExecutorAllocationClient], listenerBus, _conf,
             _env.blockManager.master))
         case _ =>
           None
       }
     } else {
       None
     }
   _executorAllocationManager.foreach(_.start())
 
   // 25. 初始化 ContextCleaner，並啟動之
   _cleaner =
     if (_conf.getBoolean("spark.cleaner.referenceTracking", true)) {
       Some(new ContextCleaner(this))
     } else {
       None
     }
   _cleaner.foreach(_.start())
   // 26. 建立並啟動 listener bus
   setupAndStartListenerBus()
   // 27.  task scheduler 已就緒，發送環境已更新請求
   postEnvironmentUpdate()
   // 28.  發送 application start 請求事件
   postApplicationStart()
 
   // Post init
   // 29.等待 直至task scheduler backend 准備好了
   _taskScheduler.postStartHook()
   // 30. 注冊 dagScheduler metricsSource
   _env.metricsSystem.registerSource(_dagScheduler.metricsSource)
   // 31. 注冊 metric source
   _env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))
   //32. 注冊 metric source
   _executorAllocationManager.foreach { e =>
     _env.metricsSystem.registerSource(e.executorAllocationManagerSource)
   }
 
   // Make sure the context is stopped if the user forgets about it. This avoids leaving
   // unfinished event logs around after the JVM exits cleanly. It doesn't help if the JVM
   // is killed, though.
   logDebug("Adding shutdown hook") // force eager creation of logger
   // 33. 設置 shutdown hook， 在spark context 關閉時，要做的回調操作
   _shutdownHookRef = ShutdownHookManager.addShutdownHook(
     ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY) { () =>
     logInfo("Invoking stop() from shutdown hook")
     try {
       stop()
     } catch {
       case e: Throwable =>
         logWarning("Ignoring Exception while stopping SparkContext from shutdown hook", e)
     }
   }
 } catch {
   case NonFatal(e) =>
     logError("Error initializing SparkContext.", e)
     try {
       stop()
     } catch {
       case NonFatal(inner) =>
         logError("Error stopping SparkContext after init error.", inner)
     } finally {
       throw e
     }
 }

從上面可以看出，spark context 的初始化是非常復雜的，涉及的spark 組件很多，包括異步事務總線系統LiveListenerBus、SparkEnv、SparkUI、DAGScheduler、metrics監測系統、EventLoggingListener、TaskScheduler、ExecutorAllocationManager、ContextCleaner等等。先暫且當作是總述，后面對部分組件會有比較全面的剖析。

第三章、spark源碼分析之LiveListenerBus介紹

一、LiveListenerBus

官方說明如下：

Asynchronously passes SparkListenerEvents to registered SparkListeners.

即它的功能是異步地將SparkListenerEvent傳遞給已經注冊的SparkListener，這種異步的機制是通過生產消費者模型來實現的。

首先，它定義了 4 個消息堵塞隊列，隊列的名字分別為shared、appStatus、executorManagement、eventLog。隊列的類型是 org.apache.spark.scheduler.AsyncEventQueue#AsyncEventQueue，保存在 queues 變量中。每一個隊列上都可以注冊監聽器，如果隊列沒有監聽器，則會被移除。

它有啟動和stop和start兩個標志位來指示監聽總線的的啟動停止狀態。如果總線沒有啟動，有事件過來，先放到一個待添加的可變數組中，否則直接將事件 post 到每一個隊列中。

其直接依賴類是 AsyncEventQueue，相當於 LiveListenerBus 的多事件隊列是對 AsyncEventQueue 進一步的封裝。

二、AsyncEventQueue

其繼承關系如下：

它有啟動和stop和start兩個標志位來指示監聽總線的的啟動停止狀態。

其內部維護了listenersPlusTimers 主要就是用來保存注冊到這個總線上的監聽器對象的。

post 操作將事件放入內部的 LinkedBlockingQueue中，默認大小是 10000。

有一個事件分發器，它不停地從 LinkedBlockingQueue 執行 take 操作，獲取事件，並將事件進一步分發給所有的監聽器，由org.apache.spark.scheduler.SparkListenerBus#doPostEvent 方法實現事件轉發，具體代碼如下：

  protected override def doPostEvent(
       listener: SparkListenerInterface,
       event: SparkListenerEvent): Unit = {
     event match {
       case stageSubmitted: SparkListenerStageSubmitted =>
         listener.onStageSubmitted(stageSubmitted)
       case stageCompleted: SparkListenerStageCompleted =>
         listener.onStageCompleted(stageCompleted)
       case jobStart: SparkListenerJobStart =>
         listener.onJobStart(jobStart)
       case jobEnd: SparkListenerJobEnd =>
         listener.onJobEnd(jobEnd)
       case taskStart: SparkListenerTaskStart =>
         listener.onTaskStart(taskStart)
       case taskGettingResult: SparkListenerTaskGettingResult =>
         listener.onTaskGettingResult(taskGettingResult)
       case taskEnd: SparkListenerTaskEnd =>
         listener.onTaskEnd(taskEnd)
       case environmentUpdate: SparkListenerEnvironmentUpdate =>
         listener.onEnvironmentUpdate(environmentUpdate)
       case blockManagerAdded: SparkListenerBlockManagerAdded =>
         listener.onBlockManagerAdded(blockManagerAdded)
       case blockManagerRemoved: SparkListenerBlockManagerRemoved =>
         listener.onBlockManagerRemoved(blockManagerRemoved)
       case unpersistRDD: SparkListenerUnpersistRDD =>
         listener.onUnpersistRDD(unpersistRDD)
       case applicationStart: SparkListenerApplicationStart =>
         listener.onApplicationStart(applicationStart)
       case applicationEnd: SparkListenerApplicationEnd =>
         listener.onApplicationEnd(applicationEnd)
       case metricsUpdate: SparkListenerExecutorMetricsUpdate =>
         listener.onExecutorMetricsUpdate(metricsUpdate)
       case executorAdded: SparkListenerExecutorAdded =>
         listener.onExecutorAdded(executorAdded)
       case executorRemoved: SparkListenerExecutorRemoved =>
         listener.onExecutorRemoved(executorRemoved)
       case executorBlacklistedForStage: SparkListenerExecutorBlacklistedForStage =>
         listener.onExecutorBlacklistedForStage(executorBlacklistedForStage)
       case nodeBlacklistedForStage: SparkListenerNodeBlacklistedForStage =>
         listener.onNodeBlacklistedForStage(nodeBlacklistedForStage)
       case executorBlacklisted: SparkListenerExecutorBlacklisted =>
         listener.onExecutorBlacklisted(executorBlacklisted)
       case executorUnblacklisted: SparkListenerExecutorUnblacklisted =>
         listener.onExecutorUnblacklisted(executorUnblacklisted)
       case nodeBlacklisted: SparkListenerNodeBlacklisted =>
         listener.onNodeBlacklisted(nodeBlacklisted)
       case nodeUnblacklisted: SparkListenerNodeUnblacklisted =>
         listener.onNodeUnblacklisted(nodeUnblacklisted)
       case blockUpdated: SparkListenerBlockUpdated =>
         listener.onBlockUpdated(blockUpdated)
       case speculativeTaskSubmitted: SparkListenerSpeculativeTaskSubmitted =>
         listener.onSpeculativeTaskSubmitted(speculativeTaskSubmitted)
       case _ => listener.onOtherEvent(event)
     }
   }

然后去調用 listener 的相對應的方法。

就這樣，事件總線上的消息事件被監聽器消費了。

第四章、spark源碼分析之TaskScheduler的創建和啟動過程

一、TaskScheduler的實例化

val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)

其調用了org.apache.spark.SparkContext#createTaskScheduler ，源碼如下：

 /**
    * Create a task scheduler based on a given master URL.
    * Return a 2-tuple of the scheduler backend and the task scheduler.
    */
   private def createTaskScheduler(
       sc: SparkContext,
       master: String,
       deployMode: String): (SchedulerBackend, TaskScheduler) = {
     import SparkMasterRegex._
 
     // When running locally, don't try to re-execute tasks on failure.
     val MAX_LOCAL_TASK_FAILURES = 1
 
     master match {
       case "local" =>
         val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
         val backend = new LocalSchedulerBackend(sc.getConf, scheduler, 1)
         scheduler.initialize(backend)
         (backend, scheduler)
 
       case LOCAL_N_REGEX(threads) =>
         def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
         // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
         val threadCount = if (threads == "*") localCpuCount else threads.toInt
         if (threadCount <= 0) {
           throw new SparkException(s"Asked to run locally with $threadCount threads")
         }
         val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
         val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
         scheduler.initialize(backend)
         (backend, scheduler)
 
       case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>
         def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
         // local[*, M] means the number of cores on the computer with M failures
         // local[N, M] means exactly N threads with M failures
         val threadCount = if (threads == "*") localCpuCount else threads.toInt
         val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true)
         val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
         scheduler.initialize(backend)
         (backend, scheduler)
 
       case SPARK_REGEX(sparkUrl) =>
         val scheduler = new TaskSchedulerImpl(sc)
         val masterUrls = sparkUrl.split(",").map("spark://" + _)
         val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
         scheduler.initialize(backend)
         (backend, scheduler)
 
       case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
         // Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.
         val memoryPerSlaveInt = memoryPerSlave.toInt
         if (sc.executorMemory > memoryPerSlaveInt) {
           throw new SparkException(
             "Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
               memoryPerSlaveInt, sc.executorMemory))
         }
 
         val scheduler = new TaskSchedulerImpl(sc)
         val localCluster = new LocalSparkCluster(
           numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt, sc.conf)
         val masterUrls = localCluster.start()
         val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
         scheduler.initialize(backend)
         backend.shutdownCallback = (backend: StandaloneSchedulerBackend) => {
           localCluster.stop()
         }
         (backend, scheduler)
 
       case masterUrl =>
         val cm = getClusterManager(masterUrl) match {
           case Some(clusterMgr) => clusterMgr
           case None => throw new SparkException("Could not parse Master URL: '" + master + "'")
         }
         try {
           val scheduler = cm.createTaskScheduler(sc, masterUrl)
           val backend = cm.createSchedulerBackend(sc, masterUrl, scheduler)
           cm.initialize(scheduler, backend)
           (backend, scheduler)
         } catch {
           case se: SparkException => throw se
           case NonFatal(e) =>
             throw new SparkException("External scheduler cannot be instantiated", e)
         }
     }
   }

不同的實現如下：

實例化部分剖析完畢，下半部分重點剖析yarn-client mode 下 TaskScheduler 的啟動過程

二、yarn-client模式TaskScheduler 啟動過程

初始化調度池

在org.apache.spark.SparkContext#createTaskScheduler 方法中，有如下調用：

  case masterUrl =>
         val cm = getClusterManager(masterUrl) match {
           case Some(clusterMgr) => clusterMgr
           case None => throw new SparkException("Could not parse Master URL: '" + master + "'")
         }
         try {
           val scheduler = cm.createTaskScheduler(sc, masterUrl)
           val backend = cm.createSchedulerBackend(sc, masterUrl, scheduler)
           cm.initialize(scheduler, backend)
           (backend, scheduler)
         } catch {
           case se: SparkException => throw se
           case NonFatal(e) =>
             throw new SparkException("External scheduler cannot be instantiated", e)
         }

其中的，cm.initialize(scheduler, backend)中的cm 是org.apache.spark.scheduler.cluster.YarnClusterManager，TaskScheduler的實現是 org.apache.spark.scheduler.cluster.YarnScheduler, TaskSchedulerBackend的實現是org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend。YarnClusterManager 的 initialize 方法實現如下：

   override def initialize(scheduler: TaskScheduler, backend: SchedulerBackend): Unit = {
    scheduler.asInstanceOf[TaskSchedulerImpl].initialize(backend)
  }

其並沒有實現 initialize，父類TaskSchedulerImpl 的實現如下：

 def initialize(backend: SchedulerBackend) {
     this.backend = backend
     schedulableBuilder = {
       schedulingMode match {
         case SchedulingMode.FIFO =>
           new FIFOSchedulableBuilder(rootPool)
         case SchedulingMode.FAIR =>
           new FairSchedulableBuilder(rootPool, conf)
         case _ =>
           throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
           s"$schedulingMode")
       }
     }
     schedulableBuilder.buildPools()
   }

可以看出，其重要作用就是設置 TaskScheduler 的 TaskSchedulerBackend 引用。

調度模式主要有FIFO和FAIR兩種模式。默認是FIFO模式，可以使用spark.scheduler.mode 參數來設定。使用建造者模式來創建 Pool 對象。

其中，org.apache.spark.scheduler.FIFOSchedulableBuilder#buildPools是一個空實現，即沒有做任何的操作；而 org.apache.spark.scheduler.FairSchedulableBuilder#buildPools會加載相應調度分配策略文件；策略文件可以使用 spark.scheduler.allocation.file 參數來設定，如果沒有設定會進一步加載默認的 fairscheduler.xml 文件，如果還沒有，則不加載。如果有調度池的配置，則根據配置配置調度pool並將其加入到 root 池中。最后初始化 default 池並將其加入到 root 池中。

在HeartBeatReceiver 中設定 taskscheduler 變量

 _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

首先，_heartbeatReceiver 是一個 RpcEndPointRef 對象，其請求最終會被 HeartbeatReceiver（Endpoint）接收並處理。即org.apache.spark.HeartbeatReceiver#receiveAndReply方法：

 case TaskSchedulerIsSet =>
       scheduler = sc.taskScheduler
       context.reply(true)

具體的關於RPC的相關解釋，會在后面有專門的文章篇幅介紹。在這里就不做過多解釋。 // TODO

啟動TaskScheduler

org.apache.spark.SparkContext 的初始化方法有如下代碼啟動 TaskScheduler：

 _taskScheduler.start()

yarn-client模式下，運行中調用了 org.apache.spark.scheduler.cluster.YarnScheduler 的 start 方法，它沿用了父類 TaskSchedulerImpl 的實現：

  override def start() {
     // 1. 啟動 task scheduler backend
     backend.start()
     // 2. 設定 speculationScheduler 定時任務
     if (!isLocal && conf.getBoolean("spark.speculation", false)) {
       logInfo("Starting speculative execution thread")
       speculationScheduler.scheduleWithFixedDelay(new Runnable {
         override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
           checkSpeculatableTasks()
         }
       }, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
     }
   }

第1步：task scheduler backend 的啟動：org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend#start的方法如下：

  /**
   * Create a Yarn client to submit an application to the ResourceManager.
   * This waits until the application is running.
   */
  override def start() {
    // 1. 獲取driver 的 host 和 port
    val driverHost = conf.get("spark.driver.host")
    val driverPort = conf.get("spark.driver.port")
    val hostport = driverHost + ":" + driverPort
    // 2. 設定 driver 的 web UI 地址
    sc.ui.foreach { ui => conf.set("spark.driver.appUIAddress", ui.webUrl) }

    val argsArrayBuf = new ArrayBuffer[String]()
    argsArrayBuf += ("--arg", hostport)

    logDebug("ClientArguments called with: " + argsArrayBuf.mkString(" "))
    val args = new ClientArguments(argsArrayBuf.toArray)
    totalExpectedExecutors = SchedulerBackendUtils.getInitialTargetExecutorNumber(conf)
    // 3. 啟動 deploy client，並切初始化 driverClient 的 Rpc environment，並在該RPC 環境中初始化master 和 driver 的rpc endpoint
    client = new Client(args, conf)
    // 4. 將 application id 綁定到 yarn 上
    bindToYarn(client.submitApplication(), None)

    // SPARK-8687: Ensure all necessary properties have already been set before
    // we initialize our driver scheduler backend, which serves these properties
    // to the executors
    super.start()
   // 5. 檢查 yarn application的狀態，不能為 kill， finished等等
    waitForApplication()
   // 6. 監控線程
    monitorThread = asyncMonitorApplication()
    monitorThread.start()
  }

重點解釋一下第三步，涉及的源碼步如下：

 object Client {
   def main(args: Array[String]) {
     // scalastyle:off println
     if (!sys.props.contains("SPARK_SUBMIT")) {
       println("WARNING: This client is deprecated and will be removed in a future version of Spark")
       println("Use ./bin/spark-submit with \"--master spark://host:port\"")
     }
     // scalastyle:on println
     new ClientApp().start(args, new SparkConf())
   }
 }
 
 private[spark] class ClientApp extends SparkApplication {
 
   override def start(args: Array[String], conf: SparkConf): Unit = {
     val driverArgs = new ClientArguments(args)
 
     if (!conf.contains("spark.rpc.askTimeout")) {
       conf.set("spark.rpc.askTimeout", "10s")
     }
     Logger.getRootLogger.setLevel(driverArgs.logLevel)
 
     val rpcEnv =
       RpcEnv.create("driverClient", Utils.localHostName(), 0, conf, new SecurityManager(conf))
 
     val masterEndpoints = driverArgs.masters.map(RpcAddress.fromSparkURL).
       map(rpcEnv.setupEndpointRef(_, Master.ENDPOINT_NAME))
     rpcEnv.setupEndpoint("client", new ClientEndpoint(rpcEnv, driverArgs, masterEndpoints, conf))
 
     rpcEnv.awaitTermination()
   }
 
 }

可以看到，在Client 的main方法中，初始化了ClientApp 對象，並調用了其 start 方法，在start 方法中，首先解析了 driver的參數。然后創建了 driver 端的 RPC environment，然后根據解析的 master 的信息，初始化 master 的endpointref，並且建立了 client endpoint 並返回 client endpoint ref。

三、定時執行推測任務

下面繼續看 org.apache.spark.scheduler.cluster.YarnScheduler 的 start 方法的第二步方法，首先 spark 推測任務 feature 默認是關閉的，原因如果有很多任務都延遲了，那么它會再啟動一個相同的任務，這樣可能會消耗掉所有的資源，對集群資源和提交到集群上的任務造成不可控的影響。啟動了一個延遲定時器，定時地執行 checkSpeculatableTasks 方法，如下：

 // Check for speculatable tasks in all our active jobs.
   def checkSpeculatableTasks() {
     var shouldRevive = false
     synchronized {
       shouldRevive = rootPool.checkSpeculatableTasks(MIN_TIME_TO_SPECULATION) // 1. 推測是否應該跑一個新任務
     }
     if (shouldRevive) {
       backend.reviveOffers() // 2. 跑一個新任務
     }
   }

其中，第一步推斷任務，有兩個實現一個是Pool 的實現，一個是TaskSetManager 的實現，Pool 會遞歸調用子Pool來獲取 speculatable tasks。如果需要推測，則運行task scheduler backend 的 reviveOffers方法，大致思路如下，首先獲取 executor 上的空閑資源，然后將這些資源分配給推測的 task，供其使用。

總結，本篇源碼剖析了在Spark Context 啟動過程中，以 yarn-client 模式為例，剖析了task scheduler 是如何啟動的。

其中關於RpcEnv的介紹直接略過了，下一篇會專門講解Spark 中內置的Rpc 機制的整體架構以及其是如何運行的。

第五章、spark源碼分析之RPC

一、Spark RPC創建NettyRpcEnv

1、Spark Rpc使用示例

我們以 org.apache.spark.deploy.ClientApp#start 方法中的調用API創建 RPC 的過程入口。

// 1. 創建 RPC Environment
val rpcEnv = RpcEnv.create("driverClient", Utils.localHostName(), 0, conf, new SecurityManager(conf))

2、創建NettyRpcEnv

如下是創建NettyRpcEnv的時序圖（畫的不好看，見諒）：

RpcEnv是scala 的object伴生對象（本質上是一個java 單例對象），去調用NettyRpcEnvFactory去創建 NettyRpcEnv 對象，序列化使用的是java序列化內建的方式，然后調用Utils 類重試啟動Server。啟動成功后返回給用戶。

org.apache.spark.rpc.netty.NettyRpcEnv#startServer 代碼如下：

  def startServer(bindAddress: String, port: Int): Unit = {
     val bootstraps: java.util.List[TransportServerBootstrap] =
       if (securityManager.isAuthenticationEnabled()) {
         java.util.Arrays.asList(new AuthServerBootstrap(transportConf, securityManager))
       } else {
         java.util.Collections.emptyList()
       }
     server = transportContext.createServer(bindAddress, port, bootstraps)
     dispatcher.registerRpcEndpoint(
       RpcEndpointVerifier.NAME, new RpcEndpointVerifier(this, dispatcher))
   }

在TransportServer構造過程中調用了init方法。org.apache.spark.network.server.TransportServer#init 源碼如下：

 private void init(String hostToBind, int portToBind) {
 
   IOMode ioMode = IOMode.valueOf(conf.ioMode());
   EventLoopGroup bossGroup =
     NettyUtils.createEventLoop(ioMode, conf.serverThreads(), conf.getModuleName() + "-server");
   EventLoopGroup workerGroup = bossGroup;
 
   PooledByteBufAllocator allocator = NettyUtils.createPooledByteBufAllocator(
     conf.preferDirectBufs(), true /* allowCache */, conf.serverThreads());
 
   bootstrap = new ServerBootstrap()
     .group(bossGroup, workerGroup)
     .channel(NettyUtils.getServerChannelClass(ioMode))
     .option(ChannelOption.ALLOCATOR, allocator)
     .option(ChannelOption.SO_REUSEADDR, !SystemUtils.IS_OS_WINDOWS)
     .childOption(ChannelOption.ALLOCATOR, allocator);
 
   this.metrics = new NettyMemoryMetrics(
     allocator, conf.getModuleName() + "-server", conf);
 
   if (conf.backLog() > 0) {
     bootstrap.option(ChannelOption.SO_BACKLOG, conf.backLog());
   }
 
   if (conf.receiveBuf() > 0) {
     bootstrap.childOption(ChannelOption.SO_RCVBUF, conf.receiveBuf());
   }
 
   if (conf.sendBuf() > 0) {
     bootstrap.childOption(ChannelOption.SO_SNDBUF, conf.sendBuf());
   }
 
   bootstrap.childHandler(new ChannelInitializer<SocketChannel>() {
     @Override
     protected void initChannel(SocketChannel ch) {
       RpcHandler rpcHandler = appRpcHandler;
       for (TransportServerBootstrap bootstrap : bootstraps) {
         rpcHandler = bootstrap.doBootstrap(ch, rpcHandler);
       }
       context.initializePipeline(ch, rpcHandler);
     }
   });
 
   InetSocketAddress address = hostToBind == null ?
       new InetSocketAddress(portToBind): new InetSocketAddress(hostToBind, portToBind);
   channelFuture = bootstrap.bind(address);
   channelFuture.syncUninterruptibly();
 
   port = ((InetSocketAddress) channelFuture.channel().localAddress()).getPort();
   logger.debug("Shuffle server started on port: {}", port);
 }

主要功能是：調用netty API 初始化 nettyServer。

org.apache.spark.rpc.netty.Dispatcher#registerRpcEndpoint的源碼如下：

  def registerRpcEndpoint(name: String, endpoint: RpcEndpoint): NettyRpcEndpointRef = {
   val addr = RpcEndpointAddress(nettyEnv.address, name)
   val endpointRef = new NettyRpcEndpointRef(nettyEnv.conf, addr, nettyEnv)
   synchronized {
     if (stopped) {
       throw new IllegalStateException("RpcEnv has been stopped")
     }
     if (endpoints.putIfAbsent(name, new EndpointData(name, endpoint, endpointRef)) != null) {
       throw new IllegalArgumentException(s"There is already an RpcEndpoint called $name")
     }
     val data = endpoints.get(name)
     endpointRefs.put(data.endpoint, data.ref)
     receivers.offer(data)  // for the OnStart message
   }
   endpointRef
 }

EndpointData 在初始化過程中會放入 OnStart 消息。在 Inbox 的 process 中，有如下代碼：

 case OnStart =>
   endpoint.onStart()
   if (!endpoint.isInstanceOf[ThreadSafeRpcEndpoint]) {
     inbox.synchronized {
       if (!stopped) {
         enableConcurrent = true
       }
     }
   }

調用 endpoint 的 onStart 方法和初始化是否支持並發處理模式。endpoint 指的是 RpcEndpointVerifier，其 onStart 方法如下：

 /**
    * Invoked before [[RpcEndpoint]] starts to handle any message.
    */
   def onStart(): Unit = {
    // By default, do nothing.
  }

即不做任何事情，直接返回，至此初始化NettyRPCEnv 流程就剖析完。伴生對象RpcEnv調用netty rpc 工廠創建NettyRpcEnv 對象，然后使用重試機制啟動TransportServer，然后NettyRpcEnv注冊RpcEndpointVerifier

到Dispatcher。最終返回 NettyRpcEnv 給API調用端，NettyRpcEnv 創建成功。在這里，Dispatcher 和 TransportServer 等組件暫不做深入了解，后續會一一剖析。

Dispatcher 是消息的分發器，負責將消息分發給適合的 endpoint

其實這個類還是比較簡單的，先來看它的類圖：

我們從成員變量入手分析整個類的內部構造和機理：

endpoints是一個 ConcurrentMap[String, EndpointData]，負責存儲 endpoint name 和 EndpointData 的映射關系。其中，EndpointData又包含了 endpoint name， RpcEndpoint 以及 NettyRpcEndpointRef 的引用以及Inbox 對象（包含了RpcEndpoint 以及 NettyRpcEndpointRef 的引用）。
endpointRefs: ConcurrentMap[RpcEndpoint, RpcEndpointRef] 包含了 RpcEndpoint 和 RpcEndpointRef 的映射關系。
receivers 是一個 LinkedBlockingQueue[EndpointData] 消息阻塞隊列，用於存放 EndpointData 對象。它主要用於追蹤那些可能會包含需要處理消息receiver（即EndpointData）。在post消息到Dispatcher 時，一般會先post 到 EndpointData 的 Inbox 中，然后，再將 EndpointData對象放入 receivers 中，源碼如下：

// Posts a message to a specific endpoint.
private def postMessage(
      endpointName: String,
      message: InboxMessage,
      callbackIfStopped: (Exception) => Unit): Unit = {
    val error = synchronized {
      // 1. 先根據endpoint name從路由中找到data
      val data = endpoints.get(endpointName)
      if (stopped) {
        Some(new RpcEnvStoppedException())
      } else if (data == null) {
        Some(new SparkException(s"Could not find $endpointName."))
      } else {
        // 2. 將待消費的消息發送到 inbox中
        data.inbox.post(message)
        // 3. 將 data 放到待消費的receiver 中
        receivers.offer(data)
        None
      }
    }
    // We don't need to call `onStop` in the `synchronized` block
    error.foreach(callbackIfStopped)
  }

stopped 標志 Dispatcher 是否已經停止了
threadpool 是 ThreadPoolExecutor 對象，其中的線程的 core 數量的計算如下： val availableCores = if (numUsableCores > 0) numUsableCores else Runtime.getRuntime.availableProcessors()val numThreads = nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads", math.max(2, availableCores)) 獲取到線程數之后，會初始化一個固定的線程池，用來執行 MessageLoop 任務，MessageLoop 是一個Runnable 對象。它會不停地從 receiver 堵塞隊列中，把放入的 EndpointData對象取出來，並且去調用其inbox成員變量的 process 方法。
PoisonPill 是一個空的EndpointData對象，起了一個標志位的作用，如果想要停止 Diapatcher ，會把PoisonPill 喂給 receiver 吃，當threadpool 執行 MessageLoop 任務時，吃到了毒葯，馬上退出，線程也就死掉了。PoisonPill命名很形象，關閉線程池的方式也是優雅的，是值得我們在工作中去學習和應用的。

從上面的成員變量分析部分可以知道，數據通過 postMessage 方法將 InboxMessage 數據 post 到 EndpointData的Inbox對象中，並將待處理的EndpointData 對象放入到 receivers 中，線程池會不斷從這個隊列中拿數據，分發數據。

二、剖析Dispatcher和Inbox、OOutbox

列中拿數據，分發數據。

1、引出Inbox

其實，data 就包含了 RpcEndpoint 和 RpcEndpointRef 對象，本可以在Dispatcher 中就可以調用 endpoint 的方法去處理。為什么還要設計出來一個 Inbox 層次的抽象呢？下面我們就趁熱剖析一下 Inbox 這個對象。

2、Inbox剖析

Inbox 的官方解釋： An inbox that stores messages for an RpcEndpoint and posts messages to it thread-safely. 其實就是它為RpcEndpoint 對象保存了消息，並且將消息 post給 RpcEndpoint，同時保證了線程的安全性。

類圖如下：

跟 put 和 get 語義相似的有兩個方法，分別是post 和 process。其實這兩個方法都是給 Dispatcher 對象調用的。post 將數據存放到堵塞消息隊列隊尾， pocess 則堵塞式從消息隊列中取出數據來，並處理之。

這兩個關鍵方法源碼如下：

  def post(message: InboxMessage): Unit = inbox.synchronized {
     if (stopped) {
       // We already put "OnStop" into "messages", so we should drop further messages
       onDrop(message)
     } else {
       messages.add(message)
       false
     }
 }
 
 
 /**
    * Calls action closure, and calls the endpoint's onError function in the case of exceptions.
    */
   private def safelyCall(endpoint: RpcEndpoint)(action: => Unit): Unit = {
     try action catch {
       case NonFatal(e) =>
         try endpoint.onError(e) catch {
           case NonFatal(ee) =>
             if (stopped) {
               logDebug("Ignoring error", ee)
             } else {
               logError("Ignoring error", ee)
             }
         }
     }
 }
 
 /**
    * Process stored messages.
    */
   def process(dispatcher: Dispatcher): Unit = {
     var message: InboxMessage = null
     inbox.synchronized {
       if (!enableConcurrent && numActiveThreads != 0) {
         return
       }
       message = messages.poll()
       if (message != null) {
         numActiveThreads += 1
       } else {
         return
       }
     }
     while (true) {
       safelyCall(endpoint) {
         message match {
           case RpcMessage(_sender, content, context) =>
             try {
               endpoint.receiveAndReply(context).applyOrElse[Any, Unit](content, { msg =>
                 throw new SparkException(s"Unsupported message $message from ${_sender}")
               })
             } catch {
               case e: Throwable =>
                 context.sendFailure(e)
                 // Throw the exception -- this exception will be caught by the safelyCall function.
                 // The endpoint's onError function will be called.
                 throw e
             }
 
           case OneWayMessage(_sender, content) =>
             endpoint.receive.applyOrElse[Any, Unit](content, { msg =>
               throw new SparkException(s"Unsupported message $message from ${_sender}")
             })
 
           case OnStart =>
             endpoint.onStart()
             if (!endpoint.isInstanceOf[ThreadSafeRpcEndpoint]) {
               inbox.synchronized {
                 if (!stopped) {
                   enableConcurrent = true
                 }
               }
             }
 
           case OnStop =>
             val activeThreads = inbox.synchronized { inbox.numActiveThreads }
             assert(activeThreads == 1,
               s"There should be only a single active thread but found $activeThreads threads.")
             dispatcher.removeRpcEndpointRef(endpoint)
             endpoint.onStop()
             assert(isEmpty, "OnStop should be the last message")
 
           case RemoteProcessConnected(remoteAddress) =>
             endpoint.onConnected(remoteAddress)
 
           case RemoteProcessDisconnected(remoteAddress) =>
             endpoint.onDisconnected(remoteAddress)
 
           case RemoteProcessConnectionError(cause, remoteAddress) =>
             endpoint.onNetworkError(cause, remoteAddress)
         }
       }
 
       inbox.synchronized {
         // "enableConcurrent" will be set to false after `onStop` is called, so we should check it
         // every time.
         if (!enableConcurrent && numActiveThreads != 1) {
           // If we are not the only one worker, exit
           numActiveThreads -= 1
           return
         }
         message = messages.poll()
         if (message == null) {
           numActiveThreads -= 1
           return
         }
       }
     }
 }

其中，InboxMessage 繼承關系如下：

這些InboxMessage子類型在process 方法源碼中有體現。其中OneWayMessage和RpcMessage 都是自帶消息content 的，其他的幾種都是消息事件，本身不帶任何除事件類型信息之外的信息。

在process 處理過程中，考慮到了一次性批量處理消息問題、多線程安全問題、異常拋出問題，多消息分支處理問題等等。

此時可以回答上面我們的疑問了，抽象出來 Inbox 的原因在於，Diapatcher 的職責變得單一，只需要把數據分發就可以了。具體分發數據要如何處理的問題留給了 Inbox，Inbox 把關注點放在了如何處理這些消息上。考慮並解決了一次性批量處理消息問題、多線程安全問題、異常拋出問題，多消息分支處理問題等等問題。

3、Outbox

下面看一下Outbox，它的內部構造和Inbox很類似，不再剖析。

OutboxMessage的繼承關系如下：

其中，OneWayOutboxMessage 的行為是特定的。源碼如下：

它沒有回調方法。

RpcOutboxMessage 的回調則是通過構造方法傳進來的。其源碼如下：

RpcOutboxMessage 是有回調的，回調方法通過構造方法指定，內部onFailure和onSuccess是模板方法。

三、RpcEndPoint和RpcEndPointRef剖析

1、RpcEndPoint

文檔對RpcEndpoint的解釋： An end point for the RPC that defines what functions to trigger given a message. It is guaranteed that onStart, receive and onStop will be called in sequence. The life-cycle of an endpoint is: constructor -> onStart -> receive* -> onStop Note: receive can be called concurrently. If you want receive to be thread-safe, please use ThreadSafeRpcEndpoint If any error is thrown from one of RpcEndpoint methods except onError, onError will be invoked with the cause. If onError throws an error, RpcEnv will ignore it.

其子類繼承關系如下：

其下面還有一個抽象子接口：ThreadSafeRpcEndpoint

文檔對ThreadSafeRpcEndpoint的解釋如下：需要RpcEnv線程安全地向其發送消息的trait。線程安全意味着在通過相同的ThreadSafeRpcEndpoint處理一條消息完成后再處理下一個消息。換句話說，在處理下一條消息時，可以看到對ThreadSafeRpcEndpoint的內部字段的更改，並且ThreadSafeRpcEndpoint中的字段不需要是volatile或等效的。但是，不能保證同一個線程將為不同的消息執行相同的ThreadSafeRpcEndpoint。即順序處理消息，不能同時並發處理。traint RpcEndpoint的方法如下：

對其變量和方法解釋如下：

rpcEnv：RpcEndpoint 注冊的那個 RpcEnv 對象
self ： RpcEndpoint 對應的 RpcEndpointRef。onStart 方法被調用的時候，RpcEndpointRef有效，onStop 調用后，self會是null，注意由於在onStart之前，RpcEndpoint 還沒有被注冊，還沒有有效的RpcEndpointRef，所以不要在onStart之前調用 self 方法
receive ：處理從RpcEndpointRef.send 或 RpcCallContext.reply 過來的消息，如果接收到一個未匹配的消息，會拋出 SparkException 並且發送給onError 方法
receiveAndReply：處理從RpcEndpointRef.ask發過來的消息，如果接收到一個未匹配的消息，會拋出 SparkException 並且發送給onError 方法
onError：在消息處理過程中，如果有異常都會調用此方法\6. onConnected：當remoteAddress 連接上當前節點時被調用
onDisconnected：當當前節點丟失掉 remoteAddress 后被調用
onNetworkError：當連接當前節點和remoteAddress時，有網絡錯誤發生時被調用
onStart：在RpcEndpoint開始處理其他消息之前被調用
onStop：當RpcEndpoint停止時被調用，self 將會是null，不能用於發送消息
stop：停止RpcEndpoint

2、RpcEndPointRef

RpcEndPointRef：遠程的RpcEndpoint引用，RpcEndpointRef是線程安全的。

有一個跟RpcEndPoint 很像的類 -- RpcEndPointRef。先來看 RpcEndpointRef抽象類。下面我們重點來看一下它內部構造。

首先看它的繼承結構：

它的父類是 RpcEndpointRef。先來剖析它的內部變量和方法的解釋：

有三個成員變量：

maxRetries：最大嘗試連接次數。可以通過 spark.rpc.numRetries 參數來指定，默認是 3 次。該變量暫時沒有使用。
retryWaitMs：每次嘗試連接最大等待毫秒值。可以通過 spark.rpc.retry.wait 參數來指定，默認是 3s。該變量暫時沒有使用。
defaultAskTimeout： spark 默認 ask 請求操作超時時間。可以通過 spark.rpc.askTimeout 或 spark.network.timeout參數來指定，默認是120s。

成員方法：

address ：抽象方法，返回 RpcEndpointRef的RpcAddress
name：抽象方法，返回 endpoint 的name
send：抽象方法，Sends a one-way asynchronous message. Fire-and-forget semantics. 發送單向的異步消息，滿足即發即忘語義。
ask：抽象方法。發送消息到相應的 RpcEndpoint.receiveAndReply , 並返回 Future 以在默認超時內接收返回值。它有兩個重載方法：其中沒有RpcTimeOut 的ask方法添加一個 defaultAskTimeout 參數繼續調用有RpcTimeOut 的ask方法。
askSync：調用抽象方法ask。跟ask類似，有兩個重載方法：其中沒有RpcTimeOut 的askSync方法添加一個 defaultAskTimeout 參數繼續調用有RpcTimeOut 的askSync方法。有RpcTimeOut 的askSync方法會調用 ask 方法生成一個Future 對象，然后等待任務執行完畢后返回。注意，這里面其實就涉及到了模板方法模式。ask跟askSync都是設定好了，ask 要返回一個Future 對象，askSync則是調用 ask 返回的Future 對象，然后等待 future 的 result 方法返回。

下面看RpcEndpointRef 的唯一實現類 - NettyRpcEndpointRef

RpcEndpointRef的NettyRpcEnv版本。此類的行為取決於它的創建位置。在“擁有”RpcEndpoint的節點上，它是RpcEndpointAddress實例的簡單包裝器。在接收序列化版本引用的其他計算機上，行為會發生變化。實例將跟蹤發送引用的TransportClient，以便通過客戶端連接發送到端點的消息，而不需要打開新連接。此ref的RpcAddress可以為null;這意味着ref只能通過客戶端連接使用，因為托管端點的進程不會偵聽傳入連接。不應與第三方共享這些引用，因為它們將無法向端點發送消息。

先來看成員變量：

conf ：是一個SparkConf 實例
endpointAddress：是一個RpcEndpointAddress 實例，主要包含了 RpcAddress (host和port) 和 rpc endpoint name的信息
nettyEnv：是一個NettyRpcEnv實例
client：是一個TransportClient實例，這個client 是不參與序列化的。

成員方法：

實現並重寫了繼承自超類的ask方法，如下：

實現並重寫了繼承自超類的send方法，如下：

關於序列化和反序列化的兩個方法：writeObject（序列化方法）和 readObject（反序列化方法），如下：

3、RequestMessage

順便，我們來看RequestMessage對象，代碼如下：

RequestMessage里面的消息是sender 發給 receiver 的，RequestMessage主要負責sender RpcAddress， receiver RpcAddress，receiver rpcendpoint name以及消息 content 的序列化。

總結：本文主要剖析了 RpcEndpoint和RpcEntpointRef兩個類，順便，也介紹了支持序列化的 RequestMessage 類。

四、 TransportContext和TransportClientFactory剖析

1、 TransportContext

首先官方文檔對TransportContext的說明如下：

Contains the context to create a TransportServer, TransportClientFactory, and to setup Netty Channel pipelines with a TransportChannelHandler. There are two communication protocols that the TransportClient provides, control-plane RPCs and data-plane "chunk fetching". The handling of the RPCs is performed outside of the scope of the TransportContext (i.e., by a user-provided handler), and it is responsible for setting up streams which can be streamed through the data plane in chunks using zero-copy IO. The TransportServer and TransportClientFactory both create a TransportChannelHandler for each channel. As each TransportChannelHandler contains a TransportClient, this enables server processes to send messages back to the client on an existing channel.

首先這個上下文對象是一個創建TransportServer, TransportClientFactory，使用TransportChannelHandler建立netty channel pipeline的上下文，這也是它的三個主要功能。

TransportClient 提供了兩種通信協議：控制層面的RPC以及數據層面的 "chunk抓取"。

用戶通過構造方法傳入的 rpcHandler 負責處理RPC 請求。並且 rpcHandler 負責設置流，這些流可以使用零拷貝IO以數據塊的形式流式傳輸。

TransportServer 和 TransportClientFactory 都為每一個channel創建一個 TransportChannelHandler對象。每一個TransportChannelHandler 包含一個 TransportClient，這使服務器進程能夠在現有通道上將消息發送回客戶端。

成員變量：

logger：負責打印日志的對象
conf：TransportConf對象
rpcHandler：RPCHandler的實例
closeIdleConnections：空閑時是否關閉連接
ENCODER：網絡層數據的加密，MessageEncoder實例
DECODER：網絡層數據的解密，MessageDecoder實例

三類方法：

創建TransportClientFactory，兩個方法如下：

創建TransportServer，四個方法如下：

建立netty channel pipeline，涉及方法以及調用關系如下：

注意：TransportClient就是在建立netty channel pipeline時候被調用的。整個rpc模塊，只有這個方法可以實例化TransportClient對象。

2、TransportClientFactory

TransportClientFactory

使用 TransportClientFactory 的 createClient 方法創建 TransportClient。這個factory維護到其他主機的連接池，並應為同一遠程主機返回相同的TransportClient。所有TransportClients共享一個工作線程池，TransportClients將盡可能重用。

在完成新TransportClient的創建之前，將運行所有給定的TransportClientBootstraps。

其內部維護了一個連接池，如下：

TransportClientFactory 類圖如下：

TransportClientFactory成員變量如下：

logger 日志類
context 是 TransportContext 實例
conf 是 TransportConf 實例
clientBootstraps是一個 List<TransportClientBootstrap>實例
connectionPool 是一個 ConcurrentHashMap<SocketAddress, ClientPool>實例，維護了 SocketAddress和ClientPool的映射關系，即連接到某台機器某個端口的信息被封裝到
rand是一個Random 隨機器，主要用於在ClientPool中選擇TransportClient 實例
numConnectionsPerPeer 表示到一個rpcAddress 的連接數
socketChannelClass 是一個 Channel 的Class 對象
workerGroup 是一個EventLoopGroup 主要是為了注冊channel 對象
pooledAllocator是一個 PooledByteBufAllocator 對象，負責分配buffer 的11.metrics是一個 NettyMemoryMetrics對象，主要負責從 PooledByteBufAllocator 中收集內存使用metric 信息

其成員方法比較簡單，簡言之就是幾個創建TransportClient的幾個方法。

創建受管理的TransportClient，所謂的受管理，其實指的是創建的對象被放入到了connectionPool中：

創建不受管理的TransportClient，新對象創建后不需要放入connectionPool中：

上面的兩個方法都調用了核心方法 createClient 方法，其源碼如下：

其中Bootstrap類目的是為了讓client 更加容易地創建channel。Bootstrap可以認為就是builder模式中的builder。

將復雜的channel初始化過程隱藏在Bootstrap類內部。

五、TransportResponseHandler、TransportRequestHandler和TransportChannelHandler剖析

1、TransportChannelHandler剖析

先來看類說明：

Handler that processes server responses, in response to requests issued from a [[TransportClient]]. It works by tracking the list of outstanding requests (and their callbacks). Concurrency: thread safe and can be called from multiple threads.

即處理服務器響應的處理程序，以響應TransportClient發出的請求。它的工作原理是跟蹤未完成的請求（及其回調）列表。它是線程安全的。

其關鍵的成員字段作如下說明：

channel：與之綁定的SocketChannel對象
outstandingFetches：是一個ConcurrentHashMap，主要保存StreamChunkId和ChunkReceivedCallback的映射關系。
outstandingRpcs：是一個ConcurrentHashMap，主要保存 request id 和RpcResponseCallback的映射關系。
streamCallbacks 是一個ConcurrentLinkedQueue隊列，保存了Pair<String, StreamCallback>，其中String是stream id
timeOfLastRequestNs：記錄了上次rpc 請求或 chunk fetching 的系統時間，以納秒計算

其關鍵方法 handle 如下：

2、TransportRequestHandler分析

類說明如下：

A handler that processes requests from clients and writes chunk data back. Each handler is attached to a single Netty channel, and keeps track of which streams have been fetched via this channel, in order to clean them up if the channel is terminated (see #channelUnregistered). The messages should have been processed by the pipeline setup by TransportServer.

它是一個handler，處理來自於client 的請求，返回chunk 給 client。每一個handler與一個netty channel 關聯，並追蹤那個chunk 已經被chennel獲取到了。其中消息應該已經被TransportServer建立起來的管道處理過了。

其成員變量說明如下：

channel：是Channel對象，與之關聯的SocketChannel對象
reverseClient：是TransportClient對象，同一個channel 上的client，這樣，就可以給消息的請求者通信了
rpcHandler：是一個RpcHandler對象，處理所有的 RPC 消息
streamManager：是一個StreamManager對象，返回一個流的任意一部分chunk
maxChunksBeingTransferred：正在傳輸的流的chunk 下標

其關鍵方法 handle 如下：

我們只看一個分支作為示例：

其調用了rpcHandler 的 receive 方法，該方法處理完畢后返回，如果成功，則返回RpcResponse對象，否則返回RpcResponse對象，由於這個返回可能是需要跨網絡傳輸的，所以，有進一步封裝了response 方法，如下：

即通過response 方法將server 端的請求結果返回給客戶端。

3、TransportChannelHandler分析

類說明如下：

The single Transport-level Channel handler which is used for delegating requests to the TransportRequestHandler and responses to the TransportResponseHandler. All channels created in the transport layer are bidirectional. When the Client initiates a Netty Channel with a RequestMessage (which gets handled by the Server's RequestHandler), the Server will produce a ResponseMessage (handled by the Client's ResponseHandler). However, the Server also gets a handle on the same Channel, so it may then begin to send RequestMessages to the Client. This means that the Client also needs a RequestHandler and the Server needs a ResponseHandler, for the Client's responses to the Server's requests. This class also handles timeouts from a io.netty.handler.timeout.IdleStateHandler. We consider a connection timed out if there are outstanding fetch or RPC requests but no traffic on the channel for at least requestTimeoutMs. Note that this is duplex traffic; we will not timeout if the client is continuously sending but getting no responses, for simplicity.

傳輸層的handler，負責委托請求給TransportRequestHandler，委托響應給TransportResponseHandler。

在傳輸層中創建的所有通道都是雙向的。當客戶端使用RequestMessage啟動Netty通道（由服務器的RequestHandler處理）時，服務器將生成ResponseMessage（由客戶端的ResponseHandler處理）。但是，服務器也會在同一個Channel上獲取句柄，因此它可能會開始向客戶端發送RequestMessages。這意味着客戶端還需要一個RequestHandler，而Server需要一個ResponseHandler，用於客戶端對服務器請求的響應。此類還處理來自io.netty.handler.timeout.IdleStateHandler的超時。如果存在未完成的提取或RPC請求但是至少在“requestTimeoutMs”上沒有通道上的流量，我們認為連接超時。請注意，這是雙工流量;如果客戶端不斷發送但是沒有響應，我們將不會超時。

關鍵方法channelRead如下：

該方法，負責將請求委托給TransportRequestHandler，將響應委托給TransportResponseHandler。

因為這個channel最終被添加到了channel上，所以消息從channel中傳輸（流出或流入）都會觸發這個方法，進而調用響應的方法。

即Spark RPC通過netty的channel發送請求，獲取響應。

六、TransportClient、TransportServer剖析

1、TransportClient類說明

先來看，官方文檔給出的說明：

Client for fetching consecutive chunks of a pre-negotiated stream. This API is intended to allow efficient transfer of a large amount of data, broken up into chunks with size ranging from hundreds of KB to a few MB. Note that while this client deals with the fetching of chunks from a stream (i.e., data plane), the actual setup of the streams is done outside the scope of the transport layer. The convenience method "sendRPC" is provided to enable control plane communication between the client and server to perform this setup. For example, a typical workflow might be: client.sendRPC(new OpenFile("/foo")) --> returns StreamId = 100 client.fetchChunk(streamId = 100, chunkIndex = 0, callback) client.fetchChunk(streamId = 100, chunkIndex = 1, callback) ... client.sendRPC(new CloseStream(100)) Construct an instance of TransportClient using TransportClientFactory. A single TransportClient may be used for multiple streams, but any given stream must be restricted to a single client, in order to avoid out-of-order responses. NB: This class is used to make requests to the server, while TransportResponseHandler is responsible for handling responses from the server. Concurrency: thread safe and can be called from multiple threads.

用於獲取預先協商的流的連續塊的客戶端。此API允許有效傳輸大量數據，分解為大小從幾百KB到幾MB的chunk。注意，雖然該客戶端處理從流（即，數據平面）獲取chunk，但是流的實際設置在傳輸層的范圍之外完成。提供便利方法“sendRPC”以使客戶端和服務器之間的控制平面通信能夠執行該設置。例如，典型的工作流程可能是：

// 打開遠程文件 client.sendRPC（new OpenFile（“/ foo”）） - >返回StreamId = 100

// 打開獲取遠程文件chunk-0 client.fetchChunk（streamId = 100，chunkIndex = 0，callback）

// 打開獲取遠程文件chunk-1 client.fetchChunk（streamId = 100，chunkIndex = 1，callback） .. .

// 關閉遠程文件 client.sendRPC（new CloseStream（100））使用TransportClientFactory構造TransportClient的實例。

單個TransportClient可以用於多個流，但是任何給定的流必須限制在單個客戶端，以避免無序響應。注意：此類用於向服務器發出請求，而TransportResponseHandler負責處理來自服務器的響應。並發：線程安全，可以從多個線程調用。

簡言之，可以認為TransportClient就是Spark Rpc 最底層的基礎客戶端類。主要用於向server端發送rpc 請求和從server 端獲取流的chunk塊。

下面看一下類的結構：

1589007457581

它有兩個內部類：RpcChannelListener和StdChannelListener，這兩個類的繼承關系如下：

其公共父類GenericFutureListener 官方說明如下：

Listens to the result of a Future. The result of the asynchronous operation is notified once this listener is added by calling Future.addListener(GenericFutureListener).

即，監聽一個Future 對象的執行結果，通過Future.addListener(GenericFutureListener)的方法，添加監聽器來監聽這個異步任務的最終結果。當異步任務執行成功之后，會調用監聽器的 operationComplete 方法。在StdChannelListener 中，其operationComplete 方法其實就是添加了日志打印運行軌跡的作用，添加了異常的處理方法 handleFailure，它是一個空實現，如下：

1589007513814

其子類RpcChannelListener的handleFailure實現如下：

這個handleFailure 方法充當着失敗處理轉發的作用。其調用了 RpcResponseCallback （通過構造方法傳入）的 onFailure 方法。

再來看一下TransportClient 的主要方法解釋：

fetchChunk ： Requests a single chunk from the remote side, from the pre-negotiated streamId. Chunk indices go from 0 onwards. It is valid to request the same chunk multiple times, though some streams may not support this. Multiple fetchChunk requests may be outstanding simultaneously, and the chunks are guaranteed to be returned in the same order that they were requested, assuming only a single TransportClient is used to fetch the chunks.其源碼如下：

stream：Request to stream the data with the given stream ID from the remote end.其源碼如下：

sendRpc：Sends an opaque message to the RpcHandler on the server-side. The callback will be invoked with the server's response or upon any failure.

uploadStream：Send data to the remote end as a stream. This differs from stream() in that this is a request to send data to the remote end, not to receive it from the remote.

sendRpcSync：Synchronously sends an opaque message to the RpcHandler on the server-side, waiting for up to a specified timeout for a response.

send：Sends an opaque message to the RpcHandler on the server-side. No reply is expected for the message, and no delivery guarantees are made.

1589007567341

removeRpcRequest：Removes any state associated with the given RPC.主要是從handler 中把監聽的rpcRequest移除。
close：close the channel
timeOut： Mark this channel as having timed out.

可以看出，其主要是一個比較底層的客戶端，主要用於發送底層數據的request，主要是數據層面的流中的chunk請求或者是控制層面的rpc請求，發送數據請求的方法中都有一個回調方法，回調方法是用於處理請求返回的結果。

2、TransportClient初始化

它是由TransportClientFactory 創建的。看TransportClientFactory 的核心方法： createClient(java.net.InetSocketAddress)的關鍵代碼如下：

  // 1. 添加一個 ChannelInitializer 的 handler
 bootstrap.handler(new ChannelInitializer<SocketChannel>() {
   @Override
   public void initChannel(SocketChannel ch) {
     TransportChannelHandler clientHandler = context.initializePipeline(ch);
     clientRef.set(clientHandler.getClient());
     channelRef.set(ch);
   }
 });
 
 // Connect to the remote server
 long preConnect = System.nanoTime();
 // 2. 連接到遠程的服務端，返回一個ChannelFuture 對象，調用其 await 方法等待其結果返回。
 ChannelFuture cf = bootstrap.connect(address);
 // 3. 等待channelFuture 對象其結果返回。
 if (!cf.await(conf.connectionTimeoutMs())) {
   throw new IOException(
     String.format("Connecting to %s timed out (%s ms)", address, conf.connectionTimeoutMs()));
 } else if (cf.cause() != null) {
   throw new IOException(String.format("Failed to connect to %s", address), cf.cause());
 }

在connect 方法中，初始化了handler。handler 被添加到ChannelPipiline之后，使用線程池來處理初始化操作，其調用了 DefaultChannelPipeline的callHandlerAdded0 方法，callHandlerAdded0調用了handler 的 handlerAdded 方法，handlerAdded內部調用了 initChannel 私有方法，initChannel又調用了保護抽象方法 initChannel，其會調用 ChannelInitializer自定義匿名子類的initChannel 方法。在這個 initChannel 方法中調用了TransportContext 的initializePipeline方法，在這個方法中實例化了 TransportClient對象。

我們再來看一下TransportContext 的initializePipeline方法的核心方法createChannelHandler：

再來看 NettyRpcEnv 是如何初始化transportContext 的：

從上面可以看到 rpcHandler 是NettyRpcHandler，其依賴三個對象，Dispatcher 對象，nettyEnv 對象以及StreamManager 對象。

3、TransportServer

官方說明：

Server for the efficient, low-level streaming service.

即：用於高效，低級別流媒體服務的服務器。

使用TransportContext createServer方法創建：

其構造方法源碼如下：

重點看其init方法：

ServerBootstrap是用於初始化Server的。跟TransportClientFactory創建TransportClient類似，也有ChannelInitializer的回調，跟Bootstrap類似。參照上面的剖析。

至此，TransClient和TransServer的剖析完畢。

七、spark RPC總結

spark rpc 整體架構圖如下：

作如下說明：

spark 網絡層是直接依賴於netty 框架的，它的適配器直接綁定到netty 的channel 上。
圖中的channel 的encoder 和 decoder 等等netty 相關的組件沒有體現出來。
channel 是全雙工的，所以NettyRpcEnv既有TransportClient 也有TransportServer。
請求包括數據層面的chunk請求和控制層面的rpc請求。chunk請求會被StreamManager處理，rpc 請求會進一步通過Dispatcher分發給合適的endpoint。返回結果通過channel 返回給發送端。
RpcEndpointRef可以是本地的RpcEndpoint的簡單包裝也可以是遠程RpcEndpoint 的代表。當RpcEndpoint 發送給 RpcEndpointRef 時，如果這個 RpcEndpointRef 是本地 RpcEndpointRef，則事件消息會被Dispatcher做進一步分發。如果是遠程消息，則事件會被進一步封裝成OutboxMessage，進而通過本地TransportClient將這個消息通過channel 發送給遠程的 RpcEndpoint。

至此，spark rpc全部分析完畢。

第六章、spark源碼分析之存儲

一、 SerializerManager剖析

對SerializerManager的說明：

它是為各種Spark組件配置序列化，壓縮和加密的組件，包括自動選擇用於shuffle的Serializer。spark中的數據在network IO 或 local disk IO傳輸過程中。都需要序列化。其默認的 Serializer 是 org.apache.spark.serializer.JavaSerializer，在一定條件下，可以使用kryo，即org.apache.spark.serializer.KryoSerializer。

1、支持的兩種序列化方式

即值的類型是八種基本類型中一種或null或String，都會使用kryo，否則使用默認序列化方式，即java序列化方式。

它還負責讀寫Block流是否使用壓縮：

2、數據流是否支持壓縮

默認情況下：

其中，如果使用壓縮，默認的壓縮是 lz4，可以通過參數 spark.io.compression.codec 來配置。它支持的所有壓縮類型如下：

3、讀寫數據流如何支持壓縮

其中，支持壓縮的InputStream和OutputStream是對原來的InputStream和OutputStream做了包裝。我們以LZ4BlockOutputStream為例說明。

調用如下函數返回支持壓縮的OutputStream：

首先，LZ4BlockOutputStream的繼承關系如下：

被包裝的類被放到了FilterOutputStream類的out 字段中，如下：

outputStream核心方法就是write。直接來看LZ4BlockOutputStream的write方法：

其中buffer是一個byte 數組，默認是 32k，可以通過spark.io.compression.lz4.blockSize 參數來指定，在LZ4BlockOutputStream類中用blockSize保存。

重點看flushBufferedData方法：

方法內部實現思路如下：

外部寫入到buffer中的數據經過compressor壓縮到compressorBuffer中，然后再寫入一些magic，最終將壓縮的buffer寫入到out中，write操作結束。

可見，數據的壓縮是由 LZ4BlockOutputStream 負責的，壓縮之后的數據被寫入到目標outputStream中。

二、 broadcast 是如何實現的？

1、BroadcastManager初始化

BroadcastManager初始化方法源碼如下：

TorrentBroadcastFactory的繼承關系如下：

2、BroadcastFactory

An interface for all the broadcast implementations in Spark (to allow multiple broadcast implementations). SparkContext uses a BroadcastFactory implementation to instantiate a particular broadcast for the entire Spark job.

即它是Spark中broadcast中所有實現的接口。SparkContext使用BroadcastFactory實現來為整個Spark job實例化特定的broadcast。它有唯一子類 -- TorrentBroadcastFactory。

它有兩個比較重要的方法：

newBroadcast 方法負責創建一個broadcast變量。

3、TorrentBroadcastFactory

其主要方法如下：

newBroadcast其實例化TorrentBroadcast類。

unbroadcast方法調用了TorrentBroadcast 類的 unpersist方法。

4、TorrentBroadcast父類Broadcast

官方說明如下：

A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost. Broadcast variables are created from a variable v by calling org.apache.spark.SparkContext.broadcast. The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The interpreter session below shows this:

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)

After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).

即廣播變量允許編程者將一個只讀變量緩存到每一個機器上，而不是隨任務一起發送它的副本。它們可以被用來用一種高效的方式拷貝輸入的大數據集。Spark也嘗試使用高效的廣播算法來減少交互代價。它通過調用SparkContext的broadcast 方法創建，broadcast變量是對真實變量的包裝，它可以通過broadcast對象的value方法返回真實對象。一旦真實對象被廣播了，要確保對象不會被改變，以確保該數據在所有節點上都是一致的。

TorrentBroadcast繼承關系如下：

TorrentBroadcast 是 Broadcast 的唯一子類。

5、TorrentBroadcast

其說明如下：

A BitTorrent-like implementation of org.apache.spark.broadcast.Broadcast. The mechanism is as follows: The driver divides the serialized object into small chunks and stores those chunks in the BlockManager of the driver. On each executor, the executor first attempts to fetch the object from its BlockManager. If it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or other executors if available. Once it gets the chunks, it puts the chunks in its own BlockManager, ready for other executors to fetch from. This prevents the driver from being the bottleneck in sending out multiple copies of the broadcast data (one per executor). When initialized, TorrentBroadcast objects read SparkEnv.get.conf.

實現機制：

driver 將數據拆分成多個小的chunk並將這些小的chunk保存在driver的BlockManager中。在每一個executor節點上，executor首先先從它自己的blockmanager獲取數據，如果不存在，它使用遠程抓取，從driver或者是其他的executor中抓取數據。一旦它獲取到chunk，就將其放入到自己的BlockManager中，准備被其他的節點請求獲取。這使得driver發送多個副本到多個executor節點的瓶頸不復存在。

6、driver 端寫數據

廣播數據的保存有兩種形式：

數據保存在memstore中一份，需要反序列化后存入；保存在磁盤中一份，磁盤中的那一份先使用 SerializerManager序列化為字節數組，然后保存到磁盤中。
將對象根據blockSize（默認為4m，可以通過spark.broadcast.blockSize 參數指定），compressCodec（默認是啟用的，可以通過 spark.broadcast.compress參數禁用。壓縮算法默認是lz4，可以通過 spark.io.compression.codec 參數指定）將數據寫入到outputStream中，進而拆分為幾個小的chunk，最終將數據持久化到blockManager中，也是memstore一份，不需要反序列化；磁盤一份。

其中，TorrentBroadcast 的 blockifyObject 方法如下：

壓縮的Outputstream對 ChunkedByteBufferOutputStream 做了裝飾。

7、driver或executor讀數據

broadcast 方法調用 value 方法時，會調用 TorrentBroadcast 的 getValue 方法，如下：

_value 字段聲明如下：

private lazy val _value: T = readBroadcastBlock()

接下來看一下 readBroadcastBlock 這個方法：

  private def readBroadcastBlock(): T = Utils.tryOrIOException {
  TorrentBroadcast.synchronized {
    val broadcastCache = SparkEnv.get.broadcastManager.cachedValues

    Option(broadcastCache.get(broadcastId)).map(_.asInstanceOf[T]).getOrElse {
      setConf(SparkEnv.get.conf)
      val blockManager = SparkEnv.get.blockManager
      blockManager.getLocalValues(broadcastId) match {
        case Some(blockResult) =>
          if (blockResult.data.hasNext) {
            val x = blockResult.data.next().asInstanceOf[T]
            releaseLock(broadcastId)

            if (x != null) {
              broadcastCache.put(broadcastId, x)
            }

            x
          } else {
            throw new SparkException(s"Failed to get locally stored broadcast data: $broadcastId")
          }
        case None =>
          logInfo("Started reading broadcast variable " + id)
          val startTimeMs = System.currentTimeMillis()
          val blocks = readBlocks()
          logInfo("Reading broadcast variable " + id + " took" + Utils.getUsedTimeMs(startTimeMs))

          try {
            val obj = TorrentBroadcast.unBlockifyObject[T](
              blocks.map(_.toInputStream()), SparkEnv.get.serializer, compressionCodec)
            // Store the merged copy in BlockManager so other tasks on this executor don't
            // need to re-fetch it.
            val storageLevel = StorageLevel.MEMORY_AND_DISK
            if (!blockManager.putSingle(broadcastId, obj, storageLevel, tellMaster = false)) {
              throw new SparkException(s"Failed to store $broadcastId in BlockManager")
            }

            if (obj != null) {
              broadcastCache.put(broadcastId, obj)
            }

            obj
          } finally {
            blocks.foreach(_.dispose())
          }
      }
    }
  }
}

對源碼作如下解釋：

第3行：broadcastManager.cachedValues 保存着所有的 broadcast 的值，它是一個Map結構的，key是強引用，value是虛引用（在垃圾回收時會被清理掉）。

第4行：根據 broadcastId 從cachedValues 中取數據。如果沒有，則執行getOrElse里的 default 方法。

第8行：從BlockManager的本地獲取broadcast的值（從memstore或diskstore中，獲取的數據是完整的數據，不是切分之后的小chunk），若有，則釋放BlockManager的鎖，並將獲取的值存入cachedValues中；若沒有，則調用readBlocks將chunk 數據讀取到並將數據轉換為 broadcast 的value對象，並將該對象放入cachedValues中。

其中， readBlocks 方法如下：

  /** Fetch torrent blocks from the driver and/or other executors. */
private def readBlocks(): Array[BlockData] = {
  // Fetch chunks of data. Note that all these chunks are stored in the BlockManager and reported
  // to the driver, so other executors can pull these chunks from this executor as well.
  val blocks = new Array[BlockData](numBlocks)
  val bm = SparkEnv.get.blockManager

  for (pid <- Random.shuffle(Seq.range(0, numBlocks))) {
    val pieceId = BroadcastBlockId(id, "piece" + pid)
    logDebug(s"Reading piece $pieceId of $broadcastId")
    // First try getLocalBytes because there is a chance that previous attempts to fetch the
    // broadcast blocks have already fetched some of the blocks. In that case, some blocks
    // would be available locally (on this executor).
    bm.getLocalBytes(pieceId) match {
      case Some(block) =>
        blocks(pid) = block
        releaseLock(pieceId)
      case None =>
        bm.getRemoteBytes(pieceId) match {
          case Some(b) =>
            if (checksumEnabled) {
              val sum = calcChecksum(b.chunks(0))
              if (sum != checksums(pid)) {
                throw new SparkException(s"corrupt remote block $pieceId of $broadcastId:" +
                  s" $sum != ${checksums(pid)}")
              }
            }
            // We found the block from remote executors/driver's BlockManager, so put the block
            // in this executor's BlockManager.
            if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, tellMaster = true)) {
              throw new SparkException(
                s"Failed to store $pieceId of $broadcastId in local BlockManager")
            }
            blocks(pid) = new ByteBufferBlockData(b, true)
          case None =>
            throw new SparkException(s"Failed to get $pieceId of $broadcastId")
        }
    }
  }
  blocks
}

源碼解釋如下：

第14行：根據pieceid從本地BlockManager 中獲取到 chunk

第15行：如果獲取到了chunk，則釋放鎖。

第18行：如果沒有獲取到chunk，則從遠程根據pieceid獲取遠程獲取chunk，獲取到chunk后做checksum校驗，之后將chunk存入到本地BlockManager中。

三、Spark內存管理剖析

1、整體介紹

Spark內存管理相關類都在 spark core 模塊的 org.apache.spark.memory 包下。

文檔對這個包的解釋和說明如下：

This package implements Spark's memory management system. This system consists of two main components, a JVM-wide memory manager and a per-task manager:

- org.apache.spark.memory.MemoryManager manages Spark's overall memory usage within a JVM. This component implements the policies for dividing the available memory across tasks and for allocating memory between storage (memory used caching and data transfer) and execution (memory used by computations, such as shuffles, joins, sorts, and aggregations).
- org.apache.spark.memory.TaskMemoryManager manages the memory allocated by individual tasks. Tasks interact with TaskMemoryManager and never directly interact with the JVM-wide MemoryManager. Internally, each of these components have additional abstractions for memory bookkeeping:

- org.apache.spark.memory.MemoryConsumers are clients of the TaskMemoryManager and correspond to individual operators and data structures within a task. The TaskMemoryManager receives memory allocation requests from MemoryConsumers and issues callbacks to consumers in order to trigger spilling when running low on memory.

- org.apache.spark.memory.MemoryPools are a bookkeeping abstraction used by the MemoryManager to track the division of memory between storage and execution.

即內存管理主要涉及了兩個組件：JVM 范圍的內存管理和單個任務的內存管理。

MemoryManager管理Spark在JVM中的總體內存使用情況。該組件實現了跨任務划分可用內存以及在存儲（內存使用緩存和數據傳輸）和執行（計算使用的內存，如shuffle，連接，排序和聚合）之間分配內存的策略。
TaskMemoryManager管理由各個任務分配的內存。任務與TaskMemoryManager交互，永遠不會直接與JVM范圍的MemoryManager交互。

在TaskMemoryManager內部，每個組件都有額外的記憶簿來記錄內存使用情況：

MemoryConsumers是TaskMemoryManager的客戶端，對應於任務中的各個運算符和數據結構。TaskMemoryManager接收來自MemoryConsumers的內存分配請求，並向消費者發出回調，以便在內存不足時觸發溢出。
MemoryPools是MemoryManager用來跟蹤存儲和執行之間內存划分的薄記抽象。

如圖：

1589008705767

MemoryManager的兩種實現：

There are two implementations of org.apache.spark.memory.MemoryManager which vary in how they handle the sizing of their memory pools: - org.apache.spark.memory.UnifiedMemoryManager, the default in Spark 1.6+, enforces soft boundaries between storage and execution memory, allowing requests for memory in one region to be fulfilled by borrowing memory from the other. - org.apache.spark.memory.StaticMemoryManager enforces hard boundaries between storage and execution memory by statically partitioning Spark's memory and preventing storage and execution from borrowing memory from each other. This mode is retained only for legacy compatibility purposes.

org.apache.spark.memory.MemoryManager有兩種實現，它們在處理內存池大小方面有所不同：

org.apache.spark.memory.UnifiedMemoryManager，Spark 1.6+中的默認值，強制存儲內存和執行內存之間的軟邊界，允許通過從另一個區域借用內存來滿足一個區域中的內存請求。
org.apache.spark.memory.StaticMemoryManager 通過靜態分區Spark的內存，強制存儲內存和執行內存之間的硬邊界並防止存儲和執行從彼此借用內存。僅為了傳統兼容性目的而保留此模式。

先來一張自己畫的類圖，對涉及類之間的關系有一個比較直接的認識：

下面我們逐一對涉及的類做說明。

2、MemoryMode

內存模式：主要分堆內內存和堆外內存，MemoryMode是一個枚舉類，從本質上來說，ON_HEAP和OFF_HEAP都是MemoryMode的子類。

3、MemoryPool

文檔說明如下：

`Manages bookkeeping ``for` `an adjustable-sized region of memory. This ``class` `is internal to the MemoryManager. `

即它負責管理可調大小的內存區域的簿記工作。可以這樣理解，內存就是一個金庫，它是一個負責記賬的管家，主要負責記錄內存的借出歸還。這個類專門為MempryManager而設計。

給內存記賬，其實從本質上來說，它不是Spark內存管理部分的核心功能，但是又很重要，它的核心方法都是被MemoryManager來調用的。

理解了這個類，其子類就比較好理解了。記賬的管家有兩種實現，分別是StorageMemoryPool和ExecutionMemoryPool。

3.1、StorageMemoryPool

文檔解釋：

Performs bookkeeping for managing an adjustable-size pool of memory that is used for storage (caching).

說白了，它就是專門給負責存儲或緩存的內存區域記賬的。

其類結構如下：

它有三種方法：

acquireMemory：獲取N個字節的內存給指定的block，如果有必要，即內存不夠用了，可以將其他的從內存中驅除。源碼如下：

圖中標記的邏輯，參照下文MemoryStore的剖析。

releaseMemory：釋放內存。源碼如下：

很簡單，就只是在統計值_memoryUsed 上面做減法。

freeSpaceToShrinkPool：可用空間通過spaceToFree字節縮小此存儲內存池的大小。源碼如下：

簡單地可以看出，這個方法是在收縮存儲內存池之前調用的，因為這個方法返回值是要收縮的值。

收縮存儲內存池是為了擴大執行內存池，即這個方法是在收縮存儲內存，擴大執行內存時用的，這個方法只是為了縮小存儲內存池作准備的，並沒有真正的縮小存儲內存池。

實現思路，首先先計算需要驅逐的內存大小，如果需要驅逐內存，則跟 acquireMemory 方法類似，調用MemoryStore 的 evictBlocksToFreeSpace方法，否則直接返回。

總結：這個類是給存儲內存池記賬的，也負責不夠時或內存池不滿足縮小條件時，通知MemoryStore驅逐內存。

3.2、ExecutionMemoryPool

文檔解釋：

Implements policies and bookkeeping for sharing an adjustable-sized pool of memory between tasks. Tries to ensure that each task gets a reasonable share of memory, instead of some task ramping up to a large amount first and then causing others to spill to disk repeatedly. If there are N tasks, it ensures that each task can acquire at least 1 / 2N of the memory before it has to spill, and at most 1 / N. Because N varies dynamically, we keep track of the set of active tasks and redo the calculations of 1 / 2N and 1 / N in waiting tasks whenever this set changes. This is all done by synchronizing access to mutable state and using wait() and notifyAll() to signal changes to callers. Prior to Spark 1.6, this arbitration of memory across tasks was performed by the ShuffleMemoryManager.

實現策略和簿記，以便在任務之間共享可調大小的內存池。嘗試確保每個任務獲得合理的內存份額，而不是首先增加大量任務然后導致其他任務重復溢出到磁盤。

如果有N個任務，它確保每個任務在溢出之前至少可以獲取1 / 2N的內存，最多1 / N.

由於N動態變化，我們會跟蹤活動任務的集合並在每當任務集合改變時重做等待任務中的1 / 2N和1 / N的計算。

這一切都是通過同步對可變狀態的訪問並使用 wait() 和 notifyAll() 來通知對調用者的更改來完成的。在Spark 1.6之前，跨任務的內存仲裁由ShuffleMemoryManager執行。

類內部結構如下：

1589008857440

memoryForTask聲明如下：

 @GuardedBy("lock")
 private val memoryForTask = new mutable.HashMap[Long, Long]()

其中，key 指的是 taskAttemptId， value 是內存使用情況（以byte計算）。它用來記錄每一個任務內存使用情況。

它也有三類方法：

獲取總的或每一個任務的內存使用大小，源碼如下：

memoryForTask 記錄了每一個task使用的內存大小。

給一個任務分配內存，源碼如下：

numBytes表示申請的內存大小（in byte），taskAttemptId 表示申請內存的 task id，maybeGrowPool 表示一個可能會增加執行池大小的回調。它接受一個參數（Long），表示應該擴展此池的所需內存量。computeMaxPoolSize 表示在此給定時刻返回此池的最大允許大小的回調。這不是字段，因為在某些情況下最大池大小是可變的。例如，在統一內存管理中，可以通過驅逐緩存塊來擴展執行池，從而縮小存儲池。

如果之前該任務沒有申請過，則將(taskAttemptId <- 0) 放入到 memoryForTask map 中，然后釋放鎖並喚醒lock鎖等待區的線程。

被喚醒的因為synchronized實現的是一個互斥鎖，所以當前僅當只有一個線程執行while循環。

首先根據（需要的內存大小 - 池總空閑內存大小）來確認是否需要擴大池，由於存儲池可能會偷執行池的內存，所以需要執行 maybeGrowPool 方法。

computeMaxPoolSize計算出此時該池允許的最大內存大小。然后分別算出每個任務最大分配內存和最小分配內存。進而計算出分配給該任務的最大分配大小（maxToGrant）和實際分配大小（toGrant）。

如果實際分配大小小於需要分配的內存大小並且當前任務占有內存 + 實際分配內存 < 每個任務最小分配內存，則該線程進入鎖wait區等待，等待內存可用時喚醒，否則將內存分配給任務。

可以看到這個方法中的wait和notify方法並不是成對的，因為新添加的taskAttemptId不能滿足內存可用的條件。因為這個鎖是從外部傳過來的，即MemoryManager也可能對其做了操作，使內存空余下來，可供任務分配。

釋放task內存，源碼如下：

它有兩個方法，分別是釋放當前任務已經使用的所有內存空間 releaseAllMemoryForTask 和釋放當前任務的指定大小的內存空間 releaseMemory。

思路：

releaseAllMemoryForTask 先計算好當前任務使用的全部內存，然后調用 releaseMemory 方法釋放內存。

releaseMemory 方法則會比對當前使用內存和要釋放的內存，如果要釋放的內存大小小於當前使用的，做減法即可。釋放之后的任務內存如果小於等於0，則移除task即可，最后通知lock鎖等待區的對象，讓其重新分配內存。

在這個記賬的實現里，每一個來的task不一定是可以分配到內存的，所以，鎖在其中起了很大的資源協調的作用，也防止了內存的溢出。

4、MemoryManager

文檔說明：

An abstract memory manager that enforces how memory is shared between execution and storage. In this context, execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, while storage memory refers to that used for caching and propagating internal data across the cluster. There exists one MemoryManager per JVM.

一種抽象內存管理器，用於強制執行和存儲之間共享內存的方式。在這個上下文下，執行內存是指用於在shuffle，join，sort和aggregation中進行計算的內存，而存儲內存是指用於在群集中緩存和傳播內部數據的內存。每個JVM都有一個MemoryManager。

先來說一下其依賴的MemoryPool，源碼如下：

MemoryPool中的lock對象就是MemoryManager對象

存儲內存池和執行內存池分別有兩個：堆內和堆外。

onHeapStorageMemory和onHeapExecutionMemory 是從構造方法傳過來的，先不予考慮。

maxOffHeapMemory 默認是 0，可以根據 spark.memory.offHeap.size 參數設置，文檔對這個參數的說明如下：

The absolute amount of memory in bytes which can be used for off-heap allocation. 
This setting has no impact on heap memory usage, so if your executors' total memory consumption must fit within some hard limit 
then be sure to shrink your JVM heap size  accordingly. This must be set to a positive value when spark.memory.offHeap.enabled=true.

存儲堆外內存 = 最大堆外內存（offHeapStorageMemory） X 堆外存儲內存占比，這個占比默認是0.5，可以根據 spark.memory.storageFraction 來調節

執行堆外內存 = 最大堆外內存 - 存儲堆外內存

還有跟 Tungsten 管理內存有關的常量：

這三個常量分別定義了tungsten的內存形式、內存頁大小和內存分配器。

其方法解釋如下：

獲取存儲池最大使用內存，抽象方法，待子類實現。

獲取已使用內存

獲取內存，這也是抽象方法，待子類實現

釋放內存

這些請求都委托給對應的MemoryPool來做了

1.6 之前使用MemoryManager子類 StaticMemoryManager 來做內存管理。

5、StaticMemoryManager

這個靜態內存管理中的執行池和存儲池之間有嚴格的界限，兩個池的大小永不改變。

注意：如果想使用這個內存管理方式，設置 spark.memory.useLegacyMode 為 true即可（默認是false）

下面我們重點看1.6 之后的默認使用的MemoryManager子類 -- UnifiedMemoryManager

6、UnifiedMemoryManager

先來看文檔說明：

這個MemoryManager保證了存儲池和執行池之間的軟邊界，即可以互相借用內存來滿足彼此動態的內存需求變化。執行和存儲的占比由 spark.memory.storageFraction 配置，默認是0.6，即偏向於存儲池。其中存儲池的默認占比是由 spark.memory.storageFraction 參數決定，默認是 0.5 ，即存儲池默認占比 = 0.6 * 0.5 = 0.3 ，即存儲池默認占比為0.3。存儲池可以盡可能多的向執行池借用空閑內存。但是當執行池需要它的內存的時候，會把一部分內存池的內存對象從內存中驅逐出，直到滿足執行池的內存需求。類似地，執行池也可以盡可能地借用存儲池中的空閑內存，不同的是，執行內存不會被存儲池驅逐出內存，也就是說，緩存block時可能會因為執行池占用了大量的內存池不能釋放導致緩存block失敗，在這種情況下，新的block會根據StorageLevel做相應處理。

我們主要來看其實現的父類MemoryManager 的方法：

獲取存儲池最大使用內存：

1589009090840

其中，maxHeapMemory 是從構造方法傳進來的成員變量，maxOffHeapMemory 是根據參數 spark.memory.offHeap.size 配置生成的。

可以看出，存儲池的允許的最大使用內存是實時變化的，因為總內存不變，執行池內存使用情況隨任務執行情況變化而變化。

獲取內存，逐一來看：

實現思路：先根據存儲方式（堆內還是堆外）確定存儲池，執行池，存儲區域內存大小和最大總內存。

然后調用執行池的 acquireMemory 方法申請內存，computeMaxExecutionPoolSize是隨存儲的實時變化而變化的，增大ExecutionPool的回調也被調用來確保有足夠空間可供執行池分配。

acquireUnrollMemory 直接調用 acquireStorageMemory 方法。

acquireStorageMemory實現思路：先根據存儲方式（堆內還是堆外）確定存儲池，執行池，存儲區域內存大小和最大總內存。

存儲內存如果大於最大內存，直接存儲失敗，否則，繼續查看所需內存大小是否大於內存池最大空閑內存，如果大於，則從執行池中申請足夠的空閑空間，注意，真正申請的空間大小在0 和numBytes - storagePool.memoryFree 之間，繼續調用storagePool的acquireMemory 方法去申請內存，如果不夠申請，則會驅逐出舊或空的block塊。

最后，我們來看一下其伴生對象：

首先 apply 方法就類似於工廠方法的創造方法。我們對比下面的一張圖，來說明一下Spark內存結構：

系統內存：可以根據 spark.testing.memory 參數來配置（主要用於測試），默認是JVM 的可以使用的最大內存。

保留內存：可以根據 spark.testing.reservedMemory 參數來配置（主要用於測試），默認是 300M

最小系統內存：保留內存 * 1.5 后，再向下取整

系統內存的約束：系統內存必須大於最小保留內存，即系統可用內存必須大於 450M，可以通過 --driver-memory 或 spark.driver.memory 或 --executor-memory 或spark.executor.memory 來調節

可用內存 = 系統內存 - 保留內存

堆內內存占比默認是0.6，可以根據 spark.memory.fraction 參數來調節

最大堆內內存 = 堆內可用內存 * 堆內內存占比

堆內內存存儲池占比默認是 0.5 ，可以根據spark.memory.storageFraction 來調節。

默認堆內存儲內存大小 = 最大堆內內存 * 堆內內存存儲池占比。即堆內存儲池內存大小默認是（系統JVM最大可用內存 - 300M）* 0.6 * 0.5，即約等於JVM最大可用內存的三分之一。

注意：下圖中的spark.memory.fraction是0.75，是Spark 1.6 的默認配置。在Spark 2.4.3 中默認是0.6。

至此，Saprk 的內存管理模塊基本上剖析完畢。

總結：先介紹了內存的管理池，即MemoryPool的實現，然后重點分析了Spark 1.6 以后的內存管理機制，着重說明Spark內部的內存是如何划分以及如何動態調整內存的。

四、spark內存存儲剖析

1、總述

跟內存存儲的相關類的關系如下：

MemoryStore是負責內存存儲的類，其依賴於BlockManager、SerializerManager、BlockInfoManager、MemoryManager。

BlockManager是BlockEvictionHandler的實現類，負責實現dropFromMemory方法，必要時從內存中把block丟掉，可能會轉儲到磁盤上。

BlockInfoManager是一個實現了對block讀寫時的一個鎖機制，具體可以看下文。

MemoryManager 是一個內存管理器，從Spark 1.6 以后，其存儲內存池大小和執行內存池大小是可以動態擴展的。即存儲內存和執行內存必要時可以從對方內存池借用空閑內存來滿足自己的使用需求。

BlockInfo 保存了跟block相關的信息。

BlockId的name不同的類型有不同的格式，代表不同的block類型。

StorageLevel 表示block的存儲級別，它本身是支持序列化的。

當存儲一個集合為序列化字節數組時，失敗的結果由 PartiallySerializedBlock 返回。

當存儲一個集合為Java對象數組時，失敗的結果由 PartiallyUnrolledIterator 返回。

RedirectableOutputStream 是對另一個outputstream的包裝outputstream，負責直接將數據中轉到另一個outputstream中。

ValueHolder是一個內存中轉站，其有一個getBuilder方法可以獲取到MemoryEntryBuilder對象，該對象會負責將中轉站的數據轉換為對應的可以保存到MemStore中的MemoryEntry。

我們逐個來分析其源碼：

2、BlockInfo

它記錄了block 的相關信息。

level： StorageLevel 類型，代表block的存儲級別

classTag：block的對應類，用於選擇序列化類

tellMaster：block 的變化是否告知master。大部分情況下都是需要告知的，除了廣播的block。

size： block的大小（in byte）

readerCount：block 讀的次數

writerTask：當前持有該block寫鎖的taskAttemptId，其中 BlockInfo.NON_TASK_WRITER 表示非 task任務持有鎖，比如driver線程，BlockInfo.NO_WRITER 表示沒有任何代碼持有寫鎖。

3、BlockId

A Block can be uniquely identified by its filename, but each type of Block has a different set of keys which produce its unique name. If your BlockId should be serializable, be sure to add it to the BlockId.apply() method.

其子類，在上圖中已經標明。

4、BlockInfoManager

文檔介紹如下：

Component of the BlockManager which tracks metadata for blocks and manages block locking. The locking interface exposed by this class is readers-writer lock. Every lock acquisition is automatically associated with a running task and locks are automatically released upon task completion or failure. This class is thread-safe.

它有三個成員變量，如下：

infos 保存了 Block-id 和 block信息的對應關系。

writeLocksByTask 保存了每一個任務和任務持有寫鎖的block-id

readLockByTasks 保存了每一個任務和任務持有讀鎖的block-id，因為讀鎖是可重入的，所以 ConcurrentHashMultiset 是支持多個重復值的。

方法如下：

注冊task

獲取當前task

獲取讀鎖

思路：如果block存在，並且沒有task在寫，則直接讀即可，否則進入鎖等待區等待。

獲取寫鎖

思路：如果block存在，且沒有task在讀，也沒有task在寫，則在寫鎖map上記錄task，表示已獲取寫鎖，否則進入等待區等待

斷言有task持有寫鎖寫block

寫鎖降級

思路：首先把和block綁定的task取出並和當前task比較，若是同一個task，則調用unlock方法

釋放鎖：

思路：若當前任務持有寫鎖，則直接釋放，否則讀取次數減1，並且從讀鎖記錄中刪除一條讀鎖記錄。最后喚醒在鎖等待區等待的task。

獲取為寫一個新的block獲取寫鎖

釋放掉指定task的所有鎖

思路：先獲取該task的讀寫鎖記錄，然后移除寫鎖記錄集中的每一條記錄，移除讀鎖記錄集中的每一條讀鎖記錄。

移除並釋放寫鎖

讀寫鎖記錄清零，解除block-id和block信息的綁定。

還有一些查詢方法，不再做詳細說明。

簡單總結一下：

讀鎖支持可重入，即可以重復獲取讀鎖。可以獲取讀鎖的條件是：沒有task在寫該block，對有沒有task在讀block沒有要求。

寫鎖當且僅當一個task獲取，可以獲取寫鎖的條件是：沒有task在讀block，沒有task在寫block。

注意，這種設計可以用在一個block的讀的次數遠大於寫的次數的情況下。我們可以來做個假設：假設一個block寫的次數遠超過讀的次數，同時多個task寫同一個block的操作就變成了串行的，寫的效率，因為只有一個BlockInfoManager對象，即一個鎖，即所有在鎖等待區等待的writer們都在競爭一個鎖。對於讀的次數遠超過寫的次數情況下，reader們可以肆無忌憚地讀取數據數據，基本處於無鎖情況下，幾乎沒有了鎖切換帶來的開銷，並且可以允許不同task同時讀取同一個block的數據，讀的吞吐量也提高了。

總之，BlockInfoManager自己實現了block的一套讀寫鎖機制，這種讀寫鎖的設計思路是非常經典和值得學習的。

5、RedirectableOutputStream

文檔說明：

A wrapper which allows an open [[OutputStream]] to be redirected to a different sink.

即這個類可以將outputstream重定向到另一個outputstream。

源碼也很簡單：

os成員變量就是重定向的目標outputstream

6、MemoryEntry

memoryEntry本質上就是內存中一個block，指向了存儲在內存中的真實數據。

如上圖，它有兩個子類：

其中，DeserializedMemoryEntry 是用來保存反序列化之后的java對象數組的，value是一個數據，保存着真實的反序列化數據，size表示，classTag記錄着數組中被擦除的數據的Class類型，這種數據只能保存在堆內內存中。

SerializedMemoryEntry 是用來保存序列化之后的ByteBuffer數組的，buffer中記錄的是真實的Array[ByteBuffer]數據。memoryMode表示數據存儲的內存區域，堆外內存還是堆內內存，classTag記錄着序列化前被擦除的數據的Class類型，size表示字節數據大小。

7、MemoryEntryBuilder

build方法將內存數據構建到MemoryEntry中

8、ValuesHolder

本質上來說，就是一個內存中轉站。數據被臨時寫入到這個中轉站，然后調用其getBuilder方法獲取 MemoryEntryBuilder 對象，這個對象用於構建MemoryEntry 對象。

storeValues用於寫入數據，estimateSize用於評估holder中內存的大小。調用getBuilder之后會返回 MemoryEntryBuilder對象，后續可以拿這個builder創建MemoryEntry

調用getBuilder之后，會關閉流，禁止數據寫入。

它有兩個子類：用於中轉Java對象的DeserializedValuesHolder和用於中轉字節數據的SerializedValuesHolder。

其實現類具體如下：

1、DeserializedValuesHolder

2. SerializedValuesHolder

接下來，我們看一下Spark內存存儲中的重頭戲 -- MemoryStore

9、MemoryStore

文檔說明：

Stores blocks in memory, either as Arrays of deserialized Java objects or as serialized ByteBuffers.

類內部結構如下：

對成員變量的說明：

entries 本質上就是在內存中保存blockId和block內容的一個map，它的 accessOrder為true，即最近訪問的會被移動到鏈表尾部。

onHeapUnrollMemoryMap 記錄了taskAttemptId和需要攤開一個block需要的堆內內存大小的關系

offHeapUnrollMemoryMap 記錄了taskAttemptId和需要攤開一個block需要的堆外內存大小的關系

unrollMemoryThreshold 表示在攤開一個block 之前給request分配的初始內存，可以通過 spark.storage.unrollMemoryThreshold 來調整，默認是 1MB

下面，開門見山，直接剖析比較重要的方法：

putBytes：這個方法只被BlockManager調用，其中_bytes回調用於生成直接被緩存的ChunkedByteBuffer：

思路：先從MemoryManager中申請內存，如果申請成功，則調用回調方法 _bytes 獲取ChunkedByteBuffer數據，然后封裝成 SerializedMemoryEntry對象，最后將封裝好的SerializedMemoryEntry對象緩存到 entries中。

把迭代器中值保存為內存中的Java對象

思路：轉換為DeserializedValueHolder對象，進而調用putIterator方法，ValueHolder就是一個抽象，使得putIterator既可以緩存序列化的字節數據又可以緩存Java對象數組。

把迭代器中值保存為內存中的序列化字節數據

思路：轉換為 SerializedValueHolder 對象，進而調用putIterator方法。

MAX_ROUND_ARRARY_LENGTH和unrollMemoryThreshold的定義如下：

1 public static int MAX_ROUNDED_ARRAY_LENGTH = Integer.MAX_VALUE - 15;
2 private val unrollMemoryThreshold: Long = conf.getLong("spark.storage.unrollMemoryThreshold", 1024 * 1024)

unrollMemoryThreshold 默認是 1MB，可以通過 spark.storage.unrollMemoryThreshold 參數調整大小。

putIterator方法由參數ValueHolder，使得緩存字節數據和Java對象可以放到一個方法來。方法2跟3 都調用了 putIterator 方法，如下：

思路：

第一步：定義攤開內存初始化大小，攤開內存增長率，攤開內存檢查頻率等變量。

第二步：向MemoryManager請求申請攤開初始內存，若成功，則記錄這筆攤開內存。

第三步：然后進入223～240行的while循環，在這個循環里：

循環條件：如果還有值需要攤開並且上次內存申請是成功的，則繼續進行該次循環
不斷想ValueHolder中add數據。如果攤開的元素個數不是UNROLL_MEMORY_CHECK_PERIOD的整數倍，則攤開個數加1；否則，查看ValueHolder中的內存是否大於了已分配內存，若大於，則請求MemoryManager分配內存，並將分配的內存累加到已分配內存中。

第四步：

若上一次向MemoryManager申請內存成功，則從ValueHolder中獲取builder，並且計算准確內存開銷。查看准確內存是否大於了已分配內存，若大於，則請求MemoryManager分配內存，並將分配的內存累加到已分配內存中。

否則，否則打印內存使用情況，返回為攤開該block申請的內存

第五步：

若上一次向MemoryManager申請內存成功，首先調用MemoryEntryBuilder的build方法構建出可以直接存入內存的MemoryEntry，並向MemoryManager請求釋放攤開內存，申請存儲內存，並確保存儲內存申請成功。最后將數據存入內存的entries中。

否則打印內存使用情況，返回為攤開該block申請的內存

其實之前不是很理解unroll這個詞在這里的含義，一直譯作攤開，它其實指的就是集合的數據轉儲到中轉站這個操作，攤開內存指這個操作需要的內存。

下面來看一下這個方法里面依賴的常量和方法：

1 unrollMemoryThreshold 在上一個方法已做說明。UNROLL_MEMORY_CHECK_PERIOD 和 UNROLL_MEMORY_GROWTH_FACTOR 常量定義如下：

即，UNROLL_MEMORY_CHECK_PERIOD默認是16，UNROLL_MEMORY_GROWTH_FACTOR 默認是 1.5

4.2 reserveUnrollMemoryForThisTask方法源碼如下，思路大致上是先從MemoryManager 申請攤開內存，若成功，則根據memoryMode在堆內或堆外記錄攤開內存的map上記錄新分配的內存。

4.3 releaseUnrollMemoryForThisTask方法如下，實現思路：先根據memoryMode獲取到對應記錄堆內或堆外內存的使用情況的map，然后在該task的攤開內存上減去這筆內存開銷，如果減完之后，task使用內存為0，則直接從map中移除對該task的內存記錄。

4.4 日志打印block攤開內存和當前內存使用情況

獲取緩存的值：

思路：直接根據blockId從entries中取出MemoryEntry數據，然后根據MemoryEntry類型取出數據即可。

移除Block或清除緩存，比較簡單，不做過多說明：

嘗試驅逐block來釋放指定大小的內存空間來存儲給定的block，代碼如下：

該方法有三個參數：要分配內存的blockId，block大小，內存類型(堆內還是堆外)。

第 469～485 行：dropBlock 方法思路：先從MemoryEntry中獲取data，再調用 BlockManager從內存中驅逐出該block，如果該block 的StorageLevel允許落地到磁盤，則先落到磁盤，再從內存中刪除之，最后更新該block的StorageLevel，最后檢查新的StorageLevel，若該block還在內存或磁盤中，則釋放鎖，否則，直接從BlockInfoManager中刪除之。

第 443 行：找到block對應的rdd。

第451～467 行：先給entries上鎖，然后遍歷entries集合，檢查block 是否可以從內存中驅逐，若可以則把它加入到selectedBlocks集合中，並把該block大小累加到freedMemory中。

461行的 lockForWriting 方法，不堵塞，即如果第一次拿不到寫鎖，則一直不停地輪詢，直到可以拿到寫鎖為止。那么問題來了，為什么要先獲取寫鎖呢？因為寫鎖具有排他性並且不具備可重入性，一旦拿到寫鎖，其他鎖就不能再訪問該block了。

487行～ 528 行：若計划要釋放的內存小於存儲新block需要的內存大小，則直接釋放寫鎖，不從內存中驅逐之前選擇的block，直接返回。

若計划要釋放的內存不小於存儲新block需要的內存大小，則遍歷之前選擇的每一個block，獲取entry，並調用dropMemory方法，返回釋放的內存大小。finally 代碼塊是防止在dropMemory過程中，該線程被中斷，其余block寫鎖不能被釋放的情況。

其依賴的方法如下：

存儲內存失敗之后，會返回 PartiallySerializedBlock 或者 PartiallyUnrolledIterator。

PartiallyUnrolledIterator 是一個Iterator，可以用來遍歷block數據，同時負責釋放攤開內存。

PartiallySerializedBlock 它可以將失敗的block轉化成 PartiallyUnrolledIterator 用來遍歷，可以直接丟棄失敗的block，也可以把數據轉儲到給定的可以落地的outputstream中，同時釋放攤開內存。

總結：

本篇文章主要講解了Spark的內存存儲相關的內容，重點講解了BlockInfoManager實現的鎖機制、跟ValuesHolder中轉站相關的MemoryEntry、EmmoryEntryBuilder等相關內容以及內存存儲中的重頭戲 -- MemStore相關的Block存儲、Block釋放、為新Block驅逐內存等等功能。

五、spark磁盤存儲剖析

1、總述

磁盤存儲相對比較簡單，相關的類關系圖如下：

我們先從依賴類 DiskBlockManager 剖析。

2、DiskBlockManager

文檔說明如下：

Creates and maintains the logical mapping between logical blocks and physical on-disk locations. One block is mapped to one file with a name given by its BlockId. Block files are hashed among the directories listed in spark.local.dir (or in SPARK_LOCAL_DIRS, if it's set).

創建並維護邏輯block和block落地的物理文件的映射關系。一個邏輯block通過它的BlockId的name屬性映射到具體的文件。

1、類結構

其類結構如下：

可以看出，這個類主要用於創建並維護邏輯block和block落地文件的映射關系。保存映射關系，有兩個解決方案：一者是使用Map存儲每一條具體的映射鍵值對，二者是指定映射函數像分區函數等等，給定的key通過映射函數映射到具體的value。

2、成員變量

成員變量如下：

subDirsPerLocalDir：這個變量表示本地文件下有幾個文件，默認為64，根據參數 spark.diskStore.subDirectories 來調節。

subDirs：是一個二維數組表示本地目錄和子目錄名稱的組合關系，即 ${本地目錄1 ... 本地目錄n}/${子目錄1 ... 子目錄64}

localDirs：表示block落地本地文件根目錄，通過 createLocalDirs 方法獲取，方法如下：

思路：它先調用調用Utils的 getConfiguredLocalDirs 方法，獲取到配置的目錄集合，然后map每一個父目錄，調用Utils的createDirectory方法，在每一個子目錄下創建一個以blockmgr 為前綴的目錄。其依賴方法 createDirectory 如下：

這個方法允許重試次數為10，目的是為了防止創建的目錄跟已存在的目錄重名。

getConfiguredLocalDirs 方法如下：

大多數生產情況下，都是使用yarn，我們直接看一下spark on yarn 環境下，目錄到底在哪里。直接來看getYarnLocalDirs方法：

LOCAL_DIRS的定義是什么？

任務是跑在yarn 上的，下面就去定位一下hadoop yarn container的相關源碼。

3、定位LOCAL_DIRS環境變量

在ContainerLaunch類的 sanitizeEnv 方法中，找到了如下語句：

addToMap 方法如下：

即，數據被添加到了envirment map變量和 nmVars set集合中了。

在ContainerLaunch 的 call 方法中調用了 sanitizeEnv 方法：

appDirs變量定義如下：

即每一個 appDir格式如下：${localDir}/usercache/${user}/appcache/${application-id}/

localDirs 定義如下：

dirHandler是一個 LocalDirsHandlerService 類型變量，這是一個服務，在其serviceInit方法中，實例化了 MonitoringTimerTask對象：

在 MonitoringTimerTask 構造方法中，發現了：

NM_LOCAL_DIRS 常量定義如下：

即：yarn.nodemanager.local-dirs 參數，該參數定義在yarn-default.xml下。

即localDir如下：

${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/${application-id}/

再結合createDirectory方法，磁盤存儲的本地目錄是：

${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/${application-id}/blockmgr-隨機的uuid/

4、核心方法

根據文件內容創建File對象，如下：

思路：先根據filename即blockId的name字段生成正的hashcode（abs(hashcode)）

dirId 是指的第幾個父目錄（從0開始數），subDirId是指的父目錄下的第幾個子目錄（從0開始數）。最后拼接父子目錄為一個新的父目錄subDir。

然后以subDir為父目錄，創建File對象，並返回之。

跟getFile 方法相關的方法如下：

比較簡單，不做過多說明。

創建一個臨時Block，包括臨時本地block 或 shuffle block，如下：

還有一個方法，是停止 DiskBlockManager之后的回調方法：

若deleteFilesOnStop 為 true，即DiskBlockManager停止時，是否需要清除本地存儲的block文件。

在 BlockManager 中初始化DiskBlockManager時，deleteFilesOnStop 通過構造方法傳入

總結：DiskBlockManager 是用來創建並維護邏輯block和落地后的block文件的映射關系的，它還負責創建用於shuffle或本地的臨時文件。

下面看一下在DiskStore中可能會用到的類以及其相關類的說明。

3、CountingWritableChannel

它主要對sink做了包裝，在寫入sink的同時，還記錄向sink寫的數據的總量。源碼如下：

代碼比較簡單，不做過多說明。

4、ManagedBuffer

類說明如下：

This interface provides an immutable view for data in the form of bytes. The implementation should specify how the data is provided: - FileSegmentManagedBuffer: data backed by part of a file - NioManagedBuffer: data backed by a NIO ByteBuffer - NettyManagedBuffer: data backed by a Netty ByteBuf The concrete buffer implementation might be managed outside the JVM garbage collector. For example, in the case of NettyManagedBuffer, the buffers are reference counted. In that case, if the buffer is going to be passed around to a different thread, retain/release should be called.

類結構如下：

5、EncryptedManagedBuffer

它是一個適配器，它將幾乎所以轉換的請求委托給了 blockData，下面來看一下這個類相關的剖析。

首先先看一下它的父類 -- BlockData

6、BlockData

接口說明如下：

它是一個接口，它定義了存儲方式以及如何提供不同的方式來讀去底層的block 數據。

定義方法如下：

方法說明如下：

toInputStream用於返回用於讀取該文件的輸入流。

toNetty用於返回netty對block數據的包裝類，方便netty包來讀取數據。

toChunkedByteBuffer用於將block包裝成ChunkedByteBuffer。

toByteBuffer 用於將block數據轉換為內存中直接讀取的 ByteBuffer 對象。

當對該block的操作執行完畢后，需要調用dispose來做后續的收尾工作。

size表示block文件的大小。

它有三個子類：DiskBlockData、EncryptedBlockData和ByteBufferBlockData。

即block的三種存在形式：磁盤、加密后的block和內存中的ByteBuffer

分別介紹如下：

7、DiskBlockData

該類主要用於將磁盤中的block文件轉換為指定的流或對象。

先來看其簡單的方法實現：

構造方法：

相關字段說明如下：

minMemoryMapBytes表示磁盤block映射到內存塊中最小大小，默認為2MB，可以通過 spark.storage.memoryMapThreshold 進行調整。

maxMemoryMapBytes表示磁盤block映射到內存塊中最大大小，默認為(Integer.MAX_VALUE - 15)B，可以通過 spark.storage.memoryMapLimitForTests 進行調整。

對應源碼如下：

比較簡單的方法如下：

size方法直接返回block文件的大小。

dispose空實現。

open是一個私有方法，主要用於獲取讀取該block文件的FileChannel對象。

toByteBuffer方法實現如下：

Utils的tryWithResource方法如下，它先執行createResource方法，然后執行Function對象的apply方法，最終釋放資源，思路就是創建資源 --使用資源-- 釋放資源三步曲：

即先獲取讀取block文件的FileChannel對象，若blockSize 小於最小的內存映射字節大小，則將channel的數據讀取到buffer中，返回的是HeapByteBuffer對象，即數據被寫入到了堆里，即它是non-direct buffer，相當於數據被讀取到中間臨時內存中，否則使用FileChannelImpl的map方法返回 MappedByteBuffer 對象。

MappedByteBuffer文檔說明如下：

A direct byte buffer whose content is a memory-mapped region of a file.
Mapped byte buffers are created via the FileChannel.map method. This class extends the ByteBuffer class with operations that are specific to memory-mapped file regions.
A mapped byte buffer and the file mapping that it represents remain valid until the buffer itself is garbage-collected.
The content of a mapped byte buffer can change at any time, for example if the content of the corresponding region of the mapped file is changed by this program or another. Whether or not such changes occur, and when they occur, is operating-system dependent and therefore unspecified. 
All or part of a mapped byte buffer may become inaccessible at any time, for example if the mapped file is truncated. An attempt to access an inaccessible region of a mapped byte buffer will not change the buffer's content and will cause an unspecified exception to be thrown either at the time of the access or at some later time. It is therefore strongly recommended that appropriate precautions be taken to avoid the manipulation of a mapped file by this program, or by a concurrently running program, except to read or write the file's content.
Mapped byte buffers otherwise behave no differently than ordinary direct byte buffers.

它是direct buffer，即直接從磁盤讀數據，不經過中間臨時內存，可以參照ByteBuffer的文檔對Direct vs. non-direct buffers 的說明如下：

Direct vs. non-direct buffers
A byte buffer is either direct or non-direct. Given a direct byte buffer, the Java virtual machine will make a best effort to perform native I/O operations directly upon it. That is, it will attempt to avoid copying the buffer's content to (or from) an intermediate buffer before (or after) each invocation of one of the underlying operating system's native I/O operations.
A direct byte buffer may be created by invoking the allocateDirect factory method of this class. The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance.
A direct byte buffer may also be created by mapping a region of a file directly into memory. An implementation of the Java platform may optionally support the creation of direct byte buffers from native code via JNI. If an instance of one of these kinds of buffers refers to an inaccessible region of memory then an attempt to access that region will not change the buffer's content and will cause an unspecified exception to be thrown either at the time of the access or at some later time.
Whether a byte buffer is direct or non-direct may be determined by invoking its isDirect method. This method is provided so that explicit buffer management can be done in performance-critical code.

toChunkedByteBuffer 方法如下：

首先，ChunkedByteBuffer對象里包含的是數據分成多個小的chunk，而不是連續的數組。

先把文件讀到內存中的 HeapByteBuffer 對象中即單個chunk，然后放入存放chunk的ListBuffer中，最終轉換為Array存入到ChunkedByteBuffer 對象中。

toNetty實現如下：

DefaultFileRegion說明請繼續向下看，先不做過多說明。

8、EncryptedBlockData

這個類主要是用於加密的block磁盤文件轉換為特定的流或對象。

構造方法如下：

file指block文件，blockSize指block文件大小，key是用於加密的密鑰。

先來看三個比較簡單的方法：

open方法不再直接根據FileInputStream獲取其 FileChannelImpl 對象了，而是獲取 FileChannelImpl 之后，再調用了 CryptoStreamUtils 的 createReadableChannel 方法，如下：

進一步將channel 對象封裝為 CryptoInputStream 對象，對ErrorHandlingReadableChannel的讀操作，實際上是讀的 CryptoInputStream，這個流內部有一個根據key來初始化的加密器，這個加密器負責對數據的解密操作。

toByteBuffer方法如下：

思路：如果block數據大小在整數范圍內，則直接將加密的block解密之后存放在內存中。

toChunkedByteBuffer方法除了解密操作外，跟DiskBlockData 中toChunkedByteBuffer方法無異，不做過多說明，代碼如下：

toNetty 方法，源碼如下：

ReadableChannelFileRegion類在下文介紹，先不做過多說明。

toInputStream方法，源碼如下：

思路：這個就不能直接open方法返回的獲取inputStream，因為 CryptoInputStream 是沒有獲取inputStream的接口的，Channels.newInputStream返回的是ChannelInputStream，ChannelInputStream對channel做了裝飾。

9、ByteBufferBlockData

整體比較簡單，主要來看一下dispose方法，ChunkedByteBuffer 方法的 dispose 如下：

即使用StorageUtils的dispose 方法去清理每一個chunk，StorageUtils的dispose 方法如下：

即獲取它的cleaner，然后調用cleaner的clean方法。我們以 DirectByteBufferR 為例，做進一步說明：

在其構造方法中初始化Cleaner，如下：

base是調用unsafe類的靜態方法allocateMemory分配指定大小內存后返回的內存地址，size是內存大小。

類聲明：

沒錯它是一個虛引用，隨時會被垃圾回收。

Cleaner的構造方法如下：

var1 是待清理的對象，var2 是執行清理任務的Runnable對象。

再看它的成員變量：

沒錯，它自己本身就是雙向鏈表上的一個節點，也是雙向鏈表。

其create 方法如下：

思路：創建cleanr並把它加入到雙向鏈表中。

Cleaner的 clean方法如下：

它會先調用remove 方法，調用成功則執行內存清理任務，注意這里沒有異步任務同步調用Runnable的run方法。

remove 方法如下：

思路：從雙向鏈表中移除指定的cleaner。

Deallocator 類如下：

unsafe的allocateMemory方法使用了off-heap memory，這種方式的內存分配不是在堆里，不受GC的管理，使用Unsafe.freeMemory()來釋放它。

先調用 unsafe釋放內存，然后調用Bits的 unreserveMemory 方法：

至此，dispose 方法結束。

下面看一下，ReadableChannelFileRegion的繼承關系：

我們按繼承關系來看類： ReferenceCounted --> FileRegion --> AbstractReferenceCounted --> AbstractFileRegion --> ReadableChannelFileRegion。

10、ReferenceCounted

類說明如下：

+ View Code

`A reference-counted object that requires explicit deallocation.``When a new ReferenceCounted is instantiated, it starts with the reference count of 1. <``br``>retain() increases the reference count, and release() decreases the reference count. <``br``>If the reference count is decreased to 0, the object will be deallocated explicitly, <``br``>and accessing the deallocated object will usually result in an access violation.``If an object that implements ReferenceCounted is a container of other objects that implement ReferenceCounted, <``br``>the contained objects will also be released via release() when the container's reference count becomes 0.`

這是netty包下的一個接口。

它是一個引用計數對象，需要顯示調用deallocation。

ReferenceCounted對象實例化時，引用計數設為1，調用retain方法增加引用計數，release方法則釋放引用計數。

如果引用計數減少至0，對象會被顯示deallocation，訪問已經deallocation的對象會造成訪問問題。

如果一個對象實現了ReferenceCounted接口的容器包含了其他實現了ReferenceCounted接口的對象，當容器的引用減少為0時，被包含的對象也需要通過 release 方法釋放之，即引用減1。

主要有三類核心方法：

retain：Increases the reference count by 1 or the specified increment.

touch：Records the current access location of this object for debugging purposes. If this object is determined to be leaked, the information recorded by this operation will be provided to you via ResourceLeakDetector. This method is a shortcut to touch(null).

release：Decreases the reference count by 1 and deallocates this object if the reference count reaches at 0. Returns true if and only if the reference count became 0 and this object has been deallocated

refCnt：Returns the reference count of this object. If 0, it means this object has been deallocated.

11、FileRegion

它也是netty下的一個包，FileRegion數據通過支持零拷貝的channel將數據傳輸到目標channel。

A region of a file that is sent via a Channel which supports zero-copy file transfer .

注意：文件零拷貝傳輸對JDK版本和操作系統是有要求的：

FileChannel.transferTo(long, long, WritableByteChannel) has at least four known bugs in the old versions of Sun JDK and perhaps its derived ones. Please upgrade your JDK to 1.6.0_18 or later version if you are going to use zero-copy file transfer.
If your operating system (or JDK / JRE) does not support zero-copy file transfer, sending a file with FileRegion might fail or yield worse performance. For example, sending a large file doesn't work well in Windows.
Not all transports support it

接口結構如下：

下面對新增方法的解釋：

count：Returns the number of bytes to transfer.

position：Returns the offset in the file where the transfer began.

transferred：Returns the bytes which was transfered already.

transferTo：Transfers the content of this file region to the specified channel.

12、AbstractReferenceCounted

這個類是通過一個變量來記錄引用的增加或減少情況。

類結構如下：

先來看成員變量：

refCnt就是內部記錄引用數的一個volatile類型的變量，refCntUpdater是一個 AtomicIntegerFieldUpdater 類型常量，AtomicIntegerFieldUpdater 基於反射原子性更新某個類的 volatile 類型成員變量。

A reflection-based utility that enables atomic updates to designated volatile int fields of designated classes. This class is designed for use in atomic data structures in which several fields of the same node are independently subject to atomic updates.
Note that the guarantees of the compareAndSet method in this class are weaker than in other atomic classes. Because this class cannot ensure that all uses of the field are appropriate for purposes of atomic access, it can guarantee atomicity only with respect to other invocations of compareAndSet and set on the same updater.

方法如下：

設置或獲取 refCnt 變量

增加引用：

減少引用：

13、AbstractFileRegion

AbstractFileRegion 繼承了AbstractReferenceCounted，但他還是一個抽象類，只是實現了部分的功能，如下：

14、DefaultFileRegion

文檔說明如下：

Default FileRegion implementation which transfer data from a FileChannel or File. Be aware that the FileChannel will be automatically closed once refCnt() returns 0.

先來看一下它主要的成員變量：

f：是指要傳輸的源文件。

file：是指要傳輸的源FileChannel

position：傳輸開始的字節位置

count：總共需要傳輸的字節數量

transferred：指已經傳輸的字節數量

關鍵方法 transferTo 的源碼如下：

思路：先計算出剩余需要傳輸的字節的總大小。然后從 position 的相對位置開始傳輸到指定的target sink。

注意：position是指相對於position最初開始位置的大小，絕對位置為 this.position + position。

其中，open 方法如下，它返回一個隨機讀取文件的 FileChannel 對象。

其deallocate 方法如下：

思路：直接關閉，取消成員變量對於FileChannel的引用，便於垃圾回收時可以回收FileChannel，然后關閉FileChannel即可。

總結：它通過 RandomeAccessFile 獲取可以支持隨機訪問 FileChannelImpl 的FileChannel，然后根據相對位置計算出絕對位置以及需要傳輸的字節總大小，最后將數據傳輸到target。

其引用計數的處理調用其父類 AbstractReferenceCounted的對應方法。

15、ReadableChannelFileRegion

其源碼如下：

其內部的buffer 的大小時 64KB，_traferred 變量記錄了已經傳輸的字節數量。ReadableByteChannel 是按順序讀的，所以pos參數沒有用。

下面，重點對DiskStore做一下剖析。

16、DiskStore

它就是用來保存block 到磁盤的。

構造方法如下：

它有三個成員變量：

blockSizes 記錄了每一個block 的blockId 和其大小的關系。可以通過get 方法獲取指定blockId 的block大小。如下：

putBytes方法如下：

putBytes將數據寫入到磁盤中；getBytes獲取的是BlockData數據，注意現在只是返回文件的引用，文件的內容並沒有返回，使得上文所講的多種多樣的BlockData轉換操作直接對接FileChannel，即本地文件，可以充分發揮零拷貝等特性，數據傳輸效率會更高。

其中put 方法如下：

思路很簡單，先根據diskManager獲取到block在磁盤中的文件的抽象 -- File對象，然后獲取到filechannel，調用回調函數將數據寫入到本地block文件中，最后記錄block和其block大小，最后關閉out channel。如果中途拋出異常，則格式化已寫入的數據，確保數據的寫入是原子化操作(要么全成功，要么全失敗)。

put方法依賴的方法如下：

openForWrite方法，先獲取filechannel，然后如果數據有加密，在創建加密的channel用來處理加密的數據

總結：本篇文章介紹了維護blockId和block物理文件的映射關系的DiskBlockManager；Hadoop yarn定位LOCAL_DIRS環境變量是如何定義的；定義了block的存儲方式以及轉換成流或channel或其他對象的BlockData接口以及它的三個具體的實現，順便介紹了directByteBuffer內存清理機制--Cleaner以及相關類的解釋；用作數據傳輸的DefaultFileRegion和ReadableChannelFileRegion類以及其相關類；最后介紹了磁盤存儲里的重頭戲--DiskStore，並重點介紹了其用於存儲數據和刪除數據的方法。

不足之處：本篇文章對磁盤IO中的nio以及netty中的相關類介紹的不是很詳細，可以閱讀相關文檔做進一步理解。畢竟如何高效地和磁盤打交道也是比較重要的技能。后面有機會可能會對java的集合io多線程jdk部分的源碼做一次徹底剖析，但那是后話了。目前打算先把spark中認為自己比較重要的梳理一遍。

六、spark存儲體系剖析

1、總述

先看 BlockManager相關類之間的關系如下：

我們從NettyRpcEnv 開始，做一下簡單說明。

SecurityManager 主要負責底層通信的安全認證。

BlockManagerMaster 主要負責在executor端和driver的通信，封裝了 driver的RpcEndpointRef。

NettyBlockTransferService 使用netty來獲取一組數據塊。

MapOutputTracker 是一個跟蹤 stage 的map 輸出位置的類，driver 和 executor 有對應的實現，分別是 MapOutputTrackerMaster 和 MapOutputTrackerWorker。

ShuffleManager在SparkEnv中初始化，它在driver端和executor端都有，負責driver端生成shuffle以及executor的數據讀寫。

BlockManager 是Spark存儲體系里面的核心類，它運行在每一個節點上（drievr或executor），提供寫或讀本地或遠程的block到各種各樣的存儲介質中，包括磁盤、堆內內存、堆外內存。

下面我們剖析一下之前沒有剖析過，圖中有的類：

2、SecurityManager

1、概述

Spark class responsible for security. In general this class should be instantiated by the SparkEnv and most components should access it from that. There are some cases where the SparkEnv hasn't been initialized yet and this class must be instantiated directly. This class implements all of the configuration related to security features described in the "Security" document. Please refer to that document for specific features implemented here.

這個類主要就是負責Spark的安全的。它是由SparkEnv初始化的。

2、類結構

其結構如下：

3、成員變量

WILDCARD_ACL：常量為*，表示允許所有的組或用戶擁有查看或修改的權限。

authOn：表示網絡傳輸是否啟用安全，由參數 spark.authenticate控制，默認為 false。

aclsOn：表示，由參數 spark.acls.enable 或 spark.ui.acls.enable 控制，默認為 false。

adminAcls：管理員權限，由 spark.admin.acls 參數控制，默認為空字符串。

adminAclsGroups：管理員所在組權限，由 spark.admin.acls.groups 參數控制，默認為空字符串。

viewAcls：查看控制訪問列表用戶。

viewAclsGroups：查看控制訪問列表用戶組。

modifyAcls：修改控制訪問列表用戶。

modifyAclsGroups：修改控制訪問列表用戶組。

defaultAclUsers：默認控制訪問列表用戶。由user.name 參數和 SPARK_USER環境變量一起設置。

secretKey：安全密鑰。

hadoopConf：hadoop的配置對象。

defaultSSLOptions：默認安全選項，如下：

其中SSLOption的parse 方法如下，主要用於一些安全配置的加載：

defaultSSLOptions跟getSSLOptions方法搭配使用：

4、核心方法

設置獲取 adminAcls、viewAclsGroups、modifyAcls、modifyAclsGroups變量的方法，比較簡單，不再說明。
檢查UI查看的權限以及修改權限：

獲取安全密鑰：

獲取安全用戶：

初始化安全：

5、總結

這個類主要是用於Spark安全的，主要包含了權限的設置和獲取的方法，密鑰的獲取、安全用戶的獲取、權限驗證等功能。

下面來看一下BlockManagerMaster類。

3、BlockManagerMaster

1、概述和類結構

主要是一些通過driver獲取的節點或block、或BlockManager信息的功能函數。

2、成員變量

driverEndpoint是一個EndpointRef 對象，可以指本地的driver 的endpoint 或者是遠程的 endpoint引用，通過它既可以和本地的driver進行通信，也可以和遠程的driver endpoint 進行通信。

timeout 是指的 Spark RPC 超時時間，默認為 120s，可以通過spark.rpc.askTimeout 或 spark.network.timeout 參數來設置。

核心方法：

移除executor，有同步和異步兩種方案，這兩個方法只會在driver端使用。如下：

向driver注冊blockmanager

更新block信息

向driver請求獲取block對應的 location信息

向driver 請求獲得集群中所有的 blockManager的信息

向driver 請求executor endpoint ref 對象

移除block、RDD、shuffle、broadcast

向driver 請求獲取每一個BlockManager內存狀態

向driver請求獲取磁盤狀態

向driver請求獲取block狀態

是否有匹配的block

10.檢查是否緩存了block

其依賴方法tell 方法如下：

總結

BlockManagerMaster 主要負責和driver的交互，來獲取跟底層存儲相關的信息。

4、ShuffleClient

1、類說明

2、核心方法

init方法用於初始化ShuffleClient，需要指定executor 的appId
fetchBlocks 用於異步從另一個節點請求獲取blocks，參數解釋如下：

host – the host of the remote node. port – the port of the remote node. execId – the executor id. blockIds – block ids to fetch. listener – the listener to receive block fetching status. downloadFileManager – DownloadFileManager to create and clean temp files. If it's not null, the remote blocks will be streamed into temp shuffle files to reduce the memory usage, otherwise, they will be kept in memory.

shuffleMetrics 用於記錄shuffle相關的metrics信息

5、BlockTransferService

1、類說明

2、核心方法

init 方法，它額外提供了使用BlockDataManager初始化的方法，方便從本地獲取block或者將block存入本地。

close：關閉ShuffleClient

port：服務正在監聽的端口

hostname：服務正在監聽的hostname

fetchBlocks 跟繼承類一樣，沒有實現，由於繼承關系可以不寫。

uploadBlocks：上傳block到遠程節點，返回一個future對象

fetchBlockSync：同步抓取遠程節點的block，直到block數據獲取成功才返回，如下：

它定義了block 抓取后，對返回結果處理的基本框架。

uploadBlockSync 方法：同步上傳信息，直到上傳成功才結束。如下：

3、ManagedBuffer的三個子類

下面看一下ManagedBuffler的三個子類：FileSegmentManagedBuffer、EncryptedManagedBuffer、NioManagedBuffer

FileSegmentManagedBuffer：由文件中的段支持的ManagedBuffer。

EncryptedManagedBuffer：由加密文件中的段支持的ManagedBuffer。

NioManagedBuffer：由ByteBuffer支持的ManagedBuffer。

6、NettyBlockTransferService

類說明：

它是BlockTransferService，使用netty來一次性獲取shuffle的block數據。

1、成員變量

hostname：TransportServer 監聽的hostname

serializer：JavaSerializer 實例，用於序列化反序列化java對象。

authEnabled：是否啟用安全

transportConf：TransportConf 對象，主要是用於初始化shuffle的線程數等配置。，spark.shuffle.io.serverThreads 和 spark.shuffle.io.clientThreads，默認是線程數在 [1,8] 個，這跟可用core的數量和指定core數量有關。這兩個參數決定了底層netty server端和client 端的線程數。

transportContext：TransportContext 用於創建TransportServer和TransportClient的上下文。

server：TransportServer對象，是Netty的server端線程。

clientFactory：TransportClientFactory 用於創建TransportClient

appId：application id，由 spark.app.id 參數指定

核心方法

init 方法主要用於初始化底層netty的server和client，如下：

關閉ShuffleClient：

上傳數據：

config.MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM 是由spark.maxRemoteBlockSizeFetchToMem參數決定的，默認是整數最大值 - 512.

所以整數范圍內的block數據，是由 netty RPC來處理的,128MB顯然是在整數范圍內的，所以hdfs上的block 數據spark都是通過netty rpc來通信傳輸的。

從遠程節點獲取block數據，源碼如下：

首先數據抓取是可以支持重試的，重試次數默認是3次，可以由參數 spark.shuffle.io.maxRetries 指定，實際上是由OneForOneBlockFetcher來遠程抓取數據的。

2、重試抓取遠程block機制的設計

當重試次數不大於0時，直接使用的是BlockFetchStarter來生成 OneForOneBlockFetcher 抓取數據。

當次數大於0 時，則使用 RetryingBlockFetcher 來重試式抓取數據。

先來看一下其成員變量：

executorService：用於等待執行重試任務的共享線程池

fetchStarter：初始化 OneForOneBlockFetcher 對象

listener：監聽抓取block成功或失敗的listener

maxRetries；最大重試次數。

retryWaitTime：下一次重試間隔時間。可以通過 spark.shuffle.io.retryWait參數設置，默認是 5s。

retryCount：已重試次數。

outstandingBlocksIds：剩余需要抓取的blockId集合。

currentListener：它只監聽當前fetcher的返回。

核心方法：

思路：首先，初始化需要抓取的blockId列表，已重試次數，以及currentListener。然后去調用fetcherStarter開始抓取任務，每一個block抓取成功后，都會調用currentListener對應成功方法，失敗則會調用 currentListener 失敗方法。在fetch過程中數據有異常出現，則先判斷是否需要重試，若需重試，則初始化重試，將wait和fetch任務放到共享線程池中去執行。

下面看一下，相關方法和類：

RetryingBlockFetchListener 類。它有兩個方法，一個是抓取成功的回調，一個是抓取失敗的回調。

在抓取成功回調中，會先判斷當前的currentListener是否是它本身，並且返回的blockId在需要抓取的blockId列表中，若兩個條件都滿足，則會從需要抓取的blockId列表中把該blockId移除並且去調用listener相對應的抓取成功方法。

在抓取失敗回調中，會先判斷當前的currentListener是否是它本身，並且返回的blockId在需要抓取的blockId列表中，若兩個條件都滿足，再判斷是否需要重試，如需重試則重置重試機制，否則直接調用listener的抓取失敗方法。

是否需要重試：

思路：如果是IO 異常並且還有剩余重試次數，則重試。

初始化重試：

總結：該重試的blockFetcher 引入了中間層，即自定義的RetryingBlockFetchListener 監聽器，來完成重試或事件的傳播機制（即調用原來的監聽器的抓取失敗成功對應方法）以及需要抓取的blockId列表的更新，重試次數的更新等操作。

7、MapOutputTracker

1、類說明

其類結構如下：

2、成員變量

trackerEndpoint：它是一個EndpointRef對象，是driver端 MapOutputTrackerMasterEndpoint 的在executor的代理對象。

epoch：The driver-side counter is incremented every time that a map output is lost. This value is sent to executors as part of tasks, where executors compare the new epoch number to the highest epoch number that they received in the past. If the new epoch number is higher then executors will clear their local caches of map output statuses and will re-fetch (possibly updated) statuses from the driver.

eposhLock：一個鎖對象

3、核心方法

向driver端trackerEndpoint 發送消息

excutor 獲取每一個shuffle中task 需要讀取的范圍的 block信息，partition范圍包頭不包尾。

刪除指定的shuffle的狀態信息

停止服務

其子類MapOutputTrackerMaster 和 MapOutputTrackerWorker在后續shuffle 剖許再作進一步說明。

8、ShuffleManager

1、類說明

類結構

registerShuffle：Register a shuffle with the manager and obtain a handle for it to pass to tasks.
getWriter：Get a writer for a given partition. Called on executors by map tasks.
getReader：Get a reader for a range of reduce partitions (startPartition to endPartition-1, inclusive). Called on executors by reduce tasks.
unregisterShuffle：Remove a shuffle's metadata from the ShuffleManager.
shuffleBlockResolver：Return a resolver capable of retrieving shuffle block data based on block coordinates.
stop：Shut down this ShuffleManager.

其有唯一子類 SortShuffleManager，我們在剖析spark shuffle 過程時，再做進一步說明。

下面，我們來看Spark存儲體系里面的重頭戲 -- BlockManager

9、BlockManager

1、類說明

它運行在每一個節點上（drievr或executor），提供寫或讀本地或遠程的block到各種各樣的存儲介質中，包括磁盤、堆內內存、堆外內存。

2、構造方法

其中涉及的變量，之前基本上都已作說明，不再說明。

這個類結構非常龐大，不再展示類結構圖。下面分別對其成員變量和比較重要的方法做一下說明。

3、成員變量

externalShuffleServiceEnabled：是否啟用外部shuffle 服務，通過spark.shuffle.service.enabled 參數配置，默認是false

remoteReadNioBufferConversion：是否 xxxxx，通過 spark.network.remoteReadNioBufferConversion 參數配置，默認是 false

diskBlockManager：DiskBlockManager對象，用於管理block和物理block文件的映射關系的

blockInfoManager：BlockInfoManager對象，Block讀寫鎖

futureExecutionContext：ExecutionContextExecutorService 內部封裝了一個線程池，線程前綴為 block-manager-future，最大線程數是 128

memoryStore：MemoryStore 對象，用於內存存儲。

diskStore：DiskStore對象，用於磁盤存儲。

maxOnHeapMemory：最大堆內內存

maxOffHeapMemory：最大堆外內存

externalShuffleServicePort：外部shuffle 服務端口，通過 spark.shuffle.service.port 參數設置，默認為 7337

blockManagerId：BlockManagerId 對象是blockManager的唯一標識

shuffleServerId：BlockManagerId 對象，提供shuffle服務的BlockManager的唯一標識

shuffleClient：如果啟用了外部存儲，即externalShuffleServiceEnabled為true，使用ExternalShuffleClient，否則使用通過構造參數傳過來的 blockTransferService 對象。

maxFailuresBeforeLocationRefresh：下次從driver刷新block location時需要重試的最大次數。通過spark.block.failures.beforeLocationRefresh 參數來設置，默認時 5

slaveEndpoint：BlockManagerSlaveEndpoint的ref對象，負責監聽處理master的請求。

asyncReregisterTask：異步注冊任務

asyncReregisterLock：鎖對象

cachedPeers：Spark集群中所有的BlockManager

peerFetchLock：鎖對象，用於獲取spark 集群中所有的blockManager時用

lastPeerFetchTime：最近獲取spark 集群中所有blockManager的時間

blockReplicationPolicy：BlockReplicationPolicy 對象，它有兩個子類 BasicBlockReplicationPolicy 和 RandomBlockReplicationPolicy。

remoteBlockTempFileManager：RemoteBlockDownloadFileManager 對象

maxRemoteBlockToMem：通過 spark.maxRemoteBlockSizeFetchToMem 參數控制，默認為整數最大值 - 512

4、核心方法[簡版]

注：未做過多的分析，大部分內容在之前內存存儲和磁盤存儲中都已涉及。

初始化方法

思路：初始化 blockReplicationPolicy，可以通過參數 spark.storage.replication.policy 來指定，默認為 RandomBlockReplicationPolicy；初始化BlockManagerId並想driver注冊該BlockManager；初始化shuffleServerId

重新想driver注冊blockManager方法：

思路：通過 BlockManagerMaster 想driver 注冊 BlockManager

獲取block數據，如下：

其依賴方法 getLocalBytes 如下，思路：如果是shuffle的數據，則通過shuffleBlockResolver獲取block信息，否則使用BlockInfoManager加讀鎖后，獲取數據。

doGetLocalBytes 方法如下，思路：按照是否需要反序列化、是否保存在磁盤中，做相應處理，操作直接依賴與MemoryStore和DiskStore。

存儲block數據，直接調用putBytes 方法：

其依賴方法如下，直接調用doPutBytes 方法：

doPutBytes 方法如下：

doPut 方法如下，思路，加寫鎖，執行putBody方法：

保存序列化之后的字節數據

保存java對象：

緩存讀取的數據在內存中：

獲取Saprk 集群中其他的BlockManager信息：

同步block到其他的replicas：

其依賴方法如下：

10.把block從內存中驅逐：

移除block：

停止方法

BlockManager 主要提供寫或讀本地或遠程的block到各種各樣的存儲介質中，包括磁盤、堆內內存、堆外內存。獲取Spark 集群的BlockManager的信息、驅逐內存中block等等方法。

其遠程交互依賴於底層的netty模塊。有很多的關於存儲的方法都依賴於MemoryStore和DiskStore的實現，不再做一一解釋。

10、總結

本篇文章介紹了Spark存儲體系的最后部分內容。行文有些倉促，有一些類可能會漏掉，但對於理解Spark 存儲體系已經綽綽有余。本地存儲依賴於MemoryStore和DiskStore，遠程調用依賴於NettyBlockTransferService、BlockManagerMaster、MapOutputTracker等，其底層絕大多數依賴於netty與driver或其他executor通信。

Spark shuffle、broadcast等也是依賴於存儲系統的。接下來將進入spark的核心部分，去探索Spark底層的RDD是如何構建Stage作業以及每一個作業是如何工作的。

第七章、spark源碼分析之任務調度與計算

一、DAG的生成和Stage的划分

在說DAG之前，先簡單說一下RDD。

1、對RDD的整體概括

文檔說明如下：

RDD全稱Resilient Distributed Dataset，即分布式彈性數據集。它是Spark的基本抽象，代表不可變的可分區的可並行計算的數據集。

RDD的特點：

包含了一系列的分區
在每一個split上執行函數計算
依賴於其他的RDD
對於key-value對的有partitioner
每一個計算有優先計算位置

更多內容可以去看Spark的論文：http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

RDD的操作

RDD支持兩種類型的操作：

transformation：它從已存在的數據集中創建一個新的數據集。它是懶執行的，即生成RDD的所有操作都是懶執行的，也就是說不會馬上計算出結果，它們只會記住它們依賴的基礎數據集（文件、MQ等等），等到一個action需要結果返回到driver端的時候，才會執行transform的計算。這種設計使得RDD計算更加高效。
action：它在數據集上運行計算之后給driver端返回一個值。

注意：reduce 是一個action，而 reduceByKey 則是一個transform，因為它返回的是一個分布式數據集，並沒有把數據返回給driver節點。

2、Action函數

官方提供了RDD的action函數，如下：

注意：這只是常見的函數，並沒有列舉所有的action函數。

3、Action函數的特點

那么action函數有哪些特點呢？

根據上面介紹的，即action會返回一個值給driver節點。即它們的函數返回值是一個具體的非RDD類型的值或Unit，而不是RDD類型的值。

4、Transformation函數

官方提供了Transform 函數，如下：

5、Transformation函數的特點

上文提到，transformation接收一個存在的數據集，並將計算結果作為新的RDD返回。也是就說，它的返回結果是RDD。

6、總結

其實，理解了action和transformation的特點，看函數的定義就知道是action還是transformation。

2、RDD的依賴關系

官方文檔里，聊完RDD的操作，緊接着就聊了一下shuffle，我們按照這樣的順序來做一下說明。

1、Shuffle

官方給出的shuffle的解釋如下：

注意：shuffle是特定操作才會發生的事情，這跟action和transformation划分沒有關系。

官方給出了一些常見的例子。

Operations which can cause a shuffle include repartition operations like repartition and coalesce, ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.

2、RDD的四種依賴關系

那么shuffle跟什么有關系呢？

shuffle跟依賴有關系，說到 RDD 分為寬依賴和窄依賴，其中窄依賴有三種，一對一依賴、Range依賴、Prune 依賴。寬依賴只有一種，那就是 shuffle 依賴。

即RDD跟父RDD的依賴關系是寬依賴，那么就是父RDD在生成新的子RDD的過程中是存在shuffle過程的。

如圖：

這張圖也說明了一個結論，並不是所有的join都是寬依賴。

3、依賴關系在源碼中的體現

我們通常說的 RDD，在Spark中具體表現為一個抽象類，所有的RDD子類繼承自該RDD，全稱為 org.apache.spark.rdd.RDD，如下：

它有兩個參數，一個參數是SparkContext，另一個是deps，即Dependency集合，Dependency是所有依賴的公共父類，即deps保存了父類的依賴關系。

其中，窄依賴的父類是 NarrowDependency，它的構造方法里是由父RDD這個參數的，寬依賴 ShuffleDependency ，它的構造方法里也是有父RDD這個參數的。

3、RDD 依賴關系的不確定性

1、getDependencies 方法

這只是定義在RDD抽象父類中的默認方法，不同的子類會有不同的實現。

它在如下類中又重新實現了這個方法，如下：

是否是shuffle依賴，跟分區的數量也有一定的關系，具體可以看下面的幾個RDD的依賴的實現：

2、CoGroupedRDD

3、SubtractedRDD

4、DAG在Spark作業中的重要性

如下圖，一個application的執行過程被划分為四個階段：

階段一：我們編寫driver程序，定義RDD的action和transformation操作。這些依賴關系形成操作的DAG。

階段二：根據形成的DAG，DAGScheduler將其划分為不同的stage。

階段三：每一個stage中有一個TaskSet，DAGScheduler將TaskSet交給TaskScheduler去執行，TaskScheduler將任務執行完畢之后結果返回給DAGSCheduler。

階段四：TaskScheduler將任務分發到每一個Worker節點去執行，並將結果返回給TaskScheduler。

本篇文章的定位就是階段一和階段二。后面會介紹階段三和階段四。

注：圖片不知出處。

5、DAG的創建

我們先來分析一個top N案例。

1、一個真實的TopN案例

需求：一個大文件里有很多的重復整數，現在求出重復次數最多的前10個數。

代碼如下（為了多幾個stage，特意加了幾個repartition）：

scala> val sourceRdd = sc.textFile("/tmp/hive/hive/result",10).repartition(5) sourceRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at repartition at <console>:27

scala> val allTopNs = sourceRdd.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(+).repartition(10).sortByKey(ascending = true, 100).map(tup => (tup.2, tup.1)).mapPartitions( | iter => { | iter.toList.sortBy(tup => tup._1).takeRight(100).iterator | } | ).collect()

// 結果略 scala> val finalTopN = scala.collection.SortedMap.empty[Int, String].++(allTopNs) //結果略

scala> finalTopN.takeRight(10).foreach(tup => {println(tup.2 + " occurs times : " + tup.1)})

53 occurs times : 1070 147 occurs times : 1072 567 occurs times : 1073 931 occurs times : 1075 267 occurs times : 1077 768 occurs times : 1080 612 occurs times : 1081 877 occurs times : 1082 459 occurs times : 1084 514 occurs times : 1087

下面看一下生成的DAG和Stage

任務概覽

Description描述的就是每一個job的最后一個方法。

stage 0 到 3的DAG圖：

stage 4 到 8的DAG圖:

每一個stage的Description描述的是stage的最后一個方法。

2、總結

可以看出，RDD的依賴關系是有driver端對RDD的操作形成的。

一個Stage中DAG的是根據RDD的依賴來構建的。

我們來看一下源碼。

6、Stage

1、構造方法

參數介紹如下：

id – Unique stage ID rdd – RDD that this stage runs on: for a shuffle map stage, it's the RDD we run map tasks on, while for a result stage, it's the target RDD that we ran an action on numTasks – Total number of tasks in stage; result stages in particular may not need to compute all partitions, e.g. for first(), lookup(), and take(). parents – List of stages that this stage depends on (through shuffle dependencies). firstJobId – ID of the first job this stage was part of, for FIFO scheduling. callSite – Location in the user program associated with this stage: either where the target RDD was created, for a shuffle map stage, or where the action for a result stage was called

callSite其實記錄的就是stage用戶代碼的位置。

2、成員變量

3、成員方法

其實相對來說比較簡單。

4、Stage的子類

它有兩個子類，如下：

7、ResultStage

類說明：

ResultStages apply a function on some partitions of an RDD to compute the result of an action. The ResultStage object captures the function to execute, func, which will be applied to each partition, and the set of partition IDs, partitions. Some stages may not run on all partitions of the RDD, for actions like first() and lookup().

ResultStage在RDD的某些分區上應用函數來計算action操作的結果。對於諸如first（）和lookup（）之類的操作，某些stage可能無法在RDD的所有分區上運行。

簡言之，ResultStage是應用action操作在action上進而得出計算結果。

源碼如下：

8、ShuffleMapStage

1、類說明

ShuffleMapStage 是中間的stage，為shuffle生產數據。它們在shuffle之前出現。當執行完畢之后，結果數據被保存，以便reduce 任務可以獲取到。

2、構造方法

shuffleDep記錄了每一個stage所屬的shuffle。

9、Stage的划分

在上面我們提到，每一個RDD都有對父RDD的依賴關系，這樣的依賴關系形成了一個有向無環圖。即DAG。

當一個用戶在一個RDD上運行一個action時，調度會檢查RDD的血緣關系（即依賴關系）來創建一個stage中的DAG圖來執行。

如下圖：

在說stage划分之前先，剖析一下跟DAGScheduler相關的類。

10、EventLoop

1、類說明

Note: The event queue will grow indefinitely. So subclasses should make sure onReceive can handle events in time to avoid the potential OOM.

它定義了異步消息處理機制框架。

2、消息隊列

其內部有一個阻塞雙端隊列，用於存放消息：

3、post到消息隊列

外部線程調用 post 方法將事件post到堵塞隊列中：

4、消費線程

有一個消息的消費線程：

onReceive 方法是一個抽象方法，由子類來實現。

下面來看其實現類 -- DAGSchedulerEventProcessLoop。

其接收的是DAGSchedulerEvent類型的事件。DAGSchedulerEvent 是一個sealed trait，其實現如下：

它的每一個子類事件，在doOnReceive 方法中都有體現，如下：

11、DAGScheduler

這個類的定義已經超過2k行了。所以也不打算全部介紹，本篇文章只介紹跟stage任務的生成相關的屬性和方法。

1、類說明

The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a minimal schedule to run the job. It then submits stages as TaskSets to an underlying TaskScheduler implementation that runs them on the cluster. A TaskSet contains fully independent tasks that can run right away based on the data that's already on the cluster (e.g. map output files from previous stages), though it may fail if this data becomes unavailable.

Spark stages are created by breaking the RDD graph at shuffle boundaries. RDD operations with "narrow" dependencies, like map() and filter(), are pipelined together into one set of tasks in each stage, but operations with shuffle dependencies require multiple stages (one to write a set of map output files, and another to read those files after a barrier). In the end, every stage will have only shuffle dependencies on other stages, and may compute multiple operations inside it. The actual pipelining of these operations happens in the RDD.compute() functions of various RDDs

In addition to coming up with a DAG of stages, the DAGScheduler also determines the preferred locations to run each task on, based on the current cache status, and passes these to the low-level TaskScheduler. Furthermore, it handles failures due to shuffle output files being lost, in which case old stages may need to be resubmitted. Failures within a stage that are not caused by shuffle file loss are handled by the TaskScheduler, which will retry each task a small number of times before cancelling the whole stage. When looking through this code, there are several key concepts:

- Jobs (represented by ActiveJob) are the top-level work items submitted to the scheduler. For example, when the user calls an action, like count(), a job will be submitted through submitJob. Each Job may require the execution of multiple stages to build intermediate data.

- Stages (Stage) are sets of tasks that compute intermediate results in jobs, where each task computes the same function on partitions of the same RDD. Stages are separated at shuffle boundaries, which introduce a barrier (where we must wait for the previous stage to finish to fetch outputs). There are two types of stages: ResultStage, for the final stage that executes an action, and ShuffleMapStage, which writes map output files for a shuffle. Stages are often shared across multiple jobs, if these jobs reuse the same RDDs.

- Tasks are individual units of work, each sent to one machine.

- Cache tracking: the DAGScheduler figures out which RDDs are cached to avoid recomputing them and likewise remembers which shuffle map stages have already produced output files to avoid redoing the map side of a shuffle.

- Preferred locations: the DAGScheduler also computes where to run each task in a stage based on the preferred locations of its underlying RDDs, or the location of cached or shuffle data.

- Cleanup: all data structures are cleared when the running jobs that depend on them finish, to prevent memory leaks in a long-running application.

To recover from failures, the same stage might need to run multiple times, which are called "attempts". If the TaskScheduler reports that a task failed because a map output file from a previous stage was lost, the DAGScheduler resubmits that lost stage. This is detected through a CompletionEvent with FetchFailed, or an ExecutorLost event. The DAGScheduler will wait a small amount of time to see whether other nodes or tasks fail, then resubmit TaskSets for any lost stage(s) that compute the missing tasks. As part of this process, we might also have to create Stage objects for old (finished) stages where we previously cleaned up the Stage object. Since tasks from the old attempt of a stage could still be running, care must be taken to map any events received in the correct Stage object.

Here's a checklist to use when making or reviewing changes to this class:

- All data structures should be cleared when the jobs involving them end to avoid indefinite accumulation of state in long-running programs.

- When adding a new data structure, update DAGSchedulerSuite.assertDataStructuresEmpty to include the new structure. This will help to catch memory leaks.

下面直接來看stage的划分

12.從源碼看Stage的划分

1、從action函數到DAGScheduler

collect 函數定義如下：

其調用了SparkContext的 runJob 方法，又調用了幾次其重載方法最終調用的runJob 方法如下：

其內部調用了DAGScheduler的runJob 方法

2、DAGScheduler對stage的划分

DAGScheduler的runJob 方法如下：

思路，提交方法后返回一個JobWaiter 對象，等待任務執行完成，然后根據任務執行狀態去執行對應的成功或失敗的方法。

submitJob 如下：

最終任務被封裝進了JobSubmitted 事件消息體中，最終該事件消息被放入了eventProcessLoop 對象中，eventProcessLoop定義如下：

即事件被放入到了上面我們提到的 DAGSchedulerEventProcessLoop 異步消息處理模型中。

DAGSchedulerEventProcessLoop 的 doOnReceive 中，發現了 JobSubmitted 事件對應的分支為：

即會執行DAGScheduler的handleJobSubmitted方法，如下：

這個方法里面有兩步：

創建ResultStage
提交Stage

本篇文章，我們只分析第一步，第二步在下篇文章分析。

createResultStage 方法如下：

getOrCreateParentStage 方法創建或獲取該RDD的Shuffle依賴關系，然后根據shuffle依賴進而划分stage，源碼如下：

獲取其所有父類的shuffle依賴，getShuffleDependency 方法如下，類似於樹的深度遍歷。

getOrCreateShuffleMapStage方法根據shuffle依賴創建ShuffleMapStage，如下，思路，先查看當前stage是否已經記錄在shuffleIdToMapStage變量中，若存在，表示已經創建過了，否則需要根據依賴的RDD去找其RDD的shuffle依賴，然后再創建shuffleMapStage。

shuffleIdToMapStage定義如下：

這個map中只包含正在運行的job的stage信息。

其中shuffle 依賴的唯一id 是：shuffleId，這個id 是 SpackContext 生成的全局shuffleId。

getMissingAncestorShuffleDependencies 方法如下，思路：深度遍歷依賴關系，把所有未運行的shuffle依賴都找到。

到此，所有尋找shuffle依賴關系的的邏輯都已經剖析完畢，下面看創建MapShuffleStage的方法，

思路：生成ShuffleMapStage，並更新 stageIdToStage變量，更新shuffleIdToMapStage變量，如果 MapOutputTrackerMaster 中沒有注冊過該shuffle，需要注冊，最后返回ShuffleMapStage對象。

updateJobIdStageIdMaps方法如下，思路該ResultStage依賴的所有ShuffleMapStage的jobId設定為指定的jobId，即跟ResultStage一致的jobId：

至此，stage的划分邏輯剖析完畢。

13、總結

本篇文章對照官方文檔，說明了RDD的主要操作，action和transformation，進一步引出了RDD的依賴關系，最后剖析了DAGScheduler根據shuffle依賴划分stage的邏輯。

二、Stage的提交

1、引言

2、緊接上篇文章

上篇文章中，DAGScheduler的handleJobSubmitted方法我們只剖析了stage的生成部分，下面我們看一下stage的提交部分源碼。

1、提交Stage的思路

首先構造ActiveJob對象，其次清除緩存的block location信息，然后記錄jobId和job對象的映射關系到jobIdToActiveJob map集合中，並且將該jobId記錄到活動的job集合中。

獲取到Job所有的stage的唯一標識，並且根據唯一標識來獲取stage對象，並且調用其lastestInfo方法獲取其StageInfo對象。

然后進一步封裝成 SparkListenerJobStart 事件對象，並post到 listenerBus中，listenerBus 是一個 LiveListenerBus 對象，其內部封裝了四個消息隊列組成的集合。

最后調用submitStage 方法執行Stage的提交。

先來看一下ActiveJob的說明。

2、ActiveJob類說明

它代表了正運行在DAGScheduler中的一個job，job有兩種類型：result job，其通過計算一個ResultStage來執行一個action操作；map-stage job，它在下游的stage提交之前，為ShuffleMapStage計算map的輸出。

構造方法

finalStages是這個job的最后一個stage。

3、提交Stage前的准備

直接先來看submitStage方法，如下：

思路：首先先獲取可能丟失的父stage信息，如果該stage的父stage被遺漏了，則遞歸調用查看其爺爺stage是否被遺漏。

1、查找遺漏父Stage

getMissingParentStages方法如下：

思路：不斷創建父stage，可以看上篇文章 spark 源碼分析之十九 -- DAG的生成和Stage的划分做進一步了解。

4、提交Stage

submitMissingTasks方法過於長，為方便分析，按功能大致分為如下部分：

1、獲取Stage需要計算的partition信息

org.apache.spark.scheduler.ResultStage#findMissingPartitions 方法如下：

org.apache.spark.scheduler.ShuffleMapStage#findMissingPartitions 方法如下：

org.apache.spark.MapOutputTrackerMaster#findMissingPartitions 方法如下：

2、將stage和分區記錄到OutputCommitCoordinator中

OutputCommitCoordinator 的 stageStart實現如下：

本質上就是把它放入到一個map中了。

3、獲取分區的優先位置

思路：根據stage的RDD和分區id獲取到其rdd中的分區的優先位置。

下面看一下 getPreferredLocs 方法：

注釋中說到，它是線程安全的，下面看一下，它是如何實現的，即 getPrefferredLocsInternal 方法。

這個方法中提到四種情況：

如果之前獲取到過，那么直接返回Nil即可。
如果之前已經緩存在內存中，直接從緩存的內存句柄中取出返回即可。
如果RDD對應的是HDFS輸入的文件等，則使用RDD記錄的優先位置。
如果上述三種情況都不滿足，且是narrowDependency，則調用該方法，獲取子RDDpartition對應的父RDD的partition的優先位置。

下面仔細說一下中間兩種情況。

從緩存中取

getCacheLocs 方法如下：

思路：先查看rdd的存儲級別，如果沒有存儲級別，則直接返回Nil，否則根據RDD和分區id組成BlockId集合，請求存儲系統中的BlockManager來獲取block的位置，然后轉換為TaskLocation信息返回。

獲取RDD的優先位置

RDD的 preferredLocations 方法如下：

思路：先從checkpoint中找，如果checkpoint中沒有，則返回默認的為Nil。

返回對象是TaskLocation對象，做一下簡單的說明。

TaskLocation

類說明

A location where a task should run. This can either be a host or a (host, executorID) pair. In the latter case, we will prefer to launch the task on that executorID, but our next level of preference will be executors on the same host if this is not possible.

它有三個子類，如下：

這三個類定義如下：

很簡單，不做過多說明。

TaskLocation伴隨對象如下，現在用的方法是第二種 apply 方法：

4、創建新的StageInfo

對應方法如下：

org.apache.spark.scheduler.Stage#makeNewStageAttempt 方法如下：

很簡單，主要是調用了StageInfo的fromStage方法。

先來看Stage類。

StageInfo

StageInfo封裝了關於Stage的一些信息，用於調度和SparkListener傳遞stage信息。

其伴生對象如下：

5、廣播要執行task函數

對應源碼如下：

通過broadcast機制，將數據廣播到spark集群中的driver和各個executor中。關於broadcast的實現細節，可以查

6、生成Task集合

根據stage的類型生成不同的類型Task。關於過多Task 的內容，在階段四進行剖析。

7、TaskScheduler提交TaskSet

對應代碼如下：

其中taskScheduler是 TaskSchedulerImpl，它是TaskScheduler的唯一子類實現。它負責task的調度。

org.apache.spark.scheduler.TaskSchedulerImpl#submitTasks方法實現如下：

其中 createTaskSetManager 方法如下：

SchedulableBuilder類是構建Schedulable樹的接口。

schedulableBuilder 定義如下：

其中schedulingMode 可以通過參數 spark.scheduler.mode 來調整，默認為FIFO。

schedulableBuilder 初始化如下：

schedulableBuilder的 addTaskSetManager （FIFO）方法如下：

即調用了內部Pool對象的addSchedulable 方法：

關於更多TaskSetManager的內容，將在階段四進行剖析。

backend是一個 SchedulerBackend 實例。在SparkContetx的初始化過程中調用 createTaskScheduler 初始化 backend

在yarn 模式下，它有兩個實現yarn-client 模式下的 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend實現和 yarn-cluster 模式下的 org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend 實現。

這兩個類在spark 項目的 resource-managers 目錄下的 yarn 目錄下定義實現，當然它也支持 kubernetes 和 mesos，不做過多說明。

這兩個類的繼承關系如下：

org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend#reviveOffers 實現如下：

發送ReviveOffers 請求給driver。

driver端的 CoarseGrainedSchedulerBackend 的 receive 方法有如下事件處理分支：

其內部經過一系列RPC過程，關於 RPC 可以看 spark 源碼分析之十二--Spark RPC剖析之Spark RPC總結做進一步了解。

即會調用driver端的makeOffsers方法，如下：

5、總結

本篇文章剖析了從DAGScheduler生成的Stage是如何被提交給TaskScheduler，以及TaskScheduler是如何把TaskSet提交給ResourceManager的。

下面就是task的運行部分了，下篇文章對其做詳細介紹。跟task執行關系很密切的TaskSchedulerBackend、Task等內容，也將在下篇文章做更詳細的說明。

三、Task的執行流程

1、引言

如下圖，我們在前兩篇文章中剖析了DAG的構建，Stage的划分以及Stage轉換為TaskSet后的提交，本篇文章主要剖析TaskSet被TaskScheduler提交之后的Task的整個執行流程，關於具體Task是如何執行的兩種stage對應的Task的執行有本質的區別，我們將在下一篇文章剖析。

我們先來剖析一下SchdulerBackend的子類實現。在yarn 模式下，它有兩個實現yarn-client 模式下的 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend實現和 yarn-cluster 模式下的 org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend 實現，如下圖。

這兩個類在spark 項目的 resource-managers 目錄下的 yarn 目錄下定義實現。

下面簡單看一下這幾個類的定義和實現。

2、ExecutorAllocationClient

簡單說明一下，這個類主要是負責想Cluster Master請求或殺掉executor。核心方法如下，不做過多解釋，可以看源碼做進一步了解。

3、SchedulerBackend

1、主要方法

killTask：請求 executor 殺掉正在運行的task

applicationId：獲取job的applicationId

applicationAttemptId：獲取task的 attemptId

getDriverLogUrls：獲取驅動程序日志的URL。這些URL用於顯示驅動程序的UI Executors選項卡中的鏈接。

maxNumConcurrentTasks：當前task的最大並發數

下面我們來看一下它的子類。

4、CoarseGrainedSchedulerBackend

1、類聲明

調度程序后端，等待粗粒度執行程序進行連接。此后端在Spark作業期間保留每個執行程序，而不是在任務完成時放棄執行程序並要求調度程序為每個新任務啟動新的執行程序。執行程序可以以多種方式啟動，例如用於粗粒度Mesos模式的Mesos任務或用於Spark的獨立部署模式（spark.deploy。*）的獨立進程。

2、內部類DriverEndpoint

類說明類結構

rpcEnv 是指的每個節點上的NettyRpcEnv

executorsPendingLossReason：記錄了已經丟失的並且不知道原因的executor

addressToExecutorId：記錄了每一個executor的id和executor地址的映射關系

下面我們看一下Task以及其繼承關系。

5、Task

1、類說明

它是Task的基本單元。

2、類結構

即內部結構如下：

下面看一下其核心方法。

runTask 運行Task，被run方法調用，它是一個抽象方法，由子類來實現。

kill：殺死Task。源碼如下：

下面看一下其繼承關系。

3、繼承關系

Task的繼承關系如下：

A unit of execution. We have two kinds of Task's in Spark:

- org.apache.spark.scheduler.ShuffleMapTask

- org.apache.spark.scheduler.ResultTask

A Spark job consists of one or more stages.

The very last stage in a job consists of multiple ResultTasks, while earlier stages consist of ShuffleMapTasks. A ResultTask executes the task and sends the task output back to the driver application. A ShuffleMapTask executes the task and divides the task output to multiple buckets (based on the task's partitioner).

下面分別看一下兩個Task的實現，是如何定義 runTask 方法的？

6、ResultTask

類名：org.apache.spark.scheduler.ResultTask

其runTask方法如下：

7、ShuffleMapTask

類名：org.apache.spark.scheduler.ShuffleMapTask

其runTask方法如下：

8、Executor

全稱：org.apache.spark.executor.Executor

1、類說明

Executor對象是Spark Executor的抽象，它背后有一個線程池用來執行任務。其實從源碼可以看出，Spark的Executor這個術語，其實來自於Java線程池部分的Executors。

下面主要分析一下其內部的結構。

2、執行Task的線程池

線程池定義如下：

3、心跳機制

Executor會不斷地向driver發送心跳來匯報其健康狀況，如下：

EXECUTOR_HEARTBEAT_INTERVAL 值默認為 10s，可以通過參數 spark.executor.heartbeatInterval 來進行調整。

startDriverHeartBeater方法如下：

其依賴方法 reportHeartBeat 方法源碼如下：

4、殺死任務機制--reaper機制

首先先來了解一下 TaskReaper。

TaskReaper

類說明：

Supervises the killing / cancellation of a task by sending the interrupted flag, optionally sending a Thread.interrupt(), and monitoring the task until it finishes. Spark's current task cancellation / task killing mechanism is "best effort" because some tasks may not be interruptable or may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources that are being used by these zombie tasks. The TaskReaper was introduced in SPARK-18761 as a mechanism to monitor and clean up zombie tasks. For backwards-compatibility / backportability this component is disabled by default and must be explicitly enabled by setting spark.task.reaper.enabled=true. A TaskReaper is created for a particular task when that task is killed / cancelled. Typically a task will have only one TaskReaper, but it's possible for a task to have up to two reapers in case kill is called twice with different values for the interrupt parameter. Once created, a TaskReaper will run until its supervised task has finished running. If the TaskReaper has not been configured to kill the JVM after a timeout (i.e. if spark.task.reaper.killTimeout < 0) then this implies that the TaskReaper may run indefinitely if the supervised task never exits.

其源碼如下：

思路：發送kill信號，等待一定時間后，如果任務停止，則返回，否則yarn模式下拋出一場，對local模式沒有影響。

是否啟用reaper機制

reaper機制默認是不啟用的，可以通過參數 spark.task.reaper.enabled 來啟用。

taskReapter線程池

它也是一個daemon的支持多個worker同時工作的線程池，也就是說可以同時停止多個任務。

kill任務

當kill任務的時候，會調用kill Task方法，源碼如下：

9、driver端SchedulerBackend接受task請求

提到SchedulerBackend接收到task請求后調用了 makeOffsers 方法，如下：

先調用TaskScheduler分配資源，並返回TaskDescription對象，然后拿着該對象去執行任務。

10、分配資源

1、過濾掉即將被回收的executor

其中ExecutorData 是記錄着executor的信息。包括 executor的address，port，可用cpu核數，總cpu核數等信息。

executorIsAlive方法定義如下：

即該executor既不在即將被回收的集合中也不在丟失的executor集合中。

2、構造WorkOffer集合

WorkOffer對象代表着一個executor上的可用資源，類定義如下：

3、分配資源

org.apache.spark.scheduler.TaskSchedulerImpl#resourceOffers 方法如下：

思路：先過濾掉不可用的WorkOffser對象，然后給每一個TaskSet分配資源。如果taskSet是barrier的，需要初始化barrierCoordinator的rpc endpoint。

記錄映射關系

記錄hostname和executorId的映射關系，記錄executorId和taskId的映射關系，源碼如下：

其中 executorAdded的源碼如下：

org.apache.spark.scheduler.DAGScheduler#executorAdded的映射關系如下：

經過eventProcessLoop異步消息隊列后，最終被如下分支處理：

最終處理邏輯如下，即把狀態健康的executor從失敗的epoch集合中移除。

其中，獲取host的rack信息的方法沒有實現，返回None。

更新不可用executor集合

blacklistTrackerOpt 定義如下：

org.apache.spark.scheduler.BlacklistTracker#isBlacklistEnabled 方法如下：

即 BLACKLIST_ENABLED 可以通過設置參數 spark.blacklist.enabled 來設定是否使用blacklist，默認沒有設置。如果設定了spark.scheduler.executorTaskBlacklistTime參數值大於 0 ，也啟用 blacklist。

BlacklistTracker 主要就是用來追蹤有問題的executor和host信息的，其類說明如下：

BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting executors and nodes across an entire application (with a periodic expiry). TaskSetManagers add additional blacklisting of executors and nodes for individual tasks and stages which works in concert with the blacklisting here. The tracker needs to deal with a variety of workloads, eg.:

bad user code -- this may lead to many task failures, but that should not count against individual executors

many small stages -- this may prevent a bad executor for having many failures within one stage, but still many failures over the entire application

"flaky" executors -- they don't fail every task, but are still faulty enough to merit blacklisting See the design doc on SPARK-8425 for a more in-depth discussion.

過濾不可用WorkOffer

過濾掉host或executor在黑名單中的WorkOffer，對應源碼如下：

對TaskSetManager排序

對應源碼如下：

首先對WorkOffer集合隨機打亂順序，然后獲取其可用core，可用slot的信息，然后獲取排序后的TaskSetManager隊列。rootPool是Pool對象，源碼在 TaskScheduler提交TaskSet 中有描述，不再贅述。

CPUS_PER_TASK的核數默認是1，即一個task使用一個core，所以在spark算子中，盡量不要使用多線程，因為就一個core，提高不了多少的性能。可以通過spark.task.cpus參數進行調節。

org.apache.spark.scheduler.Pool#getSortedTaskSetQueue 源碼如下：

其中TaskSetManager的 getSortedTaskSetManager的源碼如下：

重新計算本地性：

org.apache.spark.scheduler.TaskSetManager#executorAdded 的源碼如下：

org.apache.spark.scheduler.TaskSetManager#computeValidLocalityLevels 源碼如下：

在這里，可以很好的理解五種數據本地性級別。先加入數據本地性數組的優先考慮使用。

為每一個TaskSet分配資源

對應源碼如下：

如果slot資源夠用或者TaskSet不是barrier的，開始為TaskSet分配資源。

org.apache.spark.scheduler.TaskSchedulerImpl#resourceOfferSingleTaskSet 源碼如下：

思路：遍歷每一個shuffledOffers，如果其可用cpu核數不小於一個slot所用的核數，則分配資源，分配資源完畢后，記錄taskId和taskSetManager的映射關系、taskId和executorId的映射關系、executorId和task的映射關系。最后可用核數減一個slot所以的cpu核數。

其依賴方法 org.apache.spark.scheduler.TaskSetManager#resourceOffer 源碼如下，思路：先檢查該executor和該executor所在的host都不在黑名單中，若在則返回None，否則開始分配資源。

分配資源步驟：

計算數據本地性。
每一個task出隊並構建 TaskDescription 對象。

其依賴方法 org.apache.spark.scheduler.TaskSetManager#getAllowedLocalityLevel 源碼如下，目的就是計算該task 的允許的最大數據本地性。

初始化BarrierCoordinator

如果任務資源分配成功並且TaskSet是barrier的，則初始化BarrierCoordinator，源碼如下：

依賴方法 org.apache.spark.scheduler.TaskSchedulerImpl#maybeInitBarrierCoordinator 如下：

11、運行Task

org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverEndpoint#makeOffers中，分配資源結束后，就可以運行task了，源碼如下：

1、序列化TaskDescription

其依賴方法 lauchTasks 源碼如下：

org.apache.spark.scheduler.TaskDescription#encode 方法是一個序列化的操作，將內存中的Java Function對象序列化為字節數組。源碼如下：

maxRpcMessageSize定義如下：

org.apache.spark.util.RpcUtils#maxMessageSizeBytes 源碼如下：

默認為128MB，可以通過參數 spark.rpc.message.maxSize 來調整。

executorData可用核數減去一個Slot所需的核數后，去調用executor運行task。

2、發送RPC請求executor運行任務

對應 lauchTasks 源碼如下：

經過底層RPC的傳輸，executorEndpoint的處理代碼receive方法處理分支為：

其主要有兩步，反序列化TaskDescription字節數據為Java對象。

調用executor來運行task。

下面詳細來看每一步。

3、executor反序列化TaskDescription

思路：將通過RPC傳輸過來的ByteBuffer對象中的字節數據內容反序列化為在內存中的Java對象，即TaskDescription對象。

4、executor運行task

Executor對象是Spark Executor的抽象，它背后有一個線程池用來執行任務。其實從源碼可以看出，Spark的Executor這個術語，其實來自於Java線程池部分的Executors。

launchTasks方法源碼如下：

TaskRunner是一個Runnable的實現，worker線程池中的worker會去執行其run方法。

下面來看一下TaskRunner類。

12、TaskRunner

13、運行任務

run方法比較長，划分為四部分來說明。

1、准備環境

對應源碼如下：

初始化環境，修改task的運行狀態為RUNNING，初始化gc時間。

2、准備task配置

其源碼如下：

反序列化Task對象，並且設置Task的依賴。

3、運行task

記錄任務開始時間，開始使用cpu時間，運行task，最后釋放內存。

其依賴方法 org.apache.spark.util.Utils#tryWithSafeFinally 源碼如下：

從源碼可以看出，第一個方法是執行的方法，第二個方法是finally方法體中需要執行的方法。即釋放內存。

4、處理失敗任務

源碼如下：

5、更新metrics信息

關於metrics的相關內容，不做過多介紹。源碼如下：

6、序列化Task執行結果

思路：將返回值序列化為ByteBuffer對象。

7、將結果返回給driver

org.apache.spark.executor.CoarseGrainedExecutorBackend#statusUpdate 方法如下：

經過rpc后，driver端org.apache.spark.executor.CoarseGrainedExecutorBackend 的 receive 方法如下：

思路：更新task的狀態，接着在同一個executor上分配資源，執行任務。

8、更新task狀態

org.apache.spark.scheduler.TaskSchedulerImpl#statusUpdate 方法如下：

9、處理失敗任務

源碼如下，不做再深入的剖析：

10、處理成功任務

源碼如下：

其依賴方法 org.apache.spark.scheduler.TaskSchedulerImpl#handleSuccessfulTask 源碼如下：

org.apache.spark.scheduler.TaskSetManager#handleSuccessfulTask 源碼如下：

org.apache.spark.scheduler.TaskSchedulerImpl#markPartitionCompletedInAllTaskSets 源碼如下：

org.apache.spark.scheduler.TaskSetManager#markPartitionCompleted 的源碼如下：

org.apache.spark.scheduler.TaskSetManager#maybeFinishTaskSet 源碼如下：

11、通知DAGScheduler任務已完成

在org.apache.spark.scheduler.TaskSetManager#handleSuccessfulTask 源碼中，最后調用了dagScheduler的taskEnded 方法，源碼如下：

即發送事件消息給eventProcessLoop隊列做異步處理：

在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop#doOnReceive 源碼中，處理該事件的分支為：

即會調用 org.apache.spark.scheduler.DAGScheduler#handleTaskCompletion，源碼中處理成功的返回值的代碼如下：

我們重點關注其返回值的處理，如果執行的是一個Action操作，則會進入第一個分支。如果執行的是shuffle操作，則會進入第二個分支。

12、Action作業的返回值處理

先來看第一個分支：

跟返回值有關的代碼如下：

org.apache.spark.scheduler.JobWaiter#taskSucceeded源碼如下：

思路：調用RDD定義的resultHandler方法，取出返回值，如果該 task執行完畢之后，所有task都已經執行完畢了，那么jobPromise可以標志為成功，driver就可以拿着action操作返回的值做進一步操作。

假設是collect方法，可以根據 org.apache.spark.SparkContext#submitJob 依賴方法推出resultHandler的定義，如下：

可以知道，resultHandler是在調用方法之前傳遞過來的方法參數。

我們從collect 方法正向推：

其調用的SparkContext的幾個重載的runJob方法如下：

即，上圖中標紅的就是resultHandler方法，collect方法是應用於整個RDD的分區的。

也就是說，org.apache.spark.scheduler.JobWaiter#taskSucceeded的第一個參數其實就是partition，第二個參數就是該action在RDD的該partition上計算后的返回值。

該resultHandler方法將返回值，直接賦值給result的特定分區。最終，將所有分區的數據都返回給driver。注意，現在的返回值是數組套數組的形式，即二維數組。

最終collect方法中也定義了二維數組flatten為一維數組的方法，如下：

這個方法內部是會生成一個ArrayBuilder對象的用來添加數組元素，最終構造新數組返回。這個方法是會內存溢出的，所以不建議使用這個方法獲取大量結果數據。

下面，我們來看第二個分支。

13、Shuffle作業的返回值處理

shuffle作業的返回值是 MapStatus 類型。

先來聊一下MapStatus類。

MapStatus

主要方法如下：

location表示 shuffle的output數據由哪個BlockManager管理着。

getSizeForBlock：獲取指定block的大小。

其繼承關系如下：

CompressedMapStatus 主要是實現了壓縮的MapStatus，即在網絡傳輸進行序列化的時候，可以對MapStatus進行壓縮。

HighlyCompressedMapStatus 主要實現了大block的存儲，以及保存了block的平均大小以及block是否為空的信息。

處理shuffle 作業返回值

我們只關注返回值的處理，org.apache.spark.scheduler.DAGScheduler#handleTaskCompletion方法中涉及值處理的源碼如下：

org.apache.spark.MapOutputTrackerMaster#registerMapOutput 的源碼如下，mapId就是partition的id：

其中，成員變量 shuffleStatuses 定義如下：

即shuffleStatuses在driver端保存了shuffleId和shuffleStatus的信息。便於后續stage可以調用 MapOutputTrackerMasterEndpoint ref 來獲取該stage返回的MapStatus信息。具體內容，我們將在下一節分析。

14、總結

本篇文章主要介紹了跟Spark內部Task運行的細節流程，關於Task的運行部分沒有具體涉及，Task按照ResultStage和ShuffleStage划分為兩種Task，ResultStage任務和ShuffleStage分別對應的Task的執行流程有本質的區別，將在下一篇文章進行更加詳細的剖析。

四、Task的內存管理

1、提出問題

spark任務在執行的時候，其內存是如何管理的？
堆內內存的尋址是如何設計的？是如何避免由於JVM的GC的存在引起的內存地址變化的？其內部的內存緩存池回收機制是如何設計的？
堆外和堆內內存分別是通過什么來分配的？其數據的偏移量是如何計算的？
消費者MemoryConsumer是什么？
數據在內存頁中是如何尋址的？

單個任務的內存管理是由 org.apache.spark.memory.TaskMemoryManager 來管理的。

2、TaskMemoryManager

它主要是負責管理單個任務的內存。

首先內存分為堆外內存和堆內內存。

對於堆外內存，可以內存地址直接使用64位長整型地址尋址。

對於堆內內存，內存地址由一個base對象和一個offset對象組合起來表示。

類在設計的過程中遇到的問題：

對於其他結構內部的結構的地址的保存是存在問題的，比如在hashmap或者是 sorting buffer 中的記錄的指針，盡管我們決定使用128位來尋址，我們不能只存base對象的地址，因為由於gc的存在，這個地址不能保證是穩定不變的。（由於分代回收機制的存在，內存中的對象會不斷移動，每次移動，對象內存地址都會改變，但這對於不關注對象地址的開發者來說，是透明的）

最終的方案：

對於堆外內存，只保存其原始地址，因為堆外內存不受gc影響；對於堆內內存，我們使用64位的高13位來保存內存頁數，低51位來保存這個頁中的offset，使用page表來保存base對象，其在page表中的索引就是該內存的內存頁數。頁數最多有8192頁，理論上允許索引 8192 * (2^31 -1)* 8 bytes，相當於140TB的數據。其中 2^31 -1 是整數的最大值，因為page表中記錄索引的是一個long型數組，這個數組的最大長度是2^31 -1。實際上沒有那么大。因為64位中除了用來設計頁數和頁內偏移量外還用於存放數據的分區信息。

3、MemoryLocation

其中這個base對象和offset對象被封裝進了 MemoryLocation對象中，也就是說，這個類就是用來內存尋址的，如下：

其唯一實現類為 org.apache.spark.unsafe.memory.MemoryBlock。

4、MemoryBlock

它表示一段連續的內存塊，包括一個起始位置和一個固定大小。起始位置有MemoryLocation來表示。

也就是說它有四個屬性：

這段連續內存塊的起始地址：從父類繼承而來的base對象和offset。

固定大小 length以及對這個內存塊的唯一標識 - 內存頁碼（page number）

主要方法如下，其中Platform是跟操作系統有關的一個類，不做過多說明。

5、MemoryAllocator

其主要負責內存的申請工作。這個接口的實現類是真正分配內存的。后面介紹的TaskMemoryManager只是負責管理內存，但是不負責具體的內存分配事宜。

其繼承關系如下，有兩個子類：

其定義的主要的常量和方法如下：

主要方法主要用來分配和釋放內存塊。下面主要來看一下它兩個子類的實現。

6、HeapMemoryAllocator

全稱：org.apache.spark.unsafe.memory.HeapMemoryAllocator

主要負責分配堆內內存，其主要分配long型數組，最大分配內存為16GB。

1、成員變量

bufferPoolBySize是一個HashMap，其內部的value里面存放的數據都是弱引用類型的數據，在JVM 發生GC時，數據可能會被回收。它里面存放的數據都是已經不用的廢棄掉的內存塊。

2、是否使用內存緩存池

申請的內存塊的大小大於閥值才使用內存緩存池。

3、分配內存

思路：首先根據bytes大小計算處words的大小，然后字節對齊計算出對齊需要的字節，斷言對齊后的字節大小大於等於之前未對齊的字節大小。為什么要對齊呢？因為長整型數組的內存大小是對齊的。

如果對齊后的字節大小滿足使用緩存池的條件，則先從緩存池中彈出對應的pool，並且如果彈出的pool不為空，則逐一取出之前釋放的數組，並將其封裝進MmeoryBlock對象，並且使用標志位清空之前的歷史數據返回之。

否則，則初始化指定的words長度的長整型數組，並將其封裝進MmeoryBlock對象，並且使用標志位清空之前的歷史數據返回之。總之緩存的是長整型數組，存放數據的也是長整型數組。

4、釋放內存

首先把要釋放的內存數據使用free標志位覆蓋，pageNumber置為占位的page number。

然后取出其內部的長整型數組賦值給臨時變量，並且把base對象置為null，offset置為0。

取出的長整型數組計算其對齊大小，內存頁的大小不一定等於數組的長度 * 8，此時的size是內存頁的大小，需要進行對齊操作。

對齊之后的內存頁大小如果滿足緩存池條件，則將其暫存緩存池，等待下次回收再用或者JVM的GC回收。

這個方法結束之后，這個長整型數組被LinkedList對象（即pool）引用，但這是一個若引用，所以說，現在這個數組是一個游離對象，當JVM回收時，會回收它。

5、對堆內內存的總結

對於堆內內存上的數據真實受JVM的GC影響，其真實數據的內存地址會發生改變，巧妙使用數組這種容器以及偏移量巧妙地將這個問題規避了，數據回收也可以使用緩存池機制來減少數組頻繁初始化帶來的開銷。其內部使用虛引用來引用釋放的數組，也不會導致無法回收導致內存泄漏。

7、UnsafeMemoryAllocator

全稱：org.apache.spark.unsafe.memory.UnsafeMemoryAllocator

負責分配堆外內存。

1、分配內存

思路：底層使用unsafe這個類來分配堆外內存。這里的offset就是操作系統的內存地址，base對象為null。

2、釋放內存

堆外內存的釋放不能使用緩存池，因為堆外內存不受JVM的管理，將會導致遺留的不用的內存無法回收從而引發更嚴重的內存泄漏，更甚者堆外內存使用的是系統內存，嚴重的話還會導致出現系統級問題。

3、堆堆外內存的總結

簡言之，對於堆外內存的分配和回收，都是通過java內置的Unsafe類來實現的，其統一規范中的base對象為null，其offset就是該內存頁在操作系統中的真實地址。

下面剖析一下TaskMemoryManager的成員變量和核心方法。

8、進一步剖析TaskMemoryManager

1、成員變量

對主要的成員變量做如下解釋：

OFFSET_BITS：是指的page number 占用的bit個數

MAXIMUM_PAGE_SIZE_BYTES：約17GB，每頁最大可存內存大小

pageTable：主要用來存放內存頁的

allocatedPages：主要用來追蹤內存頁是否為空的

memoryManager：主要負責Spark內存管理，具體細節可以參照 spark 源碼分析之十五 -- Spark內存管理剖析做進一步了解。

taskAttemptId：任務id

tungstenMemoryMode：tungsten內存模式，是堆外內存還是堆內內存

consumers：記錄了任務內存的所有消費者

2、核心方法

所有方法如下：

下面，我們來逐一對其進行源碼剖析。

獲取執行內存

思路：首先先去MemoryManager中去申請執行內存，如果內存不夠，則獲取所有的MemoryConsumer，調用其spill方法將內存數據溢出到磁盤，直到釋放內存空間滿足申請的內存空間則停止spill操作。

釋放執行內存

這其實不是真正意義上的內存釋放，只是管賬的把這筆內存占用划掉了，真正的內存釋放還是需要調用MemoryConsumer的spill方法將內存數據溢出到磁盤來釋放內存。

獲取內存頁大小

分配內存頁

思路：首先獲取執行內存。執行內存獲取成功后，找到一個空的內存頁。

如果內存頁碼大於指定的最大頁碼，則釋放剛申請的內存，返回；否則使用MemoryAllocator分配內存頁、初始化內存頁碼並將其放入page表的管理，最后返回page。關於MemoryAllocator分配內存的細節，請參照上文關於其堆內內存或堆外內存的內存分配的詳細剖析。

釋放內存頁

思路：首先調用EMmoryAllocator的free 方法來釋放內存，並且調用方法2 來划掉內存的占用情況。

內存地址加密

思路：高13位保存的是page number，低51位保存的是地址的offset

7.內存地址解密

思路：跟方法6 的編碼思路相反

8.根據內存地址獲取內存的base對象，前提是必須是堆內內存頁，否則沒有base對象。

9.獲取內存地址在內存頁的偏移量offset

如果是堆內內存，則直接返回其解碼之后的offset即可。

如果是堆外內存，分配內存時的offset + 頁內的偏移量就是真正的偏移量，是針對操作系統的，也是絕對的偏移量。

10.清空所有內存頁

思路：使用MemoryAllocator釋放內存，並且請求管賬的MemoryManager釋放執行內存和task的所有內存。

11.獲取單個任務的執行內存使用情況

思路：從MemoryManager處獲取指定任務的執行內存使用情況。

下面看一下跟TaskMemoryManager交互的消費者對象 -- MemoryConsumer。

9、MemoryConsumer

1、類說明

它是任務內存的消費者。

其類結構如下：

2、成員變量

taskMemoryManager：是負責任務內存管理。

used：表示使用的內存。

mode：表示內存的模式是堆內內存還是堆外內存。

pageSize：表示頁大小。

3、主要方法

內存數據溢出到磁盤，抽象方法，等待子類實現。

申請釋放內存部分，不再做詳細的分析，都是依賴於 TaskMemoryManager 做的操作。

關於更多MemoryConsumer的以及其子類的相關內容，將在下一篇文章Shuffle的寫操作中詳細剖析。

10、總結

本篇文章主要剖析了Task在任務執行時內存的管理相關的內容，現在可能還看不出其重要性，后面在含有sort的shuffle過程中，會頻繁的使用基於內存的sorter，此時的sorter包含大量的數據，是需要內存管理的。

五、spark shuffle的寫操作之准備工作

1、前言

1、緊接上篇

我們再來看一下，ResultTask和ShuffleMapTask的runTask方法。現在只關注數據處理邏輯，下面的兩張圖都做了標注。

2、ResultTask

類名：org.apache.spark.scheduler.ResultTask

其runTask方法如下：

3、ShuffleMapTask

類名：org.apache.spark.scheduler.ShuffleMapTask

其runTask方法如下：

4、shuffle數據的管理類--IndexShuffleBlockResolver

下面說一下 IndexShuffleBlockResolver 類。這個類負責shuffle數據的獲取和刪除，以及shuffle索引數據的更新和刪除。

IndexShuffleBlockResolver繼承關系如下：

我們先來看父類ShuffleBlockResolver。

5、ShuffleBlockResolver

主要是負責根據邏輯的shuffle的標識（比如mapId、reduceId或shuffleId）來獲取shuffle的block。shuffle數據一般都被File或FileSegment包裝。

其接口定義如下：

其中，getBlockData根據shuffleId獲取shuffle數據。

下面來看 IndexShuffleBlockResolver的實現。

6、IndexShuffleBlockResolver

這個類負責shuffle數據的獲取和刪除，以及shuffle索引數據的更新和刪除。

類結構如下：

blockManager是executor上的BlockManager類。

transportCpnf主要是包含了關於shuffle的一些參數配置。

NOOP_REDUCE_ID是0，因為此時還不知道reduce的id。

核心方法如下：

獲取shuffle數據文件，源碼如下，思路：根據blockManager的DiskBlockManager獲取shuffle的blockId對應的物理文件。

獲取shuffle索引文件，源碼如下，思路：根據blockManager的DiskBlockManager獲取shuffle索引的blockId對應的物理文件。

3.根據mapId將shuffle數據移除，源碼如下，思路：根據shuffleId和mapId刪除shuffle數據和索引文件

4.校驗shuffle索引和數據，源碼如下。

從上面可以看出，文件里第一個long型數是占位符，必為0.

后面的保存的數據是每一個block的大小，可以看出來，每次讀的long型數，是前面所有block的大小總和。

所以，當前block的大小=這次讀取到的offset - 上次讀取到的offset

這種索引的設計非常巧妙。每一個block大小合起來就是整個文件的大小。每一個block的在整個文件中的offset也都記錄在索引文件中。

寫索引文件，源碼如下。

思路：首先先獲取shuffle的數據文件並創建索引的臨時文件。

獲取索引文件的每一個block 的大小。如果索引存在，則更新新的索引數組，刪除臨時數據文件，返回。

若索引不存在，將新的數據的索引數據寫入臨時索引文件，最終刪除歷史數據文件和歷史索引文件，然后臨時數據文件和臨時數據索引文件重命名為新的數據和索引文件。

這樣的設計，確保了數據索引隨着數據的更新而更新。

根據shuffleId獲取block數據，源碼如下。

思路：

先獲取shuffle數據的索引數據，然后調用position位上，獲取block 的大小，然后初始化FileSegmentManagedBuffer，讀取文件的對應segment的數據。

可以看出 reduceId就是block物理文件中的小的block（segment）的索引。

停止blockResolver，空實現。

總結，在這個類中，可以學習到spark shuffle索引的設計思路，在工作中需要設計File和FileSegment的索引文件，這也是一種參考思路。

2、Shuffle的寫數據前的准備工作

直接來看 org.apache.spark.scheduler.ShuffleMapTask 的runTask的關鍵代碼如下：

這里的manager是SortShuffleManager，是ShuffleManager的唯一實現。

org.apache.spark.shuffle.sort.SortShuffleManager#getWriter 源碼如下：

其中，numMapsForShuffle 定義如下：

它保存了shuffleID和mapper數量的映射關系。

1、獲取ShuffleHandle

首先，先來了解一下ShuffleHandle類。

ShuffleHandle

下面大致了解一下ShuffleHandle的相關內容。

類說明：

這個類是Spark內部使用的一個類，包含了關於Shuffle的一些信息，主要給ShuffleManage 使用。本質上來說，它是一個標志位，除了包含一些用於shuffle的一些屬性之外，沒有其他額外的方法，用case class來實現更好一點。

類源碼如下：

繼承關系如下：

BaseShuffleHandle

全稱：org.apache.spark.shuffle.BaseShuffleHandle

類說明：

它是ShuffleHandle的基礎實現。

類源碼如下：

下面來看一下它的兩個子類實現。

BypassMergeSortShuffleHandle

全稱：org.apache.spark.shuffle.sort.BypassMergeSortShuffleHandle

類說明：

如果想用於序列化的shuffle實現，可以使用這個標志類。其源碼如下：

SerializedShuffleHandle

全稱：org.apache.spark.shuffle.sort.SerializedShuffleHandle

類說明：

used to identify when we've chosen to use the bypass merge sort shuffle path.

類源碼如下：

獲取ShuffleHandle

在org.apache.spark.ShuffleDependency中有如下定義：

shuffleId是SparkContext生成的唯一全局id。

org.apache.spark.shuffle.sort.SortShuffleManager#registerShuffle 源碼如下：

可以看出，mapper的數量等於父RDD的分區的數量。

下面，看一下使用bypassMergeSort的條件，即org.apache.spark.shuffle.sort.SortShuffleWriter#shouldBypassMergeSort 源碼如下：

思路：首先如果父RDD沒有啟用mapSideCombine並且父RDD的結果分區數量小於bypassMergeSort閥值，則使用 bypassMergeSort。其中bypassMergeSort閥值默認是200，可以通過 spark.shuffle.sort.bypassMergeThreshold 參數設定。

使用serializedShuffle的條件，即org.apache.spark.shuffle.sort.SortShuffleManager#canUseSerializedShuffle 源碼如下：

思路：序列化類支持支持序列化對象的遷移，並且不使用mapSideCombine操作以及父RDD的分區數不大於 (1 << 24) 即可使用該模式的shuffle。

2、根據ShuffleHandle獲取ShuffleWriter

首先先對ShuffleWriter做一下簡單說明。

ShuffleWriter

類說明：它負責將map任務的輸出寫入到shuffle系統。其繼承關系如下，對應着ShuffleHandle的三種shuffle實現標志。

獲取ShuffleWriter

org.apache.spark.shuffle.sort.SortShuffleManager#getWriter源碼如下：

一個mapper對應一個writer，一個writer往一個分區上的寫數據。

3、總結

本篇文章主要從Task 的差異和相同點出發，引出spark shuffle的重要性，接着對Spark shuffle數據的類型以及spark shuffle的管理類做了剖析。最后介紹了三種shuffle類型的標志位以及如何確定使用哪種類型的數據的。

接下來，正式進入mapper寫數據部分。spark內部有三種實現，每一種寫方式會有一篇文章專門剖析，我們逐一來看其實現機制。

六、spark shuffle寫操作三部曲之BypassMergeSortShuffleWriter

1、前言

先上源碼，后解釋：

流程如下：

2、map數據根據分區函數寫入分區文件

如果沒有數據要寫，那么數據文件為空，索引文件中各個segment的大小為0，返回初始化的MapStatus。

如果有數據要寫到各個reducer的文件中，首先初始化序列化工具實例，遍歷初始化各個partition的partitionWriter數組中的DiskBlockObjectWriter對象，初始化各個partition的FileSegment數組。

然后遍歷每一個要寫入的記錄值，並且取出記錄的key值，根據Partitioner的getPartition函數確定其reduce到的目標分區索引，然后根據計算出的索引確定負責寫數據的DiskBlockObjectWriter對象，然后根據該對象將鍵值對寫入到臨時分區文件。

當每一個要寫入的記錄值遍歷操作完畢，遍歷每一個分區，將該分區對應的partitionWriter執行commitAndGet操作，返回該分區的FileSegment對象。

其依賴方法commitAndGet源碼如下：

至此，大多數情況下，reduce的每一個partition的數據有被寫入到一個單獨的文件。明明是FileSegment，為什么是單獨的文件呢？原因就在於DiskBlockManager返回的臨時ShuffleBlockId是不重復的，org.apache.spark.storage.DiskBlockManager#createTempShuffleBlock源碼如下：

又因為創建臨時文件，只是創建臨時文件的句柄，此時對應的物理文件，並不存在，所以，這個方法不能保證創建的臨時文件不重復。所以多個partition數據寫入到一個臨時文件的概率還是有的，只不過是小概率事件。

最后小的分區文件會被合並為一個文件。

首先調用ShuffleBlockResolver（它是IndexShuffleBlockResolver實例）的getDataFile方法獲取數據文件的句柄File對象，org.apache.spark.util.Utils的tempFileWith獲取臨時文件，org.apache.spark.util.Utils#tempFileWith源碼如下，即獲得一個帶uuid后綴的文件：

3、合並分區文件

最后調用org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter的writePartitionedFile方法將多個小文件合並為一個大文件並返回包含每一個partition

對應的文件段的大小的數組，源碼如下：

4、更新索引文件

最后更新索引文件，給數據文件重命名后整個寫過程就徹底結束了，源碼不再做過多解釋，在 spark shuffle的寫操作之准備工作中 IndexShuffleBlockResolver類中有說明。

5、總結

BypassMergeSortShuffleWriter是基於文件做的分區，沒有sort操作，最后分區數據被寫入一個完整文件，並且有一個索引文件記錄文件中每一個分區對應的FileSegment的大小。這種設計是比較朴素的，也很簡單，易實現。

七、spark shuffle寫操作三部曲之UnsafeShuffleWriter

1、前言

下面先來看UnsafeShuffleWriter的主要依賴實現類 -- ShuffleExternalSorter。

2、sort-based shuffle的外部sorter -- ShuffleExternalSorter**

在看本小節之前，建議先參照 spark 源碼分析之二十二-- Task的內存管理對任務的內存管理做一下詳細的了解，因為ShuffleExternalSorter使用了內存的排序。任務在做大數據量的內存操作時，內存是需要管理的。

在正式剖析之前，先剖析其依賴類。

1、依賴之記錄block元信息-- SpillInfo

它記錄了block的一些元數據信息。

其類結構如下：

其中，blockId就是shuffle的臨時的blockId，file就是shuffle合並后的文件，partitionLengths表示每一個分區的大小。

2、依賴之分區排序器 -- ShuffleInMemorySorter可以在任何內存使用的數組--LongArray

數組里的一個元素的地址等於：

if (baseObj == null) ? baseOffset(is real os address) + (length - 1) * WIDTH : address(baseObj) + baseOffset(is relative address 0) + (length - 1) * WIDTH

所有元素設為0:

設置元素

其底層使用unsafe類來設置值

獲取元素

其底層使用unsafe類來獲取值

記錄指針地址壓縮器 -- PackedRecordPointer

全稱：org.apache.spark.shuffle.sort.PackedRecordPointer

成員常量：

壓縮記錄指針和分區：

獲取記錄的地址：

獲取記錄的分區：

自定義比較器--SortComparator

思路也很簡單，就是根據分區來排序，即相同分區的數據被排到了一起。

遍歷自定義數組的迭代器 -- ShuffleSorterIterator

其定義如下：

其思路很簡單，hasNext跟JDK標准庫的實現一致，多了一個loadNext，每次都需要把數組中下一個位置的元素放到packetRecordPointer中，然后從packedRecordPointer中取出數據的地址和分區信息。

獲取迭代器

獲取迭代器的源碼如下：

其中 useRadixSort表示是否使用基數排序，默認是使用基數排序的，由參數 spark.shuffle.sort.useRadixSort 配置。

如果不使用基數排序，則會使用Spark的Sorter排序，sorter底層實現是TimSort，TimSort是優化之后的MergeSort。

總之，ShuffleSorterIterator中的數據已經是有序的了，只需要迭代式取出即可。

插入數據到自定義的數組中

思路很簡單，插入的數據就是記錄的地址和分區數據，這兩種數據被PackedRecordPointer壓縮編碼之后被存入到數組中。

3、繼承關系

其繼承關系如下：

即它是MemoryConsumer的子類，其實現了spill方法。

4、成員變量

其成員變量如下：

DISK_WRITE_BUFFER_SIZE：寫到磁盤前的緩沖區大小為1M

numPartitions：reduce的分區數

TaskContext：任務執行的上下文對象。

numElementsForSpillThreshold：ShuffleInMemorySorter 數據溢出前的元素閥值。

fileBufferSizeBytes：DiskBlockObjectWriter溢出前的buffer大小。

diskWriteBufferSize：溢出到磁盤前的buffer大小。

allocatedPages：記錄分配的內存頁。

spills：記錄溢出信息

peakMemoryUsedBytes：內存使用峰值。

inMemSorter：內存排序器

currentPage：當前使用內存頁

pageCursor：內存頁游標，標志在內存頁的位置。

5、構造方法

其構造方法如下：

fileBufferSizeBytes：通過參數 spark.shuffle.file.buffer 來配置，默認為 32k

numElementsForSpillThreshold：通過參數spark.shuffle.spill.numElementsForceSpillThreshold來配置，默認是整數的最大值。

diskWriteBufferSize：通過 spark.shuffle.spill.diskWriteBufferSize 來配置，默認為 1M

6、核心方法

主要方法如下：

我們主要分析其主要方法。

溢出操作

其源碼如下：

思路很簡單，調用writeSortedFile將數據寫入到文件中，釋放內存，重置inMemSorter。

freeMemory方法如下：

writeSortedFile 源碼如下：

圖中，我大致把步驟划分為四部分。整體思路：遍歷sorter中的所有分區數據，最終同一分區的數據被寫入到同一個FileSegment中，這些FileSegment最終又構成了一個合並的文件，其中FileSegment的大小被存放在SpillInfo中，最后放到了spills集合中。重點說一下第三步的獲取地址信息，如果是堆內地址，recordPage就是base對象，recordOffsetInPage就是記錄相對於base對象的偏移量，如果是堆外地址，recordPage為null，因為堆外地址沒有base對象，其baseOffset就是其在操作系統內存中的絕對地址，recordOffsetInPage = offsetInPage + baseOffset，具體可以在 spark 源碼分析之二十二-- Task的內存管理中看TaskMemoryManager的實現細節。

插入記錄

其源碼如下：

注意：如果是堆內內存，baseObject就是分配的數組，baseOffset就是數組的下標索引。如果是堆外內存，baseObject為null，baseOffset就是操作系統內存中的地址。

在地址編碼的時候，如果是堆內內存，頁內的偏移量就是baseObject，如果是堆外內存，頁內偏移量為：真實偏移量 - baseOffset。

它在插入數據之前，offset做了字節對齊，如果系統支持對齊，則向后錯4位，否則向后錯8位。這跟溢出操作里取數據是對應的，即可以跟上文中 writeSortedFile 方法對比看。

org.apache.spark.shuffle.sort.ShuffleExternalSorter#growPointerArrayIfNecessary源碼如下：

解釋：首先hasSpaceForAnotherRecord會比較數組中下一個寫的索引位置跟數組的最大容量比較，如果索引位置大於最大容量，那么就沒有空間來存放下一個記錄了，則需要把擴容，used是指的數組現在使用的大小，擴容倍數為源數組的一倍。

org.apache.spark.shuffle.sort.ShuffleExternalSorter#acquireNewPageIfNecessary 源碼如下：

解釋：分配內存頁的條件是當前頁的游標 + 需要的頁大小大於當前頁的最大容量，則需要重新分配一個內存頁。

關閉並且獲取spill信息

其源碼如下：

思路：執行最后一次溢出，然后將數據溢出信息返回。

清理資源

思路：釋放內存排序器的內存，刪除溢出的臨時文件。

獲取內存使用峰值

源碼如下：

思路：當前使用內存大於最大峰值則更新最大峰值，否則直接返回。

7、總結

這個sorter內部集成的內存sorter會把同一分區的數據排序到一起，數據溢出時，相同分區的數據會聚集到溢出文件的一個segment中。

3、使用UnsafeShuffleWriter寫數據

先上源碼，后解釋：

思路：流程很簡單，將所有的數據逐一遍歷放入sorter，然后將sorter關閉，獲取輸出文件，結束。

下面我們具體來看每一步是具體怎么實現的：

1、初始化Sorter

在org.apache.spark.shuffle.sort.UnsafeShuffleWriter的構造方法源碼如下：

簡單做一下說明：

DEFAULT_INITIAL_SORT_BUFFER_SIZE為 4096

DEFAULT_INITIAL_SER_BUFFER_SIZE 大小為 1M

reduce 分區數量最大為 16777216

SHUFFLE_FILE_BUFFER_SIZE默認為32k，大小由參數 spark.shuffle.file.buffer 配置。

SHUFFLE_UNSAFE_FILE_OUTPUT_BUFFER_SIZE 默認大小為32k，大小由參數 spark.shuffle.unsafe.file.output.buffer 配置。

其open方法如下：

這個方法里涉及了三個類：ShuffleExternalSorter，MyByteArrayOutputStream以及SerializationStream三個類。ShuffleExternalSorter在上文已經剖析過了，MyByteArrayOutputStream是一個ByteArrayOutputStream子類負責想堆內內存中寫數據，SerializationStream是一個序列化之后的流，數據最終會被寫入到serBuffer內存流中，調用其flush方法后，其內部的buf就是寫入的數據，如下：

2、數據寫入概述

核心方法write源碼如下：

其主要有兩步，一步是遍歷每一條記錄，將數據寫入到sorter中；第二步是關閉sorter，並將數據寫入到一個shuffle 文件中同時更新shuffle索引信息；最后清除shuffle過程中sorter使用的資源。

先來看第一步：數據寫入到sorter中。

3、數據插入到Sorter

記錄中的鍵值被序列化到serBuffer的buf字節數組中，然后被寫入到 sorter（ShuffleExternalSorter）中。在sorter中序列化數據被寫入到內存中（內存不足會溢出到磁盤中），其地址信息被寫入到 ShuffleInMemorySorter 中，具體可以看上文介紹。

4、溢出文件歸並為一個文件

一步是遍歷每一條記錄，將數據寫入到sorter中后會調用sorter的closeAndGetSpills方法執行最后一次spill操作，然后獲取到整個shuffle過程中所有的SpillInfo信息。然后使用ShuffleBlockResolver獲取到shuffle的blockId對應的shuffle文件，最終調用mergeSpills 方法合並所有的溢出文件到最終的shuffle文件，然后更新shuffle索引文件，設置Shuffle結果的MapStatus信息，結束。

org.apache.spark.shuffle.sort.UnsafeShuffleWriter#closeAndWriteOutput 源碼如下：

其關鍵方法 org.apache.spark.shuffle.sort.UnsafeShuffleWriter#mergeSpills 源碼如下：

如果溢出文件為0，直接返回全是0的分區數組。

如果溢出文件為1，文件重命名后返回只有一個元素的分區數組。

如果溢出文件多於1個則，多個溢出文件開始merge。

首先先看一下五個變量：

encryptionEnabled：是否啟用加密，默認為false，通過 spark.io.encryption.enabled 參數來設置。

transferToEnabled：是否可以使用nio的transferTo傳輸，默認為true，通過 spark.file.transferTo 參數來設置。

compressionEnabled：是否使用壓縮，默認為true，通過 spark.shuffle.compress 參數來設置。

compressionCodec：默認壓縮類，默認為LZ4CompressionCodec，通過 spark.io.compression.codec 參數來設置。

fastMergeEnabled：是否啟用fast merge，默認為true，通過 spark.shuffle.unsafe.fastMergeEnabled 參數來設置。

fastMergeIsSupported：是否支持 fast merge，如果不使用壓縮或者是壓縮算法是 org.apache.spark.io.SnappyCompressionCodec、org.apache.spark.io.LZFCompressionCodec、org.apache.spark.io.LZ4CompressionCodec、org.apache.spark.io.ZStdCompressionCodec這四種支持連接的壓縮算法中的一種都是可以使用 fast merge的。

三種merge多個文件的方式：transfered-based fast merge、fileStream-based fast merge以及slow merge三種方式。

使用transfered-based fast merge條件：使用 fast merge並且壓縮算法支持fast merge，並且啟用了nio的transferTo傳輸且不啟用文件加密。

使用fileStream-based fast merge條件：使用 fast merge並且壓縮算法支持fast merge，並且未啟用nio的transferTo傳輸或啟用了文件加密。

使用slow merge條件：未使用 fast merge或壓縮算法不支持fast merge。

下面我們來看三種合並溢出的方式。

transfered-based fast merge

其核心方法org.apache.spark.shuffle.sort.UnsafeShuffleWriter#mergeSpillsWithTransferTo 源碼如下：

其依賴方法 org.apache.spark.util.Utils#copyFileStreamNIO 如下：

很簡單，底層依賴於Java的NIO的transferTo方法實現。

fileStream-based fast merge

其核心方法 org.apache.spark.shuffle.sort.UnsafeShuffleWriter#mergeSpillsWithFileStream 源碼如下，這里不傳入任何壓縮類，見 org.apache.spark.shuffle.sort.UnsafeShuffleWriter#mergeSpills 源碼。

slow merge

其其核心方法 org.apache.spark.shuffle.sort.UnsafeShuffleWriter#mergeSpillsWithFileStream 源碼跟 fileStream-based fast merge 里的一樣，不做過多解釋，只不過這里多傳入了一個壓縮類，見 org.apache.spark.shuffle.sort.UnsafeShuffleWriter#mergeSpills 源碼。

5、更新shuffle索引

這部分更詳細的可以看 org.apache.spark.shuffle.IndexShuffleBlockResolver#writeIndexFileAndCommit 源碼。在上篇文章 spark shuffle寫操作三部曲之BypassMergeSortShuffleWriter 中使用BypassMergeSortShuffleWriter寫數據已經剖析過，不再剖析。

4、總結

ShuffleExternalSorter將數據不斷溢出到溢出小文件中，溢出文件內的數據是按分區規則排序的，分區內的數據是亂序的。

多個分區的數據同時溢出到一個溢出文件，最后使用三種歸並方式中的一種將多個溢出文件歸並到一個文件，分區內的數據是亂序的。最終數據的格式跟第一種shuffle寫操作的結果是一樣的，即有分區的shuffle數據文件和記錄分區大小的shuffle索引文件。

八、spark shuffle寫操作三部曲之SortShuffleWriter

1、提出問題

spark shuffle的預聚合操作是如何做的，其中底層的數據結構是什么？在數據寫入到內存中有預聚合，在讀溢出文件合並到最終的文件時是否也有預聚合操作？
shuffle數據的排序是如何做的？分區內的數據是否是有序的？若有序，spark 內部是按照什么排序算法來排序每一個分區上的key的？
shuffle的溢出操作和TaskMemoryManager的關系？
在數據溢出階段，內存中數據的排序是使用算法進行排序的？
在溢出文件數據合並階段，內存中的數據的排序是使用的什么算法？
為什么在讀取溢出文件到內存中時，返回的結果是迭代器而不是直接的數據結果？

。。。。。。還有很多的細節。

2、前言

剖析最后一種 shuffle 寫的方式。

我們先來看第三種shuffle的相關依賴類。

3、SizeTrackingAppendOnlyMap

這個類繼承了AppendOnlyMap並實現了SizeTracker trait。

其內部方法如下：

它依賴的類都是其父類，他只是它的兩個父類的拼湊，所以要想了解真正的動作，還是需要去看其父類AppendOnlyMap和trait SizeTracker。

1、父類AppendOnlyMap

這個類繼承了Iterable trait和 Serializable 接口。

其類結構如下：

成員變量

成員變量如下：

LOAD_FACTOR：負載因子，為0.7，實際存儲數據占比大於負載因子則需要擴容。

mask的作用：將任意的數映射到[0,mask]的范圍內。

data：是真正保存數據的數組。

haveNullValue：是否有null值，因為數組中的null值還有一個作用，那就是表示該索引位置沒有元素存在。

nullValue：null值。

destoryed：表示數據是否已經被銷毀。

理論最大容量為：512MB

成員方法如下：

根據key獲取value

解釋：

1.如果是null值，則返回null值，因為約定 null值key對應null值value。

首先先把原來的hashcode再求一次hash碼，然后和掩碼做與操作將其映射到 [0,mask] 范圍內。
嘗試取出數據如果取出來的key是指定的key，則返回數據，若取出的key是null，表示之前沒有保存過，返回null，若取出的數據的key不是當前key，則使用再散列法先有pos + delta逐步散列，求得下一次的pos，然后再重復第三步，直至找匹配的值或null值后返回。

設置鍵值對

更新鍵值思路：跟查找的思路一樣，只不過找到之后不返回，是執行更新操作。

在指定key的value上執行函數

更新鍵值思路：跟查找的思路一樣，只不過找到之后不返回，如果找的的值是null值，則執行賦值操作，否則更新value為執行更新函數后的值。

獲取未排序的迭代器

本質上是遍歷數組，只不過這里的元素是稀疏的，只返回有元素的數據，不做過多說明。

先整理數組，將數組的數據變為緊湊的數據。再按照key來進行排序。最后返回一個迭代器，這個迭代器里的數據是有序的。

rehash

擴容

如果當前使用容量占比大於負載因子，則開始擴容。

新容量是舊容量的一倍。遍歷舊的數組中的每一個非null元素，將其映射到新的數組中。

2、父類SizeTracker

A general interface for collections to keep track of their estimated sizes in bytes. We sample with a slow exponential back-off using the SizeEstimator to amortize the time, as each call to SizeEstimator is somewhat expensive (order of a few milliseconds).

集合的通用接口，用於跟蹤其估計的大小（以字節為單位）。我們使用SizeEstimator以緩慢的指數退避進行采樣以分攤時間，因為每次調用SizeEstimator都有點昂貴。

成員變量

SAMPLE_GROWTH_RATE指數增長因子，比如是2，則是 1，2，4，8，16，......

核心方法如下：

采樣

估算大小

重采樣

更新后采樣

3、依賴類 -- SizeEstimator

主要用於數據占用內存的估算。

4、ExternalAppendOnlyMap

1、繼承關系

其父類是Spillable抽象類。

先來看父類Spillable

2、超類--Spillable

類說明：當內存不足時，這個類會把內存里的集合溢出到磁盤中。

其成員變量如下，不做過多解釋。

主要方法如下：

溢出內存到磁盤

它實現了父類的抽象方法 spill方法，源碼如下：

思路：如果consumer不是這個類並且內存模式是堆內內存才支持內存溢出。

其依賴方法如下：

org.apache.spark.util.collection.Spillable#forceSpill源碼如下，它是一個抽象方法，沒有具體實現。

釋放內存方法，其調用了父類的freeMemory方法：

嘗試溢出來釋放內存

org.apache.spark.util.collection.Spillable#maybeSpill 源碼如下：

其依賴方法spill方法如下，注意這個方法是用來溢出集合的數據到內存的，它是抽象方法，待子類實現。

這個類留給子類兩個方法來實現，forceSpill和spill方法。

ExternalAppendOnlyMap這個類里面的是對 SizeTrackingAppendOnlyMap 的進一步封裝，下面我們先看 SizeTrackingAppendOnlyMap。

3、數據比較器 -- HashComparator

其源碼如下：

總之，它是根據哈希碼進行比較的。

4、SpillableIterator

首先，它是org.apache.spark.util.collection.ExternalAppendOnlyMap的內部類，實現了Iterator trait，它是跟ExternalAppendOnlyMap一起使用的，也使用了 ExternalAppendOnlyMap 里的方法。

成員變量

其成員變量如下：

SPILL_LOCK是一個對象鎖，每次執行溢出操作都會先獲取鎖再執行溢出操作，執行完畢后釋放鎖。

cur表示下一個未讀的元素。

hasSpilled表示是否有溢出。

核心方法

1.溢出

其源碼如下：

2.銷毀數據釋放內存

其依賴方法 org.apache.spark.util.collection.ExternalAppendOnlyMap#freeCurrentMap 如下：

讀取下一個

是否有下一個

獲取下一個元素

轉換為CompletionIterator

總結

從本質來來說，它是一個包裝類，數據從構造方法以Iterator的形式傳遞過來，而它自己也是一個Iterator，除了實現了Iterator本身的方法外，還具備了溢出到磁盤、銷毀內存數據、轉換為CompletionIterator的功能。

5、DiskMapIterator

這個類就是用來讀取文件的數據的，只不過文件被划分為了多個文件段，有一個數組專門記錄這多個文件段的段大小，如構造函數所示：

其中file就是要讀取的數據文件，blockId表示文件在shuffle系統中對應的blockId，batchSize就是指的每一個文件段的大小。

成員變量如下：

下面從Iterator的主要方法入手，去剖析整個類。

是否有下一個元素

其依賴方法 org.apache.spark.util.collection.ExternalAppendOnlyMap.DiskMapIterator#readNextItem 源碼如下：

思路：首先先讀取下一個key-value對，若讀取完畢后，發現這個批次的數據已經讀取完畢，則調用 nextBatchStream 方法，關閉現有反序列化流，初始化讀取下一個文件段的反序列化流。

其依賴方法 org.apache.spark.util.collection.ExternalAppendOnlyMap.DiskMapIterator#nextBatchStream 如下：

思路：首先先確定該批次的數據是否讀取完畢，若讀取完畢，則做完清理操作后，返回null值，否則先關閉現有的反序列化流，然后獲取下一個反序列化流的開始和結束offset，最后初始化一個反序列化流返回給調用端。

其依賴方法 org.apache.spark.util.collection.ExternalAppendOnlyMap.DiskMapIterator#cleanup 方法如下：

思路：首先關閉現有的反序列化流和文件流，最后如果文件存在，則刪除之。

讀取下一個元素

思路很簡單，其中，nextItem已經在是否有下一個元素的時候反序列化出來了。

6、構造方法

它有兩個重載的構造方法：

和

解釋一下其中的參數：

createCombiner：是根據一個原值來創建其combine之后的值的函數。

mergeValue：是根據一個combine之后的值和一個原值求combine之后的值的函數。

mergeCombiner：是根據兩個combine之后的值求combine之后的值函數。

本質上這幾個函數就是逐步歸並聚合的體現。

7、成員變量

serializerBatchSize：表示每次溢出時，寫入文件的批次大小，這個批次是指的寫入的對象的次數，而不是通常意義上的buffer的緩沖區大小。

_diskBytesSpilled :表示總共溢出的字節大小

fileBufferSize: 文件緩存大小，默認為 32k

_peakMemoryUsedBytes: 表示內存使用峰值

keyComparater：表示內存排序的比較器

8、核心方法之插入數據

溢出操作

思路：首先先調用currentMap的destructiveSortedIterator方法，先整理其內部的數據成緊湊的數據，然后對數據進行排序，最終有序數據以Iterator的結果返回。然后調用

將數據溢出到磁盤，最后將溢出的信息記錄到spilledMaps中，其依賴方法 org.apache.spark.util.collection.ExternalAppendOnlyMap#spillMemoryIteratorToDisk 源碼如下：

思路：創建本地臨時block，並獲取其writer，最終遍歷內存數組的迭代器，將數據都通過writer寫入到file中，其中寫文件是分批寫入的，即每次滿足serializerBatchSize大小之后，執行flush寫入，最后執行一次flush寫入，關閉文件，最終返回DiskMapIterator對象。

強制溢出

摧毀迭代器

獲取迭代器

5、預聚合類 -- Aggregator

其源碼如下：

這個類的兩個方法 combineValuesByKey 和 combineCombinersByKey 都依賴於 ExternalAppendOnlyMap類。

下面繼續來看ExternalSorter類的內部實現。

6/支持排序預聚合的sorter -- ExternalSorter

1、類說明

Sorts and potentially merges a number of key-value pairs of type (K, V) to produce key-combiner pairs of type (K, C). Uses a Partitioner to first group the keys into partitions, and then optionally sorts keys within each partition using a custom Comparator. Can output a single partitioned file with a different byte range for each partition, suitable for shuffle fetches. If combining is disabled, the type C must equal V -- we'll cast the objects at the end. Note: Although ExternalSorter is a fairly generic sorter, some of its configuration is tied to its use in sort-based shuffle (for example, its block compression is controlled by spark.shuffle.compress). We may need to revisit this if ExternalSorter is used in other non-shuffle contexts where we might want to use different configuration settings.

對類型（K，V）的多個鍵值對進行排序並可能合並，以生成類型（K，C）的鍵組合對。使用分區程序首先將key分組到分區中，然后可以選擇使用自定義Comparator對每個分區中的key進行排序。可以為每個分區輸出具有不同字節范圍的單個分區文件，適用於隨機提取。如果禁用了組合，則類型C必須等於V - 我們將在末尾轉換對象。注意：雖然ExternalSorter是一個相當通用的排序器，但它的一些配置與基於排序的shuffle的使用有關（例如，它的塊壓縮由spark.shuffle.compress控制）。如果在我們可能想要使用不同配置設置的其他非隨機上下文中使用ExternalSorter，我們可能需要重新審視這一點。

下面，先來看其構造方法：

2、構造方法

參數如下：

aggregator：可選的聚合器，可以用於歸並數據

partitioner ：可選的分區器，如果有的話，先按分區Id排序，再按key排序

ordering ：可選的排序，它在每一個分區內按key進行排序，它也可以是全局排序

serializer ：用於溢出內存數據到磁盤的序列化器

其成員變量和核心方法，先不做剖析，其方法圍繞兩個核心展開，一部分是跟數據的插入有關的方法，一部分是跟多個溢出文件的合並操作有關的方法。

下面來看看它的一些內部類。

3、只讀一個分區數據的迭代器 -- IteratorForPartition

這個類實現了Iterator trait，只負責迭代讀取一個特定分區的數據，其定義如下：

比較簡單，不做過多說明。

4、溢出文件的描述 -- SpilledFile

這個類是一個 case class ，它記錄了溢出文件的一些關鍵信息，構造方法的各個字段如下：

file：溢出文件

blockId：溢出文件對應的blockId

serializerBatchSizes：表示每一個序列化類對應的batch的大小。

elementsPerPartition：表示每一個分區的元素的個數。

比較簡單，沒有類的方法定義。

5、讀取溢出文件的內容 -- SpillReader

它負責讀取一個按分區做文件分區的文件，希望按分區順序讀取分區文件的內容。

其類結構如下：

成員變量

先看其成員變量：

batchOffsets：序列化類的每一個批次的offset

partitionId：分區id

indexInPartition：在分區內的索引信息

batchId：batch的id

indexInBatch：在batch中的索引信息

lastPartitionId：上一個partition ID

nextPartitionToRead：下一個要讀取的partition的id

fileStream：文件輸入流

deserializeStream：分序列化流

nextItem：下一個鍵值對

finished：是否讀取完畢

下面，來看其核心方法：

獲取下一個批次的反序列化流

思路跟DiskMapIterator的獲取下一個流的思路很類似，不做過多解釋。

讀取下一個partition的數據

其返回的是一個迭代器，org.apache.spark.util.collection.ExternalSorter.SpillReader#readNextPartition源碼如下：

思路：其返回迭代器中，的hasNext中先去讀取下一個item，如果讀取到的下一個元素為null，則返回false，表示沒有數據可以返回。

其依賴方法 org.apache.spark.util.collection.ExternalSorter.SpillReader#readNextItem 源碼如下：

思路：首先該批次數據讀取完畢，則關閉掉讀取該批次數據的流，繼續讀取下一個批次的流。

其依賴方法 org.apache.spark.util.collection.ExternalSorter.SpillReader#skipToNextPartition 方法如下：

下面，整理一下思路：

每次讀取一個文件的分區，該分區讀取完畢，關閉分區文件，讀取下一個文件的下一個分區數據。只不過它在讀文件的分區的時候，會有batch操作，一個分區可能會對應多個batch，但是一個batch有且只能有一個分區。

7、SpillableIterator

首先它跟 org.apache.spark.util.collection.ExternalAppendOnlyMap.SpillableIterator 很像，實現方法也很類似，都是實現了一個Iterator trait，構造方法以一個Iterator對象傳入，並且對其做了封裝，可以跟上文的 SpillableIterator 對比剖析。

其成員變量如下：

nextUpStream：下一個批次的stream

1、對Iterator的實現

先來看Iterator的方法實現：

2、溢出

其源碼如下：

思路如下：首先創建內存迭代器，然后遍歷內存迭代器，將數據溢出到磁盤中，其關鍵方法 spillMemoryIteratorToDisk。

8、兩種存放溢出前數據的數據結構

1、PartitionedAppendOnlyMap

它是SizeTrackingAppendOnlyMap和 WritablePartitionPairCollection的子類。

其源碼如下：

2、PartitionedPairBuffer

這個類底層是數組，數據按數組的形式緊湊排列。不支持多個相同key的預聚合操作。

它是SizeTracker 和 WritablePartitionPairCollection的子類。

其源碼如下：

插入數據

數組擴容

獲取排序后的迭代器

獲取讀取數組數據的迭代器

下面來看最后一種shuffle數據寫的方式。

9、使用SortShuffleWriter寫數據

這種shuffle方式支持預聚合操作。

其下操作源碼如下：

1、初始化Sorter

如果需要在map段做combine操作，則需要指定 aggragator和 keyOrdering，即map端的數據會做預聚合操作，並且分區內的數據有序，其排序規則是按照hashCode做排序的。

否則這兩個參數為null，即map端的數據沒有預聚合，並且分區內數據無序。

2、向sorter插入數據

其源碼如下：

org.apache.spark.util.collection.ExternalSorter#insertAll的源碼如下：

思路：首先如果數據需要執行map端的combine操作，則使用 PartitionedAppendOnlyMap 類來操作，這個類可以支持數據的combine操作。如果不需要執行map 端的combine 操作，則使用 PartitionedPairBuffer 來實現，這個類不會對數據進行預聚合。每次數據寫入之后，都要查看是否需要執行溢出內存數據到磁盤的操作。

這兩個類在上文中已經做了詳細的說明。

其依賴方法 addElementsRead 源碼如下：

溢出內存數據到磁盤的核心方法 maybeSpillCollection 源碼如下：

思路：它有一個標志位 usingMap表示是否使用的是map的數據結構，即是否是 PartitionedAppendOnlyMap，其思路幾乎一樣，只不過在調用 mayBeSpill 方法中傳入的參數不一樣。其中使用的內存的大小，都是經過采樣評估計算過的。其依賴方法 org.apache.spark.util.collection.Spillable#maybeSpill 如下：

思路：如果讀取的數據是 32 的整數倍並且當前使用的內存比初始內存大，則開始向TaskMemoryManager申請分配內存，如果申請成功，則返回申請的大小，注意：在向TaskMemoryManager申請內存的過程中，如果內存不夠，也會去調用 org.apache.spark.util.collection.Spillable#spill 方法，在其內部也會去調用 org.apache.spark.util.collection.ExternalSorter#forceSpill 方法其源碼如下，其中readingIterator是SpillableIterator類型的對象。

其依賴方法 org.apache.spark.util.collection.Spillable#logSpillage 會打印一些溢出日志。不再過多說明。

其依賴方法 org.apache.spark.util.collection.ExternalSorter#spill 源碼如下：

思路相對比較簡單，主要是先獲取排序后集合的迭代器，然后將迭代器傳入 org.apache.spark.util.collection.ExternalSorter#spillMemoryIteratorToDisk ，將內存數據溢出到臨時的磁盤文件后返回一個SpilledFile對象，將其記錄到 spills中，spills這個變量主要記錄了內存數據的溢出過程中的溢出文件的信息。

其溢出磁盤方法 org.apache.spark.util.collection.ExternalSorter#spillMemoryIteratorToDisk 源碼如下：

首先獲取寫序列化文件的writer，然后遍歷數據的迭代器，將數據迭代寫入到磁盤中，在寫入過程中，不斷將每一個分區的大小信息以及每一個分區內元素的個數記錄下來，最終將溢出文件、分區元素個數，以及每一個segment的大小信息封裝到SpilledFile對象中返回。

3、多文件歸並為一個文件

其核心代碼如下：

思路：首先先初始化一個臨時的最終文件（以uuid作為后綴），然后初始化blockId，最后調用 org.apache.spark.util.collection.ExternalSorter的writePartitionedFile 方法。將數據寫入一個臨時文件，並將該文件中每一個分區對應的FileSegment的大小返回。

其關鍵方法 org.apache.spark.util.collection.ExternalSorter#writePartitionedFile 源碼如下：

思路：首先如果從來沒有過溢出文件，則首先先看一下是否需要map端聚合，若是需要，則數據已經被寫入到了map中，否則是buffer中。然后調用集合的轉成迭代器的方法，將內存的數據排序后輸出，最終迭代遍歷這個迭代器，將數據不斷寫入到最終的臨時文件中，更新分區大小返回。

如果之前已經有溢出文件了，則先調用 org.apache.spark.util.collection.ExternalSorter的partitionedIterator 方法將數據合並后返回合並后的迭代器。

最終遍歷每一個分區的數據，將分區的數據寫入到最終的臨時文件，更新分區大小；最后返回分區大小。

下面重點剖析一下合並方法 org.apache.spark.util.collection.ExternalSorter#partitionedIterator，其源碼如下：

首先，要說明的是，通過我們上面的程序分支進入該程序，此時歷史溢出文件集合是空的，即它不會執行第一個分支的處理流程，但還是要做一下簡單的說明。

它有三個依賴方法分別如下：

依賴方法 org.apache.spark.util.collection.ExternalSorter#destructiveIterator 源碼如下：

思路：首先 isShuffleSort為 true，我們現在就是走的 shuffle sort的流程，肯定是需要走第一個分支的，即它不會返回一個SpillableIterator迭代器。

值得注意的是，這里的comparator跟內存排序使用的comparator是一樣的，即排序方式是一樣的。

依賴方法 org.apache.spark.util.collection.ExternalSorter#groupByPartition 源碼如下：

思路：遍歷每一個分區返回一個IteratorForPartition的分區迭代器。

注意：由於歷史溢出文件集合此時不為空，將不會調用這個方法。

依賴方法 org.apache.spark.util.collection.ExternalSorter#merge 源碼如下：

思路：傳給merge方法的有兩個參數，一個是代表溢出文件的SpiiledFile集合，一個是代表內存數據的迭代器。

首先遍歷每一個溢出文件，創建一個讀取該溢出文件的SpillReader對象，然后遍歷每一個分區創建一個IteratorForPartition迭代器，然后讀取每一個溢出文件的分區的迭代器，最終和作為參數傳入merge 方法的內存迭代器合並到一個迭代器集合中。

如果是需要預聚合的，則調用 mergeWithAggregation 方法，如果是需要排序的，則調用mergeSort 方法，對其進行排序，最后如果不滿足前兩種情況，調用集合的flatten 方法，將打平到一個迭代器中返回。

它有兩個依賴方法，分別如下：

org.apache.spark.util.collection.ExternalSorter#mergeSort 源碼如下：

思路：使用堆排序構造優先隊列，對數據進行排序，最終返回一個迭代器。每次先從堆中根據partitionID排序，將同一個partition的排到前面，每次取出一個Iterator，然后取出該Iterator中的一個元素，再放入堆中，因為可能取出一個元素后，Iterator的頭節點的partitionId改變了，所以需要再次排序，就這樣動態的出堆入堆，讓不同Iterator的相同partition的數據總是在一起被迭代取出。注意這里的comparator在指定ordering或aggragator的時候，是支持二級排序的，即不僅僅支持分區排序，還支持分區內的數據按key進行排序，其排序器源碼如下：

如果ordering和aggragator沒有指定，則數據排序器為：

即只按分區排序，跟第二種shuffle的最終格式很類似，分區內部數據無序。

org.apache.spark.util.collection.ExternalSorter#mergeWithAggregation源碼如下：

思路：如果數據整體並不要求有序，則會使用combiner將數據整體進行combine操作，最終相同key的數據被聚合在一起。如果數據整體要求有序，則直接對有序的數據按照順序一邊聚合一邊迭代輸出下一個元素，最終數據是整體有序的。

4、創建索引文件

其關鍵源碼如下：

其思路很簡單，可以參考 spark shuffle寫操作三部曲之UnsafeShuffleWriter 對應部分的說明。

10、總結

在本篇文章中，剖析了spark shuffle的最后一種寫方式。溢出前數據使用數組自定義的Map或者是列表來保存，如果指定了aggerator，則使用Map結構，Map數據結構支持map端的預聚合操作，但是列表方式的不支持預聚合。

數據每次溢出數據都進行排序，如果指定了ordering，則先按分區排序，再按每個分區內的key排序，最終數據溢出到磁盤中的臨時文件中，在merge階段，數據被SpillReader讀取出來和未溢出的數據整體排序，最終數據可以整體有序的落到最終的數據文件中。

至此，spark shuffle的三種寫方式都剖析完了。之后會有文章來剖析shuffle的讀取操作。

九、spark shuffle讀操作

1、提出問題

shuffle過程的數據是如何傳輸過來的，是按文件來傳輸，還是只傳輸該reduce對應在文件中的那部分數據？
shuffle讀過程是否有溢出操作？是如何處理的？
shuffle讀過程是否可以排序、聚合？是如何做的？

。。。。。。

2、概述

1、計算或者讀取RDD

org.apache.spark.rdd.RDD#iterator源碼如下，它是一個final方法，只在此有實現，子類不允許重實現這個方法：

思路：如果是已經緩存下來了，則調用 org.apache.spark.rdd.RDD#getOrCompute 方法，通過底層的存儲系統或者重新計算來獲取父RDD的map數據。否則調用 org.apache.spark.rdd.RDD#computeOrReadCheckpoint ，從checkpoint中讀取或者是通過計算來來獲取父RDD的map數據。

我們逐一來看其依賴方法：

org.apache.spark.rdd.RDD#getOrCompute 源碼如下：

首先先通過Spark底層的存儲系統獲取 block。如果底層存儲沒有則調用 org.apache.spark.rdd.RDD#computeOrReadCheckpoint，其源碼如下：

主要通過三種途徑獲取數據 -- 通過spark 底層的存儲系統、通過父RDD的checkpoint、直接計算。

2、處理返回的數據

讀取完畢之后，數據的處理基本上一樣，都使用 org.apache.spark.InterruptibleIterator 以迭代器的形式返回，org.apache.spark.InterruptibleIterator 源碼如下：

比較簡單，使用委托模式，將迭代下一個行為委托給受委托類。

下面我們逐一來看三種獲取數據的實現細節。

3、通過spark 底層的存儲系統

其核心源碼如下：

思路：首先先從本地或者是遠程executor中的存儲系統中獲取到block，如果是block存在，則直接返回，如果不存在，則調用 computeOrReadCheckpoint計算或者通過讀取父RDD的checkpoint來獲取RDD的分區信息，並且將根據其持久化級別（即StorageLevel）將數據做持久化。關於持久化的內容可以參考 Spark 源碼分析系列中的 Spark存儲部分做深入了解。

4、通過父RDD的checkpoint

其核心源碼如下：

通過父RDD的checkpoint也是需要通過spark底層存儲系統或者是直接計算來得出數據的。

不做過多的說明。

下面我們直接進入主題，看shuffle的讀操作是如何進行的。

5、直接計算

其核心方法如下：

首先，org.apache.spark.rdd.RDD#compute是一個抽象方法。

我們來看shuffle過程reduce的讀map數據的實現。

表示shuffle結果的是 org.apache.spark.rdd.ShuffledRDD。

其compute 方法如下：

整體思路：首先從 shuffleManager中獲取一個 ShuffleReader 對象，並調用該reader對象的read方法將數據讀取出來，最后將讀取結果強轉為Iterator[(K,C)]

該shuffleManager指的是org.apache.spark.shuffle.sort.SortShuffleManager。

其 getReader 源碼如下：

簡單來說明一下參數：

handle：是一個ShuffleHandle的實例，它有三個子類，可以參照 spark shuffle的寫操作之准備工作做深入了解。

startPartition：表示開始partition的index

endPartition：表示結束的partition的index

context：表示Task執行的上下文對象

其返回的是一個 org.apache.spark.shuffle.BlockStoreShuffleReader 對象，下面直接來看這個對象。

6、BlockStoreShuffleReader

這個類的繼承關系如下：

其中ShuffleReader的說明如下：

Obtained inside a reduce task to read combined records from the mappers.

ShuffleReader只有一個read方法，其子類BlockStoreShuffleReader也比較簡單，也只有一個實現了的read方法。

下面我們直接來看這個方法的源碼。

在上圖，把整個流程划分為5個步驟 -- 獲取block輸入流、反序列化輸入流、添加監控、數據聚合、數據排序。

下面我們分別來看這5個步驟。這5個流程中輸入流和迭代器都沒有把大數據量的數據一次性全部加載到內存中。並且即使在數據聚合和數據排序階段也沒有，但是會有數據溢出的操作。我們下面具體來看每一步的具體流程是如何進行的。

7、獲取block輸入流

其核心源碼如下：

我們先來對 ShuffleBlockFetcherIterator 做進一步了解。

1、使用ShuffleBlockFetcherIterator獲取輸入流

這個類就是用來獲取block的輸入流的。

blockId等相關信息傳入構造方法

其構造方法如下：

它繼承了Iterator trait，是一個 [(BlockId,InputStream)] 的迭代器。

對構造方法參數做進一步說明：

context：TaskContext，是作業執行的上下文對象

shuffleClieent：默認為 NettyBlockTransferService，如果使用外部shuffle系統則使用 ExternalShuffleClient

blockManager：底層存儲系統的核心類

blocksByAddress：需要的block的blockManager的信息以及block的信息。

通過 org.apache.spark.MapOutputTracker#getMapSizesByExecutorId 獲取，其源碼如下：

org.apache.spark.MapOutputTrackerWorker#getStatuses 其源碼如下：

思路：如果有shuffleId對應的MapStatus則返回，否則使用 MapOutputTrackerMasterEndpointRef 請求 driver端的 MapOutputTrackerMaster 返回對應的MapStatus信息。

org.apache.spark.MapOutputTracker#convertMapStatuses 源碼如下：

思路：將MapStatus轉換為一個可以迭代查看BlockManagerId、BlockId以及對應大小的迭代器。

streamWrapper：輸入流的解密以及解壓縮操作的包裝器，其依賴方法 org.apache.spark.serializer.SerializerManager#wrapStream 源碼如下：

讀取數據

在迭代方法next中不斷去讀取遠程的block以及本地的block輸入流。不做詳細剖析，見 ShuffleBlockFetcherIterator.scala 中next 相關方法的剖析。

8、反序列化輸入流

核心方法如下：

其依賴方法 scala.collection.Iterator#flatMap 源碼如下：

可見，即使是在這里，數據並沒有全部落到內存中。流跟管道的概念很類似，數據並沒有一次性加載到內存中。它只不過是在使用迭代器的不斷銜接，最終形成了新的處理鏈。在這個鏈中的每一個環節，數據都是懶加載式的被加載到內存中，這在處理大數據量的時候是一個很好的技巧。當然也是責任鏈的一種具體實現方式。

9、添加監控

其實這一步跟上一步本質上區別並不大，都是在責任鏈上添加了一個新的環節，其核心源碼如下：

其中，核心方法 scala.collection.Iterator#map 源碼如下：

又是一個新的迭代器處理環節被加到責任鏈中。

10、數據聚合*

數據聚合其實也很簡單。

其核心源碼如下：

1589018093853

在聚合的過程中涉及到了數據的溢出操作，如果有溢出操作還涉及 ExternalSorter的溢出合並操作。

其核心源碼不做進一步解釋，有興趣可以看 spark shuffle寫操作三部曲之SortShuffleWriter 做進一步了解。

11、數據排序

數據排序其實也很簡單。如果使用了排序，則使用ExternalSorter則在分區內部進行排序。

其核心源碼如下：

1、總結

主要從實現細節和設計思路上來說。

2、實現細節

首先在實現細節上，先使用ShuffleBlockFetcherIterator獲取本地或遠程節點上的block並轉化為流，最終返回一小部分數據的迭代器，隨后序列化、解壓縮、解密流操作被放在一個迭代器中該迭代器后執行，然后添加了監控相關的迭代器、數據聚合相關的迭代器、數據排序相關的迭代器等等。這些迭代器保證了處理大量數據的高效性，在數據聚合和排序階段，大數據量被不斷溢出到磁盤中，數據最終還是以迭代器形式返回，確保了內存不會被大數據量占用，提高了數據的吞吐量和處理數據的高效性。

3、設計思路

在設計上，主要說三點：

責任鏈和迭代器的混合使用，即使得程序易擴展，處理環節可插拔，處理流程清晰易懂。
關於聚合和排序的使用，在前面文章中shuffle寫操作也提到了，聚合和排序的類是獨立出來的，跟shuffle的處理耦合性很低，這使得在shuffle的讀和寫階段的數據內存排序聚合溢出操作的處理類可以重復使用。
shuffle數據的設計也很巧妙，shuffle的數據是按reduceId分區的，分區信息被保存在索引文件中，這使得每一個reduce task只需要取得一個文件中屬於它分區的那部分shuffle數據就可以了，極大地減少無用了數據量的網絡傳輸，提高了shuffle的效率。還值得說的是，shuffle數據的格式是一個約定，不管map階段的數據是如何被處理，最終數據形式肯定是約定好的，這使得map和reduce階段的處理類之間的耦合性大大地降低。

至此，spark 的shuffle階段的細節就徹底剖析完畢。

八、spark sql

一、spark sql 執行計划生成案例

1、前言

一個SQL從詞法解析、語法解析、邏輯執行計划、物理執行計划最終轉換為可以執行的RDD，中間經歷了很多的步驟和流程。其中詞法分析和語法分析均有ANTLR4完成，可以進一步學習ANTLR4的相關知識做進一步了解。

本篇文章主要對一個簡單的SQL生成的邏輯執行計划物理執行計划的做一個簡單地說明。

2、示例代碼

case class Person(name: String, age: Long)
private def runBasicDataFrameExample2(spark: SparkSession): Unit = {
  import spark.implicits._
  val df: DataFrame = spark.sparkContext
    .parallelize(
      Array(
        Person("zhangsan", 10),
        Person("lisi", 20),
        Person("wangwu", 30))).toDF("name", "age")
  df.createOrReplaceTempView("people")
  spark.sql("select * from people where age >= 20").show()
}

3、生成邏輯物理執行計划示例

生成的邏輯和物理執行計划，右側的是根據QueryExecution的 toString 方法，得到的對應結果

4、QueryExecution關鍵源碼分析

對關鍵源碼，自己做了簡單的分析。如下圖：

其中SparkSqlParser使用ASTBuilder生成UnResolved LogicalPlan。

5、最后

注意Spark SQL 從driver 提交經過詞法分析、語法分析、邏輯執行計划、到可落地執行的物理執行計划。其中前三部分都是 spark catalyst 子模塊的功能，與最終在哪個SQL執行引擎上執行並無多大關系。物理執行計划是后續轉換為RDD的基礎和必要條件。

本文對Spark SQL中關鍵步驟都有一定的涉及，也可以針對QueryExecution做后續的分析，建議修改SparkSQL 源碼，做本地調試。后續會進一步分析，主要結合《SparkSQL 內核剖析》這本書以及自己在工作學習中遇到的各種問題，做進一步源碼分析

二、如何查看SparkSQL 生成的抽象語法樹？

1、前言

在《Spark SQL內核剖析》書中4.3章節，談到Catalyst體系中生成的抽象語法樹的節點都是以Context來結尾，在ANLTR4以及生成的SqlBaseParser解析SQL生成，其源碼部分就是語法解析，其生成的抽象語法樹的節點都是ParserRuleContext的子類。

2、提出問題

ANLTR4解析SQL生成抽象語法樹，最終這顆樹長成什么樣子，如何查看？

1、測試案列

spark.sql("select id, count(name) from student group by id").show()

2、源碼入口

SparkSession的sql 方法如下：

def sql(sqlText: String): DataFrame = {
    // TODO 1. 生成LogicalPlan
    // sqlParser 為 SparkSqlParser
    val logicalPlan: LogicalPlan = sessionState.sqlParser.parsePlan(sqlText)
    // 根據 LogicalPlan
    val frame: DataFrame = Dataset.ofRows(self, logicalPlan)
    frame // sqlParser
  }

3、定位SparkSqlParser

入口源碼涉及到SessionState這個關鍵類，其初始化代碼如下：

lazy val sessionState: SessionState = {
    parentSessionState
      .map(_.clone(this))
      .getOrElse {
        // 構建 org.apache.spark.sql.internal.SessionStateBuilder
        val state = SparkSession.instantiateSessionState(
          SparkSession.sessionStateClassName(sparkContext.conf),
          self)
        initialSessionOptions.foreach { case (k, v) => state.conf.setConfString(k, v) }
        state
      }
  }

org.apache.spark.sql.SparkSession$#sessionStateClassName 方法具體如下：

private def sessionStateClassName(conf: SparkConf): String = {
    // spark.sql.catalogImplementation, 分為 hive 和 in-memory模式，默認為 in-memory 模式
    conf.get(CATALOG_IMPLEMENTATION) match {
      case "hive" => HIVE_SESSION_STATE_BUILDER_CLASS_NAME // hive 實現 org.apache.spark.sql.hive.HiveSessionStateBuilder
      case "in-memory" => classOf[SessionStateBuilder].getCanonicalName // org.apache.spark.sql.internal.SessionStateBuilder
    }
  }

其中，這里用到了builder模式，org.apache.spark.sql.internal.SessionStateBuilder就是用來構建 SessionState的。在 SparkSession.instantiateSessionState 中有具體說明，如下：

/**
   * Helper method to create an instance of `SessionState` based on `className` from conf.
   * The result is either `SessionState` or a Hive based `SessionState`.
   */
  private def instantiateSessionState(
      className: String,
      sparkSession: SparkSession): SessionState = {
    try {
      // org.apache.spark.sql.internal.SessionStateBuilder
      // invoke `new [Hive]SessionStateBuilder(SparkSession, Option[SessionState])`
      val clazz = Utils.classForName(className)
      val ctor = clazz.getConstructors.head
      ctor.newInstance(sparkSession, None).asInstanceOf[BaseSessionStateBuilder].build()
    } catch {
      case NonFatal(e) =>
        throw new IllegalArgumentException(s"Error while instantiating '$className':", e)
    }
  }

其中，BaseSessionStateBuilder下面有兩個主要實現，分別為 org.apache.spark.sql.hive.HiveSessionStateBuilder（hive模式）和 org.apache.spark.sql.internal.SessionStateBuilder（in-memory模式，默認）

org.apache.spark.sql.internal.BaseSessionStateBuilder#build 方法，源碼如下：

/**
   * Build the [[SessionState]].
   */
  def build(): SessionState = {
    new SessionState(
      session.sharedState,
      conf,
      experimentalMethods,
      functionRegistry,
      udfRegistration,
      () => catalog,
      sqlParser,
      () => analyzer,
      () => optimizer,
      planner,
      streamingQueryManager,
      listenerManager,
      () => resourceLoader,
      createQueryExecution,
      createClone)
  }

SessionState中，包含了很多的參數，關鍵參數介紹如下：

conf：SparkConf對象，對SparkSession的配置

functionRegistry：FunctionRegistry對象，負責函數的注冊，其內部維護了一個map對象用於維護注冊的函數。

UDFRegistration：UDFRegistration對象，用於注冊UDF函數，其依賴於FunctionRegistry

catalogBuilder: () => SessionCatalog：返回SessionCatalog對象，其主要用於管理SparkSession的Catalog

sqlParser: ParserInterface, 實際為 SparkSqlParser 實例，其內部調用ASTBuilder將SQL解析為抽象語法樹

analyzerBuilder: () => Analyzer, org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer 自定義 org.apache.spark.sql.catalyst.analysis.Analyzer.Analyzer

optimizerBuilder: () => Optimizer, // org.apache.spark.sql.internal.BaseSessionStateBuilder.optimizer --> 自定義 org.apache.spark.sql.execution.SparkOptimizer.SparkOptimizer

planner: SparkPlanner, // org.apache.spark.sql.internal.BaseSessionStateBuilder.planner --> 自定義 org.apache.spark.sql.execution.SparkPlanner.SparkPlanner

resourceLoaderBuilder: () => SessionResourceLoader，返回資源加載器，主要用於加載函數的jar或資源

createQueryExecution: LogicalPlan => QueryExecution：根據LogicalPlan生成QueryExecution對象

4、parsePlan方法

SparkSqlParser沒有該方法的實現，具體是現在其父類 AbstractSqlParser中，如下：

/** Creates LogicalPlan for a given SQL string. */
    // TODO 根據 sql語句生成 邏輯計划 LogicalPlan
  override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser =>
      val singleStatementContext: SqlBaseParser.SingleStatementContext = parser.singleStatement()
    astBuilder.visitSingleStatement(singleStatementContext) match {
      case plan: LogicalPlan => plan
      case _ =>
        val position = Origin(None, None)
        throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
    }
  }

其中 parse 方法后面的方法是一個回調函數，它在parse 方法中被調用，如下：

org.apache.spark.sql.execution.SparkSqlParser#parse源碼如下：

private val substitutor = new VariableSubstitution(conf) // 參數替換器

  protected override def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
    super.parse(substitutor.substitute(command))(toResult)
  }

其中，substitutor是一個參數替換器，用於把SQL中的參數都替換掉，繼續看其父類AbstractSqlParser的parse 方法：

protected def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
    logDebug(s"Parsing command: $command")

    // 詞法分析
    val lexer = new SqlBaseLexer(new UpperCaseCharStream(CharStreams.fromString(command)))
    lexer.removeErrorListeners()
    lexer.addErrorListener(ParseErrorListener)
    lexer.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced

    // 語法分析
    val tokenStream = new CommonTokenStream(lexer)
    val parser = new SqlBaseParser(tokenStream)
    parser.addParseListener(PostProcessor)
    parser.removeErrorListeners()
    parser.addErrorListener(ParseErrorListener)
    parser.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced

    try {
      try {
        // first, try parsing with potentially faster SLL mode
        parser.getInterpreter.setPredictionMode(PredictionMode.SLL)
        // 使用 AstBuilder 生成 Unresolved LogicalPlan
        toResult(parser)
      }
      catch {
        case e: ParseCancellationException =>
          // if we fail, parse with LL mode
          tokenStream.seek(0) // rewind input stream
          parser.reset()

          // Try Again.
          parser.getInterpreter.setPredictionMode(PredictionMode.LL)
          toResult(parser)
      }
    }
    catch {
      case e: ParseException if e.command.isDefined =>
        throw e
      case e: ParseException =>
        throw e.withCommand(command)
      case e: AnalysisException =>
        val position = Origin(e.line, e.startPosition)
        throw new ParseException(Option(command), e.message, position, position)
    }
  }

在這個方法中調用ANLTR4的API將SQL轉換為AST抽象語法樹，然后調用 toResult(parser) 方法，這個 toResult 方法就是parsePlan 方法的回調方法。

截止到調用astBuilder.visitSingleStatement 方法之前， AST抽象語法樹已經生成。

4、打印生成AST

1、源碼

override def visitSingleStatement(ctx: SingleStatementContext): LogicalPlan = withOrigin(ctx) {
    val statement: StatementContext = ctx.statement
    printRuleContextInTreeStyle(statement, 1)
    // 調用accept 生成 邏輯算子樹AST
    visit(statement).asInstanceOf[LogicalPlan]
  }

在使用訪問者模式訪問AST節點生成UnResolved LogicalPlan之前，我定義了一個方法用來打印剛解析生成的抽象語法樹， printRuleContextInTreeStyle 代碼如下：

/**
   * 樹形打印抽象語法樹
   */
  private def printRuleContextInTreeStyle(ctx: ParserRuleContext, level:Int): Unit = {
    val prefix:String = "|"
    val curLevelStr: String = "-" * level
    val childLevelStr: String = "-" * (level + 1)
    println(s"${prefix}${curLevelStr} ${ctx.getClass.getCanonicalName}")
    val children: util.List[ParseTree] = ctx.children
    if( children == null || children.size() == 0) {
      return
    }
    children.iterator().foreach {
      case context: ParserRuleContext => printRuleContextInTreeStyle(context, level + 1)
      case _ => println(s"${prefix}${childLevelStr} ${ctx.getClass.getCanonicalName}")
    }
  }

2、三種SQL打印示例SQL示例1（帶where）

其生成的AST如下：

|- org.apache.spark.sql.catalyst.parser.SqlBaseParser.StatementDefaultContext
|-- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryContext
|--- org.apache.spark.sql.catalyst.parser.SqlBaseParser.SingleInsertQueryContext
|---- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryTermDefaultContext
|----- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryPrimaryDefaultContext
|------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionSeqContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext
|----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext
|------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext
|-------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|--------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FromClauseContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FromClauseContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.RelationContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableNameContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableIdentifierContext
|----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext
|------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableAliasContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ComparisonContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext
|----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext
|------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ComparisonOperatorContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ComparisonOperatorContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ConstantDefaultContext
|----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NumericLiteralContext
|------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.IntegerLiteralContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IntegerLiteralContext
|---- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryOrganizationContext

SQL示例2（帶排序）

select name from student where age > 18 order by id desc

其生成的AST如下：

|- org.apache.spark.sql.catalyst.parser.SqlBaseParser.StatementDefaultContext
|-- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryContext
|--- org.apache.spark.sql.catalyst.parser.SqlBaseParser.SingleInsertQueryContext
|---- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryTermDefaultContext
|----- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryPrimaryDefaultContext
|------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionSeqContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext
|----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext
|------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext
|-------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|--------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FromClauseContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FromClauseContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.RelationContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableNameContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableIdentifierContext
|----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext
|------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableAliasContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ComparisonContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext
|----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext
|------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ComparisonOperatorContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ComparisonOperatorContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ConstantDefaultContext
|----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NumericLiteralContext
|------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.IntegerLiteralContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IntegerLiteralContext
|---- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryOrganizationContext
|----- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryOrganizationContext
|----- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryOrganizationContext
|----- org.apache.spark.sql.catalyst.parser.SqlBaseParser.SortItemContext
|------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext
|----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.SortItemContext

SQL示例2（帶分組）

select id, count(name) from student group by id

其生成的AST如下：

|- org.apache.spark.sql.catalyst.parser.SqlBaseParser.StatementDefaultContext
|-- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryContext
|--- org.apache.spark.sql.catalyst.parser.SqlBaseParser.SingleInsertQueryContext
|---- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryTermDefaultContext
|----- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryPrimaryDefaultContext
|------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionSeqContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext
|----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext
|------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext
|-------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|--------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionSeqContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext
|----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext
|------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.FunctionCallContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QualifiedNameContext
|-------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext
|--------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|---------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FunctionCallContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext
|-------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext
|--------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext
|---------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext
|----------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext
|------------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|------------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FunctionCallContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FromClauseContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FromClauseContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.RelationContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableNameContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableIdentifierContext
|----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext
|------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableAliasContext
|------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.AggregationContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.AggregationContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.AggregationContext
|-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext
|--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext
|---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext
|----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext
|------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext
|------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|-------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext
|---- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryOrganizationContext

5、總結

在本篇文章中，主要從測試代碼出發，到如何調用ANTLR4解析SQL得到生成AST，並且修改了源碼來打印這個AST樹。盡管現在看來，使用ANTLR解析SQL生成AST是一個black box，但對於Spark SQL來說，其后續流程的輸入已經得到。

第九章、sparkstreaming

一、spark streaming 接收kafka消息之一 -- 兩種接收方式

源碼分析的spark版本是1.6。

首先，先看一下 org.apache.spark.streaming.dstream.InputDStream 的類說明：

This is the abstract base class for all input streams. This class provides methods start() and stop() which is called by Spark Streaming system to start and stop receiving data. Input streams that can generate RDDs from new data by running a service/thread only on the driver node (that is, without running a receiver on worker nodes), can be implemented by directly inheriting this InputDStream. For example, FileInputDStream, a subclass of InputDStream, monitors a HDFS directory from the driver for new files and generates RDDs with the new files. For implementing input streams that requires running a receiver on the worker nodes, use org.apache.spark.streaming.dstream.ReceiverInputDStream as the parent class.

翻譯如下：

所有輸入stream 的抽象父類，這個類提供了 start 和 stop 方法， 這兩個方法被spark streaming系統來開始接收或結束接收數據。
兩種接收數據的兩種方式：
在driver 端接收數據；
1. 輸入流通過在driver 節點上運行一個線程或服務，從新數據產生 RDD，繼承自 InputDStream 的子類
2. 輸入流通過運行在 worker 節點上的一個receiver ，從新數據產生RDD ， 繼承自 org.apache.spark.streaming.dstream.ReceiverInputDStream

也就是說 spark 1.6 版本的輸入流的抽象父類就是 org.apache.spark.streaming.dstream.InputDStream，其子類如下圖所示：

與kafka 對接的兩個類已經在上圖中標明。

現在對兩種方式做一下簡單的比較：

相同點：

1.內部都是通過SimpleConsumer 來獲取消息，在獲取消息之前，在獲取消息之前，from offset 和 until offset 都已經確定。

2.都需要在構造 FetchRequest之前，確定leader， offset 等信息。

3.其內部都有一個速率評估器，起到平衡速率的作用

不同點：

offset 的管理不同。

DirectKafkaInputStream 可以通過外部介質來管理 offset，比如 redis， mysql等數據庫，也可以是hbase等。

KafkaInputStream 則需要使用zookeeper 來管理consumer offset數據，其內部需要監控zookeeper 的狀態。

receiver運行的節點不同。

DirectKafkaInputStream 對應的 receiver 是運行在 driver 節點上的。

KafkaInputStream 對應的 receiver 是運行在非driver 的executor 上的。

內部對應的RDD不一樣。

DirectKafkaInputStream 對應的是 KafkaRDD，內部的迭代器是KafkaRDDIterator

KafkaInputStream 對應的是 WriteAheadLogBackedBlockRDD 或者是 BlockRDD，內部的迭代器是自定義的 NextIterator

保證Exactly-once 語義的機制不一樣。

DirectKafkaInputStream 是根據 offset 和 KafkaRDD 的機制來保證 exactly-once 語義的

KafkaInputStream 是根據zookeeper的 offset 和WAL 機制來保證 exactly-once 語義的，接收到消息之后，會先保存到checkpoint 的 WAL 中

二、spark streaming 接收kafka消息之二 -- 運行在driver端的receiver

先從源碼來深入理解一下 DirectKafkaInputDStream 的將 kafka 作為輸入流時，如何確保 exactly-once 語義。

val stream: InputDStream[(String, String, Long)] = KafkaUtils.createDirectStream
      [String, String, StringDecoder, StringDecoder, (String, String, Long)](
        ssc, kafkaParams, fromOffsets,
        (mmd: MessageAndMetadata[String, String]) => (mmd.key(), mmd.message(), mmd.offset))

對應的源碼如下：

def createDirectStream[
    K: ClassTag,
    V: ClassTag,
    KD <: Decoder[K]: ClassTag,
    VD <: Decoder[V]: ClassTag,
    R: ClassTag] (
      ssc: StreamingContext,
      kafkaParams: Map[String, String],
      fromOffsets: Map[TopicAndPartition, Long],
      messageHandler: MessageAndMetadata[K, V] => R
  ): InputDStream[R] = {
    val cleanedHandler = ssc.sc.clean(messageHandler)
    new DirectKafkaInputDStream[K, V, KD, VD, R](
      ssc, kafkaParams, fromOffsets, cleanedHandler)
  }

DirectKafkaInputDStream 的類聲明如下：

A stream of org.apache.spark.streaming.kafka.KafkaRDD where each given Kafka topic/partition corresponds to an RDD partition. The spark configuration spark.streaming.kafka.maxRatePerPartition gives the maximum number of messages per second that each partition will accept. Starting offsets are specified in advance, and this DStream is not responsible for committing offsets, so that you can control exactly-once semantics. For an easy interface to Kafka-managed offsets, see org.apache.spark.streaming.kafka.KafkaCluster

簡言之，Kafka RDD 的一個流，每一個指定的topic 的每一個 partition 對應一個 RDD partition

在父類 InputDStream 中，對 compute 方法的解釋如下：

Method that generates a RDD for the given time
對於給定的時間，生成新的Rdd

這就是生成RDD 的入口：

override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = {
  // 1. 先獲取這批次數據的 until offsets
val untilOffsets = clamp(latestLeaderOffsets(maxRetries)) 
// 2. 生成KafkaRDD 實例
  val rdd = KafkaRDD[K, V, U, T, R](
    context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler)

  // Report the record number and metadata of this batch interval to InputInfoTracker.
// 獲取 該批次 的 offset 的范圍
  val offsetRanges = currentOffsets.map { case (tp, fo) =>
    val uo = untilOffsets(tp) // 獲取 until offset
    OffsetRange(tp.topic, tp.partition, fo, uo.offset)
  }
//3. 將當前批次的metadata和offset 的信息報告給 InputInfoTracker
  val description = offsetRanges.filter { offsetRange =>
    // Don't display empty ranges.
    offsetRange.fromOffset != offsetRange.untilOffset
  }.map { offsetRange =>
    s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" +
      s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"
  }.mkString("\n")
  // Copy offsetRanges to immutable.List to prevent from being modified by the user
  val metadata = Map(
    "offsets" -> offsetRanges.toList,
    StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)
  val inputInfo = StreamInputInfo(id, rdd.count, metadata)
  ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
  // 4. 更新當前的 offsets
  currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset)
  Some(rdd)
}

1、獲取這批次數據的 until offsets

詳細分析獲取 leaderOffset 的步驟，即 latestLeaderOffsets 方法：

@tailrec
protected final def latestLeaderOffsets(retries: Int): Map[TopicAndPartition, LeaderOffset] = {

  val o = kc.getLatestLeaderOffsets(currentOffsets.keySet)
  // Either.fold would confuse @tailrec, do it manually
  if (o.isLeft) { // left 代表 error
    val err = o.left.get.toString
    if (retries <= 0) {
      throw new SparkException(err)
    } else {
      log.error(err)
      Thread.sleep(kc.config.refreshLeaderBackoffMs)
      latestLeaderOffsets(retries - 1)
    }
  } else { // right 代表結果
    o.right.get
  }
}

分析 kc.getLatestLeaderOffsets(currentOffsets.keySet) 字段賦值語句：protected val kc = new KafkaCluster(kafkaParams) 即調用了 KafkaCluster的getLatestLeaderOffsets 調用棧如下：

def getLatestLeaderOffsets(
    topicAndPartitions: Set[TopicAndPartition]
  ): Either[Err, Map[TopicAndPartition, LeaderOffset]] =
  getLeaderOffsets(topicAndPartitions, OffsetRequest.LatestTime)
// 調用了下面的方法：
def getLeaderOffsets(
    topicAndPartitions: Set[TopicAndPartition],
    before: Long
  ): Either[Err, Map[TopicAndPartition, LeaderOffset]] = {
  getLeaderOffsets(topicAndPartitions, before, 1).right.map { r =>
    r.map { kv =>
      // mapValues isnt serializable, see SI-7005
      kv._1 -> kv._2.head
    }
  }
}
// getLeaderOffsets 調用了下面的方法，用於獲取leader 的offset，現在是最大的offset：
def getLeaderOffsets(
    topicAndPartitions: Set[TopicAndPartition],
    before: Long,
    maxNumOffsets: Int
  ): Either[Err, Map[TopicAndPartition, Seq[LeaderOffset]]] = {
// 獲取所有的partition 的leader的 host和 port 信息
  findLeaders(topicAndPartitions).right.flatMap { tpToLeader =>
    // tp -> (l.host -> l.port) ==> (l.host -> l.port) ->seq[tp]
val leaderToTp: Map[(String, Int), Seq[TopicAndPartition]] = flip(tpToLeader) 
// 所有的leader 的 連接方式
    val leaders = leaderToTp.keys
    var result = Map[TopicAndPartition, Seq[LeaderOffset]]()
    val errs = new Err
// 通過leader 獲取每一個 leader的offset，現在是最大的 offset
    withBrokers(leaders, errs) { consumer =>
      val partitionsToGetOffsets: Seq[TopicAndPartition] =
        leaderToTp((consumer.host, consumer.port))
      val reqMap = partitionsToGetOffsets.map { tp: TopicAndPartition =>
        tp -> PartitionOffsetRequestInfo(before, maxNumOffsets)
      }.toMap
      val req = OffsetRequest(reqMap) 
      val resp = consumer.getOffsetsBefore(req)
      val respMap = resp.partitionErrorAndOffsets
      partitionsToGetOffsets.foreach { tp: TopicAndPartition =>
        respMap.get(tp).foreach { por: PartitionOffsetsResponse =>
          if (por.error == ErrorMapping.NoError) {
            if (por.offsets.nonEmpty) {
              result += tp -> por.offsets.map { off =>
                LeaderOffset(consumer.host, consumer.port, off)
              }
            } else {
              errs.append(new SparkException(
                s"Empty offsets for ${tp}, is ${before} before log beginning?"))
            }
          } else {
            errs.append(ErrorMapping.exceptionFor(por.error))
          }
        }
      }
      if (result.keys.size == topicAndPartitions.size) {
        return Right(result)
      }
    }
    val missing = topicAndPartitions.diff(result.keySet)
    errs.append(new SparkException(s"Couldn't find leader offsets for ${missing}"))
    Left(errs)
  }
}
// 根據 TopicAndPartition 獲取partition leader 的 host 和 port 信息
def findLeaders(
    topicAndPartitions: Set[TopicAndPartition]
  ): Either[Err, Map[TopicAndPartition, (String, Int)]] = {
  val topics = topicAndPartitions.map(_.topic)
// 獲取給定topics集合的所有的partition 的 metadata信息
  val response = getPartitionMetadata(topics).right
// 獲取所有的partition 的 leader 的 host 和port 信息
  val answer = response.flatMap { tms: Set[TopicMetadata] =>
    val leaderMap = tms.flatMap { tm: TopicMetadata =>
      tm.partitionsMetadata.flatMap { pm: PartitionMetadata =>
        val tp = TopicAndPartition(tm.topic, pm.partitionId)
        if (topicAndPartitions(tp)) {
          pm.leader.map { l =>
            tp -> (l.host -> l.port)
          }
        } else {
          None
        }
      }
    }.toMap

    if (leaderMap.keys.size == topicAndPartitions.size) {
      Right(leaderMap)
    } else {
      val missing = topicAndPartitions.diff(leaderMap.keySet)
      val err = new Err
      err.append(new SparkException(s"Couldn't find leaders for ${missing}"))
      Left(err)
    }
  }
  answer
}
// 獲取給定的 topic集合的所有partition 的metadata 信息
def getPartitionMetadata(topics: Set[String]): Either[Err, Set[TopicMetadata]] = {
// 創建TopicMetadataRequest對象
  val req = TopicMetadataRequest(
    TopicMetadataRequest.CurrentVersion, 0, config.clientId, topics.toSeq)
  val errs = new Err
// 隨機打亂 broker-list的順序
  withBrokers(Random.shuffle(config.seedBrokers), errs) { consumer =>
    val resp: TopicMetadataResponse = consumer.send(req)
    val respErrs = resp.topicsMetadata.filter(m => m.errorCode != ErrorMapping.NoError)

    if (respErrs.isEmpty) {
      return Right(resp.topicsMetadata.toSet)
    } else {
      respErrs.foreach { m =>
        val cause = ErrorMapping.exceptionFor(m.errorCode)
        val msg = s"Error getting partition metadata for '${m.topic}'. Does the topic exist?"
        errs.append(new SparkException(msg, cause))
      }
    }
  }
  Left(errs)
}
// Try a call against potentially multiple brokers, accumulating errors
private def withBrokers(brokers: Iterable[(String, Int)], errs: Err)
  (fn: SimpleConsumer => Any): Unit = {
//這里雖然是一個 foreach循環，但一旦獲取到metadata，就返回，之所以使用一個foreach循環，是為了增加重試次數，// 防止kafka cluster 的單節點宕機，除此之外，還設計了 單節點的多次重試機制。只不過是循環重試，即多個節點都訪問完后，// 再sleep 200ms（默認），然后再進行下一輪訪問，可以適用於節點瞬間服務不可用情況。
  brokers.foreach { hp => 
    var consumer: SimpleConsumer = null
    try {
// 獲取SimpleConsumer 的連接
      consumer = connect(hp._1, hp._2)
      fn(consumer) // 發送請求並獲取到partition 的metadata
/* fn 即 后面定義的
consumer =>
    val resp: TopicMetadataResponse = consumer.send(req)
    val respErrs = resp.topicsMetadata.filter(m => m.errorCode != ErrorMapping.NoError)

    if (respErrs.isEmpty) {
      return Right(resp.topicsMetadata.toSet)
    } else {
      respErrs.foreach { m =>
        val cause = ErrorMapping.exceptionFor(m.errorCode)
        val msg = s"Error getting partition metadata for '${m.topic}'. Does the topic exist?"
        errs.append(new SparkException(msg, cause))
      }
    }
  }
  Left(errs)
*/
    } catch {
      case NonFatal(e) =>
        errs.append(e)
    } finally {
      if (consumer != null) {
        consumer.close()
      }
    }
  }
}


private def flip[K, V](m: Map[K, V]): Map[V, Seq[K]] =
  m.groupBy(_._2).map { kv =>
    kv._1 -> kv._2.keys.toSeq
  }

然后，根據獲取的每一個 partition的leader 最大 offset 來，確定每一個partition的 until offset，即clamp 函數的功能：

// limits the maximum number of messages per partition
protected def clamp(
  leaderOffsets: Map[TopicAndPartition, LeaderOffset]): Map[TopicAndPartition, LeaderOffset] = {
  maxMessagesPerPartition.map { mmp =>
    leaderOffsets.map { case (tp, lo) =>
// 評估的until offset = 當前offset + 評估速率
// 從 每一個topic partition leader 的最大offset 和 評估的 until offset 中選取較小值作為 每一個 topic partition 的 until offset
      tp -> lo.copy(offset = Math.min(currentOffsets(tp) + mmp, lo.offset))
    }
  }.getOrElse(leaderOffsets) // 如果是第一次獲取數據，並且沒有設置spark.streaming.kafka.maxRatePerPartition 參數，則會返回 每一個 leader 的最大大小
}


protected def maxMessagesPerPartition: Option[Long] = {
// rateController 是負責評估流速的
  val estimatedRateLimit = rateController.map(_.getLatestRate().toInt)
// 所有的 topic 分區數
  val numPartitions = currentOffsets.keys.size
  // 獲取當前的流處理速率
  val effectiveRateLimitPerPartition = estimatedRateLimit
    .filter(_ > 0) // 過濾掉非正速率
    .map { limit =>
// 通過spark.streaming.kafka.maxRatePerPartition設置這個參數，默認是0
      if (maxRateLimitPerPartition > 0) {
// 從評估速率和設置的速率中取一個較小值
        Math.min(maxRateLimitPerPartition, (limit / numPartitions))
      } else { // 如果沒有設置，評估速率 / 分區數
        limit / numPartitions
      }
    }.getOrElse(maxRateLimitPerPartition) // 如果速率評估率不起作用時，使用設置的速率，如果不設置是 0

  if (effectiveRateLimitPerPartition > 0) { // 如果每一個分區的有效速率大於0
    val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000
// 轉換成每ms的流速率
    Some((secsPerBatch * effectiveRateLimitPerPartition).toLong)
  } else {
    None
  }
}

2、生成KafkaRDD

KafkaRDD 伴生對象的 apply 方法：

def apply[
  K: ClassTag,
  V: ClassTag,
  U <: Decoder[_]: ClassTag,
  T <: Decoder[_]: ClassTag,
  R: ClassTag](
    sc: SparkContext,
    kafkaParams: Map[String, String],
    fromOffsets: Map[TopicAndPartition, Long],
    untilOffsets: Map[TopicAndPartition, LeaderOffset],
    messageHandler: MessageAndMetadata[K, V] => R
  ): KafkaRDD[K, V, U, T, R] = {
// 從 untilOffsets 中獲取 TopicAndPartition 和 leader info( host, port) 的映射關系
  val leaders = untilOffsets.map { case (tp, lo) =>
      tp -> (lo.host, lo.port)
  }.toMap
  
  val offsetRanges = fromOffsets.map { case (tp, fo) =>
// 根據 fromOffsets 和 untilOffset ，拼接成OffsetRange 對象
      val uo = untilOffsets(tp)
      OffsetRange(tp.topic, tp.partition, fo, uo.offset)
  }.toArray
  // 返回 KafkaRDD class 的實例
  new KafkaRDD[K, V, U, T, R](sc, kafkaParams, offsetRanges, leaders, messageHandler)
}

先看KafkaRDD 的解釋：

A batch-oriented interface for consuming from Kafka.
Starting and ending offsets are specified in advance,
so that you can control exactly-once semantics.
從kafka 消費的針對批處理的API，開始和結束 的 offset 都提前設定了，所以我們可以控制exactly-once 的語義。

重點看 KafkaRDD 的 compute 方法，它以分區作為參數：

override def compute(thePart: Partition, context: TaskContext): Iterator[R] = {
  val part = thePart.asInstanceOf[KafkaRDDPartition]
  assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part))
  if (part.fromOffset == part.untilOffset) { // 如果 from offset == until offset，返回一個空的迭代器對象
    log.info(s"Beginning offset ${part.fromOffset} is the same as ending offset " +
      s"skipping ${part.topic} ${part.partition}")
    Iterator.empty
  } else {
    new KafkaRDDIterator(part, context)
  }
}

KafkaRDDIterator的源碼如下，首先這個類比較好理解，因為只重寫了兩個非private 方法，close和 getNext， close 是用於關閉 SimpleConsumer 實例的（主要用於關閉socket 連接和用於讀response和寫request的blockingChannel），getNext 是用於獲取數據的

類源碼如下：

private class KafkaRDDIterator(
    part: KafkaRDDPartition,
    context: TaskContext) extends NextIterator[R] {

  context.addTaskCompletionListener{ context => closeIfNeeded() }

  log.info(s"Computing topic ${part.topic}, partition ${part.partition} " +
    s"offsets ${part.fromOffset} -> ${part.untilOffset}")
 // KafkaCluster 是與 kafka cluster通信的client API
  val kc = new KafkaCluster(kafkaParams)
// kafka 消息的 key 的解碼器
// classTag 是scala package 下的 package object – reflect定義的一個classTag方法，該方法返回一個 ClassTag 對象，// 該對象中 runtimeClass 保存了運行時被擦除的范型Class對象， Decoder 的實現類都有一個 以VerifiableProperties // 變量作為入參的構造方法。獲取到構造方法后，利用反射實例化具體的Decoder實現對象，然后再向上轉型為 Decoder
  val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
    .newInstance(kc.config.props)
    .asInstanceOf[Decoder[K]]
// kafka 消息的 value 的解碼器
  val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
    .newInstance(kc.config.props)
    .asInstanceOf[Decoder[V]]
  val consumer = connectLeader
  var requestOffset = part.fromOffset
  var iter: Iterator[MessageAndOffset] = null

  // The idea is to use the provided preferred host, except on task retry atttempts,
  // to minimize number of kafka metadata requests
  private def connectLeader: SimpleConsumer = {
    if (context.attemptNumber > 0) {
// 如果重試次數大於 0， 則允許重試訪問--bootstrap-server 列表里的所有 broker，一旦獲取到 topic 的partition 的leader 信息，則馬上返回
      kc.connectLeader(part.topic, part.partition).fold(
        errs => throw new SparkException(
          s"Couldn't connect to leader for topic ${part.topic} ${part.partition}: " +
            errs.mkString("\n")),
        consumer => consumer
      )
    } else {
      kc.connect(part.host, part.port)
    }
  }
 // 在fetch數據失敗時所做的操作，無疑，這是一個hook 函數
  private def handleFetchErr(resp: FetchResponse) {
    if (resp.hasError) {
      val err = resp.errorCode(part.topic, part.partition)
      if (err == ErrorMapping.LeaderNotAvailableCode ||
        err == ErrorMapping.NotLeaderForPartitionCode) {
        log.error(s"Lost leader for topic ${part.topic} partition ${part.partition}, " +
          s" sleeping for ${kc.config.refreshLeaderBackoffMs}ms")
        Thread.sleep(kc.config.refreshLeaderBackoffMs)
      }
      // Let normal rdd retry sort out reconnect attempts
      throw ErrorMapping.exceptionFor(err)
    }
  }
  //注意此時的 返回結果是MessageAndOffset（Message（ByteBuffer）和 offset） 的迭代器
  private def fetchBatch: Iterator[MessageAndOffset] = {
// 首先，見名之意，這是一個builder，作用就是構建一個FetchRequest 對象
    val req = new FetchRequestBuilder() 
      .addFetch(part.topic, part.partition, requestOffset, kc.config.fetchMessageMaxBytes)
      .build()
// 調用 SimpleConsumer 的 fetch 方法，發送 FetchRequest 請求並獲取返回的 topic 消息
    val resp = consumer.fetch(req)
// 查看是否有錯誤，如果有，則拋出一場，否則繼續處理返回的消息
    handleFetchErr(resp)
    // kafka may return a batch that starts before the requested offset
// 因為網絡延遲等原因，可能會獲取到之前的發送的請求結果，此時的 offset 是小於當前的 offset 的，需要過濾掉
    resp.messageSet(part.topic, part.partition)
      .iterator
      .dropWhile(_.offset < requestOffset)
  }

  override def close(): Unit = {
    if (consumer != null) {
      consumer.close()
    }
  }
 // 我們重點看getNext 方法， 它的返回值 為R， 從KafkaUtils類中的初始化KafkaRDD 方法可以看出 R 其實是 <K,V>, 即會返回一個key 和 value的pair
  override def getNext(): R = {
    if (iter == null || !iter.hasNext) { // 第一次或者是已經消費完了
      iter = fetchBatch // 調用 fetchBatch 方法，獲取得到MessageAndOffset的迭代器
    }
    if (!iter.hasNext) { // 如果本批次沒有數據需要處理或者本批次內還有所有數據均被處理，直接修改標識位，返回null
      assert(requestOffset == part.untilOffset, errRanOutBeforeEnd(part))
      finished = true
      null.asInstanceOf[R]
    } else {
      val item = iter.next() // 獲取下一個 MessageAndOffset 對象
      if (item.offset >= part.untilOffset) { // 如果返回的消息大於等於本批次的until offset，則會返回 null
        assert(item.offset == part.untilOffset, errOvershotEnd(item.offset, part))
        finished = true
        null.asInstanceOf[R]
      } else { // 獲取的 MessageAndOffse的Offset 大於等於 from offset並且小於 until offset
        requestOffset = item.nextOffset // 需要請求 kafka cluster 的消息是本條消息的下一個offset對應的消息
// MessageAndMetadata 是封裝了單條消息的相關信息，包括 topic， partition， 對應的消息ByteBuffer，消息的offset，key解碼器，value解碼類
// messageHandler 是一個回調方法， 對應了本例中的(mmd: MessageAndMetadata[String, String]) => (mmd.key(), mmd.message(), mmd.offset)代碼
        messageHandler(new MessageAndMetadata(
          part.topic, part.partition, item.message, item.offset, keyDecoder, valueDecoder))
      }
    }
  }
}

3、總結

有如下問題： 1.這個類是如何接收 kafka 的消息的？通過KafkaRDD來獲取單批次的數據的，KafkaRDD的compute方法返回一個迭代器，這個迭代器封裝了kafka partition數據的批量抓取以及負責調用傳入的消息處理回調函數並將單條處理結果返回。其中，spark streaming 的exactly-once 消費機制是通過 KafkaRDD 來保證的，在創建KafkaRDD之前，就已經通過 currentOffset和估算出的速率，以及每個分區的自定義最大抓取速率，和從partition的leader獲取的最大offset，確定分區untilOffset的值，最終fromOffset和untilOffset構成OffsetRange，在KafkaRDD中生成的迭代器中會丟棄掉offset不在該OffsetRange內的數據，最終調用用戶傳入的消息處理函數，處理數據成用戶想要的數據格式。 2.這個類是如何將單個partition的消息轉換為 RDD單個partition的數據的？ KafkaRDD 的compute 方法以 partition 作為參數，這個partition是 KafkaRDDPartition 的實例，包含了分區消息的 offset range，topic， partition 等信息，該方法會返回一個KafkaRDDIterat，該類提供了訪問該分區內kafka 數據的數據，內部通過SimpleConsumer 來從leader 節點來批量獲取數據，然后再從批量數據中獲取我們想要的數據（由offset range來保證）。 3.這個類是如何估算 kafka 消費速率的？提供了 PIDRateEstimator 類，該類通過傳入batch 處理結束時間，batch 處理條數，實際處理時間和 batch 調度時間來估算速率的。 4.這個類是如何做WAL 的？這個類做不了 WAL

三、spark streaming 接收kafka消息之三 -- kafka broker 如何處理 fetch 請求

首先看一下 KafkaServer 這個類的聲明：

Represents the lifecycle of a single Kafka broker. Handles all functionality required to start up and shutdown a single Kafka node.
代表了單個 broker 的生命周期，處理所有功能性的請求，以及startup 和shutdown 一個broker node。

在這個類的startup中，有一個線程池被實例化了：

/* start processing requests */
// 處理所有的請求
apis = new KafkaApis(socketServer.requestChannel, replicaManager, adminManager, groupCoordinator, transactionCoordinator,
  kafkaController, zkUtils, config.brokerId, config, metadataCache, metrics, authorizer, quotaManagers,
  brokerTopicStats, clusterId, time)
 // 請求處理的線程池
requestHandlerPool = new KafkaRequestHandlerPool(config.brokerId, socketServer.requestChannel, apis, time,
  config.numIoThreads)

KafkaRequestHandlerPool 的源代碼如下：

 class KafkaRequestHandlerPool(val brokerId: Int,
                               val requestChannel: RequestChannel,
                               val apis: KafkaApis,
                               time: Time,
                               numThreads: Int) extends Logging with KafkaMetricsGroup {
 
   /* a meter to track the average free capacity of the request handlers */
   private val aggregateIdleMeter = newMeter("RequestHandlerAvgIdlePercent", "percent", TimeUnit.NANOSECONDS)
 
   this.logIdent = "[Kafka Request Handler on Broker " + brokerId + "], "
   val runnables = new Array[KafkaRequestHandler](numThreads)
   for(i <- 0 until numThreads) { // 實例化所有runnable 對象
     runnables(i) = new KafkaRequestHandler(i, brokerId, aggregateIdleMeter, numThreads, requestChannel, apis, time)
 // 初始化並啟動daemon thread
     Utils.daemonThread("kafka-request-handler-" + i, runnables(i)).start()
   }
  // 關閉線程池中的所有的線程
   def shutdown() {
     info("shutting down")
     for (handler <- runnables)
       handler.initiateShutdown()
     for (handler <- runnables)
       handler.awaitShutdown()
     info("shut down completely")
   }
 }

再看一下 KafkaRequestHandler 的源碼：

 class KafkaRequestHandler(id: Int,
                           brokerId: Int,
                           val aggregateIdleMeter: Meter,
                           val totalHandlerThreads: Int,
                           val requestChannel: RequestChannel,
                           apis: KafkaApis,
                           time: Time) extends Runnable with Logging {
   this.logIdent = "[Kafka Request Handler " + id + " on Broker " + brokerId + "], "
   private val latch = new CountDownLatch(1)
 
   def run() {
     while (true) { // 這個 run 方法會一直運行
       try {
         var req : RequestChannel.Request = null
         while (req == null) { // 如果沒有 請求過來，就一直死循環下去
           // We use a single meter for aggregate idle percentage for the thread pool.
           // Since meter is calculated as total_recorded_value / time_window and
           // time_window is independent of the number of threads, each recorded idle
           // time should be discounted by # threads.
           val startSelectTime = time.nanoseconds
           req = requestChannel.receiveRequest(300)
           val endTime = time.nanoseconds
           if (req != null)
             req.requestDequeueTimeNanos = endTime
           val idleTime = endTime - startSelectTime
           aggregateIdleMeter.mark(idleTime / totalHandlerThreads)
         }
 
         if (req eq RequestChannel.AllDone) {
           debug("Kafka request handler %d on broker %d received shut down command".format(id, brokerId))
           latch.countDown()
           return
         }
         trace("Kafka request handler %d on broker %d handling request %s".format(id, brokerId, req))
         apis.handle(req) // 處理請求
       } catch {
         case e: FatalExitError =>
           latch.countDown()
           Exit.exit(e.statusCode)
         case e: Throwable => error("Exception when handling request", e)
       }
     }
   }
 
   def initiateShutdown(): Unit = requestChannel.sendRequest(RequestChannel.AllDone)
 
   def awaitShutdown(): Unit = latch.await()
 
 }

重點看一下， kafka.server.KafkaApis#handle 源碼：

 /**
  * Top-level method that handles all requests and multiplexes to the right api
  */
 def handle(request: RequestChannel.Request) {
   try {
     trace("Handling request:%s from connection %s;securityProtocol:%s,principal:%s".
       format(request.requestDesc(true), request.connectionId, request.securityProtocol, request.session.principal))
     ApiKeys.forId(request.requestId) match {
       case ApiKeys.PRODUCE => handleProduceRequest(request)
       case ApiKeys.FETCH => handleFetchRequest(request) // 這是請求fetch消息的請求
       case ApiKeys.LIST_OFFSETS => handleListOffsetRequest(request)
       case ApiKeys.METADATA => handleTopicMetadataRequest(request)
       case ApiKeys.LEADER_AND_ISR => handleLeaderAndIsrRequest(request)
       case ApiKeys.STOP_REPLICA => handleStopReplicaRequest(request)
       case ApiKeys.UPDATE_METADATA_KEY => handleUpdateMetadataRequest(request)
       case ApiKeys.CONTROLLED_SHUTDOWN_KEY => handleControlledShutdownRequest(request)
       case ApiKeys.OFFSET_COMMIT => handleOffsetCommitRequest(request)
       case ApiKeys.OFFSET_FETCH => handleOffsetFetchRequest(request)
       case ApiKeys.FIND_COORDINATOR => handleFindCoordinatorRequest(request)
       case ApiKeys.JOIN_GROUP => handleJoinGroupRequest(request)
       case ApiKeys.HEARTBEAT => handleHeartbeatRequest(request)
       case ApiKeys.LEAVE_GROUP => handleLeaveGroupRequest(request)
       case ApiKeys.SYNC_GROUP => handleSyncGroupRequest(request)
       case ApiKeys.DESCRIBE_GROUPS => handleDescribeGroupRequest(request)
       case ApiKeys.LIST_GROUPS => handleListGroupsRequest(request)
       case ApiKeys.SASL_HANDSHAKE => handleSaslHandshakeRequest(request)
       case ApiKeys.API_VERSIONS => handleApiVersionsRequest(request)
       case ApiKeys.CREATE_TOPICS => handleCreateTopicsRequest(request)
       case ApiKeys.DELETE_TOPICS => handleDeleteTopicsRequest(request)
       case ApiKeys.DELETE_RECORDS => handleDeleteRecordsRequest(request)
       case ApiKeys.INIT_PRODUCER_ID => handleInitProducerIdRequest(request)
       case ApiKeys.OFFSET_FOR_LEADER_EPOCH => handleOffsetForLeaderEpochRequest(request)
       case ApiKeys.ADD_PARTITIONS_TO_TXN => handleAddPartitionToTxnRequest(request)
       case ApiKeys.ADD_OFFSETS_TO_TXN => handleAddOffsetsToTxnRequest(request)
       case ApiKeys.END_TXN => handleEndTxnRequest(request)
       case ApiKeys.WRITE_TXN_MARKERS => handleWriteTxnMarkersRequest(request)
       case ApiKeys.TXN_OFFSET_COMMIT => handleTxnOffsetCommitRequest(request)
       case ApiKeys.DESCRIBE_ACLS => handleDescribeAcls(request)
       case ApiKeys.CREATE_ACLS => handleCreateAcls(request)
       case ApiKeys.DELETE_ACLS => handleDeleteAcls(request)
       case ApiKeys.ALTER_CONFIGS => handleAlterConfigsRequest(request)
       case ApiKeys.DESCRIBE_CONFIGS => handleDescribeConfigsRequest(request)
     }
   } catch {
     case e: FatalExitError => throw e
     case e: Throwable => handleError(request, e)
   } finally {
     request.apiLocalCompleteTimeNanos = time.nanoseconds
   }
 }

再看 handleFetchRequest：

 // call the replica manager to fetch messages from the local replica
     replicaManager.fetchMessages(
       fetchRequest.maxWait.toLong, // 在這里是 0
       fetchRequest.replicaId,
       fetchRequest.minBytes,
       fetchRequest.maxBytes,
       versionId <= 2,
       authorizedRequestInfo,
       replicationQuota(fetchRequest),
       processResponseCallback,
       fetchRequest.isolationLevel)

fetchMessage 源碼如下：

  /**
  * Fetch messages from the leader replica, and wait until enough data can be fetched and return;
  * the callback function will be triggered either when timeout or required fetch info is satisfied
  */
 def fetchMessages(timeout: Long,
                   replicaId: Int,
                   fetchMinBytes: Int,
                   fetchMaxBytes: Int,
                   hardMaxBytesLimit: Boolean,
                   fetchInfos: Seq[(TopicPartition, PartitionData)],
                   quota: ReplicaQuota = UnboundedQuota,
                   responseCallback: Seq[(TopicPartition, FetchPartitionData)] => Unit,
                   isolationLevel: IsolationLevel) {
   val isFromFollower = replicaId >= 0
   val fetchOnlyFromLeader: Boolean = replicaId != Request.DebuggingConsumerId
   val fetchOnlyCommitted: Boolean = ! Request.isValidBrokerId(replicaId)
  // 從本地 logs 中讀取數據
   // read from local logs
   val logReadResults = readFromLocalLog(
     replicaId = replicaId,
     fetchOnlyFromLeader = fetchOnlyFromLeader,
     readOnlyCommitted = fetchOnlyCommitted,
     fetchMaxBytes = fetchMaxBytes,
     hardMaxBytesLimit = hardMaxBytesLimit,
     readPartitionInfo = fetchInfos,
     quota = quota,
     isolationLevel = isolationLevel)
 
   // if the fetch comes from the follower,
   // update its corresponding log end offset
   if(Request.isValidBrokerId(replicaId))
     updateFollowerLogReadResults(replicaId, logReadResults)
 
   // check if this fetch request can be satisfied right away
   val logReadResultValues = logReadResults.map { case (_, v) => v }
   val bytesReadable = logReadResultValues.map(_.info.records.sizeInBytes).sum
   val errorReadingData = logReadResultValues.foldLeft(false) ((errorIncurred, readResult) =>
     errorIncurred || (readResult.error != Errors.NONE))
  // 立即返回的四個條件：
 // 1. Fetch 請求不希望等待
 // 2. Fetch 請求不請求任何數據
 // 3. 有足夠數據可以返回
 // 4. 當讀取數據的時候有error 發生
   // respond immediately if 1) fetch request does not want to wait
   //                        2) fetch request does not require any data
   //                        3) has enough data to respond
   //                        4) some error happens while reading data
   if (timeout <= 0 || fetchInfos.isEmpty || bytesReadable >= fetchMinBytes || errorReadingData) {
     val fetchPartitionData = logReadResults.map { case (tp, result) =>
       tp -> FetchPartitionData(result.error, result.highWatermark, result.leaderLogStartOffset, result.info.records,
         result.lastStableOffset, result.info.abortedTransactions)
     }
     responseCallback(fetchPartitionData)
   } else {// DelayedFetch
     // construct the fetch results from the read results
     val fetchPartitionStatus = logReadResults.map { case (topicPartition, result) =>
       val fetchInfo = fetchInfos.collectFirst {
         case (tp, v) if tp == topicPartition => v
       }.getOrElse(sys.error(s"Partition $topicPartition not found in fetchInfos"))
       (topicPartition, FetchPartitionStatus(result.info.fetchOffsetMetadata, fetchInfo))
     }
     val fetchMetadata = FetchMetadata(fetchMinBytes, fetchMaxBytes, hardMaxBytesLimit, fetchOnlyFromLeader,
       fetchOnlyCommitted, isFromFollower, replicaId, fetchPartitionStatus)
     val delayedFetch = new DelayedFetch(timeout, fetchMetadata, this, quota, isolationLevel, responseCallback)
 
     // create a list of (topic, partition) pairs to use as keys for this delayed fetch operation
     val delayedFetchKeys = fetchPartitionStatus.map { case (tp, _) => new TopicPartitionOperationKey(tp) }
 
     // try to complete the request immediately, otherwise put it into the purgatory;
     // this is because while the delayed fetch operation is being created, new requests
     // may arrive and hence make this operation completable.
     delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys)
   }
 }

繼續追蹤 readFromLocalLog 源碼：

  /**
  * Read from multiple topic partitions at the given offset up to maxSize bytes
  */
 // 他負責從多個 topic partition中讀數據到最大值，默認1M
 隔離級別： 讀已提交、讀未提交
 def readFromLocalLog(replicaId: Int,
                      fetchOnlyFromLeader: Boolean,
                      readOnlyCommitted: Boolean,
                      fetchMaxBytes: Int,
                      hardMaxBytesLimit: Boolean,
                      readPartitionInfo: Seq[(TopicPartition, PartitionData)],
                      quota: ReplicaQuota,
                      isolationLevel: IsolationLevel): Seq[(TopicPartition, LogReadResult)] = {
 
   def read(tp: TopicPartition, fetchInfo: PartitionData, limitBytes: Int, minOneMessage: Boolean): LogReadResult = {
     val offset = fetchInfo.fetchOffset
     val partitionFetchSize = fetchInfo.maxBytes
     val followerLogStartOffset = fetchInfo.logStartOffset
 
     brokerTopicStats.topicStats(tp.topic).totalFetchRequestRate.mark()
     brokerTopicStats.allTopicsStats.totalFetchRequestRate.mark()
 
     try {
       trace(s"Fetching log segment for partition $tp, offset $offset, partition fetch size $partitionFetchSize, " +
         s"remaining response limit $limitBytes" +
         (if (minOneMessage) s", ignoring response/partition size limits" else ""))
 
       // decide whether to only fetch from leader
       val localReplica = if (fetchOnlyFromLeader)
         getLeaderReplicaIfLocal(tp)
       else
         getReplicaOrException(tp)
 
       val initialHighWatermark = localReplica.highWatermark.messageOffset
       val lastStableOffset = if (isolationLevel == IsolationLevel.READ_COMMITTED)
         Some(localReplica.lastStableOffset.messageOffset)
       else
         None
 
       // decide whether to only fetch committed data (i.e. messages below high watermark)
       val maxOffsetOpt = if (readOnlyCommitted)
         Some(lastStableOffset.getOrElse(initialHighWatermark))
       else
         None
 
       /* Read the LogOffsetMetadata prior to performing the read from the log.
        * We use the LogOffsetMetadata to determine if a particular replica is in-sync or not.
        * Using the log end offset after performing the read can lead to a race condition
        * where data gets appended to the log immediately after the replica has consumed from it
        * This can cause a replica to always be out of sync.
        */
       val initialLogEndOffset = localReplica.logEndOffset.messageOffset
       val initialLogStartOffset = localReplica.logStartOffset
       val fetchTimeMs = time.milliseconds
       val logReadInfo = localReplica.log match {
         case Some(log) =>
           val adjustedFetchSize = math.min(partitionFetchSize, limitBytes)
 
           // Try the read first, this tells us whether we need all of adjustedFetchSize for this partition
 // 嘗試從 Log 中讀取數據
           val fetch = log.read(offset, adjustedFetchSize, maxOffsetOpt, minOneMessage, isolationLevel)
 
           // If the partition is being throttled, simply return an empty set.
           if (shouldLeaderThrottle(quota, tp, replicaId))
             FetchDataInfo(fetch.fetchOffsetMetadata, MemoryRecords.EMPTY)
           // For FetchRequest version 3, we replace incomplete message sets with an empty one as consumers can make
           // progress in such cases and don't need to report a `RecordTooLargeException`
           else if (!hardMaxBytesLimit && fetch.firstEntryIncomplete)
             FetchDataInfo(fetch.fetchOffsetMetadata, MemoryRecords.EMPTY)
           else fetch
 
         case None =>
           error(s"Leader for partition $tp does not have a local log")
           FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY)
       }
 
       LogReadResult(info = logReadInfo,
                     highWatermark = initialHighWatermark,
                     leaderLogStartOffset = initialLogStartOffset,
                     leaderLogEndOffset = initialLogEndOffset,
                     followerLogStartOffset = followerLogStartOffset,
                     fetchTimeMs = fetchTimeMs,
                     readSize = partitionFetchSize,
                     lastStableOffset = lastStableOffset,
                     exception = None)
     } catch {
       // NOTE: Failed fetch requests metric is not incremented for known exceptions since it
       // is supposed to indicate un-expected failure of a broker in handling a fetch request
       case e@ (_: UnknownTopicOrPartitionException |
                _: NotLeaderForPartitionException |
                _: ReplicaNotAvailableException |
                _: OffsetOutOfRangeException) =>
         LogReadResult(info = FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY),
                       highWatermark = -1L,
                       leaderLogStartOffset = -1L,
                       leaderLogEndOffset = -1L,
                       followerLogStartOffset = -1L,
                       fetchTimeMs = -1L,
                       readSize = partitionFetchSize,
                       lastStableOffset = None,
                       exception = Some(e))
       case e: Throwable =>
         brokerTopicStats.topicStats(tp.topic).failedFetchRequestRate.mark()
         brokerTopicStats.allTopicsStats.failedFetchRequestRate.mark()
         error(s"Error processing fetch operation on partition $tp, offset $offset", e)
         LogReadResult(info = FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY),
                       highWatermark = -1L,
                       leaderLogStartOffset = -1L,
                       leaderLogEndOffset = -1L,
                       followerLogStartOffset = -1L,
                       fetchTimeMs = -1L,
                       readSize = partitionFetchSize,
                       lastStableOffset = None,
                       exception = Some(e))
     }
   }
  // maxSize， 默認1M
   var limitBytes = fetchMaxBytes
   val result = new mutable.ArrayBuffer[(TopicPartition, LogReadResult)]
   var minOneMessage = !hardMaxBytesLimit // hardMaxBytesLimit 
   readPartitionInfo.foreach { case (tp, fetchInfo) =>
     val readResult = read(tp, fetchInfo, limitBytes, minOneMessage)
     val messageSetSize = readResult.info.records.sizeInBytes
     // Once we read from a non-empty partition, we stop ignoring request and partition level size limits
     if (messageSetSize > 0)
       minOneMessage = false
     limitBytes = math.max(0, limitBytes - messageSetSize)
     result += (tp -> readResult)
   }
   result
 }

Log.read 源碼如下：

 /**
  * Read messages from the log.
  *
  * @param startOffset The offset to begin reading at
  * @param maxLength The maximum number of bytes to read
  * @param maxOffset The offset to read up to, exclusive. (i.e. this offset NOT included in the resulting message set)
  * @param minOneMessage If this is true, the first message will be returned even if it exceeds `maxLength` (if one exists)
  * @param isolationLevel The isolation level of the fetcher. The READ_UNCOMMITTED isolation level has the traditional
  *                       read semantics (e.g. consumers are limited to fetching up to the high watermark). In
  *                       READ_COMMITTED, consumers are limited to fetching up to the last stable offset. Additionally,
  *                       in READ_COMMITTED, the transaction index is consulted after fetching to collect the list
  *                       of aborted transactions in the fetch range which the consumer uses to filter the fetched
  *                       records before they are returned to the user. Note that fetches from followers always use
  *                       READ_UNCOMMITTED.
  *
  * @throws OffsetOutOfRangeException If startOffset is beyond the log end offset or before the log start offset
  * @return The fetch data information including fetch starting offset metadata and messages read.
  */
 def read(startOffset: Long, maxLength: Int, maxOffset: Option[Long] = None, minOneMessage: Boolean = false,
          isolationLevel: IsolationLevel): FetchDataInfo = {
   trace("Reading %d bytes from offset %d in log %s of length %d bytes".format(maxLength, startOffset, name, size))
 
   // Because we don't use lock for reading, the synchronization is a little bit tricky.
   // We create the local variables to avoid race conditions with updates to the log.
   val currentNextOffsetMetadata = nextOffsetMetadata
   val next = currentNextOffsetMetadata.messageOffset
   if (startOffset == next) {
     val abortedTransactions =
       if (isolationLevel == IsolationLevel.READ_COMMITTED) Some(List.empty[AbortedTransaction])
       else None
     return FetchDataInfo(currentNextOffsetMetadata, MemoryRecords.EMPTY, firstEntryIncomplete = false,
       abortedTransactions = abortedTransactions)
   }
 
   var segmentEntry = segments.floorEntry(startOffset)
 
   // return error on attempt to read beyond the log end offset or read below log start offset
   if (startOffset > next || segmentEntry == null || startOffset < logStartOffset)
     throw new OffsetOutOfRangeException("Request for offset %d but we only have log segments in the range %d to %d.".format(startOffset, logStartOffset, next))
 
   // Do the read on the segment with a base offset less than the target offset
   // but if that segment doesn't contain any messages with an offset greater than that
   // continue to read from successive segments until we get some messages or we reach the end of the log
   while(segmentEntry != null) {
     val segment = segmentEntry.getValue
 
     // If the fetch occurs on the active segment, there might be a race condition where two fetch requests occur after
     // the message is appended but before the nextOffsetMetadata is updated. In that case the second fetch may
     // cause OffsetOutOfRangeException. To solve that, we cap the reading up to exposed position instead of the log
     // end of the active segment.
     val maxPosition = {
       if (segmentEntry == segments.lastEntry) {
         val exposedPos = nextOffsetMetadata.relativePositionInSegment.toLong
         // Check the segment again in case a new segment has just rolled out.
         if (segmentEntry != segments.lastEntry)
           // New log segment has rolled out, we can read up to the file end.
           segment.size
         else
           exposedPos
       } else {
         segment.size
       }
     }
 // 從segment 中去讀取數據
     val fetchInfo = segment.read(startOffset, maxOffset, maxLength, maxPosition, minOneMessage)
     if (fetchInfo == null) {
       segmentEntry = segments.higherEntry(segmentEntry.getKey)
     } else {
       return isolationLevel match {
         case IsolationLevel.READ_UNCOMMITTED => fetchInfo
         case IsolationLevel.READ_COMMITTED => addAbortedTransactions(startOffset, segmentEntry, fetchInfo)
       }
     }
   }
 
   // okay we are beyond the end of the last segment with no data fetched although the start offset is in range,
   // this can happen when all messages with offset larger than start offsets have been deleted.
   // In this case, we will return the empty set with log end offset metadata
   FetchDataInfo(nextOffsetMetadata, MemoryRecords.EMPTY)
 }

LogSegment 的 read 方法：

 /**
  * Read a message set from this segment beginning with the first offset >= startOffset. The message set will include
  * no more than maxSize bytes and will end before maxOffset if a maxOffset is specified.
  *
  * @param startOffset A lower bound on the first offset to include in the message set we read
  * @param maxSize The maximum number of bytes to include in the message set we read
  * @param maxOffset An optional maximum offset for the message set we read
  * @param maxPosition The maximum position in the log segment that should be exposed for read
  * @param minOneMessage If this is true, the first message will be returned even if it exceeds `maxSize` (if one exists)
  *
  * @return The fetched data and the offset metadata of the first message whose offset is >= startOffset,
  *         or null if the startOffset is larger than the largest offset in this log
  */
 @threadsafe
 def read(startOffset: Long, maxOffset: Option[Long], maxSize: Int, maxPosition: Long = size,
          minOneMessage: Boolean = false): FetchDataInfo = {
   if (maxSize < 0)
     throw new IllegalArgumentException("Invalid max size for log read (%d)".format(maxSize))
 
   val logSize = log.sizeInBytes // this may change, need to save a consistent copy
   val startOffsetAndSize = translateOffset(startOffset)
  // offset 已經到本 segment 的結尾，返回 null
   // if the start position is already off the end of the log, return null
   if (startOffsetAndSize == null)
     return null
  // 開始位置
   val startPosition = startOffsetAndSize.position
   val offsetMetadata = new LogOffsetMetadata(startOffset, this.baseOffset, startPosition)
  // 調整的最大位置
   val adjustedMaxSize =
     if (minOneMessage) math.max(maxSize, startOffsetAndSize.size)
     else maxSize
 
   // return a log segment but with zero size in the case below
   if (adjustedMaxSize == 0)
     return FetchDataInfo(offsetMetadata, MemoryRecords.EMPTY)
 
   // calculate the length of the message set to read based on whether or not they gave us a maxOffset
   val fetchSize: Int = maxOffset match {
     case None =>
       // no max offset, just read until the max position
       min((maxPosition - startPosition).toInt, adjustedMaxSize)
     case Some(offset) =>
       // there is a max offset, translate it to a file position and use that to calculate the max read size;
       // when the leader of a partition changes, it's possible for the new leader's high watermark to be less than the
       // true high watermark in the previous leader for a short window. In this window, if a consumer fetches on an
       // offset between new leader's high watermark and the log end offset, we want to return an empty response.
       if (offset < startOffset)
         return FetchDataInfo(offsetMetadata, MemoryRecords.EMPTY, firstEntryIncomplete = false)
       val mapping = translateOffset(offset, startPosition)
       val endPosition =
         if (mapping == null)
           logSize // the max offset is off the end of the log, use the end of the file
         else
           mapping.position
       min(min(maxPosition, endPosition) - startPosition, adjustedMaxSize).toInt
   }
 
   FetchDataInfo(offsetMetadata, log.read(startPosition, fetchSize),
     firstEntryIncomplete = adjustedMaxSize < startOffsetAndSize.size)
 }
 
 log.read(startPosition, fetchSize)  的源碼如下：
 /**
  * Return a slice of records from this instance, which is a view into this set starting from the given position
  * and with the given size limit.
  *
  * If the size is beyond the end of the file, the end will be based on the size of the file at the time of the read.
  *
  * If this message set is already sliced, the position will be taken relative to that slicing.
  *
  * @param position The start position to begin the read from
  * @param size The number of bytes after the start position to include
  * @return A sliced wrapper on this message set limited based on the given position and size
  */
 public FileRecords read(int position, int size) throws IOException {
     if (position < 0)
         throw new IllegalArgumentException("Invalid position: " + position);
     if (size < 0)
         throw new IllegalArgumentException("Invalid size: " + size);
 
     final int end;
     // handle integer overflow
     if (this.start + position + size < 0)
         end = sizeInBytes();
     else
         end = Math.min(this.start + position + size, sizeInBytes());
     return new FileRecords(file, channel, this.start + position, end, true);
 }

processResponseCallback（在kafka.server.KafkaApis#handleFetchRequest 中定義）源碼如下：

 // fetch response callback invoked after any throttling
   def fetchResponseCallback(bandwidthThrottleTimeMs: Int) {
     def createResponse(requestThrottleTimeMs: Int): RequestChannel.Response = {
       val convertedData = new util.LinkedHashMap[TopicPartition, FetchResponse.PartitionData]
       fetchedPartitionData.asScala.foreach { case (tp, partitionData) =>
         convertedData.put(tp, convertedPartitionData(tp, partitionData))
       }
       val response = new FetchResponse(convertedData, 0)
       val responseStruct = response.toStruct(versionId)
 
       trace(s"Sending fetch response to client $clientId of ${responseStruct.sizeOf} bytes.")
       response.responseData.asScala.foreach { case (topicPartition, data) =>
         // record the bytes out metrics only when the response is being sent
         brokerTopicStats.updateBytesOut(topicPartition.topic, fetchRequest.isFromFollower, data.records.sizeInBytes)
       }
 
       val responseSend = response.toSend(responseStruct, bandwidthThrottleTimeMs + requestThrottleTimeMs,
         request.connectionId, request.header)
       RequestChannel.Response(request, responseSend)
     }
 
     if (fetchRequest.isFromFollower)
       sendResponseExemptThrottle(createResponse(0))
     else
       sendResponseMaybeThrottle(request, request.header.clientId, requestThrottleMs =>
         requestChannel.sendResponse(createResponse(requestThrottleMs)))
   }
 
   // When this callback is triggered, the remote API call has completed.
   // Record time before any byte-rate throttling.
   request.apiRemoteCompleteTimeNanos = time.nanoseconds
 
   if (fetchRequest.isFromFollower) {
     // We've already evaluated against the quota and are good to go. Just need to record it now.
     val responseSize = sizeOfThrottledPartitions(versionId, fetchRequest, mergedPartitionData, quotas.leader)
     quotas.leader.record(responseSize)
     fetchResponseCallback(bandwidthThrottleTimeMs = 0)
   } else {
     // Fetch size used to determine throttle time is calculated before any down conversions.
     // This may be slightly different from the actual response size. But since down conversions
     // result in data being loaded into memory, it is better to do this after throttling to avoid OOM.
     val response = new FetchResponse(fetchedPartitionData, 0)
     val responseStruct = response.toStruct(versionId)
     quotas.fetch.recordAndMaybeThrottle(request.session.sanitizedUser, clientId, responseStruct.sizeOf,
       fetchResponseCallback)
   }
 }

結論，會具體定位到具體LogSegment，通過 start 和 size 來獲取 logSegement中的記錄，最大大小默認為1 M，可以設置。

並且提供了數據隔離機制，可以支持讀已提交和讀未提交（默認是讀未提交）。如果沒有數據會直接返回的。

四、spark streaming 接收kafka消息之四 -- 運行在 worker 上的 receiver

使用分布式receiver來獲取數據使用 WAL 來實現 At least once 操作： conf.set("spark.streaming.receiver.writeAheadLog.enable","true") // 開啟 WAL // 1、At most once - 每條數據最多被處理一次（0次或1次），這種語義下會出現數據丟失的問題； // 2、At least once - 每條數據最少被處理一次 (1次或更多)，這個不會出現數據丟失，但是會出現數據重復； // 3、Exactly once - 每條數據只會被處理一次，沒有數據會丟失，並且沒有數據會被多次處理，這種語義是大家最想要的，但是也是最難實現的。

如果不做容錯，將會帶來數據丟失，因為Receiver一直在接收數據，在其沒有處理的時候（已通知zk數據接收到），Executor突然掛掉(或是driver掛掉通知executor關閉)，緩存在內存中的數據就會丟失。因為這個問題，Spark1.2開始加入了WAL（Write ahead log）開啟 WAL，將receiver獲取數據的存儲級別修改為StorageLevel. MEMORY_AND_DISK_SER_2

1 // 缺點，不能自己維護消費 topic partition 的 offset
2 // 優點，開啟 WAL，來確保 exactly-once 語義
3 val stream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](
4     ssc,kafkaParams,map,StorageLevel.MEMORY_AND_DISK_SER_2)

1、從Kafka 中讀取數據

1、Driver 規划 receiver 運行的信息

org.apache.spark.streaming.StreamingContext#start中啟動了 JobScheduler實例

 // private[streaming] val scheduler = new JobScheduler(this)
 
 // Start the streaming scheduler in a new thread, so that thread local properties
 // like call sites and job groups can be reset without affecting those of the
 // current thread.
 ThreadUtils.runInNewThread("streaming-start") { // 單獨的一個daemon線程運行函數題
   sparkContext.setCallSite(startSite.get)
   sparkContext.clearJobGroup()
   sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
 // 執行start 方法
   scheduler.start()
 }
 state = StreamingContextState.ACTIVE

org.apache.spark.streaming.scheduler.JobScheduler#start 源碼如下：

 def start(): Unit = synchronized {
   if (eventLoop != null) return // scheduler has already been started
 
   logDebug("Starting JobScheduler")
   eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
     override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
 
     override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
   }
   eventLoop.start()
 
   // attach rate controllers of input streams to receive batch completion updates
   for {
     inputDStream <- ssc.graph.getInputStreams
     rateController <- inputDStream.rateController
   } ssc.addStreamingListener(rateController)
 
   listenerBus.start(ssc.sparkContext)
   receiverTracker = new ReceiverTracker(ssc)
   inputInfoTracker = new InputInfoTracker(ssc)
   receiverTracker.start()
   jobGenerator.start()
   logInfo("Started JobScheduler")
 }

ReceiverTracker 的類聲明如下：

This class manages the execution of the receivers of ReceiverInputDStreams. Instance of this class must be created after all input streams have been added and StreamingContext.start() has been called because it needs the final set of input streams at the time of instantiation.
 此類負責執行ReceiverInputDStreams的receiver。必須在添加所有輸入流並調用StreamingContext.start（）之后創建此類的實例，因為它在實例化時需要最終的輸入流集。

其 start 方法如下：

 /** Start the endpoint and receiver execution thread. */
 def start(): Unit = synchronized {
   if (isTrackerStarted) {
     throw new SparkException("ReceiverTracker already started")
   }
 
   if (!receiverInputStreams.isEmpty) {
 // 建立rpc endpoint
     endpoint = ssc.env.rpcEnv.setupEndpoint( // 注意，這是一個driver的 endpoint
       "ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
 // driver節點上發送啟動 receiver 命令
     if (!skipReceiverLaunch) launchReceivers()
     logInfo("ReceiverTracker started")
     trackerState = Started
   }
 }
 
 /**
  * Get the receivers from the ReceiverInputDStreams, distributes them to the
  * worker nodes as a parallel collection, and runs them.
  */
 // 從ReceiverInputDStreams 獲取到 receivers，然后將它們分配到不同的 worker 節點並運行它們。
 private def launchReceivers(): Unit = {
   val receivers = receiverInputStreams.map(nis => {
 // 未啟用WAL 是KafkaReceiver，啟動WAL后是ReliableKafkaReceiver
     val rcvr = nis.getReceiver()
     rcvr.setReceiverId(nis.id)
     rcvr
   })
   // 運行一個簡單的應用來確保所有的salve node都已經啟動起來，避免所有的 receiver 任務都在同一個local node上
   runDummySparkJob()
 
   logInfo("Starting " + receivers.length + " receivers")
   endpoint.send(StartAllReceivers(receivers)) // 發送請求driver 轉發 啟動 receiver 的命令
 }

Driver 端StartAllReceivers 的處理代碼如下：

 override def receive: PartialFunction[Any, Unit] = {
   // Local messages
   case StartAllReceivers(receivers) =>
 // schduleReceiver
     val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
     for (receiver <- receivers) {
       val executors = scheduledLocations(receiver.streamId)
       updateReceiverScheduledExecutors(receiver.streamId, executors)
       receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
       startReceiver(receiver, executors)
     }
 ……
 }

getExecutors源碼如下：

 /**
  * Get the list of executors excluding driver
  */
 // 如果是 local 模式，返回 本地線程； 如果是 yarn 模式，返回 非driver 節點上的 excutors
 private def getExecutors: Seq[ExecutorCacheTaskLocation] = {
   if (ssc.sc.isLocal) { // 如果在 local 模式下運行
     val blockManagerId = ssc.sparkContext.env.blockManager.blockManagerId
     Seq(ExecutorCacheTaskLocation(blockManagerId.host, blockManagerId.executorId))
   } else { // 在 yarn 模式下，過濾掉 driver 的 executor
     ssc.sparkContext.env.blockManager.master.getMemoryStatus.filter { case (blockManagerId, _) =>
       blockManagerId.executorId != SparkContext.DRIVER_IDENTIFIER // Ignore the driver location
     }.map { case (blockManagerId, _) =>
       ExecutorCacheTaskLocation(blockManagerId.host, blockManagerId.executorId)
     }.toSeq
   }
 }

org.apache.spark.streaming.scheduler.ReceiverSchedulingPolicy#scheduleReceivers的解釋如下：

 Try our best to schedule receivers with evenly distributed. However, if the preferredLocations of receivers are not even, we may not be able to schedule them evenly because we have to respect them. Here is the approach to schedule executors:
 First, schedule all the receivers with preferred locations (hosts), evenly among the executors running on those host.
 Then, schedule all other receivers evenly among all the executors such that overall distribution over all the receivers is even.
 This method is called when we start to launch receivers at the first time.
 該方法就是確保receiver 能夠在worker node 上均勻分布的。遵循以下兩個原則：
 1.使用 preferred location 分配 receiver 到這些node 上
 2.將其他的未分配的receiver均勻分布均勻分布到 每一個 worker node 上

org.apache.spark.streaming.scheduler.ReceiverTracker#updateReceiverScheduledExecutors 負責更新receiverid 和 receiver info 的映射關系，源碼如下：

 private def updateReceiverScheduledExecutors(
     receiverId: Int, scheduledLocations: Seq[TaskLocation]): Unit = {
   val newReceiverTrackingInfo = receiverTrackingInfos.get(receiverId) match {
     case Some(oldInfo) =>
       oldInfo.copy(state = ReceiverState.SCHEDULED,
         scheduledLocations = Some(scheduledLocations))
     case None =>
       ReceiverTrackingInfo(
         receiverId,
         ReceiverState.SCHEDULED,
         Some(scheduledLocations),
         runningExecutor = None)
   }
   receiverTrackingInfos.put(receiverId, newReceiverTrackingInfo)
 }

2、Driver 發送分布式啟動receiver job

startReceiver 負責啟動 receiver，源碼如下：

  /**
  * Start a receiver along with its scheduled executors
  */
 private def startReceiver(
     receiver: Receiver[_],
     scheduledLocations: Seq[TaskLocation]): Unit = {
   def shouldStartReceiver: Boolean = {
     // It's okay to start when trackerState is Initialized or Started
     !(isTrackerStopping || isTrackerStopped)
   }
 
   val receiverId = receiver.streamId
   if (!shouldStartReceiver) {
     onReceiverJobFinish(receiverId)
     return
   }
 
   val checkpointDirOption = Option(ssc.checkpointDir)
   val serializableHadoopConf =
     new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)
 
 // 在 worker node 上啟動 receiver 的方法
   val startReceiverFunc: Iterator[Receiver[_]] => Unit =
     (iterator: Iterator[Receiver[_]]) => {
       if (!iterator.hasNext) {
         throw new SparkException(
           "Could not start receiver as object not found.")
       }
       if (TaskContext.get().attemptNumber() == 0) {
         val receiver = iterator.next()
         assert(iterator.hasNext == false)
         val supervisor = new ReceiverSupervisorImpl(
           receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
         supervisor.start()
         supervisor.awaitTermination()
       } else {
         // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
       }
     }
 
   // Create the RDD using the scheduledLocations to run the receiver in a Spark job
   val receiverRDD: RDD[Receiver[_]] =
     if (scheduledLocations.isEmpty) {
       ssc.sc.makeRDD(Seq(receiver), 1)
     } else {
       val preferredLocations = scheduledLocations.map(_.toString).distinct
       ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
     }
   receiverRDD.setName(s"Receiver $receiverId")
   ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
   ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
   // 提交分布式receiver 啟動任務
   val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
     receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
   // We will keep restarting the receiver job until ReceiverTracker is stopped
   future.onComplete {
     case Success(_) =>
       if (!shouldStartReceiver) {
         onReceiverJobFinish(receiverId)
       } else {
         logInfo(s"Restarting Receiver $receiverId")
         self.send(RestartReceiver(receiver))
       }
     case Failure(e) =>
       if (!shouldStartReceiver) {
         onReceiverJobFinish(receiverId)
       } else {
         logError("Receiver has been stopped. Try to restart it.", e)
         logInfo(s"Restarting Receiver $receiverId")
         self.send(RestartReceiver(receiver))
       }
   }(submitJobThreadPool)
   logInfo(s"Receiver ${receiver.streamId} started")
 }

3、Worker節點啟動 receiver監管服務

org.apache.spark.streaming.receiver.ReceiverSupervisorImpl#ReceiverSupervisorImpl 的 start 方法如下：

  /** Start the supervisor */
 def start() {
   onStart()
   startReceiver()
 }
 override protected def onStart() { // 啟動 BlockGenerator 服務
   registeredBlockGenerators.foreach { _.start() }
 }
 // startReceiver 方法如下：
 /** Start receiver */
 def startReceiver(): Unit = synchronized {
   try {
     if (onReceiverStart()) { // 注冊receiver 成功
       logInfo("Starting receiver")
       receiverState = Started
       receiver.onStart() // 啟動 receiver
       logInfo("Called receiver onStart")
     } else {
       // The driver refused us
       stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
     }
   } catch {
     case NonFatal(t) =>
       stop("Error starting receiver " + streamId, Some(t))
   }
 }

4、注冊 receiver 到 driver節點

 override protected def onReceiverStart(): Boolean = {
   val msg = RegisterReceiver(
     streamId, receiver.getClass.getSimpleName, host, executorId, endpoint)
   trackerEndpoint.askWithRetry[Boolean](msg)
 }

簡單描述一下driver 端做的事情，主要負責將其納入到org.apache.spark.streaming.scheduler.ReceiverTracker 的管理中來，具體streamid 和 ReceiverTrackingInfo 的映射關系保存在receiverTrackingInfos中。

org.apache.spark.streaming.scheduler.ReceiverTracker#registerReceiver關鍵代碼如下：

  val name = s"${typ}-${streamId}"
 val receiverTrackingInfo = ReceiverTrackingInfo(
   streamId,
   ReceiverState.ACTIVE,
   scheduledLocations = None,
   runningExecutor = Some(ExecutorCacheTaskLocation(host, executorId)),
   name = Some(name),
   endpoint = Some(receiverEndpoint))
 receiverTrackingInfos.put(streamId, receiverTrackingInfo)
 listenerBus.post(StreamingListenerReceiverStarted(receiverTrackingInfo.toReceiverInfo))

5、啟動 receiver 線程

由於我們啟用了 WAL，所以這里的receiver 是ReliableKafkaReceiver 的實例 receiver.onStart 即 org.apache.spark.streaming.kafka.ReliableKafkaReceiver#onStart, 源碼如下：

  override def onStart(): Unit = {
   logInfo(s"Starting Kafka Consumer Stream with group: $groupId")
 
   // Initialize the topic-partition / offset hash map.
 // 1. 負責維護消費的 topic-partition 和 offset 的映射關系
   topicPartitionOffsetMap = new mutable.HashMap[TopicAndPartition, Long]
 
   // Initialize the stream block id / offset snapshot hash map.
 // 2. 負責維護 block-id 和 partition-offset 之間的映射關系
   blockOffsetMap = new ConcurrentHashMap[StreamBlockId, Map[TopicAndPartition, Long]]()
 
   // Initialize the block generator for storing Kafka message.
 // 3. 負責保存 kafka message 的 block generator，入參是GeneratedBlockHandler 實例，這是一個負責監聽 block generator事件的一個監聽器
 // Generates batches of objects received by a org.apache.spark.streaming.receiver.Receiver and puts them into appropriately named blocks at regular intervals. This class starts two threads, one to periodically start a new batch and prepare the previous batch of as a block, the other to push the blocks into the block manager. 
   blockGenerator = supervisor.createBlockGenerator(new GeneratedBlockHandler)
   // 4. 關閉consumer 自動提交 offset 選項
 // auto_offset_commit 應該是 false
   if (kafkaParams.contains(AUTO_OFFSET_COMMIT) && kafkaParams(AUTO_OFFSET_COMMIT) == "true") {
     logWarning(s"$AUTO_OFFSET_COMMIT should be set to false in ReliableKafkaReceiver, " +
       "otherwise we will manually set it to false to turn off auto offset commit in Kafka")
   }
 
   val props = new Properties()
   kafkaParams.foreach(param => props.put(param._1, param._2))
   // Manually set "auto.commit.enable" to "false" no matter user explicitly set it to true,
   // we have to make sure this property is set to false to turn off auto commit mechanism in Kafka.
   props.setProperty(AUTO_OFFSET_COMMIT, "false")
 
   val consumerConfig = new ConsumerConfig(props)
 
   assert(!consumerConfig.autoCommitEnable)
 
   logInfo(s"Connecting to Zookeeper: ${consumerConfig.zkConnect}")
 // 5. 初始化 consumer 對象
 // consumerConnector 是ZookeeperConsumerConnector的實例
   consumerConnector = Consumer.create(consumerConfig)
   logInfo(s"Connected to Zookeeper: ${consumerConfig.zkConnect}")
   // 6. 初始化zookeeper 的客戶端
   zkClient = new ZkClient(consumerConfig.zkConnect, consumerConfig.zkSessionTimeoutMs,
     consumerConfig.zkConnectionTimeoutMs, ZKStringSerializer)
    // 7. 創建線程池來處理消息流，池的大小是固定的，為partition 的總數，並指定線程池中每一個線程的name 的前綴，內部使用ThreadPoolExecutor，並且 創建線程的 factory類是guava 工具包提供的。
   messageHandlerThreadPool = ThreadUtils.newDaemonFixedThreadPool(
     topics.values.sum, "KafkaMessageHandler")
    // 8. 啟動 BlockGenerator內的兩個線程
   blockGenerator.start()
 
 // 9. 創建MessageStream對象
   val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
     .newInstance(consumerConfig.props)
     .asInstanceOf[Decoder[K]]
 
   val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
     .newInstance(consumerConfig.props)
     .asInstanceOf[Decoder[V]]
  
   val topicMessageStreams = consumerConnector.createMessageStreams(
     topics, keyDecoder, valueDecoder)
 // 10. 將待處理的MessageHandler 放入 線程池中，等待執行
   topicMessageStreams.values.foreach { streams =>
     streams.foreach { stream =>
       messageHandlerThreadPool.submit(new MessageHandler(stream))
     }
   }
 }

其中, 第9 步，創建MessageStream對象， kafka.consumer.ZookeeperConsumerConnector#createMessageStreams 方法如下：

 def createMessageStreams[K,V](topicCountMap: Map[String,Int], keyDecoder: Decoder[K], valueDecoder: Decoder[V])
     : Map[String, List[KafkaStream[K,V]]] = {
   if (messageStreamCreated.getAndSet(true))
     throw new MessageStreamsExistException(this.getClass.getSimpleName +
                                  " can create message streams at most once",null)
   consume(topicCountMap, keyDecoder, valueDecoder)
 }

其調用了 consume 方法，源碼如下：

def consume[K, V](topicCountMap: scala.collection.Map[String,Int], keyDecoder: Decoder[K], valueDecoder: Decoder[V])
    : Map[String,List[KafkaStream[K,V]]] = {
  debug("entering consume ")
  if (topicCountMap == null)
    throw new RuntimeException("topicCountMap is null")
 // 1. 初始化 topicCount
  val topicCount = TopicCount.constructTopicCount(consumerIdString, topicCountMap)
 // 2. 獲取 每一個topic 和 threadId 集合的映射關系
  val topicThreadIds = topicCount.getConsumerThreadIdsPerTopic

  // make a list of (queue,stream) pairs, one pair for each threadId
// 3. 得到每一個 threadId 對應 (queue, stream) 的映射列表
  val queuesAndStreams = topicThreadIds.values.map(threadIdSet =>
    threadIdSet.map(_ => {
      val queue =  new LinkedBlockingQueue[FetchedDataChunk](config.queuedMaxMessages)
      val stream = new KafkaStream[K,V](
        queue, config.consumerTimeoutMs, keyDecoder, valueDecoder, config.clientId)
      (queue, stream)
    })
  ).flatten.toList
 // 4. 獲取 groupId 在 zookeeper 中的path
  val dirs = new ZKGroupDirs(config.groupId)
// 5. 注冊 consumer 到 groupId（在zk中）
  registerConsumerInZK(dirs, consumerIdString, topicCount)
// 6. 重新初始化 consumer
  reinitializeConsumer(topicCount, queuesAndStreams)
  // 7. 返回流 
  loadBalancerListener.kafkaMessageAndMetadataStreams.asInstanceOf[Map[String, List[KafkaStream[K,V]]]]
}

6、consumer消費kafka數據

在 kafka.consumer.ZookeeperConsumerConnector#consume方法中，有如下操作：

 // 得到每一個 threadId 對應 (queue, stream) 的映射列表
   val queuesAndStreams = topicThreadIds.values.map(threadIdSet =>
     threadIdSet.map(_ => {
       val queue =  new LinkedBlockingQueue[FetchedDataChunk](config.queuedMaxMessages)
       val stream = new KafkaStream[K,V](
         queue, config.consumerTimeoutMs, keyDecoder, valueDecoder, config.clientId)
       (queue, stream)
     })
   ).flatten.toList
  // 獲取 groupId 在 zookeeper 中的path
   val dirs = new ZKGroupDirs(config.groupId)
 // 注冊 consumer 到 groupId（在zk中）
   registerConsumerInZK(dirs, consumerIdString, topicCount)
 // 重新初始化 consumer
   reinitializeConsumer(topicCount, queuesAndStreams)

在上面的代碼中，可以看到初始化的queue(LinkedBlockingQueue實例)除了被傳入stream(KafkaStream)的構造函數被迭代器從中取數據，還和 stream 重組成Tuple2[LinkedBlockingQueue[FetchedDataChunk]的list,之后被傳入reinitializeConsumer 方法中。 kafka.consumer.ZookeeperConsumerConnector#reinitializeConsume 其源碼如下：

 private def reinitializeConsumer[K,V](
     topicCount: TopicCount,
     queuesAndStreams: List[(LinkedBlockingQueue[FetchedDataChunk],KafkaStream[K,V])]) {
  // 1. 獲取 該groupid 在 zk 中的路徑
   val dirs = new ZKGroupDirs(config.groupId)
 
   // listener to consumer and partition changes
 // 2. 初始化loadBalancerListener，這個負載均衡listener 會時刻監控 consumer 和 partition 的變化
   if (loadBalancerListener == null) {
     val topicStreamsMap = new mutable.HashMap[String,List[KafkaStream[K,V]]]
     loadBalancerListener = new ZKRebalancerListener(
       config.groupId, consumerIdString, topicStreamsMap.asInstanceOf[scala.collection.mutable.Map[String, List[KafkaStream[_,_]]]])
   }
 
   // create listener for session expired event if not exist yet
   // 3. 監控 session 過期的listner， 有新session注冊初始化，會通知 loadBalancer
 if (sessionExpirationListener == null)
     sessionExpirationListener = new ZKSessionExpireListener(
       dirs, consumerIdString, topicCount, loadBalancerListener)
 
   // create listener for topic partition change event if not exist yet
 // 4. 初始化ZKTopicPartitionChangeListener實例，當topic partition 變化時，這個listener會通知 loadBalancer
   if (topicPartitionChangeListener == null)
     topicPartitionChangeListener = new ZKTopicPartitionChangeListener(loadBalancerListener)
  // 5. 將queuesAndStreams 的值經過一系列轉換，並添加到loadBalancerListener.kafkaMessageAndMetadataStreams 中
   val topicStreamsMap = loadBalancerListener.kafkaMessageAndMetadataStreams
 
   // map of {topic -> Set(thread-1, thread-2, ...)}
   val consumerThreadIdsPerTopic: Map[String, Set[ConsumerThreadId]] =
     topicCount.getConsumerThreadIdsPerTopic
 
   val allQueuesAndStreams = topicCount match {
     case wildTopicCount: WildcardTopicCount => // 這里是WildcardTopicCount，走這個分支
       /*
        * Wild-card consumption streams share the same queues, so we need to
        * duplicate the list for the subsequent zip operation.
        */
       (1 to consumerThreadIdsPerTopic.keySet.size).flatMap(_ => queuesAndStreams).toList
     case statTopicCount: StaticTopicCount =>
       queuesAndStreams
   }
 
   val topicThreadIds = consumerThreadIdsPerTopic.map {
     case(topic, threadIds) =>
       threadIds.map((topic, _))
   }.flatten
 
   require(topicThreadIds.size == allQueuesAndStreams.size,
     "Mismatch between thread ID count (%d) and queue count (%d)"
     .format(topicThreadIds.size, allQueuesAndStreams.size))
   val threadQueueStreamPairs = topicThreadIds.zip(allQueuesAndStreams)
 
   threadQueueStreamPairs.foreach(e => {
     val topicThreadId = e._1
     val q = e._2._1
     topicThreadIdAndQueues.put(topicThreadId, q)
     debug("Adding topicThreadId %s and queue %s to topicThreadIdAndQueues data structure".format(topicThreadId, q.toString))
     newGauge(
       "FetchQueueSize",
       new Gauge[Int] {
         def value = q.size
       },
       Map("clientId" -> config.clientId,
         "topic" -> topicThreadId._1,
         "threadId" -> topicThreadId._2.threadId.toString)
     )
   })
 
   val groupedByTopic = threadQueueStreamPairs.groupBy(_._1._1)
   groupedByTopic.foreach(e => {
     val topic = e._1
     val streams = e._2.map(_._2._2).toList
     topicStreamsMap += (topic -> streams)
     debug("adding topic %s and %d streams to map.".format(topic, streams.size))
   })
 
   // listener to consumer and partition changes
 // 6. 使用 zkClient 注冊sessionExpirationListener 實例
   zkClient.subscribeStateChanges(sessionExpirationListener)
  // 7. 使用 zkClient 注冊loadBalancerListener 實例
   zkClient.subscribeChildChanges(dirs.consumerRegistryDir, loadBalancerListener)
  // 遍歷每一個topic，使用zkClient 注冊topicPartitionChangeListener 實例
   topicStreamsMap.foreach { topicAndStreams =>
     // register on broker partition path changes
     val topicPath = BrokerTopicsPath + "/" + topicAndStreams._1
     zkClient.subscribeDataChanges(topicPath, topicPartitionChangeListener)
   }
 
   // explicitly trigger load balancing for this consumer
 // 8. 使用 loadBalancerListener 同步做負載均衡
   loadBalancerListener.syncedRebalance()
 }

重點看第 8 步，使用 loadBalancerListener 同步做負載均衡。 kafka.consumer.ZookeeperConsumerConnector.ZKRebalancerListener#syncedRebalance 源碼如下：

 def syncedRebalance() {
   rebalanceLock synchronized {
     rebalanceTimer.time {
       if(isShuttingDown.get())  { // 如果ZookeeperConsumerConnector
 已經shutdown了，直接返回
         return
       } else {
         for (i <- 0 until config.rebalanceMaxRetries) { // 默認是 4 次
           info("begin rebalancing consumer " + consumerIdString + " try #" + i)
           var done = false
           var cluster: Cluster = null
           try {
             // 1. 根據zkClient 實例 獲取並創建Cluster 對象，這個 cluster 實例包含了一個 Broker（broker的id，broker在zk中的路徑） 列表
             cluster = getCluster(zkClient) 
             // 2. 在cluster中做 rebalance操作
             done = rebalance(cluster)
           } catch {
             case e: Throwable =>
               /** occasionally, we may hit a ZK exception because the ZK state is changing while we are iterating.
                 * For example, a ZK node can disappear between the time we get all children and the time we try to get
                 * the value of a child. Just let this go since another rebalance will be triggered.
                 **/
               info("exception during rebalance ", e)
           }
           info("end rebalancing consumer " + consumerIdString + " try #" + i)
           if (done) {
             return
           } else {
             /* Here the cache is at a risk of being stale. To take future rebalancing decisions correctly, we should
              * clear the cache */
             info("Rebalancing attempt failed. Clearing the cache before the next rebalancing operation is triggered")
           }
           // stop all fetchers and clear all the queues to avoid data duplication
           closeFetchersForQueues(cluster, kafkaMessageAndMetadataStreams, topicThreadIdAndQueues.map(q => q._2))
           Thread.sleep(config.rebalanceBackoffMs)
         }
       }
     }
   }
 
   throw new ConsumerRebalanceFailedException(consumerIdString + " can't rebalance after " + config.rebalanceMaxRetries +" retries")
 }

重點看第2 步，在 cluster 中做 rebalance 操作，kafka.consumer.ZookeeperConsumerConnector.ZKRebalancerListener#rebalance 源碼如下：

  private def rebalance(cluster: Cluster): Boolean = {
   // 1. 獲取 group和 threadId 的Map 映射關系
   val myTopicThreadIdsMap = TopicCount.constructTopicCount(
     group, consumerIdString, zkClient, config.excludeInternalTopics).getConsumerThreadIdsPerTopic
   // 2. 獲取kafka cluster 中所有可用的node
   val brokers = getAllBrokersInCluster(zkClient)
   if (brokers.size == 0) { // 如果可用節點為空，設置listener訂閱，返回 true
     // This can happen in a rare case when there are no brokers available in the cluster when the consumer is started.
     // We log an warning and register for child changes on brokers/id so that rebalance can be triggered when the brokers
     // are up.
     warn("no brokers found when trying to rebalance.")
     zkClient.subscribeChildChanges(ZkUtils.BrokerIdsPath, loadBalancerListener)
     true
   }
   else {
     /**
      * fetchers must be stopped to avoid data duplication, since if the current
      * rebalancing attempt fails, the partitions that are released could be owned by another consumer.
      * But if we don't stop the fetchers first, this consumer would continue returning data for released
      * partitions in parallel. So, not stopping the fetchers leads to duplicate data.
      */
    // 3. 做rebalance 之前的准備工作
    // 3.1. 關閉現有 fetcher 連接
     closeFetchers(cluster, kafkaMessageAndMetadataStreams, myTopicThreadIdsMap)
    // 3.2 釋放 partition 的所有權（主要是刪除zk下的owner 節點的數據以及解除內存中的topic和 fetcher的關聯關系）
     releasePartitionOwnership(topicRegistry)
    // 3.3. 重新給partition分配 fetcher
     val assignmentContext = new AssignmentContext(group, consumerIdString, config.excludeInternalTopics, zkClient)
     val partitionOwnershipDecision = partitionAssignor.assign(assignmentContext)
     val currentTopicRegistry = new Pool[String, Pool[Int, PartitionTopicInfo]](
       valueFactory = Some((topic: String) => new Pool[Int, PartitionTopicInfo]))
 
     // fetch current offsets for all topic-partitions
     // 3.4 獲取當前fetcher對應的 partitions 的 offsets，這里的offset是指 consumer 下一個要消費的offset
     val topicPartitions = partitionOwnershipDecision.keySet.toSeq
 
     val offsetFetchResponseOpt = fetchOffsets(topicPartitions)
 
     if (isShuttingDown.get || !offsetFetchResponseOpt.isDefined)
       false
     else {
       // 3.5 更新 partition 和 fetcher 的對應關系
       val offsetFetchResponse = offsetFetchResponseOpt.get
       topicPartitions.foreach(topicAndPartition => {
         val (topic, partition) = topicAndPartition.asTuple
 // requestInfo是OffsetFetchResponse實例中的成員變量，它是一個Map[TopicAndPartition, OffsetMetadataAndError]實例
         val offset = offsetFetchResponse.requestInfo(topicAndPartition).offset
         val threadId = partitionOwnershipDecision(topicAndPartition)
         addPartitionTopicInfo(currentTopicRegistry, partition, topic, offset, threadId)
       })
 
       /**
        * move the partition ownership here, since that can be used to indicate a truly successful rebalancing attempt
        * A rebalancing attempt is completed successfully only after the fetchers have been started correctly
        */
       if(reflectPartitionOwnershipDecision(partitionOwnershipDecision)) {
         allTopicsOwnedPartitionsCount = partitionOwnershipDecision.size
 
         partitionOwnershipDecision.view.groupBy { case(topicPartition, consumerThreadId) => topicPartition.topic }
                                   .foreach { case (topic, partitionThreadPairs) =>
           newGauge("OwnedPartitionsCount",
             new Gauge[Int] {
               def value() = partitionThreadPairs.size
             },
             ownedPartitionsCountMetricTags(topic))
         }
         // 3.6 將已經新的 topic registry 覆蓋舊的
         topicRegistry = currentTopicRegistry
 // 4. 更新 fetcher
         updateFetcher(cluster)
         true
       } else {
         false
       }
     }
   }
 }

其中addPartitionTopicInfo 源碼如下：

 private def addPartitionTopicInfo(currentTopicRegistry: Pool[String, Pool[Int, PartitionTopicInfo]],
                                     partition: Int, topic: String,
                                     offset: Long, consumerThreadId: ConsumerThreadId) {
 //如果map沒有對應的 key，會使用valueFactory初始化鍵值對，並返回 對應的 value
     val partTopicInfoMap = currentTopicRegistry.getAndMaybePut(topic)
 
     val queue = topicThreadIdAndQueues.get((topic, consumerThreadId))
     val consumedOffset = new AtomicLong(offset)
     val fetchedOffset = new AtomicLong(offset)
     val partTopicInfo = new PartitionTopicInfo(topic,
                                                partition,
                                                queue,
                                                consumedOffset,
                                                fetchedOffset,
                                                new AtomicInteger(config.fetchMessageMaxBytes),
                                                config.clientId)
     // 1. 將其注冊到新的 Topic注冊中心中，即注冊 partition 和 fetcher 的關系
 partTopicInfoMap.put(partition, partTopicInfo)
     debug(partTopicInfo + " selected new offset " + offset)
 // 2. 更新consumer 的 已經消費的offset信息
     checkpointedZkOffsets.put(TopicAndPartition(topic, partition), offset)
   }
 }

第4步，更新 fetcher 源碼如下：

 private def updateFetcher(cluster: Cluster) {
   // update partitions for fetcher
   var allPartitionInfos : List[PartitionTopicInfo] = Nil
   for (partitionInfos <- topicRegistry.values)
     for (partition <- partitionInfos.values)
       allPartitionInfos ::= partition
   info("Consumer " + consumerIdString + " selected partitions : " +
     allPartitionInfos.sortWith((s,t) => s.partitionId < t.partitionId).map(_.toString).mkString(","))
 
   fetcher match {
     case Some(f) =>
       f.startConnections(allPartitionInfos, cluster)
     case None =>
   }
 }

其中，f.startConnections方法真正執行更新操作。此時引入一個新的類。即 fetcher 類，kafka.consumer.ConsumerFetcherManager。

kafka.consumer.ConsumerFetcherManager#startConnections 的源碼如下：

  def startConnections(topicInfos: Iterable[PartitionTopicInfo], cluster: Cluster) {
 // LeaderFinderThread 在 topic 的leader node可用時，將 fetcher 添加到 leader 節點上
   leaderFinderThread = new LeaderFinderThread(consumerIdString + "-leader-finder-thread")
   leaderFinderThread.start()
 
   inLock(lock) {
 // 更新ConsumerFetcherManager 成員變量
     partitionMap = topicInfos.map(tpi => (TopicAndPartition(tpi.topic, tpi.partitionId), tpi)).toMap
     this.cluster = cluster
     noLeaderPartitionSet ++= topicInfos.map(tpi => TopicAndPartition(tpi.topic, tpi.partitionId))
     cond.signalAll()
   }
 }

ConsumerFetcherManager 有一個LeaderFinderThread 實例，該類的父類kafka.utils.ShutdownableThread ，run 方法如下：

 override def run(): Unit = {
   info("Starting ")
   try{
     while(isRunning.get()){
       doWork()
     }
   } catch{
     case e: Throwable =>
       if(isRunning.get())
         error("Error due to ", e)
   }
   shutdownLatch.countDown()
   info("Stopped ")
 }

doWork其實就是一個抽象方法，其子類LeaderFinderThread的實現如下：

  // thread responsible for adding the fetcher to the right broker when leader is available
 override def doWork() {
 // 1. 獲取 partition 和leader node的映射關系
   val leaderForPartitionsMap = new HashMap[TopicAndPartition, Broker]
   lock.lock()
   try {
     while (noLeaderPartitionSet.isEmpty) { // 這個字段在startConnections 已更新新值
       trace("No partition for leader election.")
       cond.await()
     }
 
     trace("Partitions without leader %s".format(noLeaderPartitionSet))
     val brokers = getAllBrokersInCluster(zkClient) // 獲取所有可用broker 節點
     // 獲取kafka.api.TopicMetadata 序列，kafka.api.TopicMetadata 保存了 topic 和 partitionId，isr，leader，replicas 的信息
 val topicsMetadata = ClientUtils.fetchTopicMetadata(noLeaderPartitionSet.map(m => m.topic).toSet,
                                                         brokers,
                                                         config.clientId,
                                                         config.socketTimeoutMs,
                                                         correlationId.getAndIncrement).topicsMetadata
     if(logger.isDebugEnabled) topicsMetadata.foreach(topicMetadata => debug(topicMetadata.toString()))
 // 2. 根據獲取到的 partition 和 leader node 的關系更新noLeaderPartitionSet 和leaderForPartitionsMap 兩個map集合，其中noLeaderPartitionSet 包含的是沒有確定leader 的 partition 集合，leaderForPartitionsMap 是 已經確定了 leader 的 partition 集合。
     topicsMetadata.foreach { tmd =>
       val topic = tmd.topic
       tmd.partitionsMetadata.foreach { pmd =>
         val topicAndPartition = TopicAndPartition(topic, pmd.partitionId)
         if(pmd.leader.isDefined && noLeaderPartitionSet.contains(topicAndPartition)) {
           val leaderBroker = pmd.leader.get
           leaderForPartitionsMap.put(topicAndPartition, leaderBroker)
           noLeaderPartitionSet -= topicAndPartition
         }
       }
     }
   } catch {
     case t: Throwable => {
         if (!isRunning.get())
           throw t /* If this thread is stopped, propagate this exception to kill the thread. */
         else
           warn("Failed to find leader for %s".format(noLeaderPartitionSet), t)
       }
   } finally {
     lock.unlock()
   }
 
   try {
 // 3. 具體為 partition 分配 fetcher
     addFetcherForPartitions(leaderForPartitionsMap.map{
       case (topicAndPartition, broker) =>
         topicAndPartition -> BrokerAndInitialOffset(broker, partitionMap(topicAndPartition).getFetchOffset())}
     )
   } catch {
     case t: Throwable => {
       if (!isRunning.get())
         throw t /* If this thread is stopped, propagate this exception to kill the thread. */
       else {
         warn("Failed to add leader for partitions %s; will retry".format(leaderForPartitionsMap.keySet.mkString(",")), t)
         lock.lock()
         noLeaderPartitionSet ++= leaderForPartitionsMap.keySet
         lock.unlock()
       }
     }
   }
   // 4. 關閉空閑fetcher線程
   shutdownIdleFetcherThreads()
   Thread.sleep(config.refreshLeaderBackoffMs)
 }

重點看第3 步，具體為 partition 分配 fetcher，addFetcherForPartitions 源碼如下：

  def addFetcherForPartitions(partitionAndOffsets: Map[TopicAndPartition, BrokerAndInitialOffset]) {
   mapLock synchronized {
 // 獲取 fetcher 和 partition的映射關系
     val partitionsPerFetcher = partitionAndOffsets.groupBy{ case(topicAndPartition, brokerAndInitialOffset) =>
       BrokerAndFetcherId(brokerAndInitialOffset.broker, getFetcherId(topicAndPartition.topic, topicAndPartition.partition))}
     for ((brokerAndFetcherId, partitionAndOffsets) <- partitionsPerFetcher) {
 
       var fetcherThread: AbstractFetcherThread = null
       fetcherThreadMap.get(brokerAndFetcherId) match {
         case Some(f) => fetcherThread = f
         case None =>
 // 根據brokerAndFetcherId 去初始化Fetcher並啟動 fetcher
           fetcherThread = createFetcherThread(brokerAndFetcherId.fetcherId, brokerAndFetcherId.broker)
           fetcherThreadMap.put(brokerAndFetcherId, fetcherThread)
           fetcherThread.start
       }
 
       fetcherThreadMap(brokerAndFetcherId).addPartitions(partitionAndOffsets.map { case (topicAndPartition, brokerAndInitOffset) =>
         topicAndPartition -> brokerAndInitOffset.initOffset
       })
     }
   }
 
   info("Added fetcher for partitions %s".format(partitionAndOffsets.map{ case (topicAndPartition, brokerAndInitialOffset) =>
     "[" + topicAndPartition + ", initOffset " + brokerAndInitialOffset.initOffset + " to broker " + brokerAndInitialOffset.broker + "] "}))
 }

kafka.consumer.ConsumerFetcherManager#createFetcherThread的源碼如下：

 override def createFetcherThread(fetcherId: Int, sourceBroker: Broker): AbstractFetcherThread = {
   new ConsumerFetcherThread(
     "ConsumerFetcherThread-%s-%d-%d".format(consumerIdString, fetcherId, sourceBroker.id),
     config, sourceBroker, partitionMap, this)
 }

先來看ConsumerFetcherThread的構造方法聲明：

 class ConsumerFetcherThread(name: String,
                             val config: ConsumerConfig,
                             sourceBroker: Broker,
                             partitionMap: Map[TopicAndPartition, PartitionTopicInfo],
                             val consumerFetcherManager: ConsumerFetcherManager)
         extends AbstractFetcherThread(name = name, 
                                       clientId = config.clientId,
                                       sourceBroker = sourceBroker,
                                       socketTimeout = config.socketTimeoutMs,
                                       socketBufferSize = config.socketReceiveBufferBytes,
                                       fetchSize = config.fetchMessageMaxBytes,
                                       fetcherBrokerId = Request.OrdinaryConsumerId,
                                       maxWait = config.fetchWaitMaxMs,
                                       minBytes = config.fetchMinBytes,
                                       isInterruptible = true)

注意，partitionMap 中的value 是PartitionTopicInfo ，這個對象中封裝了存放fetch結果值的BlockingQueue[FetchedDataChunk] 實例。再來看 run 方法，其使用的是 kafka.utils.ShutdownableThread#run 方法，上面我們已經看過了，主要看該子類是如何重新 doWork方法的：

 override def doWork() {
   inLock(partitionMapLock) { // 加鎖，執行，釋放鎖
     if (partitionMap.isEmpty) // 如果沒有需要執行的 fetch 操作，等待200ms后返回
       partitionMapCond.await(200L, TimeUnit.MILLISECONDS)
     partitionMap.foreach { // 將所有的 fetch 的信息添加到fetchRequestBuilder中
       case((topicAndPartition, offset)) =>
         fetchRequestBuilder.addFetch(topicAndPartition.topic, topicAndPartition.partition,
                          offset, fetchSize)
     }
   }
   // 構建批抓取的fetchRequest對象
   val fetchRequest = fetchRequestBuilder.build()
 // 處理 FetchRequest
   if (!fetchRequest.requestInfo.isEmpty)
     processFetchRequest(fetchRequest)
 }

其中 kafka.server.AbstractFetcherThread#processFetchRequest 源碼如下：

  private def processFetchRequest(fetchRequest: FetchRequest) {
   val partitionsWithError = new mutable.HashSet[TopicAndPartition]
   var response: FetchResponse = null
   try {
     trace("Issuing to broker %d of fetch request %s".format(sourceBroker.id, fetchRequest))
 // 發送請求，並獲取返回值。
 // simpleConsumer  就是SimpleConsumer 實例，已作說明，不再贅述。
     response = simpleConsumer.fetch(fetchRequest)
   } catch {
     case t: Throwable =>
       if (isRunning.get) {
         warn("Error in fetch %s. Possible cause: %s".format(fetchRequest, t.toString))
         partitionMapLock synchronized {
           partitionsWithError ++= partitionMap.keys
         }
       }
   }
   fetcherStats.requestRate.mark()
 
   if (response != null) {
     // process fetched data
     inLock(partitionMapLock) { // 獲取鎖，執行處理response 操作，釋放鎖
       response.data.foreach {
         case(topicAndPartition, partitionData) =>
           val (topic, partitionId) = topicAndPartition.asTuple
           val currentOffset = partitionMap.get(topicAndPartition)
           // we append to the log if the current offset is defined and it is the same as the offset requested during fetch
           if (currentOffset.isDefined && fetchRequest.requestInfo(topicAndPartition).offset == currentOffset.get) {
             partitionData.error match { // 根據返回碼來確定具體執行哪部分處理邏輯
               case ErrorMapping.NoError => // 成功返回，沒有錯誤
                 try {
                   val messages = partitionData.messages.asInstanceOf[ByteBufferMessageSet]
                   val validBytes = messages.validBytes
                   val newOffset = messages.shallowIterator.toSeq.lastOption match {
                     case Some(m: MessageAndOffset) => m.nextOffset
                     case None => currentOffset.get
                   }
                   partitionMap.put(topicAndPartition, newOffset)
                   fetcherLagStats.getFetcherLagStats(topic, partitionId).lag = partitionData.hw - newOffset
                   fetcherStats.byteRate.mark(validBytes)
                   // Once we hand off the partition data to the subclass, we can't mess with it any more in this thread
                   processPartitionData(topicAndPartition, currentOffset.get, partitionData)
                 } catch {
                   case ime: InvalidMessageException => // 消息獲取不完整
                     // we log the error and continue. This ensures two things
                     // 1. If there is a corrupt message in a topic partition, it does not bring the fetcher thread down and cause other topic partition to also lag
                     // 2. If the message is corrupt due to a transient state in the log (truncation, partial writes can cause this), we simply continue and
                     //    should get fixed in the subsequent fetches
                     logger.error("Found invalid messages during fetch for partition [" + topic + "," + partitionId + "] offset " + currentOffset.get + " error " + ime.getMessage)
                   case e: Throwable =>
                     throw new KafkaException("error processing data for partition [%s,%d] offset %d"
                                              .format(topic, partitionId, currentOffset.get), e)
                 }
               case ErrorMapping.OffsetOutOfRangeCode => // offset out of range error
                 try {
                   val newOffset = handleOffsetOutOfRange(topicAndPartition)
                   partitionMap.put(topicAndPartition, newOffset)
                   error("Current offset %d for partition [%s,%d] out of range; reset offset to %d"
                     .format(currentOffset.get, topic, partitionId, newOffset))
                 } catch {
                   case e: Throwable =>
                     error("Error getting offset for partition [%s,%d] to broker %d".format(topic, partitionId, sourceBroker.id), e)
                     partitionsWithError += topicAndPartition
                 }
               case _ =>
                 if (isRunning.get) {
                   error("Error for partition [%s,%d] to broker %d:%s".format(topic, partitionId, sourceBroker.id,
                     ErrorMapping.exceptionFor(partitionData.error).getClass))
                   partitionsWithError += topicAndPartition
                 }
             }
           }
       }
     }
   }
 
   if(partitionsWithError.size > 0) {
     debug("handling partitions with error for %s".format(partitionsWithError))
     handlePartitionsWithErrors(partitionsWithError)
   }
 }

其中processPartitionData 源碼如下，它負責處理具體的返回消息：

  // process fetched data
 def processPartitionData(topicAndPartition: TopicAndPartition, fetchOffset: Long, partitionData: FetchResponsePartitionData) {
 // partitionMap 是一個成員變量，在構造函數中作為入參
   val pti = partitionMap(topicAndPartition)
   if (pti.getFetchOffset != fetchOffset)
     throw new RuntimeException("Offset doesn't match for partition [%s,%d] pti offset: %d fetch offset: %d"
                               .format(topicAndPartition.topic, topicAndPartition.partition, pti.getFetchOffset, fetchOffset))
 // 數據入隊
   pti.enqueue(partitionData.messages.asInstanceOf[ByteBufferMessageSet])
 }

可以看到，終於在這里，把從leader中fetch的消息放入了BlockingQueue[FetchedDataChunk] 緩沖堵塞隊列中。

7、KafkaStream從queue中堵塞式獲取數據

KafkaStream 是依賴於 LinkedBlockingQueue 的同理 KafkaStream 也會返回一個迭代器 kafka.consumer.ConsumerIterator，用於迭代訪問 KafkaStream 中的數據。 kafka.consumer.ConsumerIterator 的主要源碼如下：

 // 判斷是否有下一個元素
 def hasNext(): Boolean = {
   if(state == FAILED)
     throw new IllegalStateException("Iterator is in failed state")
   state match {
     case DONE => false
     case READY => true
     case _ => maybeComputeNext()
   }
 }
 // 獲取下一個元素，父類實現
 def next(): T = {
   if(!hasNext())
     throw new NoSuchElementException()
   state = NOT_READY
   if(nextItem == null)
     throw new IllegalStateException("Expected item but none found.")
   nextItem
 }
 // 獲取下一個元素，使用子類ConsumerIterator實現
 override def next(): MessageAndMetadata[K, V] = {
   val item = super.next() // 調用父類實現
   if(consumedOffset < 0)
     throw new KafkaException("Offset returned by the message set is invalid %d".format(consumedOffset))
   currentTopicInfo.resetConsumeOffset(consumedOffset)
   val topic = currentTopicInfo.topic
   trace("Setting %s consumed offset to %d".format(topic, consumedOffset))
   consumerTopicStats.getConsumerTopicStats(topic).messageRate.mark()
   consumerTopicStats.getConsumerAllTopicStats().messageRate.mark()
   item
 }
  // 或許有，嘗試計算一下下一個
 def maybeComputeNext(): Boolean = {
   state = FAILED
   nextItem = makeNext()
   if(state == DONE) {
     false
   } else {
     state = READY
     true
   }
 }
 // 創建下一個元素，這個在子類ConsumerIterator中有實現
 protected def makeNext(): MessageAndMetadata[K, V] = {
 // 首先channel 是 LinkedBlockingQueue實例， 是 KafkaStream 中的 queue 成員變量，queue 成員變量
   var currentDataChunk: FetchedDataChunk = null
   // if we don't have an iterator, get one
   var localCurrent = current.get() 
 // 如果沒有迭代器或者是沒有下一個元素了，需要從channel中取一個
   if(localCurrent == null || !localCurrent.hasNext) {
 // 刪除並返回隊列的頭節點。
     if (consumerTimeoutMs < 0)
       currentDataChunk = channel.take // 阻塞方法，一直等待，直到有可用元素
     else {
       currentDataChunk = channel.poll(consumerTimeoutMs,  TimeUnit.MILLISECONDS) // 阻塞方法，等待指定時間，超時也會返回
       if (currentDataChunk == null) { // 如果沒有數據，重置狀態為NOT_READY
         // reset state to make the iterator re-iterable
         resetState()
         throw new ConsumerTimeoutException
       }
     }
 // 關閉命令
     if(currentDataChunk eq ZookeeperConsumerConnector.shutdownCommand) {
       debug("Received the shutdown command")
       return allDone // 該函數將狀態設為DONE， 返回null
     } else {
       currentTopicInfo = currentDataChunk.topicInfo
       val cdcFetchOffset = currentDataChunk.fetchOffset
       val ctiConsumeOffset = currentTopicInfo.getConsumeOffset
       if (ctiConsumeOffset < cdcFetchOffset) {
         error("consumed offset: %d doesn't match fetch offset: %d for %s;\n Consumer may lose data"
           .format(ctiConsumeOffset, cdcFetchOffset, currentTopicInfo))
         currentTopicInfo.resetConsumeOffset(cdcFetchOffset)
       }
       localCurrent = currentDataChunk.messages.iterator
 
       current.set(localCurrent)
     }
     // if we just updated the current chunk and it is empty that means the fetch size is too small!
     if(currentDataChunk.messages.validBytes == 0)
       throw new MessageSizeTooLargeException("Found a message larger than the maximum fetch size of this consumer on topic " +
                                              "%s partition %d at fetch offset %d. Increase the fetch size, or decrease the maximum message size the broker will allow."
                                              .format(currentDataChunk.topicInfo.topic, currentDataChunk.topicInfo.partitionId, currentDataChunk.fetchOffset))
   }
   var item = localCurrent.next()
   // reject the messages that have already been consumed
   while (item.offset < currentTopicInfo.getConsumeOffset && localCurrent.hasNext) {
     item = localCurrent.next()
   }
   consumedOffset = item.nextOffset
 
   item.message.ensureValid() // validate checksum of message to ensure it is valid
  // 返回處理封裝好的 kafka 數據
   new MessageAndMetadata(currentTopicInfo.topic, currentTopicInfo.partitionId, item.message, item.offset, keyDecoder, valueDecoder)
 }

2、消費到的數據cache 到WAL中

我們再來看，org.apache.spark.streaming.kafka.ReliableKafkaReceiver#onStart 的第10 步相應的代碼：

 // 10. 將待處理的MessageHandler 放入 線程池中，等待執行
   topicMessageStreams.values.foreach { streams =>
     streams.foreach { stream =>
       messageHandlerThreadPool.submit(new MessageHandler(stream))
     }
   }

其中 MessageHandler 是一個 Runnable 對象，其 run 方法如下：

 override def run(): Unit = {
   while (!isStopped) {
     try {
 // 1. 獲取ConsumerIterator 迭代器對象
       val streamIterator = stream.iterator()
       // 2. 遍歷迭代器中獲取每一條數據，並且保存message和相應的 metadata 信息
 while (streamIterator.hasNext) {
         storeMessageAndMetadata(streamIterator.next)
       }
     } catch {
       case e: Exception =>
         reportError("Error handling message", e)
     }
   }
 }

其中第二步中關鍵方法，org.apache.spark.streaming.kafka.ReliableKafkaReceiver#storeMessageAndMetadata 方法如下：

 /** Store a Kafka message and the associated metadata as a tuple. */
 private def storeMessageAndMetadata(
     msgAndMetadata: MessageAndMetadata[K, V]): Unit = {
   val topicAndPartition = TopicAndPartition(msgAndMetadata.topic, msgAndMetadata.partition)
   val data = (msgAndMetadata.key, msgAndMetadata.message)
   val metadata = (topicAndPartition, msgAndMetadata.offset)
 // 添加數據到 block
   blockGenerator.addDataWithCallback(data, metadata)
 }

addDataWithCallback 源碼如下：

 /**
  * Push a single data item into the buffer. After buffering the data, the
  * `BlockGeneratorListener.onAddData` callback will be called.
  */
 def addDataWithCallback(data: Any, metadata: Any): Unit = {
   if (state == Active) {
     waitToPush()
     synchronized {
       if (state == Active) {
 // 1. 將數據放入 buffer 中，以便處理線程從中獲取數據
         currentBuffer += data
 // 2. 在啟動 receiver線程中，可以知道listener 是指GeneratedBlockHandler 實例
         listener.onAddData(data, metadata)
       } else {
         throw new SparkException(
           "Cannot add data as BlockGenerator has not been started or has been stopped")
       }
     }
   } else {
     throw new SparkException(
       "Cannot add data as BlockGenerator has not been started or has been stopped")
   }
 }

第二步比較簡單，先看一下第二步： org.apache.spark.streaming.kafka.ReliableKafkaReceiver.GeneratedBlockHandler#onAddData的源碼如下：

 def onAddData(data: Any, metadata: Any): Unit = {
   // Update the offset of the data that was added to the generator
   if (metadata != null) {
     val (topicAndPartition, offset) = metadata.asInstanceOf[(TopicAndPartition, Long)]
     updateOffset(topicAndPartition, offset)
   }
 }
 // 這里的 updateOffset 調用的是//org.apache.spark.streaming.kafka.ReliableKafkaReceiver#updateOffset，源碼如下：
 /** Update stored offset */
 private def updateOffset(topicAndPartition: TopicAndPartition, offset: Long): Unit = {
   topicPartitionOffsetMap.put(topicAndPartition, offset)
 }

第一步的原理如下：在 BlockGenerator中有一個定時器，定時（200ms）去執行檢查currentBuffer是否為empty任務，若不為空，則執行如下操作並把它放入等待生成block 的隊列中，有兩外一個線程來時刻監聽這個隊列，有數據，則執行pushBlock 操作。第一個定時器線程如下：

 private val blockIntervalTimer =
   new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")
 
 // 其中，updateCurrentBuffer 方法如下
 /** Change the buffer to which single records are added to. */
 private def updateCurrentBuffer(time: Long): Unit = {
   try {
     var newBlock: Block = null
     synchronized {
       if (currentBuffer.nonEmpty) {
         val newBlockBuffer = currentBuffer
         currentBuffer = new ArrayBuffer[Any]
         val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
         listener.onGenerateBlock(blockId)
         newBlock = new Block(blockId, newBlockBuffer)
       }
     }
 
     if (newBlock != null) {
       blocksForPushing.put(newBlock)  // put is blocking when queue is full
     }
   } catch {
     case ie: InterruptedException =>
       logInfo("Block updating timer thread was interrupted")
     case e: Exception =>
       reportError("Error in block updating thread", e)
   }
 }
 
 // listener.onGenerateBlock(blockId) 代碼如下：
 def onGenerateBlock(blockId: StreamBlockId): Unit = {
   // Remember the offsets of topics/partitions when a block has been generated
   rememberBlockOffsets(blockId)
 }
 // rememberBlockOffsets 代碼如下：
 /**
  * Remember the current offsets for each topic and partition. This is called when a block is
  * generated.
  */
 private def rememberBlockOffsets(blockId: StreamBlockId): Unit = {
   // Get a snapshot of current offset map and store with related block id.
   val offsetSnapshot = topicPartitionOffsetMap.toMap
   blockOffsetMap.put(blockId, offsetSnapshot)
   topicPartitionOffsetMap.clear()
 }
 // 可以看出，主要是清除 topic-partition-> offset 映射關系
 // 建立 block 和topic-partition-> offset的映射關系

其中，blocksForPushing是一個有界阻塞隊列，另外一個線程會一直輪詢它。

  private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)
 private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }
 
 /** Keep pushing blocks to the BlockManager. */
 // 這個方法主要的作用就是一直不停地輪詢blocksForPushing隊列，並處理相應的push block 事件。
 private def keepPushingBlocks() {
   logInfo("Started block pushing thread")
 
   def areBlocksBeingGenerated: Boolean = synchronized {
     state != StoppedGeneratingBlocks
   }
 
   try {
     // While blocks are being generated, keep polling for to-be-pushed blocks and push them.
     while (areBlocksBeingGenerated) { // 線程沒有被停止，則一直循環
 // 超時poll操作獲取並刪除頭節點，超過時間（10ms）則返回
       Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS)) match {
         case Some(block) => pushBlock(block) // 如果有數據則進行處理。
         case None =>
       }
     }
 
     // At this point, state is StoppedGeneratingBlock. So drain the queue of to-be-pushed blocks.
     logInfo("Pushing out the last " + blocksForPushing.size() + " blocks")
     while (!blocksForPushing.isEmpty) { // 如果隊列中還有數據，繼續進行處理
       val block = blocksForPushing.take() // 這是一個堵塞方法，不過現在會馬上返回，因為隊列里面有數據。
       logDebug(s"Pushing block $block")
       pushBlock(block) // 處理數據
       logInfo("Blocks left to push " + blocksForPushing.size())
     }
     logInfo("Stopped block pushing thread")
   } catch {
     case ie: InterruptedException =>
       logInfo("Block pushing thread was interrupted")
     case e: Exception =>
       reportError("Error in block pushing thread", e)
   }
 }

其中的pushBlock源碼如下：

 private def pushBlock(block: Block) {
   listener.onPushBlock(block.id, block.buffer)
   logInfo("Pushed block " + block.id)
 }

其調用的listener(org.apache.spark.streaming.kafka.ReliableKafkaReceiver.GeneratedBlockHandler)的 onPushBlock 源碼如下：

 def onPushBlock(blockId: StreamBlockId, arrayBuffer: mutable.ArrayBuffer[_]): Unit = {
   // Store block and commit the blocks offset
   storeBlockAndCommitOffset(blockId, arrayBuffer)
 }

其中，storeBlockAndCommitOffset具體代碼如下：

 /**
  * Store the ready-to-be-stored block and commit the related offsets to zookeeper. This method
  * will try a fixed number of times to push the block. If the push fails, the receiver is stopped.
  */
 private def storeBlockAndCommitOffset(
     blockId: StreamBlockId, arrayBuffer: mutable.ArrayBuffer[_]): Unit = {
   var count = 0
   var pushed = false
   var exception: Exception = null
   while (!pushed && count <= 3) { // 整個過程，總共允許3 次重試
     try {
       store(arrayBuffer.asInstanceOf[mutable.ArrayBuffer[(K, V)]])
       pushed = true
     } catch {
       case ex: Exception =>
         count += 1
         exception = ex
     }
   }
   if (pushed) { // 已經push block
 // 更新 offset
     Option(blockOffsetMap.get(blockId)).foreach(commitOffset)
 // 如果已經push 到 BlockManager 中，則不會再保留 block和topic-partition-> offset的映射關系
     blockOffsetMap.remove(blockId)
   } else {
     stop("Error while storing block into Spark", exception)
   }
 }
 // 其中，commitOffset源碼如下：
 /**
  * Commit the offset of Kafka's topic/partition, the commit mechanism follow Kafka 0.8.x's
  * metadata schema in Zookeeper.
  */
 private def commitOffset(offsetMap: Map[TopicAndPartition, Long]): Unit = {
   if (zkClient == null) {
     val thrown = new IllegalStateException("Zookeeper client is unexpectedly null")
     stop("Zookeeper client is not initialized before commit offsets to ZK", thrown)
     return
   }
 
   for ((topicAndPart, offset) <- offsetMap) {
     try {
 // 獲取在 zk 中 comsumer 的partition的目錄
       val topicDirs = new ZKGroupTopicDirs(groupId, topicAndPart.topic)
       val zkPath = s"${topicDirs.consumerOffsetDir}/${topicAndPart.partition}"
       // 更新 consumer 的已消費topic-partition 的offset 操作
       ZkUtils.updatePersistentPath(zkClient, zkPath, offset.toString)
     } catch {
       case e: Exception =>
         logWarning(s"Exception during commit offset $offset for topic" +
           s"${topicAndPart.topic}, partition ${topicAndPart.partition}", e)
     }
 
     logInfo(s"Committed offset $offset for topic ${topicAndPart.topic}, " +
       s"partition ${topicAndPart.partition}")
   }
 }

關鍵方法store 如下：

 /** Store an ArrayBuffer of received data as a data block into Spark's memory. */
 def store(dataBuffer: ArrayBuffer[T]) {
   supervisor.pushArrayBuffer(dataBuffer, None, None)
 }

其調用了supervisor（org.apache.spark.streaming.receiver.ReceiverSupervisorImpl實例）的pushArrayBuffer方法，內部操作如下：

 /** Store an ArrayBuffer of received data as a data block into Spark's memory. */
 def pushArrayBuffer(
     arrayBuffer: ArrayBuffer[_],
     metadataOption: Option[Any],
     blockIdOption: Option[StreamBlockId]
   ) {
   pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption)
 }

org.apache.spark.streaming.receiver.ReceiverSupervisorImpl#pushAndReportBlock 源碼如下：

 /** Store block and report it to driver */
 def pushAndReportBlock(
     receivedBlock: ReceivedBlock,
     metadataOption: Option[Any],
     blockIdOption: Option[StreamBlockId]
   ) {
 // 1.准備blockId，time等信息
   val blockId = blockIdOption.getOrElse(nextBlockId)
   val time = System.currentTimeMillis
 // 2. 執行存儲 block 操作
   val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
   logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
 // 3. 獲取保存的message 的記錄數
   val numRecords = blockStoreResult.numRecords
 // 4. 通知trackerEndpoint已經添加block，執行更新driver 的WAL操作
   val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
   trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
   logDebug(s"Reported block $blockId")
 }

其中，receivedBlockHandler 的賦值語句如下：

  private val receivedBlockHandler: ReceivedBlockHandler = {
   if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {
     if (checkpointDirOption.isEmpty) {
       throw new SparkException(
         "Cannot enable receiver write-ahead log without checkpoint directory set. " +
           "Please use streamingContext.checkpoint() to set the checkpoint directory. " +
           "See documentation for more details.")
     }
 // enable WAL並且checkpoint dir 不為空，即，在這里，返回WriteAheadLogBasedBlockHandler 對象，這個對象持有了 blockmanager，streamid，storagelevel，conf，checkpointdir 等信息
     new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId,
       receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)
   } else {
     new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)
   }
 }

ReceivedBlockHandler 的 storeBlock方法源碼如下：

 /**
  * This implementation stores the block into the block manager as well as a write ahead log.
  * It does this in parallel, using Scala Futures, and returns only after the block has
  * been stored in both places.
  */
 // 並行地將block 存入 blockmanager 和 write ahead log，使用scala 的Future 機制實現的，當兩個都寫完畢之后，返回。
 def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {
 
   var numRecords = None: Option[Long]
   // Serialize the block so that it can be inserted into both
 // 1. 將ReceivedBlock序列化（未使用壓縮機制）成字節數組
   val serializedBlock = block match { // serializedBlock 就是序列化后的結果
     case ArrayBufferBlock(arrayBuffer) => // go this branch
       numRecords = Some(arrayBuffer.size.toLong)
       blockManager.dataSerialize(blockId, arrayBuffer.iterator)
     case IteratorBlock(iterator) =>
       val countIterator = new CountingIterator(iterator)
       val serializedBlock = blockManager.dataSerialize(blockId, countIterator)
       numRecords = countIterator.count
       serializedBlock
     case ByteBufferBlock(byteBuffer) =>
       byteBuffer
     case _ =>
       throw new Exception(s"Could not push $blockId to block manager, unexpected block type")
   }
 
   // 2. Store the block in block manager
   val storeInBlockManagerFuture = Future {
     val putResult =
       blockManager.putBytes(blockId, serializedBlock, effectiveStorageLevel, tellMaster = true)
     if (!putResult.map { _._1 }.contains(blockId)) {
       throw new SparkException(
         s"Could not store $blockId to block manager with storage level $storageLevel")
     }
   }
 
   // 3. Store the block in write ahead log
   val storeInWriteAheadLogFuture = Future {
     writeAheadLog.write(serializedBlock, clock.getTimeMillis())
   }
 
   // 4. Combine the futures, wait for both to complete, and return the write ahead log record handle
   val combinedFuture = storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2)
 // 等待future任務結果返回。默認時間是 30s， 使用spark.streaming.receiver.blockStoreTimeout 參數來變更默認值
   val walRecordHandle = Await.result(combinedFuture, blockStoreTimeout)
   // 返回cache之后的block 相關信息
 WriteAheadLogBasedStoreResult(blockId, numRecords, walRecordHandle)
 }

3、將WAL的block信息發送給driver

注意WriteAheadLogBasedStoreResult 這個 WriteAheadLogBasedStoreResult 實例，后面 RDD 在處理的時候會使用到。 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl#pushAndReportBlock 通知driver addBlock 的源碼如下：

 // 4. 通知trackerEndpoint已經添加block，執行更新driver 的WAL操作
   val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
   trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
   logDebug(s"Reported block $blockId")

4、Driver將WAL block數據寫入到 driver 的WAL中

跳過中間的RPC操作，直接到 driver 端org.apache.spark.streaming.scheduler.ReceiverTracker.ReceiverTrackerEndpoint#receiveAndReply 中：

 case AddBlock(receivedBlockInfo) =>
   if (WriteAheadLogUtils.isBatchingEnabled(ssc.conf, isDriver = true)) {
     walBatchingThreadPool.execute(new Runnable {
       override def run(): Unit = Utils.tryLogNonFatalError {
         if (active) {
           context.reply(addBlock(receivedBlockInfo))
         } else {
           throw new IllegalStateException("ReceiverTracker RpcEndpoint shut down.")
         }
       }
     })
   } else {
     context.reply(addBlock(receivedBlockInfo))
   }

其中 addBlock方法源碼如下：

 /** Add new blocks for the given stream */
 private def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
   receivedBlockTracker.addBlock(receivedBlockInfo)
 }

其中，org.apache.spark.streaming.scheduler.ReceivedBlockTracker#addBlock 源碼如下：

 /** Add received block. This event will get written to the write ahead log (if enabled). */
 def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
   try {
     val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo))
     if (writeResult) {
       synchronized {
         getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo
       }
       logDebug(s"Stream ${receivedBlockInfo.streamId} received " +
         s"block ${receivedBlockInfo.blockStoreResult.blockId}")
     } else {
       logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " +
         s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.")
     }
     writeResult
   } catch {
     case NonFatal(e) =>
       logError(s"Error adding block $receivedBlockInfo", e)
       false
   }
 }
 /** Write an update to the tracker to the write ahead log */
 private def writeToLog(record: ReceivedBlockTrackerLogEvent): Boolean = {
   if (isWriteAheadLogEnabled) {
     logTrace(s"Writing record: $record")
     try {
       writeAheadLogOption.get.write(ByteBuffer.wrap(Utils.serialize(record)),
         clock.getTimeMillis())
       true
     } catch {
       case NonFatal(e) =>
         logWarning(s"Exception thrown while writing record: $record to the WriteAheadLog.", e)
         false
     }
   } else {
     true
   }
 }
 /** Get the queue of received blocks belonging to a particular stream */
 private def getReceivedBlockQueue(streamId: Int): ReceivedBlockQueue = {
   streamIdToUnallocatedBlockQueues.getOrElseUpdate(streamId, new ReceivedBlockQueue)
 }

上述代碼，主要是將BlockAdditionEvent寫WAL和更新隊列（其實就是mutable.HashMap[Int, ReceivedBlockQueue]），這個隊列中存放的是streamId ->UnallocatedBlock 的映射關系

5、從WAL RDD 中讀取數據

createStream 源碼如下：

 /**
  * Create an input stream that pulls messages from Kafka Brokers.
  * @param ssc         StreamingContext object
  * @param kafkaParams Map of kafka configuration parameters,
  *                    see http://kafka.apache.org/08/configuration.html
  * @param topics      Map of (topic_name -> numPartitions) to consume. Each partition is consumed
  *                    in its own thread.
  * @param storageLevel Storage level to use for storing the received objects
  * @tparam K type of Kafka message key
  * @tparam V type of Kafka message value
  * @tparam U type of Kafka message key decoder
  * @tparam T type of Kafka message value decoder
  * @return DStream of (Kafka message key, Kafka message value)
  */
 def createStream[K: ClassTag, V: ClassTag, U <: Decoder[_]: ClassTag, T <: Decoder[_]: ClassTag](
     ssc: StreamingContext,
     kafkaParams: Map[String, String],
     topics: Map[String, Int],
     storageLevel: StorageLevel
   ): ReceiverInputDStream[(K, V)] = {
 // 可以通過設置spark.streaming.receiver.writeAheadLog.enable參數為 true來開啟WAL
   val walEnabled = WriteAheadLogUtils.enableReceiverLog(ssc.conf)
   new KafkaInputDStream[K, V, U, T](ssc, kafkaParams, topics, walEnabled, storageLevel)
 }

創建的是KafkaInputDStream對象：

 /**
  * Input stream that pulls messages from a Kafka Broker.
  *
  * @param kafkaParams Map of kafka configuration parameters.
  *                    See: http://kafka.apache.org/configuration.html
  * @param topics Map of (topic_name -> numPartitions) to consume. Each partition is consumed
  * in its own thread.
  * @param storageLevel RDD storage level.
  */
 private[streaming]
 class KafkaInputDStream[
   K: ClassTag,
   V: ClassTag,
   U <: Decoder[_]: ClassTag,
   T <: Decoder[_]: ClassTag](
     ssc_ : StreamingContext,
     kafkaParams: Map[String, String],
     topics: Map[String, Int],
     useReliableReceiver: Boolean,
     storageLevel: StorageLevel
   ) extends ReceiverInputDStream[(K, V)](ssc_) with Logging {
 
   def getReceiver(): Receiver[(K, V)] = {
     if (!useReliableReceiver) { // 未啟用 WAL，會使用 KafkaReceiver 對象
       new KafkaReceiver[K, V, U, T](kafkaParams, topics, storageLevel)
     } else { // 如果啟用了WAL， 使用ReliableKafkaReceiver
       new ReliableKafkaReceiver[K, V, U, T](kafkaParams, topics, storageLevel)
     }
   }
 }

org.apache.spark.streaming.kafka.KafkaInputDStream 繼承父類的 compute方法：

  /**
  * Generates RDDs with blocks received by the receiver of this stream. */
 override def compute(validTime: Time): Option[RDD[T]] = {
   val blockRDD = {
 
     if (validTime < graph.startTime) {
       // If this is called for any time before the start time of the context,
       // then this returns an empty RDD. This may happen when recovering from a
       // driver failure without any write ahead log to recover pre-failure data.
       new BlockRDD[T](ssc.sc, Array.empty)
     } else {
       // Otherwise, ask the tracker for all the blocks that have been allocated to this stream
       // for this batch
       val receiverTracker = ssc.scheduler.receiverTracker
       val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)
 
       // Register the input blocks information into InputInfoTracker
       val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
       ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
 
       // Create the BlockRDD
       createBlockRDD(validTime, blockInfos)
     }
   }
   Some(blockRDD)
 }

getBlocksOfBatch 如下：

 /** Get the blocks for the given batch and all input streams. */
 def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] = {
   receivedBlockTracker.getBlocksOfBatch(batchTime)
 }
 調用：
 /** Get the blocks allocated to the given batch. */
 def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] = synchronized {
   timeToAllocatedBlocks.get(batchTime).map { _.streamIdToAllocatedBlocks }.getOrElse(Map.empty)
 }

6、JobGenerator將WAL block 分配給一個batch，並生成job

1、取出WAL block 信息

在 org.apache.spark.streaming.scheduler.JobGenerator 中聲明了一個定時器：

 // timer 會按照批次間隔 生成 GenerateJobs 任務，並放入eventLoop 堵塞隊列中
 private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
   longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

EventLoop 實例化代碼如下：

 eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
   override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)
 
   override protected def onError(e: Throwable): Unit = {
     jobScheduler.reportError("Error in job generator", e)
   }
 }
 eventLoop.start()

EventLoop里定義了一個LinkedBlockingDeque雙端堵塞隊列和一個執行daemon線程，daemon線程會不停從雙端堵塞隊列中堵塞式取數據，一旦取到數據，會調 onReceive 方法，即 processEvent 方法：

 /** Processes all events */
 private def processEvent(event: JobGeneratorEvent) {
   logDebug("Got event " + event)
   event match {
     case GenerateJobs(time) => generateJobs(time)
     case ClearMetadata(time) => clearMetadata(time)
     case DoCheckpoint(time, clearCheckpointDataLater) =>
       doCheckpoint(time, clearCheckpointDataLater)
     case ClearCheckpointData(time) => clearCheckpointData(time)
   }
 }

由於是GenerateJobs 事件，會繼續調用generateJobs 方法：

 /** Generate jobs and perform checkpoint for the given `time`.  */
 private def generateJobs(time: Time) {
   // Set the SparkEnv in this thread, so that job generation code can access the environment
   // Example: BlockRDDs are created in this thread, and it needs to access BlockManager
   // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
   SparkEnv.set(ssc.env)
   Try {
 // 1. 將 WAL block 信息 分配給batch（這些數據塊信息是worker 節點cache 到WAL 之后發送給driver 端的）
     jobScheduler.receiverTracker.allocateBlocksToBatch(time)
 // 2. 使用分配的block數據塊來生成任務
     graph.generateJobs(time) // generate jobs using allocated block
   } match {
     case Success(jobs) =>
       val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
       jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
     case Failure(e) =>
       jobScheduler.reportError("Error generating jobs for time " + time, e)
   }
 // 發布DoCheckpoint 事件，保存checkpoint操作，主要是將新的checkpoint 數據寫入到 hdfs， 刪除舊的 checkpoint 數據
   eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
 }

第一步中調用的 org.apache.spark.streaming.scheduler.ReceiverTracker#allocateBlocksToBatch方法如下：

 /** Allocate all unallocated blocks to the given batch. */
 def allocateBlocksToBatch(batchTime: Time): Unit = {
   if (receiverInputStreams.nonEmpty) {
     receivedBlockTracker.allocateBlocksToBatch(batchTime)
   }
 }

其中，org.apache.spark.streaming.scheduler.ReceivedBlockTracker#allocateBlocksToBatch 方法如下：

 def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
   if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {
 // 遍歷輸入流，根據流的 streamId 獲取未被分配的block隊列，並返回[streamId, seq[receivedBlockInfo]],由此可知，到此為止，數據其實已經從receiver中讀出來了。
    // 獲取 streamid和 WAL的blocks 的映射關系
 val streamIdToBlocks = streamIds.map { streamId =>
         (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true))
     }.toMap
     val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)
     if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {
       timeToAllocatedBlocks.put(batchTime, allocatedBlocks)
       lastAllocatedBatchTime = batchTime
     } else {
       logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery")
     }
   } else {
     // This situation occurs when:
     // 1. WAL is ended with BatchAllocationEvent, but without BatchCleanupEvent,
     // possibly processed batch job or half-processed batch job need to be processed again,
     // so the batchTime will be equal to lastAllocatedBatchTime.
     // 2. Slow checkpointing makes recovered batch time older than WAL recovered
     // lastAllocatedBatchTime.
     // This situation will only occurs in recovery time.
     logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery")
   }
 }

其中，getReceivedBlockQueue的源碼如下：

 /** Get the queue of received blocks belonging to a particular stream */
 private def getReceivedBlockQueue(streamId: Int): ReceivedBlockQueue = {
   streamIdToUnallocatedBlockQueues.getOrElseUpdate(streamId, new ReceivedBlockQueue)
 }

可以看到，worker node 發送過來的block 數據被取出來了。

2、根據WAL block創建 RDD

org.apache.spark.streaming.dstream.ReceiverInputDStream#createBlockRDD 源碼如下：

 private[streaming] def createBlockRDD(time: Time, blockInfos: Seq[ReceivedBlockInfo]): RDD[T] = {
 
   if (blockInfos.nonEmpty) {
     val blockIds = blockInfos.map { _.blockId.asInstanceOf[BlockId] }.toArray
    // 所有的block已經有了WriteAheadLogRecordHandle， 創建一個WALBackedBlockRDD即可， 否則創建BlockRDD。
 // 其中，WriteAheadLogRecordHandle 是一個跟WAL 相關聯的EntryInfo，實現類FileBasedWriteAheadLogSegment就包含了WAL segment 的path， offset 以及 length 信息。RDD 在真正需要數據時，根據這些handle信息從 WAL 中讀取數據。
     // Are WAL record handles present with all the blocks
     val areWALRecordHandlesPresent = blockInfos.forall { _.walRecordHandleOption.nonEmpty }
 
     if (areWALRecordHandlesPresent) {
       // If all the blocks have WAL record handle, then create a WALBackedBlockRDD
       val isBlockIdValid = blockInfos.map { _.isBlockIdValid() }.toArray
       val walRecordHandles = blockInfos.map { _.walRecordHandleOption.get }.toArray
       new WriteAheadLogBackedBlockRDD[T](
         ssc.sparkContext, blockIds, walRecordHandles, isBlockIdValid)
     } else {
       // Else, create a BlockRDD. However, if there are some blocks with WAL info but not
       // others then that is unexpected and log a warning accordingly.
       if (blockInfos.find(_.walRecordHandleOption.nonEmpty).nonEmpty) {
         if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) {
           logError("Some blocks do not have Write Ahead Log information; " +
             "this is unexpected and data may not be recoverable after driver failures")
         } else {
           logWarning("Some blocks have Write Ahead Log information; this is unexpected")
         }
       }
       val validBlockIds = blockIds.filter { id =>
         ssc.sparkContext.env.blockManager.master.contains(id)
       }
       if (validBlockIds.size != blockIds.size) {
         logWarning("Some blocks could not be recovered as they were not found in memory. " +
           "To prevent such data loss, enabled Write Ahead Log (see programming guide " +
           "for more details.")
       }
       new BlockRDD[T](ssc.sc, validBlockIds)
     }
   } else {
     // If no block is ready now, creating WriteAheadLogBackedBlockRDD or BlockRDD
     // according to the configuration
     if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) {
       new WriteAheadLogBackedBlockRDD[T](
         ssc.sparkContext, Array.empty, Array.empty, Array.empty)
     } else {
       new BlockRDD[T](ssc.sc, Array.empty)
     }
   }
 }

org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD#compute 的源碼如下：

  /**
  * Gets the partition data by getting the corresponding block from the block manager.
  * If the block does not exist, then the data is read from the corresponding record
  * in write ahead log files.
  */
 override def compute(split: Partition, context: TaskContext): Iterator[T] = {
   assertValid()
   val hadoopConf = broadcastedHadoopConf.value
   val blockManager = SparkEnv.get.blockManager
   val partition = split.asInstanceOf[WriteAheadLogBackedBlockRDDPartition]
   val blockId = partition.blockId
 
   def getBlockFromBlockManager(): Option[Iterator[T]] = {
     blockManager.get(blockId).map(_.data.asInstanceOf[Iterator[T]])
   }
 
   def getBlockFromWriteAheadLog(): Iterator[T] = {
     var dataRead: ByteBuffer = null
     var writeAheadLog: WriteAheadLog = null
     try {
       // The WriteAheadLogUtils.createLog*** method needs a directory to create a
       // WriteAheadLog object as the default FileBasedWriteAheadLog needs a directory for
       // writing log data. However, the directory is not needed if data needs to be read, hence
       // a dummy path is provided to satisfy the method parameter requirements.
       // FileBasedWriteAheadLog will not create any file or directory at that path.
       // FileBasedWriteAheadLog will not create any file or directory at that path. Also,
       // this dummy directory should not already exist otherwise the WAL will try to recover
       // past events from the directory and throw errors.
       val nonExistentDirectory = new File(
         System.getProperty("java.io.tmpdir"), UUID.randomUUID().toString).getAbsolutePath
       writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
         SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
       dataRead = writeAheadLog.read(partition.walRecordHandle)
     } catch {
       case NonFatal(e) =>
         throw new SparkException(
           s"Could not read data from write ahead log record ${partition.walRecordHandle}", e)
     } finally {
       if (writeAheadLog != null) {
         writeAheadLog.close()
         writeAheadLog = null
       }
     }
     if (dataRead == null) {
       throw new SparkException(
         s"Could not read data from write ahead log record ${partition.walRecordHandle}, " +
           s"read returned null")
     }
     logInfo(s"Read partition data of $this from write ahead log, record handle " +
       partition.walRecordHandle)
     if (storeInBlockManager) {
       blockManager.putBytes(blockId, dataRead, storageLevel)
       logDebug(s"Stored partition data of $this into block manager with level $storageLevel")
       dataRead.rewind()
     }
     blockManager.dataDeserialize(blockId, dataRead).asInstanceOf[Iterator[T]]
   }
  // 如果partition.isBlockIdValid 為true，則說明該 block 數據存在executors 中
   if (partition.isBlockIdValid) {
 // 先根據 BlockManager從 executor中讀取數據， 如果沒有，再從WAL 中讀取數據
 // BlockManager 從內存還是從磁盤上獲取的數據 ？
 blockManager 從 local 或 remote 獲取 block，其中 local既可以從 memory 中獲取也可以從 磁盤中讀取， 其中remote獲取數據是同步的，即在fetch block 過程中會一直blocking。
     getBlockFromBlockManager().getOrElse { getBlockFromWriteAheadLog() }
   } else {
     getBlockFromWriteAheadLog()
   }
 }

至此，從啟動 receiver，到receiver 接收數據並保存到WAL block，driver 接收WAL 的block 信息，直到spark streaming 通過WAL RDD 來獲取數據等等都一一做了說明。

五、spark streaming 接收kafka消息之五 -- spark streaming 和 kafka 的對接總結

Spark streaming 和kafka 處理確保消息不丟失的總結

1、接入kafka

我們前面的1到4 都在說 spark streaming 接入 kafka 消息的事情。講了兩種接入方式，以及spark streaming 如何和kafka協作接收數據，處理數據生成rdd的

主要有如下兩種方式

1、基於分布式receiver

基於receiver的方法采用Kafka的高級消費者API，每個executor進程都不斷拉取消息，並同時保存在executor內存與HDFS上的預寫日志（write-ahead log/WAL）。當消息寫入WAL后，自動更新ZooKeeper中的offset。它可以保證at least once語義，但無法保證exactly once語義。原因是雖然引入了WAL來確保消息不會丟失，但有可能會出現消息已寫入WAL，但更新comsuer 的offset到zk時失敗的情況，此時consumer就會按上一次的offset重新發送消息到kafka重新獲取一次已保存到WAL的數據。這種方式還會造成數據冗余（WAL中一份，blockmanager中一份，其中blockmanager可能會做StorageLevel.MEMORY_AND_DISK_SER_2，即內存中一份，磁盤上兩份），大大降低了吞吐量和內存磁盤的利用率。現在基本都使用下面基於direct stream的方法了。

2、基於direct stream的方法

基於direct stream的方法采用Kafka的簡單消費者API，大大簡化了獲取message 的流程。executor不再從Kafka中連續讀取消息，也消除了receiver和WAL。還有一個改進就是Kafka分區與RDD分區是一一對應的，允許用戶控制topic-partition 的offset，程序變得更加可控。 driver進程只需要每次從Kafka獲得批次消息的offset range，然后executor進程根據offset range去讀取該批次對應的消息即可。由於offset在Kafka中能唯一確定一條消息，且在外部只能被Streaming程序本身感知到，因此消除了不一致性，保證了exactly once語義。不過，由於它采用了簡單消費者API，我們就需要自己來管理offset。否則一旦程序崩潰，整個流只能從earliest或者latest點恢復，這肯定是不穩妥的。

2、如何保證處理結果不丟失呢？

主要有兩種方案：

2.1. 主要是通過設計冪等性操作，在 at least once 的語義之上，確保數據不丟失

2.2. 在一些shuffle或者是集合計算的結果集中，在 exactly-once 的基礎上，同時更新處理結果和 offset，這種情況下，一般都是使用事務來做。

現有的支持事務的，也就是傳統的數據庫了，對於一些緩存系統為了更簡單更高效的訪問，即使有事務機制，也設計的非常簡單，或是只實現了部分功能，例如 redis 的事務是不能支持回滾的。需要我們在代碼中做相應的設計，來確保事務的正確執行。

3、分布式 RDD 計算過程如何確保准確性和一致性？

即分布式RDD計算是如何和確保計算恰好計算一次的呢？后續會出一系列源碼分析，分析 spark 是如何做分布式計算的。

第十章、優化

一、spark 集群優化

只有滿懷自信的人，能在任何地方都懷有自信，沉浸在生活中，並認識自己的意志。

1、前言

最近公司有一個生產的小集群，專門用於運行spark作業。但是偶爾會因為nn或dn壓力過大而導致作業checkpoint操作失敗進而導致spark 流任務失敗。本篇記錄從應用層面對spark作業進行優化，進而達到優化集群的作用。

2、集群使用情況

有數據的目錄以及使用情況如下：

目錄	說明	大小	文件數量	數據數量占比	數據大小占比
/user/root/.sparkStaging/applicationIdxxx	spark任務配置以及所需jar包	5G	約1k	約20%	約100%
/tmp/checkpoint/xxx/{commits\|metadata\|offsets\|sources}	checkpoint文件，其中commits和offsets頻繁變動	2M	約4k	約80%	約0%

對於.sparkStaging目錄，不經常變動，只需要優化其大小即可。

對於 checkpoint目錄，頻繁性增刪，從生成周期和保留策略兩方面去考慮。

3、 .sparkStaging目錄優化

對於/user/root/.sparkStaging下文件，是spark任務依賴文件，可以將jar包上傳到指定目錄下，避免或減少了jar包的重復上傳，進而減少任務的等待時間。

可以在spark的配置文件spark-defaults.conf配置如下內容：

spark.yarn.archive=hdfs://hdfscluster/user/hadoop/jars
spark.yarn.preserve.staging.files=false

1、參數說明

Property Name	Default	Meaning
spark.yarn.archive	（none）	An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application's containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.
spark.yarn.preserve.staging.files	false	Set to true to preserve the staged files (Spark jar, app jar, distributed cache files) at the end of the job rather than delete them.

4、checkpoint優化

首先了解一下 checkpoint文件代表的含義。

1、checkpoint文件說明

offsets 目錄 - 預先記錄日志，記錄每個批次中存在的偏移量。為了確保給定的批次將始終包含相同的數據，我們在進行任何處理之前將其寫入此日志。因此，該日志中的第N個記錄指示當前正在處理的數據，第N-1個條目指示哪些偏移已持久地提交給sink。
commits 目錄 - 記錄已完成的批次ID的日志。這用於檢查批處理是否已完全處理，並且其輸出已提交給接收器，因此無需再次處理。（例如）在重新啟動過程中使用，以幫助識別接下來要運行的批處理。
metadata 文件 - 與整個查詢關聯的元數據，只有一個 StreamingQuery 唯一ID
sources目錄 - 保存起始offset信息

下面從兩個方面來優化checkpoint。

第一，從觸發checkpoint機制方面考慮

2、trigger的機制

Trigger是用於指示 StreamingQuery 多久生成一次結果的策略。

Trigger有三個實現類，分別為：

OneTimeTrigger - A Trigger that processes only one batch of data in a streaming query then terminates the query.
ProcessingTime - A trigger that runs a query periodically based on the processing time. If interval is 0, the query will run as fast as possible.by default，trigger is ProcessingTime, and interval=0
ContinuousTrigger - A Trigger that continuously processes streaming data, asynchronously checkpointing at the specified interval.

可以為 ProcessingTime 指定一個時間或者使用指定時間的ContinuousTrigger ，固定生成checkpoint的周期，避免checkpoint生成過於頻繁，減輕多任務下小集群的nn的壓力

第二，從checkpoint保留機制考慮。

3、保留機制

spark.sql.streaming.minBatchesToRetain - 必須保留並使其可恢復的最小批次數，默認為 100

可以調小保留的batch的次數，比如調小到 20，這樣 checkpoint 小文件數量整體可以減少到原來的 20%

5.checkpoint 參數驗證

主要驗證trigger機制和保留機制

1、驗證trigger機制

未設置trigger效果

未設置trigger前，spark structured streaming 的查詢batch提交的周期截圖如下：

每一個batch的query任務的提交是毫無周期規律可尋。

設置trigger代碼

trigger效果

設置trigger代碼后效果截圖如下：

每一個batch的query任務的提交是有規律可尋的，即每隔5s提交一次代碼，即trigger設置生效！

注意，如果消息不能馬上被消費，消息會有積壓，structured streaming 目前並無與spark streaming效果等同的背壓機制，為防止單批次query查詢的數據源數據量過大，避免程序出現數據傾斜或者無法挽回的OutOfMemory錯誤，可以通過 maxOffsetsPerTrigger 參數來設置單個批次允許抓取的最大消息條數。

使用案例如下：

spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "xxx:9092")
    .option("subscribe", "test-name")
    .option("startingOffsets", "earliest")
    .option("maxOffsetsPerTrigger", 1)
    .option("group.id", "2")
    .option("auto.offset.reset", "earliest")
    .load()

2、驗證保留機制

默認保留機制效果

spark任務提交參數

#!/bin/bash
spark-submit \
--class zd.Example \
--master yarn \
--deploy-mode client \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3,org.apache.kafka:kafka-clients:2.0.0 \
--repositories http://maven.aliyun.com/nexus/content/groups/public/ \
/root/spark-test-1.0-SNAPSHOT.jar

如下圖，offsets和commits最終最少各保留100個文件。

修改保留策略

通過修改任務提交參數來進一步修改checkpoint的保留策略。

添加 --conf spark.sql.streaming.minBatchesToRetain=2 ，完整腳本如下：

#!/bin/bash
spark-submit \
--class zd.Example \
--master yarn \
--deploy-mode client \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3,org.apache.kafka:kafka-clients:2.0.0 \
--repositories http://maven.aliyun.com/nexus/content/groups/public/ \
--conf spark.sql.streaming.minBatchesToRetain=2 \
/root/spark-test-1.0-SNAPSHOT.jar

修改后保留策略效果

修改后保留策略截圖如下：

即 checkpoint的保留策略參數設置生效！

3、總結

綜上，可以通過設置 trigger 來控制每一個batch的query提交的時間間隔，可以通過設置checkpoint文件最少保留batch的大小來減少checkpoint小文件的保留個數。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spark 源碼分析系列 Spark源碼分析系列（目錄） Spark源碼分析之Spark Shell（上） Spark源碼分析 – Checkpoint Spark源碼分析 – SparkContext Spark源碼分析 – Shuffle Spark源碼分析 -- PairRDD Spark Mllib源碼分析 Spark 源碼分析 -- RDD Spark源碼分析 – BlockManager