Spark 源碼解析 : DAGScheduler中的DAG划分與提交

本文轉載自查看原文 2016-07-20 09:26 6035

一、Spark 運行架構

Spark 運行架構如下圖：

各個RDD之間存在着依賴關系，這些依賴關系形成有向無環圖DAG，DAGScheduler對這些依賴關系形成的DAG，進行Stage划分，划分的規則很簡單，從后往前回溯，遇到窄依賴加入本stage，遇見寬依賴進行Stage切分。完成了Stage的划分,DAGScheduler基於每個Stage生成TaskSet,並將TaskSet提交給TaskScheduler。TaskScheduler 負責具體的task調度,在Worker節點上啟動task。

二、源碼解析：DAGScheduler中的DAG划分

當RDD觸發一個Action操作（如：colllect）后，導致SparkContext.runJob的執行。而在SparkContext的run方法中會調用DAGScheduler的run方法最終調用了DAGScheduler的submit方法：

   
   
   
           
    
    
    
             def submitJob[T, U](
    
    
    
             rdd: RDD[T],
    
    
    
             func: (TaskContext, Iterator[T]) => U,
    
    
    
             partitions: Seq[Int],
    
    
    
             callSite: CallSite,
    
    
    
             resultHandler: (Int, U) => Unit,
    
    
    
             properties: Properties): JobWaiter[U] = {
    
    
    
             // Check to make sure we are not launching a task on a partition that does not exist.
    
    
    
             val maxPartitions = rdd.partitions.length
    
    
    
             partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
    
    
    
             throw new IllegalArgumentException(
    
    
    
             "Attempting to access a non-existent partition: " + p + ". " +
    
    
    
             "Total number of partitions: " + maxPartitions)
    
    
    
             }
    
    
    
            
    
    
    
             val jobId = nextJobId.getAndIncrement()
    
    
    
             if (partitions.size == 0) {
    
    
    
             // Return immediately if the job is running 0 tasks
    
    
    
             return new JobWaiter[U](this, jobId, 0, resultHandler)
    
    
    
             }
    
    
    
            
    
    
    
             assert(partitions.size > 0)
    
    
    
             val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    
    
    
             val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    
    
    
             //給eventProcessLoop發送JobSubmitted消息
    
    
    
             eventProcessLoop.post(JobSubmitted(
    
    
    
             jobId, rdd, func2, partitions.toArray, callSite, waiter,
    
    
    
             SerializationUtils.clone(properties)))
    
    
    
             waiter
    
    
    
             }

DAGScheduler的submit方法中，像eventProcessLoop對象發送了JobSubmitted消息。 eventProcessLoop是 DAGSchedulerEventProcessLoop 類的對象

   
   
   
           
    
    
    
            private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)

DAGSchedulerEventProcessLoop，接收各種消息並進行處理，處理的邏輯在其doOnReceive方法中：

   
   
   
           
    
    
    
             private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
    
    
    
               //Job提交

   
   
   
           
   
   
   
           
    
    
    
             case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
    
    
    
             dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
    
    
    
            
    
    
    
             case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
    
    
    
             dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)
    
    
    
            
    
    
    
             case StageCancelled(stageId) =>
    
    
    
             dagScheduler.handleStageCancellation(stageId)
    
    
    
            
    
    
    
             case JobCancelled(jobId) =>
    
    
    
             dagScheduler.handleJobCancellation(jobId)
    
    
    
            
    
    
    
             case JobGroupCancelled(groupId) =>
    
    
    
             dagScheduler.handleJobGroupCancelled(groupId)
    
    
    
            
    
    
    
             case AllJobsCancelled =>
    
    
    
             dagScheduler.doCancelAllJobs()
    
    
    
            
    
    
    
             case ExecutorAdded(execId, host) =>
    
    
    
             dagScheduler.handleExecutorAdded(execId, host)
    
    
    
            
    
    
    
             case ExecutorLost(execId) =>
    
    
    
             dagScheduler.handleExecutorLost(execId, fetchFailed = false)
    
    
    
            
    
    
    
             case BeginEvent(task, taskInfo) =>
    
    
    
             dagScheduler.handleBeginEvent(task, taskInfo)
    
    
    
            
    
    
    
             case GettingResultEvent(taskInfo) =>
    
    
    
             dagScheduler.handleGetTaskResult(taskInfo)
    
    
    
            
    
    
    
             case completion: CompletionEvent =>
    
    
    
             dagScheduler.handleTaskCompletion(completion)
    
    
    
            
    
    
    
             case TaskSetFailed(taskSet, reason, exception) =>
    
    
    
             dagScheduler.handleTaskSetFailed(taskSet, reason, exception)
    
    
    
            
    
    
    
             case ResubmitFailedStages =>
    
    
    
             dagScheduler.resubmitFailedStages()
    
    
    
             }

可以把 DAGSchedulerEventProcessLoop 理解成DAGScheduler的對外的功能接口。它對外隱藏了自己內部實現的細節。無論是內部還是外部消息， DAGScheduler可以共用同一消息處理代碼，邏輯清晰，處理方式統一。

接下來分析 DAGScheduler的Stage划分， handleJobSubmitted 方法首先創建ResultStage

   
   
   
           
    
    
    
             try {
    
    
    
              //創建新stage可能出現異常，比如job運行依賴hdfs文文件被刪除
    
    
    
             finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
    
    
    
             } catch {
    
    
    
             case e: Exception =>
    
    
    
             logWarning("Creating new stage failed due to exception - job: " + jobId, e)
    
    
    
             listener.jobFailed(e)
    
    
    
             return
    
    
    
             }

然后調用submitStage方法，進行stage的划分。

首先由finalRDD獲取它的父RDD依賴，判斷依賴類型，如果是窄依賴，則將父RDD壓入棧中，如果是寬依賴，則作為父Stage。

看一下源碼的具體過程:

   
   
   
           
    
    
    
             private def getMissingParentStages(stage: Stage): List[Stage] = {
    
    
    
             val missing = new HashSet[Stage] //存儲需要返回的父Stage
    
    
    
             val visited = new HashSet[RDD[_]] //存儲訪問過的RDD
    
    
    
             //自己建立棧，以免函數的遞歸調用導致
    
    
    
             val waitingForVisit = new Stack[RDD[_]]
    
    
    
            

    
    
    
             def visit(rdd: RDD[_]) {
    
    
    
             if (!visited(rdd)) {
    
    
    
             visited += rdd
    
    
    
             val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
    
    
    
             if (rddHasUncachedPartitions) {
    
    
    
             for (dep <- rdd.dependencies) {
    
    
    
             dep match {
    
    
    
             case shufDep: ShuffleDependency[_, _, _] =>
    
    
    
             val mapStage = getShuffleMapStage(shufDep, stage.firstJobId)
    
    
    
             if (!mapStage.isAvailable) {
    
    
    
             missing += mapStage //遇到寬依賴，加入父stage
    
    
    
             }
    
    
    
             case narrowDep: NarrowDependency[_] =>
    
    
    
             waitingForVisit.push(narrowDep.rdd) //窄依賴入棧,
    
    
    
             }
    
    
    
             }
    
    
    
             }
    
    
    
             }
    
    
    
             }
    
    
    
            

    
    
    
               //回溯的起始RDD入棧
    
    
    
             waitingForVisit.push(stage.rdd)
    
    
    
             while (waitingForVisit.nonEmpty) {
    
    
    
             visit(waitingForVisit.pop())
    
    
    
             }
    
    
    
             missing.toList
    
    
    
             }

getMissingParentStages方法是由當前stage，返回他的父stage，父stage的創建由getShuffleMapStage返回，最終會調用 newOrUsedShuffleStage 方法返回ShuffleMapStage

   
   
   
           
    
    
    
             private def newOrUsedShuffleStage(
    
    
    
             shuffleDep: ShuffleDependency[_, _, _],
    
    
    
             firstJobId: Int): ShuffleMapStage = {
    
    
    
             val rdd = shuffleDep.rdd
    
    
    
             val numTasks = rdd.partitions.length
    
    
    
             val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, rdd.creationSite)
    
    
    
             if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
    
    
    
             //Stage已經被計算過，從MapOutputTracker中獲取計算結果
    
    
    
             val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
    
    
    
             val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
    
    
    
             (0 until locs.length).foreach { i =>
    
    
    
             if (locs(i) ne null) {
    
    
    
             // locs(i) will be null if missing
    
    
    
             stage.addOutputLoc(i, locs(i))
    
    
    
             }
    
    
    
             }
    
    
    
             } else {
    
    
    
             // Kind of ugly: need to register RDDs with the cache and map output tracker here
    
    
    
             // since we can't do it in the RDD constructor because # of partitions is unknown
    
    
    
             logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
    
    
    
             mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
    
    
    
             }
    
    
    
             stage
    
    
    
             }

現在父Stage已經划分好，下面看看你Stage的提交邏輯

   
   
   
           
    
    
    
             /** Submits stage, but first recursively submits any missing parents. */
    
    
    
             private def submitStage(stage: Stage) {
    
    
    
             val jobId = activeJobForStage(stage)
    
    
    
             if (jobId.isDefined) {
    
    
    
             logDebug("submitStage(" + stage + ")")
    
    
    
             if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
    
    
    
             val missing = getMissingParentStages(stage).sortBy(_.id)
    
    
    
             logDebug("missing: " + missing)
    
    
    
             if (missing.isEmpty) {
    
    
    
             logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
    
    
    
             //如果沒有父stage，則提交當前stage
    
    
    
             submitMissingTasks(stage, jobId.get)
    
    
    
             } else {
    
    
    
             for (parent <- missing) {
    
    
    
             //如果有父stage，則遞歸提交父stage
    
    
    
             submitStage(parent)
    
    
    
             }
    
    
    
             waitingStages += stage
    
    
    
             }
    
    
    
             }
    
    
    
             } else {
    
    
    
             abortStage(stage, "No active job for stage " + stage.id, None)
    
    
    
             }
    
    
    
             }

提交的過程很簡單，首先當前stage獲取父stage，如果父stage為空，則當前Stage為起始stage，交給submitMissingTasks處理，如果當前stage不為空，則遞歸調用submitStage進行提交。

到這里，DAGScheduler中的DAG划分與提交就講完了，下次解析這些stage是如果封裝成TaskSet交給TaskScheduler以及TaskSchedule的調度過程。

來自為知筆記(Wiz)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spark源碼分析 – DAGScheduler spark 源碼分析之十九 -- DAG的生成和Stage的划分 spark中的RDD以及DAG Spark Stage切分源碼剖析——DAGScheduler 深入理解spark－DAGscheduler源碼分析(上) Spark核心作業調度和任務調度之DAGScheduler源碼 Spark分析之DAGScheduler Spark 源碼解析：TaskScheduler的任務提交和task最佳位置算法 spark（17）DAG有向無環圖、stage划分、spark任務調度及運行架構 spark DAG 筆記