14：Spark Streaming源碼解讀之State管理之updateStateByKey和mapWithState解密

本文轉載自查看原文 2016-06-03 14:34 2466

首先簡單解釋一下什么是state(狀態)管理？我們以wordcount為例。每個batchInterval會計算當前batch的單詞計數，那如果需要計算從流開始到目前為止的單詞出現的次數，該如計算呢？SparkStreaming提供了兩種方法：updateStateByKey和mapWithState 。mapWithState 是1.6版本新增功能，目前屬於實驗階段。mapWithState具官方說性能較updateStateByKey提升10倍。那么我們來看看他們到底是如何實現的。

一、updateStateByKey 解析

1.1 updateStateByKey 的使用實例

首先看一個 updateStateByKey函數使用的例子：

   
   
   
           
    
    
    
            object UpdateStateByKeyDemo {
    
    
    
             def main(args: Array[String]) {
    
    
    
             val conf = new SparkConf().setAppName("UpdateStateByKeyDemo")
    
    
    
             val ssc = new StreamingContext(conf,Seconds(20))
    
    
    
             //要使用updateStateByKey方法，必須設置Checkpoint。
    
    
    
             ssc.checkpoint("/checkpoint/")
    
    
    
             val socketLines = ssc.socketTextStream("localhost",9999)
    
    
    
             
    
    
    
             socketLines.flatMap(_.split(",")).map(word=>(word,1))
    
    
    
             .updateStateByKey( (currValues:Seq[Int],preValue:Option[Int]) =>{
    
    
    
                 val currValue = currValues.sum //將目前值相加
    
    
    
             Some(currValue + preValue.getOrElse(0)) //目前值的和加上歷史值
    
    
    
             }).print()
    
    
    
             
    
    
    
             ssc.start()
    
    
    
             ssc.awaitTermination()
    
    
    
             ssc.stop()
    
    
    
             
    
    
    
             }
    
    
    
            }

代碼很簡單，關鍵地方寫了詳細的注釋。

1.2 updateStateByKey 方法源碼分析

我們知道map返回的是MappedDStream，而MappedDStream並沒有updateStateByKey方法，並且它的父類DStream中也沒有該方法。但是DStream的伴生對象中有一個隱式轉換函數

   
   
   
           
    
    
    
             implicit def toPairDStreamFunctions[K, V](stream: DStream[(K, V)])
    
    
    
             (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null):
    
    
    
             PairDStreamFunctions[K, V] = {
    
    
    
             new PairDStreamFunctions[K, V](stream)
    
    
    
             }

PairDStreamFunction 中updateStateByKey的源碼如下：

   
   
   
           
    
    
    
             def updateStateByKey[S: ClassTag](
    
    
    
             updateFunc: (Seq[V], Option[S]) => Option[S]
    
    
    
             ): DStream[(K, S)] = ssc.withScope {
    
    
    
             updateStateByKey(updateFunc, defaultPartitioner())
    
    
    
             }

其中updateFunc就要傳入的參數，他是一個函數，Seq[V]表示當前key對應的所有值，Option[S] 是當前key的歷史狀態，返回的是新的狀態。

最終會調用下面的方法：

   
   
   
           
    
    
    
             def updateStateByKey[S: ClassTag](
    
    
    
             updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
    
    
    
             partitioner: Partitioner,
    
    
    
             rememberPartitioner: Boolean
    
    
    
             ): DStream[(K, S)] = ssc.withScope {
    
    
    
             new StateDStream(self, ssc.sc.clean(updateFunc), partitioner, rememberPartitioner, None)
    
    
    
             }

在這里面new出了一個StateDStream對象。在其compute方法中，會先獲取上一個batch計算出的RDD（包含了至程序開始到上一個batch單詞的累計計數），然后在獲取本次batch中StateDStream的父類計算出的RDD（本次batch的單詞計數）分別是prevStateRDD和parentRDD，然后在調用 computeUsingPreviousRDD 方法：

   
   
   
           
    
    
    
             private [this] def computeUsingPreviousRDD (
    
    
    
             parentRDD: RDD[(K, V)], prevStateRDD: RDD[(K, S)]) = {
    
    
    
             // Define the function for the mapPartition operation on cogrouped RDD;
    
    
    
             // first map the cogrouped tuple to tuples of required type,
    
    
    
             // and then apply the update function
    
    
    
             val updateFuncLocal = updateFunc
    
    
    
             val finalFunc = (iterator: Iterator[(K, (Iterable[V], Iterable[S]))]) => {
    
    
    
             val i = iterator.map { t =>
    
    
    
             val itr = t._2._2.iterator
    
    
    
             val headOption = if (itr.hasNext) Some(itr.next()) else None
    
    
    
             (t._1, t._2._1.toSeq, headOption)
    
    
    
             }
    
    
    
             updateFuncLocal(i)
    
    
    
             }
    
    
    
             val cogroupedRDD = parentRDD.cogroup(prevStateRDD, partitioner)
    
    
    
             val stateRDD = cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)
    
    
    
             Some(stateRDD)
    
    
    
             }

兩個RDD進行cogroup然后應用updateStateByKey傳入的函數。cogroup的性能是比較低下的。

二、mapWithState方法解析

2.1 mapWithState方法使用實例：

   
   
   
           
    
    
    
            object StatefulNetworkWordCount {
    
    
    
             def main(args: Array[String]) {
    
    
    
             if (args.length < 2) {
    
    
    
             System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
    
    
    
             System.exit(1)
    
    
    
             }
    
    
    
            
    
    
    
             StreamingExamples.setStreamingLogLevels()
    
    
    
            
    
    
    
             val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount")
    
    
    
             // Create the context with a 1 second batch size
    
    
    
             val ssc = new StreamingContext(sparkConf, Seconds(1))
    
    
    
             ssc.checkpoint(".")
    
    
    
            
    
    
    
             // Initial state RDD for mapWithState operation
    
    
    
             val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))
    
    
    
            
    
    
    
             // Create a ReceiverInputDStream on target ip:port and count the
    
    
    
             // words in input stream of \n delimited test (eg. generated by 'nc')
    
    
    
             val lines = ssc.socketTextStream(args(0), args(1).toInt)
    
    
    
             val words = lines.flatMap(_.split(" "))
    
    
    
             val wordDstream = words.map(x => (x, 1))
    
    
    
            
    
    
    
             // Update the cumulative count using mapWithState
    
    
    
             // This will give a DStream made of state (which is the cumulative count of the words)
    
    
    
             val mappingFunc = (word: String, one: Option[Int], state: State[Int]) => {
    
    
    
             val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
    
    
    
             val output = (word, sum)
    
    
    
             state.update(sum)
    
    
    
             output
    
    
    
             }
    
    
    
            
    
    
    
             val stateDstream = wordDstream.mapWithState(
    
    
    
             StateSpec.function(mappingFunc).initialState(initialRDD))
    
    
    
             stateDstream.print()
    
    
    
             ssc.start()
    
    
    
             ssc.awaitTermination()
    
    
    
             }
    
    
    
            }

mapWithState接收的參數是一個StateSpec對象。在StateSpec中封裝了狀態管理的函數

mapWithState函數中創建了MapWithStateDStreamImpl對象

   
   
   
           
    
    
    
             def mapWithState[StateType: ClassTag, MappedType: ClassTag](
    
    
    
             spec: StateSpec[K, V, StateType, MappedType]
    
    
    
             ): MapWithStateDStream[K, V, StateType, MappedType] = {
    
    
    
             new MapWithStateDStreamImpl[K, V, StateType, MappedType](
    
    
    
             self,
    
    
    
             spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]]
    
    
    
             )
    
    
    
             }

MapWithStateDStreamImpl 中創建了一個 InternalMapWithStateDStream類型對象 internalStream，在 MapWithStateDStreamImpl的compute方法中調用了 internalStream的getOrCompute方法。

   
   
   
           
    
    
    
            /** Internal implementation of the [[MapWithStateDStream]] */
    
    
    
            private[streaming] class MapWithStateDStreamImpl[
    
    
    
             KeyType: ClassTag, ValueType: ClassTag, StateType: ClassTag, MappedType: ClassTag](
    
    
    
             dataStream: DStream[(KeyType, ValueType)],
    
    
    
             spec: StateSpecImpl[KeyType, ValueType, StateType, MappedType])
    
    
    
             extends MapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream.context) {
    
    
    
            
    
    
    
             private val internalStream =
    
    
    
             new InternalMapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream, spec)
    
    
    
            
    
    
    
             override def slideDuration: Duration = internalStream.slideDuration
    
    
    
            
    
    
    
             override def dependencies: List[DStream[_]] = List(internalStream)
    
    
    
            
    
    
    
             override def compute(validTime: Time): Option[RDD[MappedType]] = {
    
    
    
             internalStream.getOrCompute(validTime).map { _.flatMap[MappedType] { _.mappedData } }
    
    
    
             }

InternalMapWithStateDStream 中沒有getOrCompute方法，這里調用的是其父類 DStream 的getOrCpmpute方法，該方法中最終會調用 InternalMapWithStateDStream的Compute方法：

   
   
   
           
    
    
    
             /** Method that generates a RDD for the given time */
    
    
    
             override def compute(validTime: Time): Option[RDD[MapWithStateRDDRecord[K, S, E]]] = {
    
    
    
             // Get the previous state or create a new empty state RDD
    
    
    
             val prevStateRDD = getOrCompute(validTime - slideDuration) match {
    
    
    
             case Some(rdd) =>
    
    
    
             if (rdd.partitioner != Some(partitioner)) {
    
    
    
             // If the RDD is not partitioned the right way, let us repartition it using the
    
    
    
             // partition index as the key. This is to ensure that state RDD is always partitioned
    
    
    
             // before creating another state RDD using it
    
    
    
             MapWithStateRDD.createFromRDD[K, V, S, E](
    
    
    
             rdd.flatMap { _.stateMap.getAll() }, partitioner, validTime)
    
    
    
             } else {
    
    
    
             rdd
    
    
    
             }
    
    
    
             case None =>
    
    
    
             MapWithStateRDD.createFromPairRDD[K, V, S, E](
    
    
    
             spec.getInitialStateRDD().getOrElse(new EmptyRDD[(K, S)](ssc.sparkContext)),
    
    
    
             partitioner,
    
    
    
             validTime
    
    
    
             )
    
    
    
             }
    
    
    
            
    
    
    
            
    
    
    
             // Compute the new state RDD with previous state RDD and partitioned data RDD
    
    
    
             // Even if there is no data RDD, use an empty one to create a new state RDD
    
    
    
             val dataRDD = parent.getOrCompute(validTime).getOrElse {
    
    
    
             context.sparkContext.emptyRDD[(K, V)]
    
    
    
             }
    
    
    
             val partitionedDataRDD = dataRDD.partitionBy(partitioner)
    
    
    
             val timeoutThresholdTime = spec.getTimeoutInterval().map { interval =>
    
    
    
             (validTime - interval).milliseconds
    
    
    
             }
    
    
    
             Some(new MapWithStateRDD(
    
    
    
             prevStateRDD, partitionedDataRDD, mappingFunction, validTime, timeoutThresholdTime))
    
    
    
             }

根據給定的時間生成一個MapWithStateRDD，首先獲取了先前狀態的RDD：preStateRDD和當前時間的RDD:dataRDD，然后對dataRDD基於先前狀態RDD的分區器進行重新分區獲取partitionedDataRDD。最后將 preStateRDD， partitionedDataRDD和用戶定義的函數mappingFunction傳給新生成的 MapWithStateRDD對象返回。

下面看一下 MapWithStateRDD的compute方法：

   
   
   
           
    
    
    
             override def compute(
    
    
    
             partition: Partition, context: TaskContext): Iterator[MapWithStateRDDRecord[K, S, E]] = {
    
    
    
            
    
    
    
             val stateRDDPartition = partition.asInstanceOf[MapWithStateRDDPartition]
    
    
    
             val prevStateRDDIterator = prevStateRDD.iterator(
    
    
    
             stateRDDPartition.previousSessionRDDPartition, context)
    
    
    
             val dataIterator = partitionedDataRDD.iterator(
    
    
    
             stateRDDPartition.partitionedDataRDDPartition, context)
    
    
    
            //prevRecord 代表一個分區的數據
    
    
    
             val prevRecord = if (prevStateRDDIterator.hasNext) Some(prevStateRDDIterator.next()) else None
    
    
    
             val newRecord = MapWithStateRDDRecord.updateRecordWithData(
    
    
    
             prevRecord,
    
    
    
             dataIterator,
    
    
    
             mappingFunction,
    
    
    
             batchTime,
    
    
    
             timeoutThresholdTime,
    
    
    
             removeTimedoutData = doFullScan // remove timedout data only when full scan is enabled
    
    
    
             )
    
    
    
             Iterator(newRecord)
    
    
    
             }

MapWithStateRDDRecord 對應 MapWithStateRDD 的一個分區：

   
   
   
           
    
    
    
            private[streaming] case class MapWithStateRDDRecord[K, S, E](
    
    
    
             var stateMap: StateMap[K, S], var mappedData: Seq[E])

其中stateMap存儲了key的狀態，mappedData存儲了mapping function函數的返回值

看一下 MapWithStateRDDRecord的 updateRecordWithData方法

   
   
   
           
    
    
    
             def updateRecordWithData[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](
    
    
    
             prevRecord: Option[MapWithStateRDDRecord[K, S, E]],
    
    
    
             dataIterator: Iterator[(K, V)],
    
    
    
             mappingFunction: (Time, K, Option[V], State[S]) => Option[E],
    
    
    
             batchTime: Time,
    
    
    
             timeoutThresholdTime: Option[Long],
    
    
    
             removeTimedoutData: Boolean
    
    
    
             ): MapWithStateRDDRecord[K, S, E] = {
    
    
    
            

    
    
    
              // 創建一個新的 state map 從過去的Recoord中復制 (如果存在) 否則創建一下空的StateMap對象
    
    
    
             val newStateMap = prevRecord.map { _.stateMap.copy() }. getOrElse { new EmptyStateMap[K, S]() }
    
    
    
            
    
    
    
             val mappedData = new ArrayBuffer[E]
    
    
    
                //狀態
    
    
    
             val wrappedState = new StateImpl[S]()
    
    
    
            
    
    
    
             // Call the mapping function on each record in the data iterator, and accordingly
    
    
    
             // update the states touched, and collect the data returned by the mapping function
    
    
    
             dataIterator.foreach { case (key, value) =>
    
    
    
                //獲取key對應的狀態
    
    
    
             wrappedState.wrap(newStateMap.get(key))
    
    
    
                //調用mappingFunction獲取返回值
    
    
    
             val returned = mappingFunction(batchTime, key, Some(value), wrappedState)
    
    
    
                //維護newStateMap的值
    
    
    
             if (wrappedState.isRemoved) {
    
    
    
             newStateMap.remove(key)
    
    
    
             } else if (wrappedState.isUpdated
    
    
    
             || (wrappedState.exists && timeoutThresholdTime.isDefined)) {
    
    
    
             newStateMap.put(key, wrappedState.get(), batchTime.milliseconds)
    
    
    
             }
    
    
    
             mappedData ++= returned
    
    
    
             }
    
    
    
            
    
    
    
             // Get the timed out state records, call the mapping function on each and collect the
    
    
    
             // data returned
    
    
    
             if (removeTimedoutData && timeoutThresholdTime.isDefined) {
    
    
    
             newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>
    
    
    
             wrappedState.wrapTimingOutState(state)
    
    
    
             val returned = mappingFunction(batchTime, key, None, wrappedState)
    
    
    
             mappedData ++= returned
    
    
    
             newStateMap.remove(key)
    
    
    
             }
    
    
    
             }
    
    
    
            
    
    
    
             MapWithStateRDDRecord(newStateMap, mappedData)
    
    
    
             }

最終返回 MapWithStateRDDRecord對象交個 MapWithStateRDD的compute函數， MapWithStateRDD的compute函數將其封裝成Iterator返回。

來自為知筆記(Wiz)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spark 定制版：第14課 Spark Streaming源碼解讀之State管理之updateStateByKey和mapWithState解密 Spark Streaming源碼解讀之State管理之UpdataStateByKey和MapWithState解密 Spark Streaming updateStateByKey和mapWithState源碼解密流處理 —— Spark Streaming中的操作(狀態管理函數 updateStateByKey和mapWithState) Spark Streaming揭秘 Day14 State狀態管理 Spark Streaming：updateStateByKey報錯 is not applicable for the arguments... spark streaming updateStateByKey 使用方法 Spark Streaming源碼分析 – DStream Spark Streaming源碼分析 – Checkpoint