Akka（22）： Stream：實時操控：動態管道連接-MergeHub,BroadcastHub and PartitionHub

本文轉載自查看原文 2017-08-31 09:53 1164 Scala/ 編程語言/ Akka

在現實中我們會經常遇到這樣的場景：有一個固定的數據源Source，我們希望按照程序運行狀態來接駁任意數量的下游接收方subscriber、又或者我需要在程序運行時（runtime）把多個數據流向某個固定的數據流終端Sink推送。這就涉及到動態連接合並型Merge或擴散型Broadcast的數據流連接點junction。從akka-stream的技術文檔得知：一對多，多對一或多對多類型的復雜數據流組件必須用GraphDSL來設計，產生Graph類型結果。前面我們提到過：Graph就是一種運算預案，要求所有的運算環節都必須是預先明確指定的，如此應該是無法實現動態的管道連接的。但akka-stream提供了MergeHub,BroadcastHub和PartitionHub來支持這樣的功能需求。

1、MergeHub：多對一合並類型。支持動態的多個上游publisher連接

2、BroadcastHub：一對多擴散類型。支持動態的多個下游subscriber連接

3、PartitionHub：實際上是一對多擴散類型。通過一個函數來選擇數據派送目的地

MergeHub對象中有個source函數：

 /** * Creates a [[Source]] that emits elements merged from a dynamic set of producers. After the [[Source]] returned * by this method is materialized, it returns a [[Sink]] as a materialized value. This [[Sink]] can be materialized * arbitrary many times and each of the materializations will feed the elements into the original [[Source]]. * * Every new materialization of the [[Source]] results in a new, independent hub, which materializes to its own * [[Sink]] for feeding that materialization. * * If one of the inputs fails the [[Sink]], the [[Source]] is failed in turn (possibly jumping over already buffered * elements). Completed [[Sink]]s are simply removed. Once the [[Source]] is cancelled, the Hub is considered closed * and any new producers using the [[Sink]] will be cancelled. * * @param perProducerBufferSize Buffer space used per producer. Default value is 16. */ def source[T](perProducerBufferSize: Int): Source[T, Sink[T, NotUsed]] = Source.fromGraph(new MergeHub[T](perProducerBufferSize))

MergeHub.source函數的返回結果類型是Source[T,Sink[T,NotUsed]]，本質上MergeHub就是一個共用的Sink，如下所示：

  val fixedSink = Sink.foreach(println) val sinkGraph: RunnableGraph[Sink[Any,NotUsed]] = MergeHub.source(perProducerBufferSize = 16).to(fixedSink) val inGate: Sink[Any,NotUsed] = sinkGraph.run()   //common input //now connect any number of source
  val (killSwitch,_) = (Source(Stream.from(0)).delay(1.second,DelayOverflowStrategy.backpressure) .viaMat(KillSwitches.single)(Keep.right).toMat(inGate)(Keep.both)).run() val (killSwitch2,_) = (Source(List("a","b","c","d","e")).delay(2.second,DelayOverflowStrategy.backpressure) .viaMat(KillSwitches.single)(Keep.right).toMat(inGate)(Keep.both)).run() val (killSwitch3,_) = (Source(List("AA","BB","CC","DD","EE")).delay(3.second,DelayOverflowStrategy.backpressure) .viaMat(KillSwitches.single)(Keep.right).toMat(inGate)(Keep.both)).run() scala.io.StdIn.readLine() killSwitch.shutdown() killSwitch2.shutdown() killSwitch3.shutdown() actorSys.terminate()

同樣，BroadcastHub就是一種共用的Source，可以連接任何數量的下游subscriber。下面是BroadcastHub.sink的定義：

  /** * Creates a [[Sink]] that receives elements from its upstream producer and broadcasts them to a dynamic set * of consumers. After the [[Sink]] returned by this method is materialized, it returns a [[Source]] as materialized * value. This [[Source]] can be materialized an arbitrary number of times and each materialization will receive the * broadcast elements from the original [[Sink]]. * * Every new materialization of the [[Sink]] results in a new, independent hub, which materializes to its own * [[Source]] for consuming the [[Sink]] of that materialization. * * If the original [[Sink]] is failed, then the failure is immediately propagated to all of its materialized * [[Source]]s (possibly jumping over already buffered elements). If the original [[Sink]] is completed, then * all corresponding [[Source]]s are completed. Both failure and normal completion is "remembered" and later * materializations of the [[Source]] will see the same (failure or completion) state. [[Source]]s that are * cancelled are simply removed from the dynamic set of consumers. * * @param bufferSize Buffer size used by the producer. Gives an upper bound on how "far" from each other two * concurrent consumers can be in terms of element. If this buffer is full, the producer * is backpressured. Must be a power of two and less than 4096. */ def sink[T](bufferSize: Int): Sink[T, Source[T, NotUsed]] = Sink.fromGraph(new BroadcastHub[T](bufferSize))

BroadcastHub.sink返回結果類型：Sink[T,Source[T,NotUsed]]，就是個可連接任何數量下游的共用Source：

  val killAll = KillSwitches.shared("terminator") val fixedSource=Source(Stream.from(100)).delay(1.second,DelayOverflowStrategy.backpressure) val sourceGraph = fixedSource.via(killAll.flow).toMat(BroadcastHub.sink(bufferSize = 16))(Keep.right).async val outPort = sourceGraph.run()  //shared source //now connect any number of sink to outPort
  outPort.to(Sink.foreach{c =>println(s"A: $c")}).run() outPort.to(Sink.foreach{c =>println(s"B: $c")}).run() outPort.to(Sink.foreach{c =>println(s"C: $c")}).run()

還有一種做法是把MergeHub和BroadcastHub背對背連接起來形成一種多對多的形狀。理論上應該能作為一種集散中心容許連接任何數量的上游publisher和下游subscriber。我們先把它們連接起來獲得一個Sink和一個Source：

val (sink, source)  = MergeHub.source[Int](perProducerBufferSize = 16) .toMat(BroadcastHub.sink(bufferSize = 16))(Keep.both).run()

理論上我們現在可以對sink和source進行任意連接了。但有個特殊情況是：當下游沒有任何subscriber時上游所有producer都無法發送任何數據。這是由於backpressure造成的：作為一個合成的節點，下游速率跟不上則通過backpressure制約上游數據發布。我們可以安裝一個泄洪機制來保證上游publisher數據推送的正常進行：

  source.runWith(Sink.ignore)

這樣在沒有任何下游subscriber的情況下，上游producer還是能夠正常運作。

現在我們可以用Flow.fromSinkAndSource(sink, source)來構建一個Flow[I,O,?]：

  def fromSinkAndSource[I, O](sink: Graph[SinkShape[I], _], source: Graph[SourceShape[O], _]): Flow[I, O, NotUsed] = fromSinkAndSourceMat(sink, source)(Keep.none)

我們還可以把上篇提到的KillSwitches.singleBidi用上：

 val channel: Flow[Int, Int, UniqueKillSwitch] = Flow.fromSinkAndSource(sink, source) .joinMat(KillSwitches.singleBidi[Int, Int])(Keep.right) .backpressureTimeout(3.seconds)

上面backpressureTimeout保證了任何下游subscriber阻塞超時的話都會被強力終止。如下：

  /** * If the time between the emission of an element and the following downstream demand exceeds the provided timeout, * the stream is failed with a [[scala.concurrent.TimeoutException]]. The timeout is checked periodically, * so the resolution of the check is one period (equals to timeout value). * * '''Emits when''' upstream emits an element * * '''Backpressures when''' downstream backpressures * * '''Completes when''' upstream completes or fails if timeout elapses between element emission and downstream demand. * * '''Cancels when''' downstream cancels */ def backpressureTimeout(timeout: FiniteDuration): Repr[Out] = via(new Timers.BackpressureTimeout[Out](timeout))

好了，下面我們可以把channel當作Flow來使用了：

  val killChannel1 = fixedSource.viaMat(channel)(Keep.right).to(fixedSink).run() val killChannel2 = Source.repeat(888) .delay(2.second,DelayOverflowStrategy.backpressure) .viaMat(channel)(Keep.right).to(fixedSink).run()

上面我們提到：PartitionHub就是一種特殊的BroadcastHub。功能是擴散型的。不過PartitionHub用了一個函數來選擇下游的subscriber。從PartitionHub.sink函數款式可以看出：

 def sink[T](partitioner: (Int, T) ⇒ Int, startAfterNrOfConsumers: Int, bufferSize: Int = defaultBufferSize): Sink[T, Source[T, NotUsed]] = statefulSink(() ⇒ (info, elem) ⇒ info.consumerIdByIdx(partitioner(info.size, elem)), startAfterNrOfConsumers, bufferSize)

可以看出：partitioner函數就是一種典型的狀態轉換函數款式，實際上sink調用了statefulSink方法並固定了partitioner函數：

   * This `statefulSink` should be used when there is a need to keep mutable state in the partition function, * e.g. for implemening round-robin or sticky session kind of routing. If state is not needed the [[#sink]] can * be more convenient to use. *
   * @param partitioner Function that decides where to route an element. It is a factory of a function to *   to be able to hold stateful variables that are unique for each materialization. The function *   takes two parameters; the first is information about active consumers, including an array of consumer *   identifiers and the second is the stream element. The function should return the selected consumer *   identifier for the given element. The function will never be called when there are no active consumers, *   i.e. there is always at least one element in the array of identifiers. * @param startAfterNrOfConsumers Elements are buffered until this number of consumers have been connected. *   This is only used initially when the stage is starting up, i.e. it is not honored when consumers have * been removed (canceled). * @param bufferSize Total number of elements that can be buffered. If this buffer is full, the producer *   is backpressured. */ @ApiMayChange def statefulSink[T](partitioner: () ⇒ (ConsumerInfo, T) ⇒ Long, startAfterNrOfConsumers: Int, bufferSize: Int = defaultBufferSize): Sink[T, Source[T, NotUsed]] = Sink.fromGraph(new PartitionHub[T](partitioner, startAfterNrOfConsumers, bufferSize))

與BroadcastHub相同，我們首先構建一個共用的數據源producer，然后連接PartitionHub形成一個通往下游終端的通道讓任何下游subscriber可以連接這個通道：

 //interupted temination
  val killAll = KillSwitches.shared("terminator") //fix a producer
  val fixedSource = Source.tick(1.second, 1.second, "message") .zipWith(Source(1 to 100))((a, b) => s"$a-$b") //connect to PartitionHub which uses function to select sink
  val sourceGraph = fixedSource.via(killAll.flow).toMat(PartitionHub.sink( (size, elem) => math.abs(elem.hashCode) % size, startAfterNrOfConsumers = 2, bufferSize = 256))(Keep.right) //materialize the source
  val fromSource = sourceGraph.run() //connect to fixedSource freely
  fromSource.runForeach(msg => println("subs1: " + msg)) fromSource.runForeach(msg => println("subs2: " + msg)) scala.io.StdIn.readLine() killAll.shutdown() actorSys.terminate()

可以看到：上游數據流向多個下游中哪個subscriber是通過partitioner函數選定的。從這項功能來講：PartitionHub又是某種路由Router。下面的例子實現了仿Router的RoundRobin推送策略：

  //partitioner function
  def roundRobin(): (PartitionHub.ConsumerInfo, String) ⇒ Long = { var i = -1L (info, elem) => { i += 1 info.consumerIdByIdx((i % info.size).toInt) } } val roundRobinGraph = fixedSource.via(killAll.flow).toMat(PartitionHub.statefulSink( () => roundRobin(),startAfterNrOfConsumers = 2,bufferSize = 256) )(Keep.right) val roundRobinSource = roundRobinGraph.run() roundRobinSource.runForeach(msg => println("roundRobin1: " + msg)) roundRobinSource.runForeach(msg => println("roundRobin2: " + msg))

上面例子里數據源流動方向是由roundRobin函數確定的。

而在下面這個例子里數據流向速率最快的subscriber：

  val producer = Source(0 until 100) // ConsumerInfo.queueSize is the approximate number of buffered elements for a consumer. // Note that this is a moving target since the elements are consumed concurrently.
  val runnableGraph: RunnableGraph[Source[Int, NotUsed]] = producer.via(killAll.flow).toMat(PartitionHub.statefulSink( () => (info, elem) ⇒ info.consumerIds.minBy(id ⇒ info.queueSize(id)), startAfterNrOfConsumers = 2, bufferSize = 16))(Keep.right) val fromProducer: Source[Int, NotUsed] = runnableGraph.run() fromProducer.runForeach(msg => println("fast1: " + msg)) fromProducer.throttle(10, 100.millis, 10, ThrottleMode.Shaping) .runForeach(msg => println("fast2: " + msg))

上面這個例子里partitioner函數是根據眾下游的緩沖數量（queueSize）來確定數據應該流向哪個subscriber，queueSize數值越大則表示速率越慢。

下面是以上示范中MergeHub及BroadcastHub示范的源代碼：

import akka.NotUsed import akka.stream.scaladsl._ import akka.stream._ import akka.actor._ import scala.concurrent.duration._ object HubsDemo extends App { implicit val actorSys = ActorSystem("sys") implicit val ec = actorSys.dispatcher implicit val mat = ActorMaterializer( ActorMaterializerSettings(actorSys) .withInputBuffer(16,16) ) val fixedSink = Sink.foreach(println) val sinkGraph: RunnableGraph[Sink[Any,NotUsed]] = MergeHub.source(perProducerBufferSize = 16).to(fixedSink).async val inGate: Sink[Any,NotUsed] = sinkGraph.run()   //common input //now connect any number of source
  val (killSwitch,_) = (Source(Stream.from(0)).delay(1.second,DelayOverflowStrategy.backpressure) .viaMat(KillSwitches.single)(Keep.right).toMat(inGate)(Keep.both)).run() val (killSwitch2,_) = (Source(List("a","b","c","d","e")).delay(2.second,DelayOverflowStrategy.backpressure) .viaMat(KillSwitches.single)(Keep.right).toMat(inGate)(Keep.both)).run() val (killSwitch3,_) = (Source(List("AA","BB","CC","DD","EE")).delay(3.second,DelayOverflowStrategy.backpressure) .viaMat(KillSwitches.single)(Keep.right).toMat(inGate)(Keep.both)).run() val killAll = KillSwitches.shared("terminator") val fixedSource=Source(Stream.from(100)).delay(1.second,DelayOverflowStrategy.backpressure) val sourceGraph = fixedSource.via(killAll.flow).toMat(BroadcastHub.sink(bufferSize = 16))(Keep.right).async val outPort = sourceGraph.run()  //shared source //now connect any number of sink to outPort
  outPort.to(Sink.foreach{c =>println(s"A: $c")}).run() outPort.to(Sink.foreach{c =>println(s"B: $c")}).run() outPort.to(Sink.foreach{c =>println(s"C: $c")}).run() val (sink, source) = MergeHub.source[Int](perProducerBufferSize = 16) .toMat(BroadcastHub.sink(bufferSize = 16))(Keep.both).run() source.runWith(Sink.ignore) val channel: Flow[Int, Int, UniqueKillSwitch] = Flow.fromSinkAndSource(sink, source) .joinMat(KillSwitches.singleBidi[Int, Int])(Keep.right) .backpressureTimeout(3.seconds) val killChannel1 = fixedSource.viaMat(channel)(Keep.right).to(fixedSink).run() val killChannel2 = Source.repeat(888) .delay(2.second,DelayOverflowStrategy.backpressure) .viaMat(channel)(Keep.right).to(fixedSink).run() scala.io.StdIn.readLine() killSwitch.shutdown() killSwitch2.shutdown() killSwitch3.shutdown() killAll.shutdown() killChannel1.shutdown() killChannel2.shutdown() scala.io.StdIn.readLine() actorSys.terminate() }

下面是PartitionHub示范源代碼：

import akka.NotUsed import akka.stream.scaladsl._ import akka.stream._ import akka.actor._ import scala.concurrent.duration._ object PartitionHubDemo extends App { implicit val actorSys = ActorSystem("sys") implicit val ec = actorSys.dispatcher implicit val mat = ActorMaterializer( ActorMaterializerSettings(actorSys) .withInputBuffer(16,16) ) //interupted temination
  val killAll = KillSwitches.shared("terminator") //fix a producer
  val fixedSource = Source.tick(1.second, 1.second, "message") .zipWith(Source(1 to 100))((a, b) => s"$a-$b") //connect to PartitionHub which uses function to select sink
  val sourceGraph = fixedSource.via(killAll.flow).toMat(PartitionHub.sink( (size, elem) => math.abs(elem.hashCode) % size, startAfterNrOfConsumers = 2, bufferSize = 256))(Keep.right) //materialize the source
  val fromSource = sourceGraph.run() //connect to fixedSource freely
  fromSource.runForeach(msg => println("subs1: " + msg)) fromSource.runForeach(msg => println("subs2: " + msg)) //partitioner function
  def roundRobin(): (PartitionHub.ConsumerInfo, String) ⇒ Long = { var i = -1L (info, elem) => { i += 1 info.consumerIdByIdx((i % info.size).toInt) } } val roundRobinGraph = fixedSource.via(killAll.flow).toMat(PartitionHub.statefulSink( () => roundRobin(),startAfterNrOfConsumers = 2,bufferSize = 256) )(Keep.right) val roundRobinSource = roundRobinGraph.run() roundRobinSource.runForeach(msg => println("roundRobin1: " + msg)) roundRobinSource.runForeach(msg => println("roundRobin2: " + msg)) val producer = Source(0 until 100) // ConsumerInfo.queueSize is the approximate number of buffered elements for a consumer. // Note that this is a moving target since the elements are consumed concurrently.
  val runnableGraph: RunnableGraph[Source[Int, NotUsed]] = producer.via(killAll.flow).toMat(PartitionHub.statefulSink( () => (info, elem) ⇒ info.consumerIds.minBy(id ⇒ info.queueSize(id)), startAfterNrOfConsumers = 2, bufferSize = 16))(Keep.right) val fromProducer: Source[Int, NotUsed] = runnableGraph.run() fromProducer.runForeach(msg => println("fast1: " + msg)) fromProducer.throttle(10, 100.millis, 10, ThrottleMode.Shaping) .runForeach(msg => println("fast2: " + msg)) scala.io.StdIn.readLine() killAll.shutdown() actorSys.terminate() }

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Akka（25）： Stream：對接外部系統-Integration 報錯：Flink Could not resolve substitution to a value: ${akka.stream.materializer} Akka Stream文檔翻譯：Quick Start Guide: Reactive Tweets SDP（0）：Streaming-Data-Processor - Data Processing with Akka-Stream akka-streams - 從應用角度學習：basic stream parts Akka（26）： Stream：異常處理-Exception handling [Linux] 流 ( Stream )、管道 ( Pipeline ) 、Filter - 筆記實時流式計算 - Kafka Stream Akka（23）： Stream：自定義流構件功能-Custom defined stream processing stages Flink基礎：實時處理管道與ETL