Kafka 是現在大數據中流行的消息中間件,其中 kafka 中由 topic 組成,而 topic 下又可以由多個 partition 構成。有時候我們在消費 kafka 中的數據想要保證消費 kafka 中的所有的分區下數據是全局有序的,這種情況下就需要將 topic 下的 partition 的數量設置為一個這樣才會保證全局有序,但是這種情況消費數據並沒有多並發,也就影響效率。
在 Flink 中則可以即保證消費 kafka 中的數據全局有序,又可以構成多並發,這就是 flink 中的時間特性帶來的效果。Flink 在創建 kafka 的數據源時可以將其中的所有數據都存有時間並設置對應的 watermark,這樣利用 event time 對 kafka 中的數據已經形成了時間概念上的全局有序性,當 flink 在消費其中的數據時則根據時間處理即可保證 kafka 中數據的全局有序性。
下面一部分內容來至於 flink 的官網,鏈接:
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/event_timestamps_watermarks.html
Timestamps per Kafka Partition
When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).
In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.
For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks.
The illustrations below show how to use the per-Kafka-partition watermark generation, and how watermarks propagate through the streaming dataflow in that case.
FlinkKafkaConsumer09<MyType> kafkaSource = new FlinkKafkaConsumer09<>("myTopic", schema, props); kafkaSource.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<MyType>() { @Override public long extractAscendingTimestamp(MyType element) { return element.eventTimestamp(); } });
DataStream<MyType> stream = env.addSource(kafkaSource);