1 問題背景

Flume向kafka發布數據時，發現kafka接收到的數據總是在一個partition中，而我們希望發布來的數據在所有的partition平均分布

2 解決辦法

Flume的官方文檔是這么說的：

Kafka Sink uses the topic and key properties from the FlumeEvent headers to send events to Kafka. If topic exists in the headers, the event will be sent to that specific topic, overriding the topic configured for the Sink. If key exists in the headers, the key will used by Kafka to partition the data between the topic partitions. Events with same key will be sent to the same partition. If the key is null, events will be sent to random partitions.

其實以上文檔中說的很清楚了，kafka-sink是從header里的key參數來確定將數據發到kafka的哪個分區中。如果為null，那么就會隨機發布至分區中。但我測試的結果是flume發布的數據會發布到一個分區中的。

所以，我們需要向header中寫上隨機的key，然后數據才會真正的向kafka分區進行隨機發布。

我們的辦法是，向flume添加攔截器，官方文檔說有一個UUID Interceptor，會為每個event的head添加一個隨機唯一的key。其實我們直接用這個即可。

在flume添加的配置文件如下：

hiveview.sources.tailSource.interceptors = i2

hiveview.sources.tailSource.interceptors.i2.type=org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

hiveview.sources.tailSource.interceptors.i2.headerName=key

hiveview.sources.tailSource.interceptors.i2.preserveExisting=false

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 docker 單kafka ,多分區使用Flume往kafka和hdfs里同時寫數據 Kafka系列四之多分區如何保證數據的有序性 flume 讀取kafka 數據 flume從Kafka消費數據到HDFS Flume同時輸出數據到HDFS和kafka flume斷點續傳（防止重復消費）的解決方案和flume 向hdfs sink寫數據小文件過多問題 Flume采集文件數據到Kafka 使用Flume消費Kafka數據到HDFS MySQL數據實時增量同步到Kafka - Flume