Topic 與 Partition

本文轉載自查看原文 2016-04-06 22:05 2625 kafka學習

Topic在邏輯上可以被認為是一個queue隊列，每條消息都必須指定它的topic，可以簡單理解為必須指明把這條消息放進哪個queue里。為了使得Kafka的吞吐率可以水平擴展，物理上把topic分成一個或多個partition，每個partition在物理上對應一個文件夾，該文件夾下存儲這個partition的所有消息和索引文件。
　　
　　每個日志文件都是“log entries”序列，每一個log entry包含一個4字節整型數（值為N），其后跟N個字節的消息體。每條消息都有一個當前partition下唯一的64字節的offset，它指明了這條消息的起始位置。磁盤上存儲的消費格式如下：
　　message length ： 4 bytes (value: 1+4+n)
　　“magic” value ： 1 byte
　　crc ： 4 bytes
　　payload ： n bytes
　　這個“log entries”並非由一個文件構成，而是分成多個segment，每個segment名為該segment第一條消息的offset和“.kafka”組成。另外會有一個索引文件，它標明了每個segment下包含的log entry的offset范圍，
　　因為每條消息都被append到該partition中，是順序寫磁盤，因此效率非常高（經驗證，順序寫磁盤效率比隨機寫內存還要高，這是Kafka高吞吐率的一個很重要的保證）。
　　
　　每一條消息被發送到broker時，會根據paritition規則選擇被存儲到哪一個partition。如果partition規則設置的合理，所有消息可以均勻分布到不同的partition里，這樣就實現了水平擴展。（如果一個topic對應一個文件，那這個文件所在的機器I/O將會成為這個 topic的性能瓶頸，而partition解決了這個問題）。在創建topic時可以在$KAFKA_HOME/config/server.properties中指定這個partition的數量(如下所示)，當然也可以在topic創建之后去修改parition數量。

# The default number of log partitions per topic. More partitions allow greater # parallelism for consumption, but this will also result in more files across # the brokers. num.partitions=3

　　在發送一條消息時，可以指定這條消息的key，producer根據這個key和partition機制來判斷將這條消息發送到哪個 parition。paritition機制可以通過指定producer的paritition. class這一參數來指定，該class必須實現kafka.producer.Partitioner接口。本例中如果key可以被解析為整數則將對應的整數與partition總數取余，該消息會被發送到該數對應的partition。（每個parition都會有個序號）

import kafka.producer.Partitioner; import kafka.utils.VerifiableProperties; public class JasonPartitioner<T> implements Partitioner { public JasonPartitioner(VerifiableProperties verifiableProperties) {} @Override public int partition(Object key, int numPartitions) { try { int partitionNum = Integer.parseInt((String) key); return Math.abs(Integer.parseInt((String) key) % numPartitions); } catch (Exception e) { return Math.abs(key.hashCode() % numPartitions); } } }

　
　　如果將上例中的class作為partition.class，並通過如下代碼發送20條消息（key分別為0，1，2，3）至topic2（包含4個partition）。　　

public void sendMessage() throws InterruptedException{ 　　for(int i = 1; i <= 5; i++){ 　　 List messageList = new ArrayList<KeyedMessage<String, String>>(); 　　 for(int j = 0; j < 4; j++）{ 　　 messageList.add(new KeyedMessage<String, String>("topic2", j+"", "The " + i + " message for key " + j)); 　　 } 　　 producer.send(messageList); } 　　producer.close(); }

　　則key相同的消息會被發送並存儲到同一個partition里，而且key的序號正好和partition序號相同。（partition序號從0開始，本例中的key也正好從0開始）。如下圖所示。
　　
　　對於傳統的message queue而言，一般會刪除已經被消費的消息，而Kafka集群會保留所有的消息，無論其被消費與否。當然，因為磁盤限制，不可能永久保留所有數據（實際上也沒必要），因此Kafka提供兩種策略去刪除舊數據。一是基於時間，二是基於partition文件大小。例如可以通過配置$KAFKA_HOME/config/server.properties，讓Kafka刪除一周前的數據，也可通過配置讓Kafka在partition文件超過1GB時刪除舊數據，如下所示。

　　############################# Log Retention Policy ############################# # The following configurations control the disposal of log segments. The policy can # be set to delete segments after a period of time, or after a given size has accumulated. # A segment will be deleted whenever *either* of these criteria are met. Deletion always happens # from the end of the log. # The minimum age of a log file to be eligible for deletion log.retention.hours=168 # A size-based retention policy for logs. Segments are pruned from the log as long as the remaining # segments don't drop below log.retention.bytes. #log.retention.bytes=1073741824 # The maximum size of a log segment file. When this size is reached a new log segment will be created. log.segment.bytes=1073741824 # The interval at which log segments are checked to see if they can be deleted according # to the retention policies log.retention.check.interval.ms=300000 # By default the log cleaner is disabled and the log retention policy will default to #just delete segments after their retention expires. # If log.cleaner.enable=true is set the cleaner will be enabled and individual logs #can then be marked for log compaction. log.cleaner.enable=false

　　這里要注意，因為Kafka讀取特定消息的時間復雜度為O(1)，即與文件大小無關，所以這里刪除文件與Kafka性能無關，選擇怎樣的刪除策略只與磁盤以及具體的需求有關。另外，Kafka會為每一個consumer group保留一些metadata信息—當前消費的消息的position，也即offset。這個offset由consumer控制。正常情況下 consumer會在消費完一條消息后線性增加這個offset。當然，consumer也可將offset設成一個較小的值，重新消費一些消息。因為 offet由consumer控制，所以Kafka broker是無狀態的，它不需要標記哪些消息被哪些consumer過，不需要通過broker去保證同一個consumer group只有一個consumer能消費某一條消息，因此也就不需要鎖機制，這也為Kafka的高吞吐率提供了有力保障。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 kafka學習(四)-Topic & Partition Kafka的Topic、Partition和Message 理解 Kafka 中的 Topic 和 Partition Kafka topic中的partition的leader選舉 kafka 修改partition，刪除topic，查詢offset SpringBoot集成kafka實戰 (topic、partition、offset) kafka的log存儲解析——topic的分區partition分段segment以及索引等 kafka的log存儲解析——topic的分區partition分段segment以及索引等 [bigdata] kafka基本命令 -- 遷移topic partition到指定的broker kafka topic消息分配partition規則（Java源碼）