Kafka集群的安裝和使用

本文轉載自查看原文 2016-07-21 14:52 16123 sum_Big Data

（有點像流水賬，好記性不如爛筆頭，權記於此以備忘）

總結：

概念：Broker、Producer、Consumer、Consumer Group、Topic、Partition、Replication

特點：分布式、高可用、高吞吐量、數據只在Partition內有序等

高寫性能的原因：分Partition以支持高並發讀寫，順序追加寫

Kafka是一種高吞吐量的分布式發布訂閱的消息隊列系統，原本開發自LinkedIn，用作LinkedIn的活動流（ActivityStream）和運營數據處理管道（Pipeline）的基礎。現在它已被多家不同類型的公司作為多種類型的數據管道和消息系統使用。

1 Kafka消息隊列簡介

1.1 基本術語

Broker

Kafka集群包含一個或多個服務器，這種服務器被稱為broker[5]
Topic

每條發布到Kafka集群的消息都有一個類別，這個類別被稱為Topic。（物理上不同Topic的消息分開存儲，邏輯上一個Topic的消息雖然保存於一個或多個broker上但用戶只需指定消息的Topic即可生產或消費數據而不必關心數據存於何處）
Partition

Partition是物理上的概念，每個Topic包含一個或多個Partition.
Producer

負責發布消息到Kafka broker
Consumer

消息消費者，向Kafka broker讀取消息的客戶端。
Consumer Group

每個Consumer屬於一個特定的Consumer Group（可為每個Consumer指定group name，若不指定group name則屬於默認的group）。

1.2 消息隊列

1.2.1 基本特性

可擴展
- 在不需要下線的情況下進行擴容
- 數據流分區(partition)存儲在多個機器上
高性能
- 單個broker就能服務上千客戶端
- 單個broker每秒種讀/寫可達每秒幾百兆字節
- 多個brokers組成的集群將達到非常強的吞吐能力
- 性能穩定，無論數據多大
- Kafka在底層摒棄了Java堆緩存機制，采用了操作系統級別的頁緩存，同時將隨機寫操作改為順序寫，再結合Zero-Copy的特性極大地改善了IO性能。
持久存儲
- 存儲在磁盤上
- 冗余備份到其他服務器上以防止丟失

1.2.2 消息格式

一個topic對應一種消息格式，因此消息用topic分類
一個topic代表的消息有1個或者多個patition(s)組成
一個partition中
- 一個partition應該存放在一到多個server上
  - 如果只有一個server，就沒有冗余備份，是單機而不是集群
  - 如果有多個server
    - 一個server為leader，其他servers為followers；leader需要接受讀寫請求；followers僅作冗余備份；leader出現故障，會自動選舉一個follower作為leader，保證服務不中斷；每個server都可能扮演一些partitions的leader和其它partitions的follower角色，這樣整個集群就會達到負載均衡的效果
- 消息按順序存放，順序不可變
- 只能追加消息，不能插入
- 每個消息都有一個offset，用作消息ID, 在一個partition中唯一
- offset由consumer保存和管理，因此讀取順序實際上是完全有consumer決定的，不一定是線性的
- 消息有超時日期，過期則刪除

1.2.3 生產者 producer

producer將消息寫入kafka
寫入要指定topic和partition
消息如何分到不同的partition，算法由producer指定

1.2.4 消費者 consumer

consumer讀取消息並作處理
consumer group
- 這個概念的引入為了支持兩種場景：每條消息分發一個消費者，每條消息廣播給消費組的所有消費者
- 多個consumer group訂閱一個topic，該topci的消息廣播給所有consumer group
- 一條消息發送到一個consumer group后，只能由該group的一個consumer接收和使用
- 一個group中的每個consumer各自對應一個partition可以帶來如下好處
  - 可以按照partition的數目進行並發處理
  - 每個partition都只有一個consumer讀取，因而保證了消息被處理的順序是按照partition的存放順序進行，注意這個順序受到producer存放消息的算法影響

一個Consumer可以有多個線程進行消費，線程數應不多於topic的partition數，因為對於一個包含一或多消費線程的consumer group來說，一個partition只能分給其中的一個消費線程消費，且讓盡可能多的線程能分配到partition（不過實際上真正去消費的線程及線程數還是由線程池的調度機制來決定）。這樣如果線程數比partition數多，那么單射分配也會有多出的線程，它們就不會消費到任何一個partition的數據而空轉耗資源。
如果consumer從多個partition讀到數據，不保證數據間的順序性，kafka只保證在一個partition上數據是有序的，但多個partition，根據你讀的順序會有不同
增減consumer，broker，partition會導致rebalance，所以rebalance后consumer對應的partition會發生變化

2. 安裝和使用

以kafka_2.11-0.10.0.0為例。

下載解壓后，進入kafka_2.11-0.10.0.0/

2.1 啟動Zookeeper

測試時可以使用Kafka附帶的Zookeeper：

啟動： ./bin/zookeeper-server-start.sh config/zookeeper.properties & ，config/zookeeper.properties是Zookeeper的配置文件。

結束： ./bin/zookeeper-server-stop.sh

不過最好自己搭建一個Zookeeper集群，提高可用性和可靠性。詳見：Zookeeper的安裝和使用——MarchOn

2.2 啟動Kafka服務器

2.2.1 配置文件

配置config/server.properties文件，一般需要配置如下字段，其他按默認即可：

broker.id：    　　　　　　每一個broker在集群中的唯一表示，要求是正數
listeners（效果同之前的版本的host.name及port）：注意綁定host.name，否則可能出現莫名其妙的錯誤如consumer找不到broker。這個host.name是Kafka的server的機器名字，會注冊到Zookeeper中
log.dirs：    　　　　　　 kafka數據的存放地址，多個地址的話用逗號分割,多個目錄分布在不同磁盤上可以提高讀寫性能
log.retention.hours：   　數據文件保留多長時間， 存儲的最大時間超過這個時間會根據log.cleanup.policy設置數據清除策略
zookeeper.connect：   　　指定ZooKeeper的connect string，以hostname:port的形式，可有多個以逗號分隔，如hostname1:port1,hostname2:port2,hostname3:port3，還可有路徑，如：hostname1:port1,hostname2:port2,hostname3:port3/kafka，注意要事先在zk中創建/kafka節點，否則會報出錯誤：java.lang.IllegalArgumentException: Path length must be > 0

所有參數的含義及配置可參考：http://orchome.com/12、http://blog.csdn.net/lizhitao/article/details/25667831

一個配置示例如下：

1 # Licensed to the Apache Software Foundation (ASF) under one or more
2 # contributor license agreements. See the NOTICE file distributed with
3 # this work for additional information regarding copyright ownership.
4 # The ASF licenses this file to You under the Apache License, Version 2.0
5 # (the "License"); you may not use this file except in compliance with
6 # the License. You may obtain a copy of the License at
7 #
8 # http://www.apache.org/licenses/LICENSE-2.0
9 #
10 # Unless required by applicable law or agreed to in writing, software
11 # distributed under the License is distributed on an "AS IS" BASIS,
12 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 # See the License for the specific language governing permissions and
14 # limitations under the License.
15 # see kafka.server.KafkaConfig for additional details and defaults
16
17 ############################# Server Basics #############################
18
19 # The id of the broker. This must be set to a unique integer for each broker.
20 broker.id=1
21
22 ############################# Socket Server Settings #############################
23
24 # The address the socket server listens on. It will get the value returned from
25 # java.net.InetAddress.getCanonicalHostName() if not configured.
26 # FORMAT:
27 # listeners = security_protocol://host_name:port
28 # EXAMPLE:
29 # listeners = PLAINTEXT://your.host.name:9092
30 listeners=PLAINTEXT://192.168.6.128:9092
31
32 # Hostname and port the broker will advertise to producers and consumers. If not set,
33 # it uses the value for "listeners" if configured. Otherwise, it will use the value
34 # returned from java.net.InetAddress.getCanonicalHostName().
35 #advertised.listeners=PLAINTEXT://your.host.name:9092
36
37 # The number of threads handling network requests
38 num.network.threads=3
39
40 # The number of threads doing disk I/O
41 num.io.threads=8
42
43 # The send buffer (SO_SNDBUF) used by the socket server
44 socket.send.buffer.bytes=102400
45
46 # The receive buffer (SO_RCVBUF) used by the socket server
47 socket.receive.buffer.bytes=102400
48
49 # The maximum size of a request that the socket server will accept (protection against OOM)
50 socket.request.max.bytes=104857600
51
52
53 ############################# Log Basics #############################
54
55 # A comma seperated list of directories under which to store log files
56 log.dirs=/usr/local/kafka/kafka_2.11-0.10.0.0/kfk_data/
57
58 # The default number of log partitions per topic. More partitions allow greater
59 # parallelism for consumption, but this will also result in more files across
60 # the brokers.
61 num.partitions=2
62 auto.create.topics.enable=false
63
64 # The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
65 # This value is recommended to be increased for installations with data dirs located in RAID array.
66 num.recovery.threads.per.data.dir=1
67
68 ############################# Log Flush Policy #############################
69
70 # Messages are immediately written to the filesystem but by default we only fsync() to sync
71 # the OS cache lazily. The following configurations control the flush of data to disk.
72 # There are a few important trade-offs here:
73 # 1. Durability: Unflushed data may be lost if you are not using replication.
74 # 2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
75 # 3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.
76 # The settings below allow one to configure the flush policy to flush data after a period of time or
77 # every N messages (or both). This can be done globally and overridden on a per-topic basis.
78
79 # The number of messages to accept before forcing a flush of data to disk
80 #log.flush.interval.messages=10000
81
82 # The maximum amount of time a message can sit in a log before we force a flush
83 #log.flush.interval.ms=1000
84
85 ############################# Log Retention Policy #############################
86
87 # The following configurations control the disposal of log segments. The policy can
88 # be set to delete segments after a period of time, or after a given size has accumulated.
89 # A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
90 # from the end of the log.
91
92 # The minimum age of a log file to be eligible for deletion
93 log.retention.hours=4
94
95 # A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
96 # segments don't drop below log.retention.bytes.
97 #log.retention.bytes=1073741824
98
99 # The maximum size of a log segment file. When this size is reached a new log segment will be created.
100 log.segment.bytes=1073741824
101
102 # The interval at which log segments are checked to see if they can be deleted according
103 # to the retention policies
104 log.retention.check.interval.ms=300000
105
106 ############################# Zookeeper #############################
107
108 # Zookeeper connection string (see zookeeper docs for details).
109 # This is a comma separated host:port pairs, each corresponding to a zk
110 # server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
111 # You can also append an optional chroot string to the urls to specify the
112 # root directory for all kafka znodes.
113 zookeeper.connect=192.168.6.131:2181,192.168.6.132:2181,192.168.6.133:2181
114
115 # Timeout in ms for connecting to zookeeper
116 zookeeper.connection.timeout.ms=6000

View Code

注意auto.create.topics.enable字段，若為true則如果producer寫入某個不存在的topic時會自動創建該topic，若為false則需要事先創建否則會報錯：failed after 3 retries。

2.2.2 命令

啟動： bin/kafka-server-start.sh config/server.properties ，生產環境最好以守護程序啟動：nohup &

結束： bin/kafka-server-stop.sh

2.2.3 Kafka在Zookeeper中的存儲結構

若上述的zookeeper.connect的值沒有路徑，則為根路徑，啟動Zookeeper和Kafka，命令行連接Zookeeper后，用 get / 命令可發現有 consumers、config、controller、admin、brokers、zookeeper、controller_epoch 這幾個目錄。

其結構如下：（具體可參考：apache kafka系列之在zookeeper中存儲結構）

2.3 使用

kafka本身是和zookeeper相連的，而對應producer和consumer的狀態保存也都是通過zookeeper完成的。對Kafka的各種操作通過其所連接的Zookeeper完成。

2.3.1 命令行客戶端

創建topic： bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

列出所有topic： bin/kafka-topics.sh --list --zookeeper localhost:2181

查看topic信息（包括分區、副本情況等）： kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topic ，會列出分區數、副本數、副本leader節點、副本節點、活着的副本節點

往某topic生產消息： bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

從某topic消費消息： bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning （默認用一個線程消費指定topic的所有分區的數據）

刪除某個Kafka groupid：連接Zookeeper后用rmr命令，如刪除名為JSI的消費組： rmr /consumers/JSI

查看消費進度：

./bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group test-mirror-consumer-zsm --zkconnect ec2-12345.cn-north-1.compute.amazonaws.com.cn:2181/kafka/blink/0822 --topic GPS2
    各參數：
    --group指MirrorMaker消費源集群時指定的group.id
    -zkconnect指源集群的zookeeper地址
    --topic指定查的topic，沒指定則返回所有topic的消費情況

2.3.2 Java客戶端

1、Topic操作：

 1 import kafka.admin.DeleteTopicCommand;
 2 import kafka.admin.TopicCommand;
 3 
 4 /**
 5  * @author zsm
 6  * @date 2016年9月27日 上午10:26:42
 7  * @version 1.0
 8  * @parameter
 9  * @since
10  * @return
11  */
12 public class JTopic {
13     public static void createTopic(String zkAddr, String topicName, int partition, int replication) {
14         String[] options = new String[] { "--create", "--zookeeper", zkAddr, "--topic", topicName, "--partitions",
15                 partition + "", "--replication-factor", replication + "" };
16         TopicCommand.main(options);
17     }
18 
19     public static void listTopic(String zkAddr) {
20         String[] options = new String[] { "--list", "--zookeeper", zkAddr };
21         TopicCommand.main(options);
22     }
23 
24     public static void describeTopic(String zkAddr, String topicName) {
25         String[] options = new String[] { "--describe", "--zookeeper", zkAddr, "--topic", topicName, };
26         TopicCommand.main(options);
27     }
28 
29     public static void alterTopic(String zkAddr, String topicName) {
30         String[] options = new String[] { "--alter", "--zookeeper", zkAddr, "--topic", topicName, "--partitions", "5" };
31         TopicCommand.main(options);
32     }
33 
34     // 通過刪除zk里面對應的路徑來實現刪除topic的功能,只會刪除zk里面的信息，Kafka上真實的數據並沒有刪除
35     public static void deleteTopic(String zkAddr, String topicName) {
36         String[] options = new String[] { "--zookeeper", zkAddr, "--topic", topicName };
37         DeleteTopicCommand.main(options);
38     }
39 
40     public static void main(String[] args) {
41         // TODO Auto-generated method stub
42 
43         String myTestTopic = "ZsmTestTopic";
44         int myPartition = 4;
45         int myreplication = 1;
46 
47         //createTopic(ConfigureAPI.KafkaProperties.ZK, myTestTopic, myPartition, myreplication);
48         // listTopic(ConfigureAPI.KafkaProperties.ZK);
49         describeTopic(ConfigureAPI.KafkaProperties.ZK, myTestTopic);
50         // alterTopic(ConfigureAPI.KafkaProperties.ZK, myTestTopic);
51         // deleteTopic(ConfigureAPI.KafkaProperties.ZK, myTestTopic);
52     }
53 
54 }

View Code

2、寫：（寫時可以指定key以供Kafka根據key將數據寫入某個分區，若無指定，則幾乎就是隨機找一個分區發送無key的消息，然后把這個分區號加入到緩存中以備后面直接使用——當然，Kafka本身也會清空該緩存（默認每10分鍾或每次請求topic元數據時））

  1 package com.zsm.kfkdemo;
  2 
  3 import java.util.ArrayList;
  4 import java.util.List;
  5 import java.util.Properties;
  6 
  7 import com.zsm.kfkdemo.ConfigureAPI.KafkaProperties;
  8 
  9 import kafka.javaapi.producer.Producer;
 10 import kafka.producer.KeyedMessage;
 11 import kafka.producer.ProducerConfig;
 12 
 13 /**
 14  * 可以指定規則(key和分區函數)以讓消息寫到特定分區：
 15  * <p>
 16  * 1、若發送的消息沒有指定key則Kafka會隨機選擇一個分區
 17  * </p>
 18  * <p>
 19  * 2、否則，若指定了分區函數(通過partitioner.class)則該函數以key為參數確定寫到哪個分區
 20  * </p>
 21  * <p>
 22  * 3、否則，Kafka根據hash(key)%partitionNum確定寫到哪個分區
 23  * </p>
 24  * 
 25  * @author zsm
 26  * @date 2016年9月27日 上午10:26:42
 27  * @version 1.0
 28  * @parameter
 29  * @since
 30  * @return
 31  */
 32 public class JProducer extends Thread {
 33     private Producer<String, String> producer;
 34     private String topic;
 35     private final int SLEEP = 10;
 36     private final int msgNum = 1000;
 37 
 38     public JProducer(String topic) {
 39         Properties props = new Properties();
 40         props.put("metadata.broker.list", KafkaProperties.BROKER_LIST);// 如192.168.6.127:9092,192.168.6.128:9092
 41         // request.required.acks
 42         // 0, which means that the producer never waits for an acknowledgement from the broker (the same behavior as 0.7). This option provides the lowest latency but the weakest durability guarantees
 43         // (some data will be lost when a server fails).
 44         // 1, which means that the producer gets an acknowledgement after the leader replica has received the data. This option provides better durability as the client waits until the server
 45         // acknowledges the request as successful (only messages that were written to the now-dead leader but not yet replicated will be lost).
 46         // -1, which means that the producer gets an acknowledgement after all in-sync replicas have received the data. This option provides the best durability, we guarantee that no messages will be
 47         // lost as long as at least one in sync replica remains.
 48         props.put("request.required.acks", "-1");
 49         // 配置value的序列化類
 50         props.put("serializer.class", "kafka.serializer.StringEncoder");
 51         // 配置key的序列化類
 52         props.put("key.serializer.class", "kafka.serializer.StringEncoder");
 53         // 提供自定義的分區函數將消息寫到分區上，未指定的話Kafka根據hash(messageKey)%partitionNum確定寫到哪個分區
 54         props.put("partitioner.class", "com.zsm.kfkdemo.MyPartitioner");
 55         producer = new Producer<String, String>(new ProducerConfig(props));
 56         this.topic = topic;
 57     }
 58 
 59     @Override
 60     public void run() {
 61         boolean isBatchWriteMode = true;
 62         System.out.println("isBatchWriteMode: " + isBatchWriteMode);
 63         if (isBatchWriteMode) {
 64             // 批量發送
 65             int batchSize = 100;
 66             List<KeyedMessage<String, String>> msgList = new ArrayList<KeyedMessage<String, String>>(batchSize);
 67             for (int i = 0; i < msgNum; i++) {
 68                 String msg = "Message_" + i;
 69                 msgList.add(new KeyedMessage<String, String>(topic, i + "", msg));
 70                 // msgList.add(new KeyedMessage<String, String>(topic, msg));//未指定key，Kafka會自動選擇一個分區
 71                 if (i % batchSize == 0) {
 72                     producer.send(msgList);
 73                     System.out.println("Send->[" + msgList + "]");
 74                     msgList.clear();
 75                     try {
 76                         sleep(SLEEP);
 77                     } catch (Exception ex) {
 78                         ex.printStackTrace();
 79                     }
 80                 }
 81             }
 82             producer.send(msgList);
 83         } else {
 84             // 單個發送
 85             for (int i = 0; i < msgNum; i++) {
 86                 KeyedMessage<String, String> msg = new KeyedMessage<String, String>(topic, i + "", "Message_" + i);
 87                 // KeyedMessage<String, String> msg = new KeyedMessage<String, String>(topic, "Message_" + i);//未指定key，Kafka會自動選擇一個分區
 88                 producer.send(msg);
 89                 System.out.println("Send->[" + msg + "]");
 90                 try {
 91                     sleep(SLEEP);
 92                 } catch (Exception ex) {
 93                     ex.printStackTrace();
 94                 }
 95             }
 96         }
 97 
 98         System.out.println("send done");
 99     }
100 
101     public static void main(String[] args) {
102         JProducer pro = new JProducer(KafkaProperties.TOPIC);
103         pro.start();
104     }
105 }

View Code

3、讀：（對於Consumer，需要注意 auto.commit.enable 和 auto.offset.reset 這兩個字段）

 1 package com.zsm.kfkdemo;
 2 
 3 import java.text.MessageFormat;
 4 import java.util.HashMap;
 5 import java.util.List;
 6 import java.util.Map;
 7 import java.util.Properties;
 8 
 9 import com.zsm.kfkdemo.ConfigureAPI.KafkaProperties;
10 
11 import kafka.consumer.Consumer;
12 import kafka.consumer.ConsumerConfig;
13 import kafka.consumer.ConsumerIterator;
14 import kafka.consumer.KafkaStream;
15 import kafka.javaapi.consumer.ConsumerConnector;
16 import kafka.message.MessageAndMetadata;
17 
18 /**
19  * 同一consumer group的多線程消費可以兩種方法實現：
20  * <p>
21  * 1、實現單線程客戶端，啟動多個去消費
22  * </p>
23  * <p>
24  * 2、在客戶端的createMessageStreams里為topic指定大於1的線程數，再啟動多個線程處理每個stream
25  * </p>
26  * 
27  * @author zsm
28  * @date 2016年9月27日 上午10:26:42
29  * @version 1.0
30  * @parameter
31  * @since
32  * @return
33  */
34 public class JConsumer extends Thread {
35 
36     private ConsumerConnector consumer;
37     private String topic;
38     private final int SLEEP = 20;
39 
40     public JConsumer(String topic) {
41         consumer = Consumer.createJavaConsumerConnector(this.consumerConfig());
42         this.topic = topic;
43     }
44 
45     private ConsumerConfig consumerConfig() {
46         Properties props = new Properties();
47         props.put("zookeeper.connect", KafkaProperties.ZK);
48         props.put("group.id", KafkaProperties.GROUP_ID);
49         props.put("auto.commit.enable", "true");// 默認為true，讓consumer定期commit offset，zookeeper會將offset持久化，否則只在內存，若故障則再消費時會從最后一次保存的offset開始
50         props.put("auto.commit.interval.ms", KafkaProperties.INTERVAL + "");// 經過INTERVAL時間提交一次offset
51         props.put("auto.offset.reset", "largest");// What to do when there is no initial offset in ZooKeeper or if an offset is out of range
52         props.put("zookeeper.session.timeout.ms", KafkaProperties.TIMEOUT + "");
53         props.put("zookeeper.sync.time.ms", "200");
54         return new ConsumerConfig(props);
55     }
56 
57     @Override
58     public void run() {
59         Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
60         topicCountMap.put(topic, new Integer(1));// 線程數
61         Map<String, List<KafkaStream<byte[], byte[]>>> streams = consumer.createMessageStreams(topicCountMap);
62         KafkaStream<byte[], byte[]> stream = streams.get(topic).get(0);// 若上面設了多個線程去消費，則這里需為每個stream開個線程做如下的處理
63 
64         ConsumerIterator<byte[], byte[]> it = stream.iterator();
65         MessageAndMetadata<byte[], byte[]> messageAndMetaData = null;
66         while (it.hasNext()) {
67             messageAndMetaData = it.next();
68             System.out.println(MessageFormat.format("Receive->[ message:{0} , key:{1} , partition:{2} , offset:{3} ]",
69                     new String(messageAndMetaData.message()), new String(messageAndMetaData.key()),
70                     messageAndMetaData.partition() + "", messageAndMetaData.offset() + ""));
71             try {
72                 sleep(SLEEP);
73             } catch (Exception ex) {
74                 ex.printStackTrace();
75             }
76         }
77     }
78 
79     public static void main(String[] args) {
80         JConsumer con = new JConsumer(KafkaProperties.TOPIC);
81         con.start();
82     }
83 }

View Code

與Kafka相關的Maven依賴：

 1         <dependency>
 2             <groupId>org.apache.kafka</groupId>
 3             <artifactId>kafka_2.9.2</artifactId>
 4             <version>0.8.1.1</version>
 5             <exclusions>
 6                 <exclusion>
 7                     <groupId>com.sun.jmx</groupId>
 8                     <artifactId>jmxri</artifactId>
 9                 </exclusion>
10                 <exclusion>
11                     <groupId>com.sun.jdmk</groupId>
12                     <artifactId>jmxtools</artifactId>
13                 </exclusion>
14                 <exclusion>
15                     <groupId>javax.jms</groupId>
16                     <artifactId>jms</artifactId>
17                 </exclusion>
18             </exclusions>
19         </dependency>

View Code

3 MirrorMaker

Kafka自身提供的MirrorMaker工具用於把一個集群的數據同步到另一集群，其原理就是對源集群消費、對目標集群生產。

運行時需要指定源集群的Zookeeper地址（pull模式）或目標集群的Broker列表（push模式）。

3.1 使用

運行 ./kafka-run-class.sh kafka.tools.MirrorMaker --help 查看使用說明，如下：

 1 Option                                  Description                            
 2 ------                                  -----------                            
 3 --blacklist <Java regex (String)>       Blacklist of topics to mirror.         
 4 --consumer.config <config file>         Consumer config to consume from a      
 5                                           source cluster. You may specify      
 6                                           multiple of these.                   
 7 --help                                  Print this message.                    
 8 --num.producers <Integer: Number of     Number of producer instances (default: 
 9   producers>                              1)                                   
10 --num.streams <Integer: Number of       Number of consumption streams.         
11   threads>                                (default: 1)                         
12 --producer.config <config file>         Embedded producer config.              
13 --queue.size <Integer: Queue size in    Number of messages that are buffered   
14   terms of number of messages>            between the consumer and producer    
15                                           (default: 10000)                     
16 --whitelist <Java regex (String)>       Whitelist of topics to mirror.

View Code

3.2 啟動

./bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config  zsmSourceClusterConsumer.config  --num.streams 2 --producer.config zsmTargetClusterProducer.config --whitelist="ds*"
    --consumer.config所指定的文件里至少需要有zookeeper.connect、group.id兩字段
    --producer.config至少需要有metadata.broker.list字段，指定目標集群的brooker列表
    --whitelist指定要同步的topic

可以用2.3.1所說的查看消費進度來查看對原集群的同步狀況（即消費狀況）。

4 Kafka監控工具（KafkaOffsetMonitor）

可以借助KafkaOffsetMonitor來圖形化展示Kafka的broker節點、topic、consumer及offset等信息。

以KafkaOffsetMonitor-assembly-0.2.0.jar為例，下載后執行：

#!/bin/bash
java -Xms512M -Xmx512M -Xss1024K -XX:PermSize=256m -XX:MaxPermSize=512m -cp KafkaOffsetMonitor-assembly-0.2.0.jar \ com.quantifind.kafka.offsetapp.OffsetGetterWeb \ --zk 192.168.5.131:2181,192.168.6.132:2181,192.168.6.133:2181 \ --port 8087 \ --refresh 10.seconds \ --retain 1.days 1>./zsm-logs/stdout.log 2>./zsm-logs/stderr.log &

其中，zk按照host1:port1,host2:port2…的格式去寫即可，port為開啟web界面的端口號，refresh為刷新時間，retain為數據保留時間（單位seconds, minutes, hours, days）

5 Kafka集群管理工具（Kafka Manager）

kafka-manager是yahoo開源出來的項目，屬於商業級別的工具用Scala編寫。

這個管理工具可以很容易地發現分布在集群中的哪些topic分布不均勻，或者是分區在整個集群分布不均勻的的情況。它支持管理多個集群、選擇副本、副本重新分配以及創建Topic。同時，這個管理工具也是一個非常好的可以快速瀏覽這個集群的工具。

此工具以集群的方式運行，需要Zookeeper。

參考資料：http://hengyunabc.github.io/kafka-manager-install/

5.1 安裝

需要從Github下載源碼並安裝sbt工具編譯生成安裝包，生成的時間很長且不知為何一直出錯，所以這里用網友已編譯好的包（備份鏈接）。

包為kafka-manager-1.0-SNAPSHOT.zip

>解壓：

unzip kafka-manager-1.0-SNAPSHOT.zip

>配置conf/application.conf里的kafka-manager.zkhosts：

kafka-manager.zkhosts="192.168.6.131:2181,192.168.6.132:2181,192.168.6.133:2181"

>啟動：

./bin/kafka-manager -Dconfig.file=conf/application.conf （啟動后在Zookeeper根目錄下可發現增加了kafka-manager目錄）

默認是9000端口，要使用其他端口可以在命令行指定http.port，此外kafka-manager.zkhosts也可以在命令行指定，如：

./bin/kafka-manager -Dhttp.port=9001 -Dkafka-manager.zkhosts="192.168.6.131:2181,192.168.6.132:2181,192.168.6.133:2181"

5.2 使用

訪問web頁面，在Cluster->Add Cluster，輸入要監控的Kafka集群的Zookeeper即可。

6 進階

在當前的kafka版本實現中，對於zookeeper的所有操作都是由kafka controller來完成的（serially的方式）
offset管理：kafka會記錄offset到zk中。但是，zk client api對zk的頻繁寫入是一個低效的操作。0.8.2 kafka引入了native offset storage，將offset管理從zk移出，並且可以做到水平擴展。其原理就是利用了kafka的compacted topic，offset以consumer group,topic與partion的組合作為key直接提交到compacted topic中。同時Kafka又在內存中維護了三元組來維護最新的offset信息，consumer來取最新offset信息時直接從內存拿即可。當然，kafka允許你快速checkpoint最新的offset信息到磁盤上。
如何確定分區數：分區數的確定與硬件、軟件、負載情況等都有關，要視具體情況而定，不過依然可以遵循一定的步驟來嘗試確定分區數：創建一個只有1個分區的topic，然后測試這個topic的producer吞吐量和consumer吞吐量。假設它們的值分別是Tp和Tc，單位是MB/s。然后假設總的目標吞吐量是Tt，那么分區數 = Tt / max(Tp, Tc)
Kafka Exactly Once語義與事務機制：http://www.jasongj.com/kafka/transaction/

7 參考資料

1、http://www.cnblogs.com/fanweiwei/p/3689034.html（Kafka的使用）

2、http://orchome.com/12（Broker的配置）

3、http://blog.csdn.net/lizhitao/article/details/25667831（Broker的配置）

4、http://www.jasongj.com/2015/01/02/Kafka%E6%B7%B1%E5%BA%A6%E8%A7%A3%E6%9E%90/（進階——Kafka深度解析）

5、http://www.cnblogs.com/huxi2b/p/4757098.html?utm_source=tuicool&utm_medium=referral（如何確定分區數、key、consumer線程數）

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Kafka集群的安裝和部署 kafka集群的安裝和配置 kafka系列一、kafka安裝及部署、集群搭建 docker安裝kafka+kafka-manager集群 linux環境安裝kafka集群 zookeeper、kafka集群安裝部署 Kafka集群管理工具kafka-manager的安裝使用 kafka_2.12-2.5.0 集群安裝 centos7下kafka集群安裝部署 kafka的安裝及基本使用