Kafka簡介及各個組件介紹 【轉】


Kafka:分布式發布-訂閱消息系統
注:本文翻譯自官方文檔。

1. 介紹
Kafka is a distributed,partitioned,replicated commit log service. It provides the functionality of a messaging system, but with a unique design.

Kafka是一種分布式發布-訂閱消息系統,它提供了一種獨特的消息系統功能。

Kafka maintains feeds of messages in categories called topics.
We’ll call processes that publish messages to a Kafka topic producers.
We’ll call processes that subscribe to topics and process the feed of published messages consumers..
Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
1) Kafka維護的消息流稱為topic。
2) 發布消息者稱為 producer。
3) 訂閱並消費消息的稱為 consumers。
4) Kafka運行在多server的集群之上,每個server稱為broker。


2. 組件
Topics and Logs
A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like this:

 

一個Topic可以認為是一類消息,Kafka集群將每個topic將被分成多個partition(區),邏輯上如上圖所示。

Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition.

每一個partition都是一個有序的、不可變的消息序列,它在存儲層面是以append log文件形式存在的。任何發布到此partition的消息都會被直接追加到log文件的尾部。每條消息在文件中的位置稱為offset(偏移量),offset為一個long型數字,它是唯一標記一條消息。

The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time. For example if the log retention is set to two days, then for the two days after a message is published it is available for consumption, after which it will be discarded to free up space. Kafka’s performance is effectively constant with respect to data size so retaining lots of data is not a problem.

Kafka集群保留了所有以發布消息,即使消息被消費,消息仍然會被保留一段時間。例如,如果log被設置為保留兩天,那么在一條消息被消費之后的兩天內仍然有效,之后它將會被丟棄以釋放磁盤空間。Kafuka的性能相對於數據量來說是恆定的,所以保留大量的數據並不是問題。

In fact the only metadata retained on a per-consumer basis is the position of the consumer in the log, called the “offset”. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads messages, but in fact the position is controlled by the consumer and it can consume messages in any order it likes. For example a consumer can reset to an older offset to reprocess.

每個consumer(消費者)的基礎元數據只有一個,那就是offset,它表示消息在log文件中的位置,它由consumer所控制,通常情況下,offset將會”線性”的向前驅動,也就是說消息將依次順序被消費。而事實上,consumer可以通過設置offset來消費任意位置的消息。例如,consumer可以重置offset來從新處理消息。

This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to “tail” the contents of any topic without changing what is consumed by any existing consumers.

這些特性意味着Kafkaconsumer非常輕量級,它可以隨意切入和離開,而不會對集群里其他的consumer造成太大的影響。比如,你可以使用tail命令工具來查看任意topic的內容,而不會影響消息是否被其他consumer所消費。

The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.

在消息系統中采用Partitions設計方式的目的有多個。首先,允許更大的數據容量,每個topic可以擁有多個partitions,每個獨立的patition運行於servers之上,因此,topic幾乎能夠容納任意大小的數據量。第二點,partitions都是並行單位。

Partition
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

Kafka集群中,一個Topic的多個partitions被分布在多個server上。每個server負責partitions中消息的讀寫操作。每個partition可以被備份到多台server上,以提高可靠性。

Each partition has one server which acts as the “leader” and zero or more servers which act as “followers”. The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

每一個patition中有一個leader和若干個follower。leader處理patition內所有的讀寫請求,而follower是leader的候補。如果leader掛了,其中一個follower會自動成為新的leader。每一台server作為擔任一些partition的leader,同時也擔任其他patition的follower,以此達到集群內的負載均衡。

Producers
Producers publish data to the topics of their choice. The producer is responsible for choosing which message to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the message). More on the use of partitioning in a second.

Producer將消息發送的指定topic中,producer決定將消息發送到哪個partition中。比如基於”round-robin”方式實現簡單的負載均衡或者通過其他的一些算法等.

Consumers
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each message goes to one of them; in publish-subscribe the message is broadcast to all consumers. Kafka offers a single consumer abstraction that generalizes both of these—the consumer group.

消息基本上有兩種模式:queuing(隊列模式) 和 publish-subscribe(發布-訂閱模式) , 在隊列模式中,consumer池從server中讀取消息,每個消息都會到達一個consumer。在發布-訂閱模式中,消息被廣播到所有的consumer。Kafka提供了consumer group這個抽象概念來概括這兩種模式。

Consumers label themselves with a consumer group name, and each message published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

每個consumer屬於一個consumer group, 如果consumer group訂閱了topic,那么它會接收到該topic發布的每條消息,該消息只會被分配到一個consumer上。consumer實例可以部署在不同的進程或機器上。

If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.

如果所有的consumer都具有相同的group,這種情況和queue模式很像,消息將會在consumers之間負載均衡。

If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.

如果所有的consumer都具有不同的group,那這就是”發布-訂閱”,消息將會廣播給所有的消費者。

More commonly, however, we have found that topics have a small number of consumer groups, one for each “logical subscriber”. Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is cluster of consumers instead of a single process.

然而,我們發現大多數情況下topic只有少量的邏輯上的訂閱者 consumer group,每個group由許多的consumer實例組成,以提高擴展性和容錯性。這就是發布-訂閱模式,訂閱者是consumer集群而非單個進程。

Kafka has stronger ordering guarantees than a traditional messaging system, too.

相比於傳統的消息系統,Kafka具有更強的序列保證。

A traditional queue retains messages in-order on the server, and if multiple consumers consume from the queue then the server hands out messages in the order they are stored. However, although the server hands out messages in order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of “exclusive consumer” that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

傳統的隊列在server上保持有序,如果多個consumer從隊列中消費,隊列會按序彈出,然后消息被異步分配到consumer上,因此,消息到達consumer時可能會破壞順序。這意味着在並行處理過程中,消息處理是無序的。為了解決這個問題,消息系統的exclusive consumer機制只允許單進程從隊列中消費消息,當然,這就是說,沒有了並行處理能力。

 

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.

Kafka具有更好的解決方案。通過parallelism—the partition—within the topics機制,Kafka能夠提提供有序保證,使consumer池能夠負載均衡。這是通過把topic中的partition分派給consumer group中的consumer來實現的,因此,每個partition由group中一個確定的consumer來消費。通過這種方式我們保證了consumer是指定partition的唯一reader,並且按順序消費數據。由於有很多partition,這種方式使得consumer實例可以負載均衡。

Kafka only provides a total order over messages within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over messages this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

kafka只能保證一個partition中的消息被某個consumer消費時,消息是順序的。事實上,從Topic角度來說,消息仍不是有序的。如果你需要topic范圍內的有序,那么你可以只使用一個partition,這也就是說,group中也只有一個consumer。

Guarantees
At a high-level Kafka gives the following guarantees:
1. Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a message M1 is sent by the same producer as a message M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
2. A consumer instance sees messages in the order they are stored in the log.
3. For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages committed to the log.

在更高的層面,Kafka給出以下保證:
1) 發送到partitions中的消息將會按照它接收的順序追加到日志中。
2) 對於消費者而言,它們消費消息的順序和log中消息順序一致。
3) 如果Topic的”replication factor“為N,那么允許N-1個kafka實例失效。
————————————————
版權聲明:本文為CSDN博主「李小靜」的原創文章,遵循CC 4.0 BY-SA版權協議,轉載請附上原文出處鏈接及本聲明。
原文鏈接:https://blog.csdn.net/a568078283/article/details/51464524


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM