kafka-0.10.0官網翻譯(一)入門指南


1.1 Introduction
Kafka is a distributed streaming platform. What exactly does that mean?
kafka是一個分布式的流式平台,它到底是什么意思?


We think of a streaming platform as having three key capabilities:
我們認為流式平台有以下三個主要的功能:
  It let's you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.
  它能讓你推送和訂閱流記錄。在這個方面它類似於一個消息隊列或者企業級的消息系統。
  It let's you store streams of records in a fault-tolerant way.
  它能讓你存儲流記錄以一種容錯的方式。
  It let's you process streams of records as they occur.
  它讓你處理流記錄當流數據來到時。

新浪微博:intsmaze劉洋洋哥


What is Kafka good for?
kafka有什么好處?
  It gets used for two broad classes of application:
  它被用於兩大類別的應用程序:
  Building real-time streaming data pipelines that reliably get data between systems or applications
  建立實時的流式數據通道,這個通道能可靠的獲取到在系統或應用間的數據
  Building real-time streaming applications that transform or react to the streams of data
  建立實時流媒體應用來轉換流數據或對流數據做出反應。


  To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
  為了明白kafka能怎么做這些事情,讓我們從下面開始深入探索kafka的功能:
First a few concepts:
首先看這幾個概念:
  Kafka is run as a cluster on one or more servers.
  kafka作為集群運行在一個或多個服務器。
  The Kafka cluster stores streams of records in categories called topics.
  kafka集群存儲的流記錄以類別划分稱為主題。
  Each record consists of a key, a value, and a timestamp.
  每條記錄包含一個鍵,一個值和一個時間戳。
    
  Kafka has four core APIs:
  kafka有四個核心的apis:
  The Producer API allows an application to publish a stream records to one or more Kafka topics.
  生成者api允許一個應用去推送一個流記錄到一個或多個kafka主題上。
  The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
  消費者api允許一個應用去訂閱一個或多個主題,對他們產生過程的記錄。
  The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
  這個Streams API允許應用去作為一個流處理器,消費一個來至於一個或多個主題的輸入流,生產一個輸出流到一個或多個輸出流主題,有效地將輸入流轉換為輸出流。
  The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to
  Connector API允許建立和允許可重用的生產者或消費者去連接kafka主題到存在的應用或數據系統。例如,關系數據庫的連接器可能捕獲每一個變化。


  In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.
  在kafka的客戶和服務器之間的通信是用一個簡單的,高性能的,語言無關的TCP協議完成的。這個協議的版本能向后維護來兼容舊版本。我們為kafka提供一個java版本的客戶端,其實這個客戶端有很多語言版本供選擇。


Topics and Logs 主題和日志
  Let's first dive into the core abstraction Kafka provides for a stream of records—the topic.
  我們首先深入kafka核心概念,kafka提供了一連串的記錄稱為主題。

  A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

  主題就是一個類別或者命名哪些記錄會被推送走。kafka中的主題總是有多個訂閱者。所以,一個主題可以有零個,一個或多個消費者去訂閱寫到這個主題里面的數據。
  For each topic, the Kafka cluster maintains a partitioned log that looks like this:
  針對每一個主題,這個kafka集群維護一個像下面這樣的分區日志:

  Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
  每個分區是一個有序,不變的序列的記錄,它被不斷追加—這種結構化的操作日志。分區的記錄都分配了一個連續的id號叫做偏移量。偏移量唯一的標識在分區的每一條記錄。

  The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

  kafka集群使用一個可配置的保存期來保存所以已經推送出去的記錄,不論他們是否已經被消費掉。例如,如果保存的策略設置為兩天,然后記錄被推送出去兩天后,這個記錄可以消費,之后,它將被丟棄來騰出空間。kafka的性能是有效常數對數據大小所以存儲數據很長一段時間不是一個問題。

  In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
  事實上,唯一的元數據保留在每個消費者的基礎上 偏移量是通過消費者進行控制:通常當消費者讀取一個記錄后會線性的增加他的偏移量。但是,事實上,自從記錄的位移由消費者控制后,消費者可以在任何順序消費記錄。例如,一個消費者可以重新設置偏移量為之前使用的偏移量來重新處理數據或者跳到最近的記錄開始消費。
  This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.
  kafka的組合特性意味着kafka消費者們是很方便的,他們能夠加入或者離開不會影響集群或者其他的消費者。例如,你能夠使用我們的命令行工具去追蹤任何主題的內容不改變消費被任何存在的消費者。
  The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
  日志划分分區有多個目的。第一:他們允許日志的大小可以超過他們部署在一台單機的限制。每個分區的服務器主機上必須適合它。


Distribution
分布
  The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
  日志的分區被分布在kafka集群的服務器上,每個服務器處理數據和請求一個共享的分區。每個分區復制在一個可配置的容錯服務器數量。

  Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
  每個分區都有一個服務器充當“領導者”和零個或多個服務器充當“追隨者”。leader處理所有對分區讀寫請求時followers就會被動復制這個leader的分區。如果這個leader發送故障,這些followers中的一個將自動的成為一個新的leader。Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

Producers
生產者
  Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!
  生產者推送數據到他們選擇的主題。生產者負責選擇哪個記錄分配到指定主題的哪個分區中。通過循環的方式可以簡單地來平衡負載記錄到分區上或可以根據一些語義分區函數來確定記錄到哪個分區上(根據記錄的key進行划分)。馬上你會看到關於更多的划分使用。

Consumers

消費者
  Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
  消費者們標識他們自己通過消費組名稱,每一條被推送到主題的記錄只被交付給訂閱該主題的每一個消費組。消費者可以在單獨的實例流程或在不同的機器上。

  If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
  如果所有的消費者實例都在同一個消費組中,那么一條消息將會有效地負載平衡給這些消費者實例。
  If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
  如果所有的消費者實例在不同的消費組中,那么每一條消息將會被廣播給所有的消費者處理。
  

  A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.
  兩個服務器的kafka集群管理四個分區(P0-P3)作用於兩個消費者組。消費組A有兩個消費者實例,消費組B有四個消費者實例。
  More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

  更常見的,我們發現主題有一個小數量的消費群體one for each "logical subscriber"。

  The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

  Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

Guarantees
保證
  At a high-level Kafka gives the following guarantees:
  kafka的高級api可以賦予以下保證:
  Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
  消息被生產者發送到一個特定的主題分區,消息將以發送的順序追加到這個分區上面。比如,如果M1和M2消息都被同一個消費者發送,M1先發送,M1的偏移量將比M2的小且更早出現在日志上面。
  A consumer instance sees records in the order they are stored in the log.
  一個消費者實例按照記錄存儲在日志上的順序讀取。
  For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
  一個主題的副本數是N,我們可以容忍N-1個服務器發生故障沒而不會丟失任何提交到日志中的記錄。

More details on these guarantees are given in the design section of the documentation.
關於擔保的更多的細節將在文檔的設計章節被給出來。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM