kafka 教程(一)-初識kafka

本文轉載自查看原文 2019-09-05 17:43 811 大數據-Kafka

消息隊列 MQ

消息隊列就是消息 message 加隊列 queue，是一種消息傳輸的容器，提供生產和消費 API 來存儲和獲取消息。

消息隊列分兩種：點對點（p2p）、發布訂閱（pub/sub）

相同點：生產的消息存入隊列，都從隊列中獲取消息

不同點：p2p 模式是一個消息只能被消費一次，消費之后這個消息就不存在了，比如打電話；

　　　　而發布訂閱模式是一個消息可以被消費 N 次，而且可以被多個消費者同時消費，比如微信公眾號；

kafka 簡介

kafka 就是一個發布訂閱消息系統，有以下特點：

高吞吐量：支持每秒百萬級的消息生產消費

持久性：有一套完善的消息存儲機制，確保消息安全持久

分布式：基於分布式的擴展和容錯機制；kafka 會將數據復制幾份到其他服務器上，如果一台服務器掛了，會自動切到其他服務器。

kafka 也是一個消息中間件；

常用來處理活躍的數據，如登錄、瀏覽

kafka 組成

kafka 服務

topic：主題，代表消息的類別，如體育的，娛樂的

broker：消息代理，就是集群中的一個節點，負責存儲數據，topic 可以分區存儲

partition：topic 物理上的分組，一個 topic 在 broker 中被分成 n 個 partition

message：消息，每個消息被分到對應的 partition，需要一種映射關系

kafka 服務相關

producer：消息生產者

consumer：消息消費者

zookeeper：協調 kafka 正常運行

broker 配置

一個 broker 代表一個 kafka 服務，配置文件為 kafka 配置文件：server.properties

1. 為了減少磁盤寫入次數，kafka 會先把消息 buffer 起來，當消息達到一定數量或者過了一定時間后，再 flush 到磁盤

對應配置

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk. 
# There are a few important trade-offs here:
#    1. Durability: Unflushed data may be lost if you are not using replication.
#    2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks. 
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000　　<=========

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000　　<=========

2. 消息保存一定時間會自動刪除，默認 7 天，168 小時

對應配置

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.

# The minimum age of a log file to be eligible for deletion
log.retention.hours=168　　

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according 
# to the retention policies
log.retention.check.interval.ms=300000

# By default the log cleaner is disabled and the log retention policy will default to just delete segments after their retention expires.
# If log.cleaner.enable=true is set the cleaner will be enabled and individual logs can then be marked for log compaction.
log.cleaner.enable=false

producer 配置

消息生產者，配置文件：producer.properties

1. partitioner.class：可以自定義分區方法，指定用戶自己寫的算法

2. producer.type=sync：發送消息是同步還是異步，同步是發出消息后收到回應再發下一條，異步是只管發

3. 異步發送支持批量發送，提高發送效率，先把消息緩存到內存中，然后一次性發出去，對應參數 queue.buffering.max.ms=；queue.buffering.max.messages=；據說默認 5000 和 10000

consumer 配置

配置文件：consumer.properties

1. group.id=test-consumer-group：每個消費者都屬於某個 group，這里指定組 id

2. kafka 對消息的消費形式跟分組有關，

組間，不同的組消費相同的數據，互不影響；

組內，組內成員消費相同的數據，不同的 consumer 不能同時消費一個 topic 的 1 個 partition，可以同時消費一個 topic 的不同 partition

　　// 所以，對應一個 topic，同一個組不推薦超過 partition 個數的成員來消費這個 topic，這樣會有 consumer 被浪費

3. 一個 consumer 開啟多個線程，一個線程相當於一個 consumer

（這是Kafka用來實現一個Topic消息的廣播（發給所有的Consumer）和單播（發給某一個Consumer）的手段。
一個Topic可以對應多個Consumer Group。如果需要實現廣播，只要每個Consumer有一個獨立的Group就可以了。
要實現單播只要所有的Consumer在同一個Group里。用Consumer Group還可以將Consumer進行自由的分組而不需要多次發送消息到不同的Topic。）

partition

每個 partition 在存儲層面是個 append log 文件，新消息追加到文件尾部；

每條消息在 log 文件中有個位置稱為 offset（偏移量）；

越多的 partition 意味着可以容納更多的 consumer，有效提升並發消費的能力；

業務分區增加 topic，數據量大增加 partition

message

3個屬性：

offset：long型，代表此消息在 partition 中的序號，或者說 id

MessageSize：int32，代表字節大小

data：具體內容

broker 配置詳解

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

##################################################################################
#  broker就是一個kafka的部署實例，在一個kafka集群中，每一台kafka都要有一個broker.id
#  並且，該id唯一，且必須為整數
##################################################################################
broker.id=10

############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from 
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = security_protocol://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
#listeners=PLAINTEXT://:9092

# Hostname and port the broker will advertise to producers and consumers. If not set, 
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
#advertised.listeners=PLAINTEXT://your.host.name:9092

##################################################################################
#The number of threads handling network requests
# 默認處理網絡請求的線程個數 3個
##################################################################################
num.network.threads=3
##################################################################################
# The number of threads doing disk I/O
# 執行磁盤IO操作的默認線程個數 8
##################################################################################
num.io.threads=8

##################################################################################
# The send buffer (SO_SNDBUF) used by the socket server
# socket服務使用的進行發送數據的緩沖區大小，默認100kb
##################################################################################
socket.send.buffer.bytes=102400

##################################################################################
# The receive buffer (SO_SNDBUF) used by the socket server
# socket服務使用的進行接受數據的緩沖區大小，默認100kb
##################################################################################
socket.receive.buffer.bytes=102400

##################################################################################
# The maximum size of a request that the socket server will accept (protection against OOM)
# socket服務所能夠接受的最大的請求量，防止出現OOM(Out of memory)內存溢出，默認值為：100m
# （應該是socker server所能接受的一個請求的最大大小，默認為100M）
##################################################################################
socket.request.max.bytes=104857600

############################# Log Basics （數據相關部分，kafka的數據稱為log）#############################

##################################################################################
# A comma seperated list of directories under which to store log files
# 一個用逗號分隔的目錄列表，用於存儲kafka接受到的數據
##################################################################################
log.dirs=/home/uplooking/data/kafka

##################################################################################
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
# 每一個topic所對應的log的partition分區數目，默認1個。更多的partition數目會提高消費
# 並行度，但是也會導致在kafka集群中有更多的文件進行傳輸
# （partition就是分布式存儲，相當於是把一份數據分開幾份來進行存儲，即划分塊、划分分區的意思）
##################################################################################
num.partitions=1

##################################################################################
# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
# 每一個數據目錄用於在啟動kafka時恢復數據和在關閉時刷新數據的線程個數。如果kafka數據存儲在磁盤陣列中
# 建議此值可以調整更大。
##################################################################################
num.recovery.threads.per.data.dir=1

############################# Log Flush Policy （數據刷新策略）#############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs（平衡） here:
#    1. Durability 持久性: Unflushed data may be lost if you are not using replication.
#    2. Latency 延時性: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput 吞吐量: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.
# kafka中只有基於消息條數和時間間隔數來制定數據刷新策略，而沒有大小的選項，這兩個選項可以選擇配置一個
# 當然也可以兩個都配置，默認情況下兩個都配置，配置如下。

# The number of messages to accept before forcing a flush of data to disk
# 消息刷新到磁盤中的消息條數閾值
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
# 消息刷新到磁盤生成一個log數據文件的時間間隔
#log.flush.interval.ms=1000

############################# Log Retention Policy（數據保留策略） #############################

# The following configurations control the disposal（清理） of log segments（分片）. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated（累積）.
# A segment will be deleted whenever（無論什么時間） *either* of these criteria（標准） are met. Deletion always happens
# from the end of the log.
# 下面的配置用於控制數據片段的清理，只要滿足其中一個策略（基於時間或基於大小），分片就會被刪除

# The minimum age of a log file to be eligible for deletion
# 基於時間的策略，刪除日志數據的時間，默認保存7天
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes. 1G
# 基於大小的策略，1G
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
# 數據分片策略
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies 5分鍾
# 每隔多長時間檢測數據是否達到刪除條件
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=uplooking01:2181,uplooking02:2181,uplooking03:2181

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 初識Kafka Kafka從入門到放棄(一) —— 初識Kafka 初識kafka 之分區策略初識kafka-connect kafka接口文檔和kafka教程初識 Kafka Producer 生產者初識中間件Kafka 初識Apache Kafka 核心概念什么是Kafka? 什么是Kafka