Kafka 消費遲滯監控工具 Burrow


Kafka 官方對於自身的 LAG 監控並沒有太好的方法,雖然Kafka broker 自帶有 kafka-topic.sh, kafka-consumer-groups.sh, kafka-console-consumer.sh 等腳本,但是對於大規模的生產集群上,使用腳本采集是非常不可靠的。


Burrow 簡介

LinkedIn 公司的數據基礎設施Streaming SRE團隊正在積極開發Burrow,該軟件由Go語言編寫,在Apache許可證下發布,並托管在 GitHub Burrow 上。

它收集集群消費者群組的信息,並為每個群組計算出一個單獨的狀態,告訴我們群組是否運行正常,是否落后,速度是否變慢或者是否已經停止工作,以此來完成對消費者狀態的監控。它不需要通過監控群組的進度來獲得閾值,不過用戶仍然可以從中獲得消息的延時數量。

Burrow 的設計框架


Burrow自動監控所有消費者和他們消費的每個分區。它通過消費特殊的內部Kafka主題來消費者偏移量。然后,Burrow將消費者信息作為與任何單個消費者分開的集中式服務提供。消費者狀態通過評估滑動窗口中的消費者行為來確定。

這些信息被分解成每個分區的狀態,然后轉化為Consumer的單一狀態。消費狀態可以是OK,或處於WARNING狀態(Consumer正在工作但消息消費落后),或處於ERROR狀態(Consumer已停止消費或離線)。此狀態可通過簡單的HTTP請求發送至Burrow獲取狀態,也可以通過Burrow 定期檢查並使用通知其通過電子郵件或單獨的HTTP endpoint接口(例如監視或通知系統)發送出去。

Burrow能夠監控Consumer消費消息的延遲,從而監控應用的健康狀況,並且可以同時監控多個Kafka集群。用於獲取關於Kafka集群和消費者的信息的HTTP上報服務與滯后狀態分開,對於在無法運行Java Kafka客戶端時有助於管理Kafka集群的應用程序非常有用。

Burrow 安裝及版本

Burrow 是基於 Go 語言開發,當前 Burrow 的 v1.1 版本已經release。
Burrow 也提供用於 docker 鏡像。

Burrow_1.2.2_checksums.txt              297 Bytes

Burrow_1.2.2_darwin_amd64.tar.gz        4.25 MB

Burrow_1.1.0_linux_amd64.tar.gz       3.22 MB (CentOS 6)

Burrow_1.2.2_linux_amd64.tar.gz        4.31 MB (CentOS 7 Require GLIBC >= 2.14)

Burrow_1.2.2_windows_amd64.tar.gz        4 MB

Source code (zip)

Source code (tar.gz)

本發行版包含針對初始1.0.0發行版中發現的問題的一些重要修復,其中包括:

  • 支持 Kafka 1.0更新版本(#306
  • Fix Zookeeper 監視處理(#328

還有一些小的功能更新

  • 存儲最近的代理偏移環以避免停止的分區出現虛假警報
  • 添加可配置的通知間隔
  • 通過環境變量添加對配置的支持
  • 支持存儲模塊中可配置的隊列深度
Changelog - version 1.2
[d244fce922] - Bump sarama to 1.20.1 (Vlad Gorodetsky)
[793430d249] - Golang 1.9.x is no longer supported (Vlad Gorodetsky)
[735fcb7c82] - Replace deprecated megacheck with staticcheck (Vlad Gorodetsky)
[3d49b2588b] - Link the README to the Compose file in the project (Jordan Moore)
[3a59b36d94] - Tests fixed (Mikhail Chugunkov)
[6684c5e4db] - Added unit test for v3 value decoding (Mikhail Chugunkov)
[10d4dc39eb] - Added v3 messages protocol support (Mikhail Chugunkov)
[d6b075b781] - Replace deprecated MAINTAINER directive with a label (Vlad Gorodetsky)
[52606499a6] - Refactor parseKafkaVersion to reduce method complexity (gocyclo) (Vlad Gorodetsky)
[b0440f9dea] - Add gcc to build zstd (Vlad Gorodetsky)
[6898a8de26] - Add libc-dev to build zstd (Vlad Gorodetsky)
[b81089aada] - Add support for Kafka 2.1.0 (Vlad Gorodetsky)
[cb004f9405] - Build with Go 1.11 (Vlad Gorodetsky)
[679a95fb38] - Fix golint import path (golint fixer)
[f88bb7d3a8] - Update docker-compose Readme section with working url. (Daniel Wojda)
[3f888cdb2d] - Upgrade sarama to support Kafka 2.0.0 (#440) (daniel)
[1150f6fef9] - Support linux/arm64 using Dup3() instead of Dup2() (Mpampis Kostas)
[1b65b4b2f2] - Add support for Kafka 1.1.0 (#403) (Vlad Gorodetsky)
[74b309fc8d] - code coverage for newly added lines (Clemens Valiente)
[279c75375c] - accidentally reverted this (Clemens Valiente)
[192878c69c] - gofmt (Clemens Valiente)
[33bc8defcd] - make first regex test case a proper match everything (Clemens Valiente)
[279b256b27] - only set whitelist / blacklist if it's not empty string (Clemens Valiente)
[b48d30d18c] - naming (Clemens Valiente)
[7d6c6ccb03] - variable naming (Clemens Valiente)
[4e051e973f] - add tests (Clemens Valiente)
[545bec66d0] - add blacklist for memory store (Clemens Valiente)
[07af26d2f1] - Updated burrow endpoint in README : #401 (Ratish Ravindran)
[fecab1ea88] - pass custom headers to http notifications. (#357) (vixns)

Changelog - version 1.1
fecab1e pass custom headers to http notifications. (#357)
7c0b8b1 Add minimum-complete config for the evaluator (#388)
dc4cb84 Fix mail template (#369)
e2216d7 Fetch goreleaser via curl instead of 'go get' as compilation only works in 1.10 (#387)
f3659d1 Add a send-interval configuration parameter (#364)
3e488a2 Allow env vars to be used for configuration (#363)
b7428c9 Fix typo in slack close (#361)
5b546cc Create the broker offset rings earlier (#360)
61f097a Metadata refresh on detecting a deleted topic must not be for that topic (#359)
b890885 Make inmemory module request channel's size configurable (#352)
9911709 Update sarama to support 10.2.1 too. (#345)
a1bdcde Adjusting docker build to be self-contained (#344)
a91cf4d Fix an incorrect cast from #338 and add a test to cover it (#340)
389ef47 Store broker offset history (#338)
1a60efe Fix alert closing (#334)
b75a6f3 Fix typo in Cluster reference
cacf05e Reject offsets that are older than the group expiration time (#330)
b6184ff Fix typo in the config checked for TLS no-verify #316 (#329)
3b765ea Sync Gopkg.lock with Gopkg.toml (#312)
e47ec4c Fix ZK watch problem (#328)
846d785 Assume backward-compatible consumer protocol version (fix #313) (#327)
e3a1493 Update sarama to support Kafka 1.0.0 (#306)
946a425 Fixing requests for StorageFetchConsumersForTopic (#310)
52e3e5d Update burrow.toml (#300)
3a4372f Upgrade sarama dependency to support Kafka 0.11.0 (#297)
8993eb7 Fix goreleaser condition (#299)
d088c99 Add gitter webhook to travis config (#296)
08e9328 Merge branch 'gitter-badger-gitter-badge'
76db0a9 Fix positioning
dddd0ea Add Gitter badge

安裝方法可以選用源碼編譯,和使用官方提供的二進制包等方法。

這里推薦使用二進制包的方式。

Burrow 是無本地狀態存儲的,CPU密集型,網絡IO密集型應用。

安裝方法

# wget https://github.com/linkedin/Burrow/releases/download/v1.1.0/Burrow_1.1.0_linux_amd64.tar.gz 
# mkdir burrow
# tar -xf Burrow_1.1.0_linux_amd64.tar.gz -C burrow
# cp burrow/burrow /usr/bin/
# mkdir /etc/burrow
# cp burrow/config/* /etc/burrow/
# chkconfig --add burrow
# /etc/init.d/burrow start

配置文件

[general]
pidfile="/var/run/burrow.pid"
stdout-logfile="/var/log/burrow.log"
access-control-allow-origin="mysite.example.com"

[logging]
filename="/var/log/burrow.log"
level="info"
maxsize=512
maxbackups=30
maxage=10
use-localtime=true
use-compression=true

[zookeeper]
servers=[ "test1.localhost:2181","test2.localhost:2181" ]
timeout=6
root-path="/burrow"

[client-profile.prod]
client-id="burrow-lagchecker"
kafka-version="0.10.0"

[cluster.production]
class-name="kafka"
servers=[ "test1.localhost:9092","test2.localhost:9092" ]
client-profile="prod"
topic-refresh=180
offset-refresh=30

[consumer.production_kafka]
class-name="kafka"
cluster="production"
servers=[ "test1.localhost:9092","test2.localhost:9092" ]
client-profile="prod"
start-latest=false
group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-|test).*$"
group-whitelist=""

[consumer.production_consumer_zk]
class-name="kafka_zk"
cluster="production"
servers=[ "test1.localhost:2181","test2.localhost:2181" ]
#zookeeper-path="/"
# If specified, this is the root of the Kafka cluster metadata in the Zookeeper ensemble. If not specified, the root path is used.
zookeeper-timeout=30
group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-|test).*$"
group-whitelist=""

[httpserver.default]
address=":8000"

[storage.default]
class-name="inmemory"
workers=20
intervals=15
expire-group=604800
min-distance=1

#[notifier.default]
#class-name="http"
#url-open="http://127.0.0.1:1467/v1/event"
#interval=60
#timeout=5
#keepalive=30
#extras={ api_key="REDACTED", app="burrow", tier="STG", fabric="mydc" }
#template-open="/etc/burrow/default-http-post.tmpl"
#template-close="/etc/burrow/default-http-delete.tmpl"
#method-close="DELETE"
#send-close=false
##send-close=true
#threshold=1

啟動腳本(RHEL6 CentOS6):

#!/bin/bash
#
# Comments to support chkconfig
# chkconfig: - 98 02
# description: Burrow is kafka lag check_program by LinkedIn, Inc. 
#
# Source function library.
. /etc/init.d/functions

### Default variables
prog_name="burrow"
prog_path="/usr/bin/${prog_name}"
pidfile="/var/run/${prog_name}.pid"
options="-config-dir /etc/burrow/"

# Check if requirements are met
[ -x "${prog_path}" ] || exit 1

RETVAL=0

start(){
  echo -n $"Starting $prog_name: "
  #pidfileofproc $prog_name
  #killproc $prog_path
  PID=$(pidofproc -p $pidfile $prog_name)
  #daemon $prog_path $options

  if [ -z $PID  ]; then
    $prog_path $options > /dev/null 2>&1 &
    [ ! -e $pidfile ] && sleep 1
  fi

  [ -z $PID ] && PID=$(pidof ${prog_path})
  if [ -f $pidfile -a -d "/proc/$PID" ]; then
    #RETVAL=$?
    RETVAL=0
    #[ ! -z "${PID}" ] && echo ${PID} > ${pidfile}
    echo_success
    [ $RETVAL -eq 0 ] && touch /var/lock/subsys/$prog_name
  else
    RETVAL=1
    echo_failure
  fi
    
  echo
  return $RETVAL
}

stop(){
  echo -n $"Shutting down $prog_name: "
  killproc -p ${pidfile} $prog_name
  RETVAL=$?
  echo
  [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/$prog_name
  return $RETVAL
}

restart() {
  stop
  start
}

case "$1" in
  start)
    start
    ;;
  stop)
    stop
    ;;
  restart)
    restart
    ;;
  status)
    status $prog_path
    RETVAL=$?
    ;;
  *)
    echo $"Usage: $0 {start|stop|restart|status}"
    RETVAL=1
esac

exit $RETVAL

systemd 服務腳本(RHEL7 CentOS7):

[Unit]
Description=Burrow - Kafka consumer LAG Monitor
After=network.target


[Service]
Type=simple
RestartSec=20s
ExecStart=/usr/bin/burrow --config-dir /etc/burrow
PIDFile=/var/run/burrow/burrow.pid
User=burrow
Group=burrow
Restart=on-abnormal


[Install]
WantedBy=multi-user.target

默認配置文件為 burrow.toml


使用方法

獲取消費者列表

GET /v3/kafka/(cluster)/consumer

Burrow 返回額接口均為 json 對象格式,所以非常方便用於二次采集處理。

獲取指定消費者的狀態 或 消費延時

GET /v3/kafka/(cluster)/consumer/(group)/status
GET /v3/kafka/(cluster)/consumer/(group)/lag

消費組健康狀態的接口含義如下:

NOTFOUND – 消費組未找到
OK   – 消費組狀態正常
WARN   – 消費組處在WARN狀態,例如offset在移動但是Lag不停增長。 the offsets are moving but lag is increasing
ERR   – 消費組處在ERR狀態。例如,offset停止變動,但Lag非零。 the offsets have stopped for one or more partitions but lag is non-zero
STOP   – 消費組處在ERR狀態。例如offset長時間未提交。the offsets have not been committed in a log period of time
STALL   – 消費組處在STALL狀態。例如offset已提交但是沒有變化,Lag非零。the offsets are being committed, but they are not changing and the lag is non-zero

獲取 topic 列表

GET /v3/kafka/(cluster)/topic

獲取指定 topic offsets 信息

GET /v3/kafka/(cluster)/topic/(topic)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM