一、編寫監控kafka的topic,每秒會話次數,超過一個特定值,即觸發報警
1、根據grafana儀表盤監控,可查看具體監控指標
2、可在prometheus監控頁面找到抓取的實時數據
3、根據prometheus抓取的數據編寫報警規則文件
# pwd /usr/local/prometheus-2.6.1.linux-amd64
# mkdir rules
# cat rules/kafka.yml
groups: - name: kafka.rules rules: - alert: topic消費者每分鍾流量 expr: kafka_topic_partition_current_offset{topic="superman"} > 2000 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} :{{ $labels.topic }} 消費使用率過高" description: "{{ $labels.instance }} : {{ $labels.job }} :{{ $labels.partition }} : { { $labels.topic } } 這個分區使用大於百分之80% (當前值:{{ $value }})"
4、修改prometheus.yml配置文件
# cat prometheus.yml
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "rules/*.yml" # - "first_rules.yml" # - "second_rules.yml"
重啟prometheus
6、查看prometheus頁面Alerts
二、編寫監控kafka某個會話組,topic的lag超過特定值,就觸發報警(步驟同上)
根據上述信息編寫報警配置
# cd /usr/local/prometheus-2.6.1.linux-amd64/rules/ # cat kafka_lag.yml
groups: - name: kafka_rules rules: - alert: 消費組中topic的lag值,每分鍾 expr: kafka_consumergroup_lag{consumergroup="mygroup"} > 20 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.consumergroup }} :{{ $labels.topic }} 消費滯后" description: "{{ $labels.consumergroup }} : {{ $labels.job }} :{{ $labels.partition }} : { { $labels.topic } } 消費滯后 (當前值:{{ $value }})"
重啟prometheus