prometheus發起告警的邏輯
- 假設A服務器和prometheus服務器斷聯,且已經超過一分鍾,匹配上監測存活的告警規則
- Prometheus向alertmanager報信,A服務器斷聯
- alertmanager調用釘釘告警插件,發起告警
- 釘釘機器人在群里發消息。
節點
- 172.50.13.101:prometheus server
- 172.50.13.102:alertmanager和釘釘告警插件
配置alertmanager
首先肯定要去官方下載alertmanager。GitHub - prometheus/alertmanager: Prometheus Alertmanager
安裝很簡單,解壓縮就行了。
alertmanager.yml文件內容:(receiver中的url應該為釘釘告警插件的url)
global:
resolve_timeout: 5m
route:
group_by: [alertname]
group_wait: 10s
group_interval: 10s
repeat_interval: 2h
receiver: webhook
receivers:
- name: webhook
webhook_configs:
- url: 'http://172.50.13.102:8060/dingtalk/webhook1/send'
send_resolved: true
配置釘釘告警插件
插件下載地址:Releases · timonwong/prometheus-webhook-dingtalk · GitHub
只是能用的話,解壓縮就行了,不需要修改配置文件。
配置supervisor守護進程
vim /etc/supervisord.d/prometheus.ini
[program:alertmanager]
command=/usr/local/prometheus/alertmanager/alertmanager --storage.path="/home/data/prometheus/alertmanager/" --web.listen-address=":18081" --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml --data.retention=120h --web.external-url=http://172.50.13.102:18081
directory=/usr/local/prometheus/alertmanager
autostart=true
startsecs=10
startretries=3
autorestart=true
[program:dingtalk]
command=/usr/local/prometheus/dingtalk/prometheus-webhook-dingtalk --ding.profile="webhook1=https://oapi.dingtalk.com/robot/send?access_token=xxxx"
directory=/usr/local/prometheus/dingtalk
autostart=true
startsecs=10
startretries=3
autorestart=true
啟動參數說明:
- alertmanager:
- storage.path:數據存儲路徑
- web.listen.addreess:監聽端口
- config.file:alertmanager.yml文件的路徑
- data.retention:數據存儲時間
- web.external-url:啟用web頁面並配置地址
- dingtalk:
- ding.profile:注意webhook1后面替換為實際釘釘機器人的webhook
關聯prometheus和alertmanager
prometheus.yml中主要的配置內容:
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['172.50.13.102:18081']
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "alertrules/*_rules.yml"
targets為alertmanager的地址。rule_files為告警規則文件,此處為同級目錄中alertrules目錄下所有帶“_rules.yml”后綴的文件。
監測存活的告警規則:
- alertrules/live_rules.yml
groups:
- name: UP
rules:
- alert : node
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
description: "{{ $labels.job }} {{ $labels.instance }} 節點斷聯已超過1分鍾!"
summary: "{{ $labels.instance }} down "
監測負載的告警規則:(監測內存,磁盤占用率和CPU使用率)
- alertrules/perf_rules.yml
groups:
- name: mem_product
rules:
- alert : mem_product
expr: (1 - (node_memory_MemAvailable_bytes{job="生產服務器"} / (node_memory_MemTotal_bytes{job="生產服務器"})))* 100 > 90
for: 5m
labels:
severity: critical
annotations:
description: "{{ $labels.job }} {{ $labels.instance }} 節點的內存使用率超過90%已持續5分鍾!"
summary: "{{ $labels.instance }} 內存使用率超標! "
- name: disk
rules:
- alert : disk
expr: (node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}) *100/(node_filesystem_avail_bytes {fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}+(node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"})) > 95
for: 5m
labels:
severity: warning
annotations:
description: "{{ $labels.job }} {{ $labels.instance }} 節點的硬盤使用率超過95%已持續5分鍾!"
summary: "{{ $labels.instance }} 硬盤空間使用率超標! "
- name: cpu
rules:
- alert : cpu
expr: ((1- sum(increase(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)/sum(increase(node_cpu_seconds_total[5m])) by (instance)) * 100) > 70
for: 5m
labels:
severity: warning
annotations:
description: "{{ $labels.job }} {{ $labels.instance }} 節點的CPU使用率超過70%已持續5分鍾!"
summary: "{{ $labels.instance }} CPU使用率超標!"