一、prometheus告警管理介紹
prometheus的告警管理分為兩部分。通過在prometheus服務端設置告警規則, Prometheus服務器端產生告警向Alertmanager發送告警。 然后,Alertmanager管理這些告警,包括靜默,抑制,聚合以及通過電子郵件,郵件、微信、釘釘、Slack等方法發送通知。
設置警報和通知的主要步驟如下:
設置並配置Alertmanager;
配置Prometheus對Alertmanager訪問;
在Prometheus創建警報規則;
1、告警管理模塊AlertManager的核心概念
AlertManager官方文檔:
https://prometheus.io/docs/alerting/alertmanager/
分組(Grouping):分組將類似性質的告警分類為單個通知。 這在大型中斷期間尤其有用,因為許多系統一次失敗,並且可能同時發射數百到數千個警報。
抑制(Inhibition):如果某些特定的告警已經觸發,則某些告警需要被抑制。(inhibit_rules)
靜默(SILENCES):靜默是在給定時間內簡單地靜音告警的方法。 基於匹配器配置靜默,就像路由樹一樣。 檢查告警是否匹配或者正則表達式匹配靜默。 如果匹配,則不會發送該告警的通知。
二 、AlertManager安裝設置及郵件告警
1、安裝設置
主機安裝
[root@node1 ~]# https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz [root@node1 prometheus]# tar xf alertmanager-0.20.0.linux-amd64.tar.gz [root@node1 prometheus]# cd alertmanager-0.20.0.linux-amd64 [root@node1 alertmanager-0.20.0.linux-amd64]# ./alertmanager --version
按需修改配置,運行二進制文件即可。
docker部署
[root@node1 ~]# docker pull prom/alertmanager [root@node1 ~]# docker inspect prom/alertmanager 准備配置文件:alertmanager.yml,放到/opt/prometheus/alertmanager/下 [root@node1 ~]# docker exec -it alertmanager_tmp cat /etc/alertmanager/alertmanager.yml [root@node1 ~]# mkdir /opt/prometheus/alertmanager/ [root@node1 alertmanager]# vim alertmanager.yml [root@node1 alertmanager]# docker run -d --name alertmanager -p 9093:9093 -v /opt/prometheus/alertmanager/:/etc/alertmanager/ prom/alertmanager #運行
瀏覽器訪問alertmanager的web界面:
http://192.168.42.133:9093/#/alerts
2、郵件告警設置
webhook告警設置官方文檔:
https://prometheus.io/docs/alerting/configuration/#webhook_config
(1)配置alertmanager消息通知
[root@node1 ~]# cd /opt/prometheus/alertmanager/ [root@node1 alertmanager]# vim alertmanager.yml global: smtp_smarthost: 'smtp.163.com:25' smtp_from: 'xcm_xxxx@163.com' smtp_auth_username: 'xcm_xxxx@163.com' smtp_auth_password: 'xxxxxxx' smtp_require_tls: false route: receiver: 'mail_163' receivers: - name: 'mail_163' email_configs: - to: '9933xxxxx@qq.com' [root@node1 alertmanager]# docker restart alertmanager
(2)配置prometheus,添加告警規則
[root@node1 prometheus]# pwd /opt/prometheus/prometheus [root@node1 prometheus]# vim rules/node1_alerts.yml groups: - name: node1_alerts rules: - alert: HighNodeCpu expr: instance:node_cpu:avg_rate1m > 10 for: 1m labels: severity: warning annotations: summary: Hgih Node CPU for 1 hour console: This is a Test [root@node1 prometheus]# vim prometheus.yml rule_files: - "rules/node1_rules.yml" - "rules/*_alerts.yml" #添加告警規則 [root@node1 prometheus]# docker restart prometheus-server

(3)配置prometheus,添加alertmamager
[root@node1 prometheus]# vim prometheus.yml alerting: alertmanagers: - static_configs: - targets: - 192.168.42.133:9093
(4)告警測試
[root@master ~]# wget https://cdn.pmylund.com/files/tools/cpuburn/linux/cpuburn-1.0-amd64.tar.gz [root@master cpuburn]# ./cpuburn
查看prometheus告警界面:

查看alertmanager web界面(
http://192.168.42.133:9093/#/alerts),可以看到告警已發出,郵箱收到告警如下圖。告警模板可定制。

注:告警消息模板可定制,在alertmanager.yml文件中
templates: #與global同級
- 'template/*.tmpl'
三、添加其他告警規則(節點磁盤、節點target、prometheus、systemd)
磁盤還有7天滿告警
groups: - name: node1_alerts rules: - alert: DiskWillFillIn7Days expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 7*24*3600) < 0 for: 1m labels: severity: critical annotations: summary: Disk on {{ $labels.instance }} will fill in approximately 7 days.
監控節點instance down
groups: - name: node1_alerts rules: - alert: InstanceDown expr: up == 0 for: 10s labels: severity: critical annotations: summary: Host {{ $labels.instance }} of {{ $labels.job }} is Down!
監控prometheus配置加載錯誤和與alertmanager連接失敗:
[root@node1 rules]# vim prometheus_alerts.yml groups: - name: prometheus_alerts rules: - alert: PrometheusConfigReloadFailed expr: prometheus_config_last_reload_successful == 0 for: 1m labels: severity: warning annotations: description: Reloading Prometheus config has failed on {{ $labels.instance }}. - alert: PrometheusNotConnectedToAlertmanagers expr: prometheus_notifications_alertmanagers_discovered < 2 for: 1m labels: severity: warning annotations: description: Prometheus {{ $labels.instance }} is not connected to some Alertmanagers.
監控systemd管理的服務down掉
groups: - name: service_alerts rules: - alert: NodeServiceDown expr: node_systemd_unit_state{state="active"} != 1 for: 40s labels: severity: critical annotations: summary: Service {{ $labels.name }} on {{ $labels.instance }} is no longer active!

四、AlertManager路由配置
route屬性用來設置報警的分發策略,它是一個樹狀結構,按照深度優先從左向右的順序進行匹配。
[root@node1 alertmanager]# vim alertmanager.yml global: smtp_smarthost: 'smtp.163.com:25' smtp_from: 'xcm_xxx@163.com' smtp_auth_username: 'xcm_xxx@163.com' smtp_auth_password: 'xxxx' smtp_require_tls: false route: group_by: ['instance'] #報警分組依據,根據 labael(標簽)進行匹配,如果是多個,就要多個都匹配 group_wait: 30s #組報警等待時間 group_interval: 5m #組報警間隔時間,為該組啟動新的告警間隔時間 repeat_interval: 3h #重復報警間隔時間,告警發送成功后,下一次發送間隔時間 receiver: mail_qq #默認,必須指定 routes: - match: severity: critical receiver: mail_163 - match_re: severity: ^(warning|critical)$ receiver: mail_qq receivers: - name: 'mail_qq' email_configs: - to: '9933xxxx@qq.com' - name: 'mail_163' email_configs: - to: 'xcm_xxx@163.com' [root@node1 alertmanager]# docker restart alertmanager
五、AlertManager靜默配置
類似於zabbix的維護模式,添加的靜默模式的告警在設置時間段內不會告警。
web頁面設置

命令行工具(amtool)設置
[root@node1 ~]# docker exec alertmanager /bin/amtool --alertmanager.url=http://192.168.42.133:9093 silence add alertname="InstanceDown" -c “忽略instance故障告警” [root@node1 ~]# docker exec alertmanager /bin/amtool --alertmanager.url=http://192.168.42.133:9093 silence add alertname="InstanceDown" job=~".*CADvisor.*" -c “忽略cadvsor instance故障告警” [root@node1 ~]# docker exec alertmanager /bin/amtool --alertmanager.url=http://192.168.42.133:9093 silence query

[root@node1 ~]# docker exec alertmanager /bin/amtool --alertmanager.url=http://192.168.42.133:9093 silence expire 840158fb-2185-4568-b6c8-413ceaf7d3a5 [root@node1 ~]# /bin/amtool --help #查看命令選項 注意:amtool 如果不指定 --alertmanager ,默認會在 $HOME/.config/amtool/config.yml 或/etc/amtools/config.yml 查詢