上一篇:二進制安裝Prometheus
下面准備在監控的流程中呈現到告警到企微
查看企業ID,用於后續配置文件
四、安裝Alertmanager
1、准備安裝的包
--選擇上面鏈接給的Linux的tar包
alertmanager-0.22.2.linux-amd64.tar.gz wget https://github.com/prometheus/alertmanager/releases/download/v0.22.2/alertmanager-0.22.2.linux-amd64.tar.gz
2、下載完之后直接解壓並放到/usr/local/prometheus目錄,便於管理
[root@zhoujt prometheus]# tar -zxvf alertmanager-0.22.2.linux-amd64.tar.gz [root@zhoujt prometheus]# cp -r alertmanager-0.22.2.linux-amd64 /usr/local/prometheus/alertmanager [root@zhoujt prometheus]# cd /usr/local/prometheus/alertmanager/ [root@zhoujt alertmanager]# ls alertmanager alertmanager.yml amtool LICENSE NOTICE [root@zhoujt alertmanager]# ./alertmanager --version alertmanager, version 0.22.2 (branch: HEAD, revision: 44f8adc06af5101ad64bd8b9c8b18273f2922051) build user: root@b595c7f32520 build date: 20210602-07:50:37 go version: go1.16.4 platform: linux/amd64
3、配置alertmanager
[zhoujt@zhoujt alertmanager]$ cat alertmanager.yml global: #每五分鍾檢查一次是否恢復 resolve_timeout: 5m # SMTP的相關配置 # smtp_smarthost: smtp.263.net:587 # smtp_from: no-reply@xxx.com # smtp_auth_username: no-reply@xxx.com # smtp_auth_password: xxx # 路由的根節點,每個傳進來的報警從這里開始 route: group_by: ['alertname'] # 將傳入的報警中有這些標簽的分為一個組 group_wait: 10s # 第一次觸發報警的延時 group_interval: 10s # 自第一次告警等待多久發送壓縮的警報 repeat_interval: 1m # 重復告警發送間隔 receiver: 'wechat' # 定義告警接收的對象 receivers: # 告警接收對象 - name: 'wechat' #告警接收名稱,與route的receiver對應 wechat_configs: - corp_id: 'wwfaxxxxxxxxxxxx' # 企業微信唯一ID,我的企業--企業信息 to_party: '1' # 告警需要發送的組 to_user: '1' # 告警發送的用戶ID agent_id: '1000002' # 自己創建應用的ID api_secret: 'o22cBPAm3xxxxxxxxxxxxxxxxxxx' # 應用密鑰 send_resolved: true # 告警解決后是否發送通知 inhibit_rules: # 告警抑制規則,比如閾值告警,達到critical肯定也達到了warning了,沒必要發送兩個告警 - source_match: severity: 'major' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance'] templates: #告警消息模板 - '/usr/local/prometheus/alertmanager/*.tmpl'
配置完成之后有自帶的工具用於檢查文件里面的語法
[zhoujiangtao@root alertmanager]$ ./amtool check-config alertmanager.yml Checking 'alertmanager.yml' SUCCESS Found: - global config - route - 1 inhibit rules - 1 receivers - 1 templates SUCCESS
4、配置告警信息模板
注意:配置這些配置文件時,一定要是utf-8的形式,否則無法啟動服務
- file filename # 查看文件屬性
UTF-8 Unicode text
- set fileencoding=utf-8
ps: 模板的時間切記不要改,這個是go語言定義的一月二號下午三點四分五秒,06年時區是-7
{{ define "wechat.default.message" }} {{- if gt (len .Alerts.Firing) 0 -}} {{- range $index, $alert := .Alerts -}} ======== 異常告警 ======== 告警名稱:{{ $alert.Labels.alertname }} 告警級別:{{ $alert.Labels.severity }} 告警機器:{{ $alert.Labels.instance }} {{ $alert.Labels.device }} 告警詳情:{{ $alert.Annotations.summary }} 告警時間:{{ $alert.StartsAt.Format "2006-01-02 15:04:05" }} ========== END ========== {{- end }} {{- end }} {{- if gt (len .Alerts.Resolved) 0 -}} {{- range $index, $alert := .Alerts -}} ======== 告警恢復 ======== 告警名稱:{{ $alert.Labels.alertname }} 告警級別:{{ $alert.Labels.severity }} 告警機器:{{ $alert.Labels.instance }} 告警詳情:{{ $alert.Annotations.summary }} 告警時間:{{ $alert.StartsAt.Format "2006-01-02 15:04:05" }} 恢復時間:{{ $alert.EndsAt.Format "2006-01-02 15:04:05" }} ========== END ========== {{- end }} {{- end }} {{- end }}
5、測試告警是否正常,首先編輯告警觸發規則
groups: - name: mem-rule rules: - alert: "內存報警" expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 10 for: 30s labels: severity: warning annotations: summary: "服務名:{{$labels.alertname}} 內存報警" description: "{{ $labels.alertname }} 內存資源利用率大於 10%" value: "{{ $value }}" - name: node-up rules: - alert: "節點狀態" expr: up{job="node-exporter"} == 0 #測試的話可以把節點改為1,不方便停止節點的時候 for: 5s labels: severity: ERROR level: error annotations: summary: "{{ $labels.instance }} 已停止15s!" description: "{{ $labels.instance }} 檢測到異常!請重點關注!!!" value: "{{ $value }}" - name: node_health rules: - alert: HighMemoryUsage expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes > 0.9 for: 1m labels: severity: warning annotations: summary: High memory usage - alert: HighDiskUsage expr: node_filesystem_free_bytes{mountpoint='/'} / node_filesystem_size_bytes{mountpoint='/'} > 0.7 for: 1m labels: severity: major annotations: summary: High Disk usage - alert: HighDiskUsage expr: node_filesystem_free_bytes{mountpoint='/'} / node_filesystem_size_bytes{mountpoint='/'} > 0.71 for: 1m labels: severity: warning annotations: summary: High Disk usage
6、配置systemd對應服務,便於自啟動和管理
[zhoujt@zhoujt rules]$ cat /usr/lib/systemd/system/alertmanager.service [Unit] Description=altermanager After=network.target [Service] ExecStart=/usr/local/prometheus/alertmanager/alertmanager --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml ExecReload=/bin/kill -s HUP $MAINPID Restart=on-failure [Install] WantedBy=multi-user.target [zhoujt@zhoujt prometheus]$ cat /usr/lib/systemd/system/prometheus.service [Unit] Description=Prometheus Documentation=https://prometheus.io/ After=network.target [Service] # Type設置為notify時,服務會不斷重啟 Type=simple User=prometheus # --storage.tsdb.path是可選項,默認數據目錄在運行目錄的./data目錄中
# --web.enable-lifecycle 用於重載Prometheus的,要么改下配置文件就要重啟一下不是理想狀態 ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/home/prometheus/prometheus-data --web.enable-lifecycle #ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/home/prometheus/prometheus-date --web.listen-address=:9099 Restart=on-failure [Install] WantedBy=multi-user.target
7、配置Prometheus的配置文件,使用alertmanager
# Alertmanager configuration #alerting: # alertmanagers: # - static_configs: # - targets: # - 127.0.0.1:9093 alerting: alertmanagers: - static_configs: - targets: ['localhost:9093']
8、基本配置已完成,開始啟動服務,查看端口
- 重載Prometheus: curl -X POST http://localhost:9090/-/reload 或者: systemctl reload prometheus - 啟動Alertmanager: systemctl enable alertmanager&& systemctl start alertmanager tcp6 0 0 :::9090 :::* LISTEN 11401/prometheus tcp6 0 0 :::9093 :::* LISTEN 30974/alertmanager tcp6 0 0 :::9094 :::* LISTEN 30974/alertmanager 訪問 9090 9093 可以查看當前狀態
9、服務啟動成功
10、測試的話,將rule里面改幾個參數,
告警時:
這里除了監控節點是否存活外,還可以監控很多很多指標,例如 CPU 負載告警、Mem 使用量告警、Disk 存儲空間告警、Network 負載告警等等,這些都可以通過自定義 PromQL 表達式驗證值來定義一些列的告警規則,來豐富日常工作中需要的各種告警
到這里,企微告警已完成,后續可以配置郵件告警,在配置文件中注釋掉了