一、概況
1、拓撲圖
2、名詞解釋
Grafana 可視化監控容器運行情況 Prometheus: 開源系統監視和警報工具包 Alertmanager 一個獨立的組件,負責接收並處理來自Prometheus Server(也可以是其它的客戶端程序)的告警信息 Cadvisor 不僅可以搜集一台機器上所有運行的容器信息還提供基礎查詢界面和 http 接口,方便 Prometheus 進行數據抓取。
二、部署grafana
docker run -d -p 5000:3000 \
-v /home/grafana:/var/lib/grafana \
--name grafana grafana/grafana:latest
三、部署cadvisor
在監控的節點上部署
docker run -d \
-v /:/rootfs:ro \
-v /var/run:/var/run:ro \
-v /sys:/sys:ro \
-v /var/lib/docker/:/var/lib/docker:ro \
-v /dev/disk/:/dev/disk:ro \
-p 8888:8080 \
--detach=true \
--name=cadvisor \
--restart=always \
google/cadvisor:latest
四、部署alertmanager
4.1部署
tar xf alertmanager-0.21.0.linux-amd64.tar.gz –C /home/
主目錄:/home/alertmanager-0.21.0
4.2創建啟動文件
[root@autodeploy alertmanager-0.21.0]# cat start.sh
#!/bin/bash
pid=`ps -ef|grep aler|grep -v grep|awk '{print $2}'`
kill -9 $pid
nohup ./alertmanager --config.file=alertmanager.yml --storage.path=data --log.level=debug &
4.3配置文件
[root@autodeploy alertmanager-0.21.0]# vim alertmanager.yml
global:
resolve_timeout: 5m
templates:
- '/home/alertmanager-0.21.0/rules/*.tmpl'
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 1m
repeat_interval: 1m
receiver: 'wechat'
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: 'ww f53808cd2d0'
agent_id: '10 0003'
api_secret: 't6pdo4QRF4z_EyZWDXNlLRq-2Ahmtefu3Wt99uKyw'
to_user: '@all'
send_resolved: true
4.4報警信息模板
[root@autodeploy alertmanager-0.21.0]# cat rules/weixin.tmpl
{{ define "wechat.default.message" }}
{{ range $i, $alert :=.Alerts }}
========監控報警====
告警狀態: {{ .Status }}
告警級別: {{ $alert.Labels.severity }}
告警類型: {{ $alert.Labels.alertname }}
告警應用: {{ $alert.Labels.name }}
告警主機: {{ $alert.Labels.instance }}
告警詳情: {{ $alert.Annotations.description }}
告警時間: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========end=========
{{ end }}
{{ end }}
4.5啟動告警服務
[root@autodeploy alertmanager-0.21.0]# sh start.sh
[root@autodeploy alertmanager-0.21.0]# tail -f nohup.out
4.6訪問
http://ip:9093
五、部署prometheus
5.1部署
tar xf prometheus-2.25.2.linux-amd64.tar.gz –C /home/
主目錄:/home/prometheus-2.25.2
5.2創建啟動命令文件
[root@autodeploy prometheus-2.25.2]# cat start.sh
#!/bin/bash
pid=`ps -ef|grep prometheus|grep -v grep|awk '{print $2}'`
kill -9 $pid
nohup ./prometheus --config.file=prometheus.yml &
5.3修改配置文件
[root@autodeploy prometheus-2.25.2]# vim prometheus.yml #告警關聯 # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: ['10.0.0.189:9093'] #報警規則目錄 rule_files: - "rules/*" #獲取各節點數據 scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. - job_name: 'node_exporter' static_configs: - targets: - "10.0.0.194:8888" - "10.0.0.195:8888" - "10.0.0.196:8888"
5.4告警觸發規則
1、警告容器輸出帶寬大於50M,警告容器輸入帶寬大於50M
[root@autodeploy rules]# cat container.network.yml
groups:
- name: container.rules.network
rules:
- alert: Container-network-output-alarm
expr: sum by (name,instance) (irate(container_network_receive_bytes_total{instance=~"10.0.0.*:8888",name=~".+",image!=""}[3m])) /1024/1024 >50
for: 1m
labels:
severity: critical
annotations:
description: "Warning of container output bandwidth greater than 50M"
- alert: Container-network-input-alarm
expr: sum by (name,instance) (irate(container_network_transmit_bytes_total{instance=~"10.0.0.*:8888",name=~".+",image!=""}[3m])) /1024/1024 >50
for: 1m
labels:
severity: critical
annotations:
description: "Warning of container input bandwidth greater than 50M"
2、容器內存超過1000M報警
[root@autodeploy rules]# cat container.memory.yml
groups:
- name: container.rules.memory
rules:
- alert: Container-memory-alarm
expr: sum(container_memory_rss{instance=~"10.0.0.*:8888",name=~".+"}) by (name,instance) > 1000000000
for: 1m
labels:
severity: critical
annotations:
description: "Container memory over 1000M alarm"
3、告警解釋容器cpu利用率超過60%
[root@autodeploy rules]# cat container.CPUutilization.yml
groups:
- name: container.rules.CPUutilization
rules:
- alert: Container-CPUutilization-alarm
expr: sum(irate(container_cpu_usage_seconds_total{instance=~"10.0.0.*:8888",name=~".+",image!=""}[1m])) without (cpu)*100>60
for: 1m
labels:
severity: critical
annotations:
description: "Container CPUutilization over 60% alarm"
容器內cpu壓測,測試監控
echo "scale=5000; 4*a(1)" | bc -l -q
4、緩存使用超過500M報警
[root@autodeploy rules]# cat container.cache.yml
groups:
- name: container.rules.cache
rules:
- alert: Container-cache-alarm
expr: sum(container_memory_cache{instance=~"10.0.0.*:8888",name=~".+"}) by (name,instance) >500000000
for: 1m
labels:
severity: critical
annotations:
description: "Container cache over 500M alarm"
5.5啟動監控服務
[root@autodeploy prometheus-2.25.2]# sh start.sh
[root@autodeploy prometheus-2.25.2]# tail -f nohup.out
5.6訪問
http://ip:9090

六、微信告警展示
友情grafana監控模板:11600