- 本次任務是用alertmanaer發一個報警郵件
- 本次環境采用二進制普羅組件
- 本次准備監控一個節點的內存,當使用率大於2%時候(測試),發郵件報警.
環境准備
下載二進制https://prometheus.io/download/
https://github.com/prometheus/prometheus/releases/download/v2.0.0/prometheus-2.0.0.windows-amd64.tar.gz
https://github.com/prometheus/alertmanager/releases/download/v0.12.0/alertmanager-0.12.0.windows-amd64.tar.gz
https://github.com/prometheus/node_exporter/releases/download/v0.15.2/node_exporter-0.15.2.linux-amd64.tar.gz
解壓
/root/
├── alertmanager -> alertmanager-0.12.0.linux-amd64
├── alertmanager-0.12.0.linux-amd64
├── alertmanager-0.12.0.linux-amd64.tar.gz
├── node_exporter-0.15.2.linux-amd64
├── node_exporter-0.15.2.linux-amd64.tar.gz
├── prometheus -> prometheus-2.0.0.linux-amd64
├── prometheus-2.0.0.linux-amd64
└── prometheus-2.0.0.linux-amd64.tar.gz
實驗架構
配置alertmanager
創建 alert.yml
[root@n1 alertmanager]# ls
alertmanager alert.yml amtool data LICENSE NOTICE simple.yml
alert.yml 里面定義下: 誰發送 什么事件 發給誰 怎么發等.
cat alert.yml
global:
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'maotai@163.com'
smtp_auth_username: 'maotai@163.com'
smtp_auth_password: '123456'
templates:
- '/root/alertmanager/template/*.tmpl'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 10m
receiver: default-receiver
receivers:
- name: 'default-receiver'
email_configs:
- to: 'maotai@foxmail.com'
- 配置好后啟動即可
./alertmanager -config.file=./alert.yml
配置prometheus
報警規則rule.yml配置(將被prometheus.yml調用)
當使用率大於2%時候(測試),發郵件報警
$ cat rule.yml
groups:
- name: test-rule
rules:
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 2
for: 1m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
關鍵在於這個公式
(node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 2
labels 給這個規則打個標簽
annotations(報警說明)這部分是報警內容
監控k從哪里獲取?(后面有說) node_memory_MemTotal/node_memory_Buffers/node_memory_Cached
prometheus.yml配置
-
添加node_expolore這個job
-
添加rule_files的報警規則,rule_files部分調用rule.yml
$ cat prometheus.yml
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
rule_files:
- /root/prometheus/rule.yml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['192.168.14.11:9090']
- job_name: linux
static_configs:
- targets: ['192.168.14.11:9100']
labels:
instance: db1
配置好后啟動普羅然后訪問,可以看到了node target了.
查看node_explore拋出的metric
查看alert,可以看到告警規則發生的狀態
這些公式的key從這里可以看到(前提是當你安裝了對應的explore),按照這個k來寫告警公式
查看收到的郵件
微信報警配置
global:
# The smarthost and SMTP sender used for mail notifications.
resolve_timeout: 6m
smtp_smarthost: 'x.x.x.x:25'
smtp_from: 'maomao@qq.com'
smtp_auth_username: 'maomao'
smtp_auth_password: 'maomao@qq.com'
smtp_require_tls: false
# The auth token for Hipchat.
hipchat_auth_token: '1234556789'
# Alternative host for Hipchat.
hipchat_api_url: 'https://123'
wechat_api_url: "https://123"
wechat_api_secret: "123"
wechat_api_corp_id: "123"
# The directory from which notification templates are read.
templates:
- 'templates/*.tmpl'
# The root route on which each incoming alert enters.
route:
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group_by: ['alertname']
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 3s
# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 5m
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 1h
# A default receiver
receiver: maotai
routes:
- match:
job: "11"
#service: "node_exporter"
routes:
- match:
status: yellow
receiver: maotai
- match:
status: orange
receiver: berlin
# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules:
- source_match:
service: 'up'
target_match:
service: 'mysql'
# Apply inhibition if the alerqtname is the same.
equal: ["instance"]
- source_match:
service: "mysql"
target_match:
service: "mysql-query"
equal: ['instance']
- source_match:
service: "A"
target_match:
service: "B"
equal: ["instance"]
- source_match:
service: "B"
target_match:
service: "C"
equal: ["instance"]
receivers:
- name: 'maotai'
email_configs:
- to: 'maotai@qq.com'
send_resolved: true
html: '{{ template "email.default.html" . }}'
headers: { Subject: "[mail] 測試技術部監控告警郵件" }
- name: "berlin"
wechat_configs:
- send_resolved: true
to_user: "@all"
to_party: ""
to_tag: ""
agent_id: "1"
corp_id: "xxxxxxx"