感謝作者分享-http://bjbsair.com/2020-04-07/tech-info/30650.html
1、指定告警服務和規則文件
告訴Promentheus,將告警信息發送給那個告警管理服務,以及使用那個告警規則文件。這里的告警服務在Kubernetes中部署,對外提供的服務名稱為alertmanager,端口為9093。告警規則文件為“/etc/prometheus/rules/”目錄下的所有規則文件。
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# 指定告警服務器
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# 指定告警規則文件
rule_files:
- "/etc/prometheus/rules/*.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter-np:9121']
- job_name: 'node'
static_configs:
- targets: ['prometheus-prometheus-node-exporter:9100']
- job_name: 'windows-node-001'
static_configs:
- targets: ['10.0.32.148:9182']
- job_name: 'windows-node-002'
static_configs:
- targets: ['10.0.34.4:9182']
- job_name: 'rabbit'
static_configs:
- targets: ['prom-rabbit-prometheus-rabbitmq-exporter:9419']
2、設置告警規則
設置告警的規則,Prometheus基於此告警規則,將告警信息發送給告警服務。這將未啟動的實例信息發送給告警服務,告知哪些實例沒有正常啟動。
#rules
groups:
- name: node-rules
rules:
- alert: InstanceDown # 告警名稱
expr: up == 0 # 告警判定條件
for: 3s # 持續多久后,才發送
labels: # 標簽
team: k8s
annotations: # 警報信息
summary: "{{$labels.instance}}: has been down"
description: "{{$labels.instance}}: job {{$labels.job}} has been down "
3、設置告警信息路由和接收器
這里設置通過郵件接收告警信息,當告警服務接收到告警信息后,會通過郵件將告警信息發送給被告知者。
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25' # 發送信息郵箱的smtp服務器代理
smtp_from: 'xxx@163.com' # 發送信息的郵箱名稱
smtp_auth_username: 'xxx' # 郵箱的用戶名
smtp_auth_password: 'SYNUNQBZMIWUQXGZ' # 郵箱的密碼或授權碼
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'xxxxxx@aliyun.com' # 接收告警的郵箱
headers: { Subject: "[WARN] 報警郵件"} # 接收郵件的標題
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
4、驗證
在方案中Prometheus所監控的實例中,redis和windows-node-002沒有正常啟動,因此根據上述的告警規則,應該會將這些信息發送給被告警者的郵箱。
在被告警者的郵箱中,接收的告警信息如下。
感謝作者分享-http://bjbsair.com/2020-04-07/tech-info/30650.html
1、指定告警服務和規則文件
告訴Promentheus,將告警信息發送給那個告警管理服務,以及使用那個告警規則文件。這里的告警服務在Kubernetes中部署,對外提供的服務名稱為alertmanager,端口為9093。告警規則文件為“/etc/prometheus/rules/”目錄下的所有規則文件。
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# 指定告警服務器
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# 指定告警規則文件
rule_files:
- "/etc/prometheus/rules/*.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter-np:9121']
- job_name: 'node'
static_configs:
- targets: ['prometheus-prometheus-node-exporter:9100']
- job_name: 'windows-node-001'
static_configs:
- targets: ['10.0.32.148:9182']
- job_name: 'windows-node-002'
static_configs:
- targets: ['10.0.34.4:9182']
- job_name: 'rabbit'
static_configs:
- targets: ['prom-rabbit-prometheus-rabbitmq-exporter:9419']
2、設置告警規則
設置告警的規則,Prometheus基於此告警規則,將告警信息發送給告警服務。這將未啟動的實例信息發送給告警服務,告知哪些實例沒有正常啟動。
#rules
groups:
- name: node-rules
rules:
- alert: InstanceDown # 告警名稱
expr: up == 0 # 告警判定條件
for: 3s # 持續多久后,才發送
labels: # 標簽
team: k8s
annotations: # 警報信息
summary: "{{$labels.instance}}: has been down"
description: "{{$labels.instance}}: job {{$labels.job}} has been down "
3、設置告警信息路由和接收器
這里設置通過郵件接收告警信息,當告警服務接收到告警信息后,會通過郵件將告警信息發送給被告知者。
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25' # 發送信息郵箱的smtp服務器代理
smtp_from: 'xxx@163.com' # 發送信息的郵箱名稱
smtp_auth_username: 'xxx' # 郵箱的用戶名
smtp_auth_password: 'SYNUNQBZMIWUQXGZ' # 郵箱的密碼或授權碼
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'xxxxxx@aliyun.com' # 接收告警的郵箱
headers: { Subject: "[WARN] 報警郵件"} # 接收郵件的標題
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
4、驗證
在方案中Prometheus所監控的實例中,redis和windows-node-002沒有正常啟動,因此根據上述的告警規則,應該會將這些信息發送給被告警者的郵箱。
在被告警者的郵箱中,接收的告警信息如下。
感謝作者分享-http://bjbsair.com/2020-04-07/tech-info/30650.html
1、指定告警服務和規則文件
告訴Promentheus,將告警信息發送給那個告警管理服務,以及使用那個告警規則文件。這里的告警服務在Kubernetes中部署,對外提供的服務名稱為alertmanager,端口為9093。告警規則文件為“/etc/prometheus/rules/”目錄下的所有規則文件。
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# 指定告警服務器
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# 指定告警規則文件
rule_files:
- "/etc/prometheus/rules/*.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter-np:9121']
- job_name: 'node'
static_configs:
- targets: ['prometheus-prometheus-node-exporter:9100']
- job_name: 'windows-node-001'
static_configs:
- targets: ['10.0.32.148:9182']
- job_name: 'windows-node-002'
static_configs:
- targets: ['10.0.34.4:9182']
- job_name: 'rabbit'
static_configs:
- targets: ['prom-rabbit-prometheus-rabbitmq-exporter:9419']
2、設置告警規則
設置告警的規則,Prometheus基於此告警規則,將告警信息發送給告警服務。這將未啟動的實例信息發送給告警服務,告知哪些實例沒有正常啟動。
#rules
groups:
- name: node-rules
rules:
- alert: InstanceDown # 告警名稱
expr: up == 0 # 告警判定條件
for: 3s # 持續多久后,才發送
labels: # 標簽
team: k8s
annotations: # 警報信息
summary: "{{$labels.instance}}: has been down"
description: "{{$labels.instance}}: job {{$labels.job}} has been down "
3、設置告警信息路由和接收器
這里設置通過郵件接收告警信息,當告警服務接收到告警信息后,會通過郵件將告警信息發送給被告知者。
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25' # 發送信息郵箱的smtp服務器代理
smtp_from: 'xxx@163.com' # 發送信息的郵箱名稱
smtp_auth_username: 'xxx' # 郵箱的用戶名
smtp_auth_password: 'SYNUNQBZMIWUQXGZ' # 郵箱的密碼或授權碼
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'xxxxxx@aliyun.com' # 接收告警的郵箱
headers: { Subject: "[WARN] 報警郵件"} # 接收郵件的標題
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
4、驗證
在方案中Prometheus所監控的實例中,redis和windows-node-002沒有正常啟動,因此根據上述的告警規則,應該會將這些信息發送給被告警者的郵箱。
在被告警者的郵箱中,接收的告警信息如下。