alertmanager
alertmanager可以放在遠程服務器上
報警機制
在 prometheus 中定義你的監控規則,即配置一個觸發器,某個值超過了設置的閾值就觸發告警, prometheus 會推送當前的告警規則到 alertmanager,alertmanager 收到了會進行一系列的流程處理,然后發送到接收人手里
配置安裝
wget https://github.com/prometheus/alertmanager/releases/download/v0.19.0/alertmanager-0.19.0.linux-amd64.tar.gz
tar zxf alertmanager-0.19.0.linux-amd64.tar.gz
mv alertmanager-0.19.0.linux-amd64.tar.gz /usr/local/alertmanager && cd /usr/local/alertmanager && ls
配置文件
cat alertmanager.yml
global:
resolve_timeout: 5m ##全局配置,設置解析超時時間
route:
group_by: ['alertname'] ##alertmanager中的分組,選哪個標簽作為分組的依據
group_wait: 10s ##分組等待時間,拿到第一條告警后等待10s,如果有其他的一起發送出去
group_interval: 10s ##各個分組之前發搜告警的間隔時間
repeat_interval: 1h ##重復告警時間,默認1小時
receiver: 'web.hook' ##接收者
##配置告警接受者
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
##配置告警收斂
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
郵件接收配置
cat alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25' #smtp服務地址
smtp_from: 'xxx@163.com' #發送郵箱
smtp_auth_username: 'xxx@163.com' #認證用戶名
smtp_auth_password: 'xxxx' #認證密碼
smtp_require_tls: false #禁用tls
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1m
receiver: 'email' #定義接受告警組名
receivers:
- name: 'email' #定義組名
email_configs: #配置郵件
- to: 'xx@xxx.com' #收件人
檢查配置文件
./amtool check-config alertmanager.yml
配置為系統服務
cat > /usr/lib/systemd/system/alertmanager.service <<EOF
> [Unit]
> Description=alertmanager
>
> [Service]
> Restart=on-failure
> ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
>
> [Install]
> WantedBy=multi-user.target
> EOF
和prometheus 結合配置
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093 ##配置alertmanager地址
rule_files:
- "rules/*.yml" ##配置告警規則的文件
配置報警規則
報警規則的目錄 /usr/local/prometheus/rules
/usr/local/prometheus/rules]# cat example.yml
groups:
- name: exports.rules ##定義這組告警的組名,同性質的,都是監控實例exports是否開啟的模板
rules:
- alert: 采集器掛了 ## 告警名稱
expr: up == 0 ## 告警表達式,監控up指標,如果等於0,表示監控的節點沒有起來,然后進行下面的操作
for: 1m ## 持續一分鍾為0就進行告警
labels: ## 定義告警級別
severity: ERROR
annotations: ## 定義告警通知怎么寫,默認調用了{$labels.instance&$labels.job}的值
summary: "實例 {{ $labels.instance }} 掛了"
description: "實例 {{ $labels.instance }} job 名為 {{ $labels.job }} 的掛了"
配置的變量解釋:
{{ $labels.instance }} #提取了up里的instance 值
{{ $labels.job }}
相同的報警名稱即 alertname (根據配置文件 alert 歸類)會被合並到同一個郵件里一並發出
告警的分配
分配策略,在報警的配置文件中設定
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1m
receiver: 'email'
告警分配示例
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'xxx@163.com'
smtp_auth_username: 'xxx@163.com'
smtp_auth_password: 'xxx'
smtp_require_tls: false
route:
receiver: 'default-receiver' ##定義默認接收器名,如果其他的匹配不到走這個
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
group_by: [cluster, alertname] ##分組設置
routes: ##子路由
- receiver: 'database-pager' ##定義接收器名字
group_wait: 10s ##分組設置
match_re: ##正則匹配
service: mysql|cassandra ##接收標簽service值為mysql&&cassandra的告警
- receiver: 'frontend-pager' ##接收器名
group_by: [product, environment] ##分組設置
match: ##直接匹配
team: frontend ##匹配標簽team值為frontend的告警
receivers: ##定義接收器
- name: 'default-receiver' ##接收器名字
email_configs: ##郵件接口
- to: 'xxx.xx.com' ##接收人,下面以此類推
- name: 'database-pager'
email_configs:
- to: 'xxx.xx.com'
- name: 'frontend-pager'
email_configs:
- to: 'xxx@.xx.com'
告警收斂
收斂就是盡量壓縮告警郵件的數量,防止關鍵信息淹沒,alertmanager 中有很多收斂機制,最主要的就是分組抑制靜默,alertmanager 收到告警之后會先進行分組,然后進入通知隊列,這個隊列會對通知的郵件進行抑制靜默,再根據 router 將告警路由到不同的接收器
機制 說明
分組 (group) 將類似性質的告警合並為單個進行通知
抑制 (Inhibition) 當告警發生后,停止重復發送由此告警引發的其他告警
靜默 (Silences) 一種簡單的特定時間靜音提醒的機制
分組:根據報警名稱分組,如果相同的報警名稱的信息有多條,會合並到一個郵件里發出。
匹配的報警名稱:
prometheus 監控的報警規則
/usr/local/prometheus/rules/*.yml
- alert: 節點掛了
抑制:消除冗余告警,在 alertmanager 中配置的
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['instance']
##當我收到一個告警級別為 critical 時,他就會抑制掉 warning 這個級別的告警,這個告警等級是在你編寫規則的時候定義的,最后一行就是要對哪些告警做抑制,通過標簽匹配的,我這里只留了一個 instance,舉個最簡單的例子,當現在 alertmanager 先收到一條 critical、又收到一條 warning 且 instance 值一致的兩條告警他的處理邏輯是怎樣的。
##在監控 nginx,nginx 宕掉的告警級別為 warning,宿主機宕掉的告警級別為 critical,譬如說現在我跑 nginx 的服務器涼了,這時候 nginx 肯定也涼了,普羅米修斯發現后通知 alertmanager,普羅米修斯發過來的是兩條告警信息,一條是宿主機涼了的,一條是 nginx 涼了的,alertmanager 收到之后,發現告警級別一條是 critical,一條是 warning,而且 instance 標簽值一致,也就是說這是在一台機器上發生的,所以他就會只發一條 critical 的告警出來,warning 的就被抑制掉了,我們收到的就是服務器涼了的通知
靜默:
特定時間靜音提醒的機制,主要是使用標簽匹配這一批不發送告警,假如某天要對服務器進行維護,可能會涉及到服務器重啟,在這期間肯定會有 N 多告警發出來, 在這期間配置一個靜默,這類的告警就不要發了
告警示例
監控內存
promsql
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )* 100 > 80
編寫規則:
CD /usr/local/prometheus/rules
cat memory.yml
groups:
- name: memeory_rules
rules:
- alert: 內存沒了
expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )* 100 > 80 #表達式成立,即可以查詢到數據
for: 1m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} 內存沒了"
description: "{{ $labels.instance }} 內存沒了,當前使用率為 {{ $value }}"
配置告警分配
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 5m
receiver: 'default-receiver'
routes:
- group_by: ['mysql']
group_wait: 10s
group_interval: 10s
repeat_interval: 5m
receiver: 'mysql-pager'
match_re:
job: mysql
receivers:
- name: 'default-receiver'
email_configs:
- to: 'xxx@xx.com'
- name: 'mysql-pager'
email_configs:
- to: 'xxx@xx.cn'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['instance']
釘釘報警
編譯釘釘webhook接口
#安裝go環境
wget -c https://storage.googleapis.com/golang/go1.8.3.linux-amd64.tar.gz
tar -C /usr/local/ -zxvf go1.8.3.linux-amd64.tar.gz
mkdir -p /home/gocode
cat << EOF >> /etc/profile
export GOROOT=/usr/local/go #設置為go安裝的路徑
export GOPATH=/home/gocode #默認安裝包的路徑
export PATH=$PATH:$GOROOT/bin:$GOPATH/bin
EOF
source /etc/profile
----------------------------------------
#安裝釘釘插件
cd /home/gocode/
mkdir -p src/github.com/timonwong/
cd /home/gocode/src/github.com/timonwong/
git clone https://github.com/timonwong/prometheus-webhook-dingtalk.git
cd prometheus-webhook-dingtalk
make
#編譯成功
[root@mini-install prometheus-webhook-dingtalk]# make
>> formatting code
>> building binaries
> prometheus-webhook-dingtalk
>> checking code style
>> running tests
? github.com/timonwong/prometheus-webhook-dingtalk/chilog [no test files]
? github.com/timonwong/prometheus-webhook-dingtalk/cmd/prometheus-webhook-dingtalk [no test files]
? github.com/timonwong/prometheus-webhook-dingtalk/models [no test files]
? github.com/timonwong/prometheus-webhook-dingtalk/notifier [no test files]
? github.com/timonwong/prometheus-webhook-dingtalk/template [no test files]
? github.com/timonwong/prometheus-webhook-dingtalk/template/internal/deftmpl [no test files]
? github.com/timonwong/prometheus-webhook-dingtalk/webrouter [no test files]
#創建軟連接
ln -s /home/gocode/src/github.com/timonwong/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk /usr/local/bin/prometheus-webhook-dingtalk
##查看
prometheus-webhook-dingtalk --help
usage: prometheus-webhook-dingtalk --ding.profile=DING.PROFILE [<flags>]
Flags:
-h, --help Show context-sensitive help (also try --help-long and --help-man).
--web.listen-address=":8060"
The address to listen on for web interface.
--ding.profile=DING.PROFILE ...
Custom DingTalk profile (can be given multiple times, <profile>=<dingtalk-url>).
--ding.timeout=5s Timeout for invoking DingTalk webhook.
--template.file="" Customized template file (see template/default.tmpl for example)
--log.level=info Only log messages with the given severity or above. One of: [debug, info, warn, error]
--version Show application version.
啟動釘釘插件
根據已申請的釘釘接口啟動釘釘插件
prometheus-webhook-dingtalk --ding.profile="webhook=https://oapi.dingtalk.com/robot/send?access_token=OOOOOOXXXXXXOXOXOX9b46d54e780d43b98a1951489e3a0a5b1c6b48e891e86bd"
#注意:可以配置多個webhook名字,這個名字和后續的報警url相關聯
#關於這里的 -ding.profile 參數:為了支持同時往多個釘釘自定義機器人發送報警消息,因此 -ding.profile 可以在命令行中指定多次,比如:
prometheus-webhook-dingtalk \
--ding.profile="webhook1=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx" \
--ding.profile="webhook2=https://oapi.dingtalk.com/robot/send?access_token=yyyyyyyyyyy"
這里就定義了兩個 WebHook,一個 webhook1,一個 webhook2,用來往不同的釘釘組發送報警消息。
然后在 AlertManager 的配置里面,加入相應的 receiver(注意下面的 url):
receivers:
- name: send_to_dingding_webhook1
webhook_configs:
- send_resolved: false
url: http://localhost:8060/dingtalk/webhook1/send
- name: send_to_dingding_webhook2
webhook_configs:
- send_resolved: false
url: http://localhost:8060/dingtalk/webhook2/send
##配置釘釘插件為系統服務
cat > dingtalk.service <<EFO
[Unit]
Description=alertmanager
[Service]
Restart=on-failure
ExecStart=/usr/local/bin/prometheus-webhook-dingtalk --ding.profile="webhook=https://oapi.dingtalk.com/robot/send?access_token=XXXXXXXXOOOOOOO0d43b98a1951489e3a0a5b1c6b48e891e86bd"
[Install]
WantedBy=multi-user.target
EFO
systemctl daemon-reload
systemctl status dingtalk 會報錯,請忽略,直接start dingtalk
##看端口監聽
[root@mini-install system]# ss -tanlp | grep 80
LISTEN 0 128 :::8060 :::* users:(("prometheus-webh",pid=18541,fd=3))
##簡單測試
curl -H "Content-Type: application/json" -d '{ "version": "4", "status": "firing", "description":"description_content"}' http://localhost:8060/dingtalk/webhook/send
##prometheus webhook 傳遞數據格式
The webhook receiver allows configuring a generic receiver:
# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]
# The endpoint to send HTTP POST requests to.
url: <string>
# The HTTP client's configuration.
[ http_config: <http_config> | default = global.http_config ]
The Alertmanager will send HTTP POST requests in the following JSON format to the configured endpoint:
{
"version": "4",
"groupKey": <string>, // key identifying the group of alerts (e.g. to deduplicate)
"status": "<resolved|firing>",
"receiver": <string>,
"groupLabels": <object>,
"commonLabels": <object>,
"commonAnnotations": <object>,
"externalURL": <string>, // backlink to the Alertmanager.
"alerts": [
{
"status": "<resolved|firing>",
"labels": <object>,
"annotations": <object>,
"startsAt": "<rfc3339>",
"endsAt": "<rfc3339>",
"generatorURL": <string> // identifies the entity that caused the alert
},
...
]
}
alertmanager
配置
wget https://github.com/prometheus/alertmanager/releases/download/v0.19.0/alertmanager-0.19.0.linux-amd64.tar.gz
tar zxvf alertmanager-0.19.0.linux-amd64.tar.gz
ln -sv `pwd`/alertmanager-0.19.0.linux-amd64 /usr/local/alertmanager
#配置為系統服務
cat >> /usr/lib/systemd/system/alertmanager.service <<EFO
[Unit]
Description=alertmanager
[Service]
Restart=on-failure
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
[Install]
WantedBy=multi-user.target
EFO
systemctl daemon-reload 后啟動
#編輯配置文件
cd /usr/local/alertmanager
vim alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/webhook/send'
和prometheus 結合
pwd
/usr/local/prometheus
mkdir rules && cd !$
cat example.yml
groups:
- name: exports.rules ##定義這組告警的組名,同性質的,都是監控實例exports是否開啟的模板
rules:
- alert: 采集器黃了 ## 告警名稱
expr: up == 0 ## 告警表達式,監控up指標,如果等於0就進行下面的操作
for: 1m ## 持續一分鍾為0進行告警
labels: ## 定義告警級別
severity: ERROR
annotations: ## 定義了告警通知怎么寫,默認調用了{$labels.instance&$labels.job}的值
summary: "實例 {{ $labels.instance }} 采集器 黃!!"
description: "實例 {{ $labels.instance }} job 名為 {{ $labels.job }} 的采集器 黃了有一分鍾!!"
cat prometheus.yml
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml"
# - "first_rules.yml"
# - "second_rules.yml"
##啟動服務各個服務
節點監控正常后關閉一個節點。效果
參考:
https://blog.rj-bai.com/post/158.html
釘釘插件作者:
https://theo.im/blog/2017/10/16/release-prometheus-alertmanager-webhook-for-dingtalk/
https://github.com/timonwong/prometheus-webhook-dingtalk
釘釘插件編譯: https://blog.51cto.com/9406836/2419876
http://ylzheng.com/2018/03/01/alertmanager-webhook-dingtalk/
釘釘報警python版
釘釘的報警數據格式比較嚴格(別人講的)為了使用釘釘報警的markdown格式,自己寫一個api 將alertmanager 發送的數據優化后發送到釘釘機器人
釘釘報警 python 版本
import os
import json
import requests
import arrow
from flask import Flask
from flask import request
app = Flask(__name__)
@app.route('/', methods=['POST', 'GET'])
def send():
if request.method == 'POST':
post_data = request.get_data()
send_alert(bytes2json(post_data))
return 'success'
else:
return 'weclome to use prometheus alertmanager dingtalk webhook server!'
def bytes2json(data_bytes):
data = data_bytes.decode('utf8').replace("'", '"')
return json.loads(data)
def send_alert(data):
token = os.getenv('ROBOT_TOKEN')
if not token:
print('you must set ROBOT_TOKEN env')
return
url = 'https://oapi.dingtalk.com/robot/send?access_token=%s' % token
for output in data['alerts'][:]:
# annotations
send_data = {
"msgtype": "markdown",
"markdown": {
"title": "prometheus_alert",
"text": "## 告警程序: prometheus_alertmanager \n" +
"**告警級別**: %s \n\n" % output['labels']['status'] +
"**告警類型**: %s \n\n" % output['labels']['alertname'] +
"**告警實例**: %s \n\n" % output['labels']['instance'] +
"**告警詳情**: %s \n\n" % output['annotations']['summary'] +
"**觸發時間**: %s \n\n" % arrow.get(output['startsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ') +
"**觸發結束時間**: %s \n" % arrow.get(output['endsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ')
}
}
req = requests.post(url, json=send_data)
result = req.json()
if result['errcode'] != 0:
print('notify dingtalk error: %s' % result['errcode'])
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8060)
將該程序打包成容器
#工作目錄
# tree
.
├── Dockerfile
└── main.py
main.py 為flask代碼
#cat Dockerfile
FROM tiangolo/uwsgi-nginx-flask:python3.7
#設置環境變量 釘釘的令牌
ENV ROBOT_TOKEN 47f07271e8a24b6a63486bBSJDFKj346556jhjk9892fk545jjf234jFJ89489JFKSDLF2KgfhsJK234
RUN pip install requests flask arrow -i https://pypi.tuna.tsinghua.edu.cn/simple some-package --no-cache-dir
COPY main.py /app
EXPOSE 80
##打成鏡像
##啟動容器
docker run -d --restart=always -p 8060:80 dingding
##測試成功
curl localhost:8060
weclome to use prometheus alertmanager dingtalk webhook server!
測試數據:
[root@t1 ~]# cat data.json
{
"version": "3",
"status": "firing",
"receiver": "jdhf",
"alerts": [
{
"labels": {'instance':"192.168.1.145:9100",'alertname':"home目錄可用量", 'status':"嚴重告警"},
"annotations": {'summary': "實例在root掛載點磁盤可用量小於4G!, 當前可用: 2G"}
}
]
}
curl 127.0.0.1:8060 -X POST -d @data.json --header "Content-Type: application/json"
#測試有問題
是因為Alertmanager 發送給釘釘報警器的數據里有額外的數據,我們的測試數據不足,如果希望成功需要修改main.py,去除觸發時間和觸發結束時間
##錯誤
#釘釘發群通知報{"errcode":310000,"errmsg":"keywords not in content" 解決辦法
釘釘安全設置的的自定義關鍵字未配置或公網ip未添加
#######################
##此時alertmanager配置文件為
/usr/local/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- send_resolved: true
url: 'http://localhost:8060'
#url: 'http://localhost:8060/dingtalk/webhook/send'
python 報警版修改的這個:
import os
import json
import requests
import arrow
from flask import Flask
from flask import request
app = Flask(__name__)
@app.route('/', methods=['POST', 'GET'])
def send():
if request.method == 'POST':
post_data = request.get_data()
send_alert(bytes2json(post_data))
return 'success'
else:
return 'weclome to use prometheus alertmanager dingtalk webhook server!'
def bytes2json(data_bytes):
data = data_bytes.decode('utf8').replace("'", '"')
return json.loads(data)
def send_alert(data):
token = os.getenv('ROBOT_TOKEN')
if not token:
print('you must set ROBOT_TOKEN env')
return
url = 'https://oapi.dingtalk.com/robot/send?access_token=%s' % token
for output in data['alerts'][:]:
try:
pod_name = output['labels']['pod']
except KeyError:
try:
pod_name = output['labels']['pod_name']
except KeyError:
pod_name = 'null'
try:
namespace = output['labels']['namespace']
except KeyError:
namespace = 'null'
try:
message = output['annotations']['message']
except KeyError:
try:
message = output['annotations']['description']
except KeyError:
message = 'null'
send_data = {
"msgtype": "markdown",
"markdown": {
"title": "prometheus_alert",
"text": "## 告警程序: prometheus_alert \n" +
"**告警級別**: %s \n\n" % output['labels']['severity'] +
"**告警類型**: %s \n\n" % output['labels']['alertname'] +
"**故障pod**: %s \n\n" % pod_name +
"**故障namespace**: %s \n\n" % namespace +
"**告警詳情**: %s \n\n" % message +
"**告警狀態**: %s \n\n" % output['status'] +
"**觸發時間**: %s \n\n" % arrow.get(output['startsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ') +
"**觸發結束時間**: %s \n" % arrow.get(output['endsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ')
}
}
req = requests.post(url, json=send_data)
result = req.json()
if result['errcode'] != 0:
print('notify dingtalk error: %s' % result['errcode'])
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)