【最佳實踐】prometheus+grafana+Alertmanager基本實踐

本文轉載自查看原文 2020-08-11 15:45 1405

來源：bilibili

【0】前置信息

學習目標

需求目標

實驗環境准備（必看）

prometheus 服務器：192.168.175.131:9090

grafana /alertmanager 服務器：　 192.168.175.130:3000 192.168.175.130:9030

被監控的服務器：　 192.168.175.129

prometheus 默認端口：9090

node_exporter采集器默認端口：9100

grafana 默認端口：3000

mysqld_exporter 默認端口 9104

alertmanager 默認端口：9030

【1】prometheus 的下載與安裝

【1.1】下載

官網：https://prometheus.io/download/#mysqld_exporter

linux:

mkdir /soft
cd /soft 
wget https://github.com/prometheus/prometheus/releases/download/v2.20.1/prometheus-2.20.1.linux-amd64.tar.gz

【1.2】安裝

#【1】安裝go
yum -y install go

#【2】安裝 prometheus 服務端
cd /soft
tar -zxf prometheus-2.20.1.linux-amd64.tar.gz
ln -s prometheus-2.20.1.linux-amd64 prometheus
cd prometheus


#【4】啟動

#prometheus啟動命令添加參數 --web.enable-lifecycle ,這樣修改配置文件后就不用再重啟 prometheus 了
#使用 curl -X POST http://localhost:9090/-/reload  就可以在線重載配置文件

nohup ./prometheus --config.file=./prometheus.yml --web.enable-lifecycle &

配置、核驗 prometheus.yml

l#####down is add info ############# 之后是新增的數據，為了配合我們的規划，我們提前配置好mysqld_exporter 與 node_exporter 的目標監控

vim prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['localhost:9090'] ####down is add info ############# - job_name: 'agent_linux' static_configs: - targets: ['192.168.175.129:9100'] labels: name: linux_db1 - job_name: 'agent_mysql' static_configs: - targets: ['192.168.175.129:9104'] labels: name: mysql_db1

核驗配置文件是否有語法、引用等錯誤： ./promtool check config prometheus.yml

擴展：封裝成系統服務（可以略過）

vi /usr/lib/systemd/system/prometheus.service

[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network.target

[Service]

# Type設置為notify時，服務會不斷重啟
#user=prometheus
Type=simple
ExecStart=/soft/prometheus/prometheus --config.file=/soft/prometheus/prometheus.yml --web.enable-lifecycle
Restart=on-failure

[Install]
WantedBy=multi-user.target

配置好文件后要重載，然后用服務啟動

systemctl daemon-reload

pkill prometheus

systemctl start prometheus

systemctl status prometheus

systemctl enable prometheus

【1.3】核驗

ps -ef|grep prome
netstat -anp|grep 9090

輸入端口+IP，進入界面

【1.4】收集器信息

http://192.168.175.131:9090/metrics

不能用localhost，要用ip噢

由上可知，prometheus默認監控了服務端主機信息。通過 http://192.168.175.131:9090/metrics 可以看到數據

【1.5】基本查看prometheus監控數據與圖表

這里的按鈕，可以切換顯示方式，一種是數值，一種是圖表

【2】安裝node_exporter組件監控 linux主機

192.168.175.129 安裝，node_exporter采集器默認端口：9100

【2.1】什么是 node_exporter

舉個例子，如果你有一台服務器，你想要獲取它運行時候的參數，比如當前的CPU負載、系統負載、內存消耗、硬盤使用量、網絡IO等等。

那么你就可以在服務器上運行一個 node_exporter，它能幫你把這些參數收集好，並且暴露出一個HTTP接口以便你訪問查詢。廢話不多說我們直接試一試

【2.2】node_exporter 下載

官網：https://github.com/prometheus/node_exporter/releases

linux:

mkdir /soft
cd /soft
wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz

【2.3】安裝與啟動

sudo tar -zxf node_exporter-1.0.1.linux-amd64.tar.gz
ln -s node_exporter-1.0.1.linux-amd64 node_exporter
cd node_exporter-1.0.1.linux-amd64
nohup ./node_exporter &

啟動成功會顯示如下信息：注意后后續有沒有報錯　　　

【2.4】核驗

（1）curl訪問核驗

192.168.175.129:9100/metrics

有數據就沒有問題

（2）進程和端口訪問核驗

（3）進入prometheus界面核驗

192.168.175.131:9090

status=>targets

【2.5】封裝成系統服務（可以忽略）

vi /usr/lib/systemd/system/node_exporter.service

[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
#User=prometheus
ExecStart=/soft/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target

#然后需要一些系統操作來應用
systemctl daemon-reload
pkill node_exporter
systemctl start node_exporter
systemctl status node_exporter
systemctl enable node_exporter

【2.6】在線重載配置文件辦法

curl -X POST http://localhost:9090/-/reload

【3】mysqld_exporter 采集mysql

192.168.175.129安裝，mysqld_exporter 默認端口是 9104

【3.1】下載

官網：https://prometheus.io/download/#mysqld_exporter

linux：

mkdir /soft
cd /soft
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.1/mysqld_exporter-0.12.1.linux-amd64.tar.gz

【3.2】解壓、配置、啟動、核驗

tar -zxf mysqld_exporter-0.12.1.linux-amd64.tar.gz
cd mysqld_exporter-0.12.1.linux-amd64/
vim mysqld_exporter.cnf
#記住，配置文件中的賬戶，要是mysql賬戶，且host要是localhost，比如我這里就用 root@localhost登錄
#如果要使用其他用戶，需要這些權限：grant select,replication client,process on *.* to mysql_monitor@'localhost' identified by '123456';

[client]
user=root
password=123456


nohup ./mysqld_exporter --config.my-cnf ./mysqld_exporter.cnf  &

如果我們啟動不加參數，就會報錯，因為默認的配置文件位置是 /root/.my.cnf ，因為是2進制裝的，又沒有這個用戶，所以必須向上面步驟一樣，創建配置文件並且啟動時制定配置文件位置

【3.3】核驗

（1）url核驗

http://192.168.175.129:9104/metrics

（2）prometheus

【4】grafana

在192.168.175.130上安裝，默認端口 3000

【4.1】下載安裝

官網：https://grafana.com/grafana/download

安裝：https://grafana.com/docs/grafana/latest/installation/rpm/

（1）下載安裝

cd /soft 
wget https://dl.grafana.com/oss/release/grafana-7.1.3.linux-amd64.tar.gz
tar -zxf grafana-7.1.3.linux-amd64.tar.gz
cd grafana-7.1.3

（2）啟動

nohup ./bin/grafana-server web &

（3）核驗

netstat -anp|grep -E "3000"

（4）相關信息

默認端口：3000
默認日志：var/log/grafana/grafana.log
默認持久化文件：/var/lib/grafana/grafana.db
web默認賬戶密碼：admin/admin

【4.2】登錄 grafana

在網頁上輸入URL :192.168.175.130:3000

默認賬戶密碼都是admin，登錄上之后，第一次登錄會要求我們改密碼，怕不安全。

最終登錄上的界面

【4.3】grafana 添加 prometheus 數據源

然后點擊 add data source，點擊prometheus

輸入好 prometheus 的服務端地址和端口

然后拉到界面最下面，點擊 save & test

【4.4】手動添加儀表盤

如下圖，我們選擇好我們的數據源，然后輸入我們的顯示指標，儀表盤就出現內容了。

主要是多看看右邊的面板和下方的監控項

我們配置完后，點擊右上角的保存，再次進入，我們就可以查看到我們保存的儀表盤了

然后點擊一下這個儀表盤，就可以看到了之前的圖表了

　　我們還可以通過 instance 或者 jobname 來篩選,舉例如下

　　我們還可以在這個儀表盤上添加更多圖表

【4.5】導入官方模板

官網:https://grafana.com/grafana/dashboards

然后導入這個9777模板

然后我們就可以看到儀表盤數據出來了。

我們還可以對這個模板儀表盤進行設置

我們可以從這里看到一些信息，這個variables 就是篩選按鈕了，這里可以篩選出我們的 lab 下面定義的 instance　

　　然后儀表盤也可以直接修改一下

點一下右上角的APPLY，主頁也有了

【4.6】Linux推薦模板 9276

【4.7】顯示not date 不顯示圖的原因分析

【4.6】中的儀表盤中有幾個沒有數據

原因分析：

（1）沒數據

（2）服務器時間與客戶端/瀏覽器時間不匹配

（3）promQL 語句寫的不對

驗證解決思路：

（1）時間不對

一般就算時間不對，也可能就差個幾秒鍾或者幾個小時，我們把時間查閱范圍選擇2天、7天甚至更高，還沒有就不是這個原因

（2）沒數據，promQL 語句是否正確

用這個（儀表盤）panel 里面的語句

我們直接上prometheus=>Graph 上面查看，變量值和網卡設備名改一下（我的是ens33）。發現是有數據的，那么應該就是 device 這個網卡設備名稱和我們實際被監控機器網卡名稱不一致的問題了

然后我們回去改一下面板（panel）里面的表達式就好了，我把設備名改成了被監控機器的網卡名稱

然后點擊右上角的應用

好了，有數據了，完成

【5】mysql儀表盤監控

【5.1】下載json模板

GITHUB下載地址：https://github.com/percona/grafana-dashboards/tree/master/dashboards

官網儀表盤下載：https://grafana.com/grafana/dashboards?dataSource=prometheus&search=mysql

但官網坑爹啊，儀表盤和采集器不匹配，一用就如【4.5】一樣，各種not date

這里我們下載 mysql_overview.json

https://github.com/percona/grafana-dashboards/blob/master/dashboards/MySQL_Overview.json

直接復制

【5.2】應用json模板

把【5.1】中找到的json 貼進來就OK了

【最終效果】

注意，有一個坑的地方，那就是，這里的監控數據源默認是 Prometheus ，大小寫不一樣也會出問題。如果我們添加的數據源名字不叫這個，估計得改，要么改panel（圖表）中的數據源，要么修改數據源名字

【6】prometheus告警配置

【6.1】查看當前各個job狀態

　　http://192.168.175.131:9090/targets

好，都是OK的；

【6.2】配置 rule 文件、prometheus文件

（1）我們回到 prometheus 服務器（192.168.175.131）

（2）進入配置文件所在目錄 /soft/prometheus

（3）查看、修改prometheus.yml 配置文件

cd /soft/prometheus

vim /prometheus.yml，把rule_files 放開，那么我們就引用了一個叫 first_rules.yml 的配置文件

（4）新建編輯 first_rules.yml 文件

cd /soft/prometheus

vim first_rules.yml

[root@DB3 prometheus]# vim first_rules.yml 

groups:
- name: simulator-alert-rule  #組名稱
  rules:
  - alert: HttpSimulatorDown #報警名稱，必須唯一
    expr: sum(up{job="agent_linux"}) == 0  #監控指標表達式，這里是驗證 agent_linux 節點是否是可訪問的
    for: 1m  #持續時間，1分鍾內表達式持續生效則報警，避免瞬間故障引起報警的可能性
    labels:
      severity: critical
    annotations:
      summary: Linux node status is {{ humanize $value}}% for 1m  #警報描述

我們關鍵監控指標

sum(up{job="agent_linux"}) == 0  就是判斷 agent_linux 下面對應的 target 也就是 192.168.175.129:9100 這個IP+端口是否可以訪問。

我們還可以使用 prometheus里面自帶的 promtool 命令工具來核驗語法是否正確

　　./promtool check rules first_rules.yml

（5）重載配置文件

curl -X POST http://localhost:9090/-/reload

【6.3】查閱config、rule、alert

（1）rule

可以用URL訪問，也可以點擊 status=>rules，如下圖，我們可以看到Rules確實已經配置好了，當前狀態也是OK的（表示表達式並沒有觸發成功）

（2）config

可以用URL訪問，也可以點擊 status=>rules，如下圖，我們可以看出在線重載配置文件確實生效了。

【6.4】驗證報警

（1）查看 192.168.175.131:9090/alerts

我們訪問prometheus服務器/alerts，查看當前報警情況，具體圖如下：

三個選項意思分別是：

　　Inactive ：未觸發報警

　　Pending：質疑狀態，即將發生報警（即現在表達式已經失敗了，但還沒有到達for 后面的時間標准，用我們這個監控來說，就是agent_linux 下面的target 即192.168.175.129:9100 端口已經無法訪問了，但這種情況還沒持續1分鍾）

　　Firing：　發生報警

（2）關閉 agent_linux 下面的 target

即關閉 192.168.175.129:9100 ，這個是 node_exporter 程序

再看這個：已經變成了 Pending狀態，過一分鍾之后，就變成右圖了

【7】Alertmanager

192.168.175.130 上安裝，默認端口 9030

【7.1】Alertmanager 下載安裝

（1）下載

下載官網：https://prometheus.io/download/#alertmanager

linx:

#下載

cd /soft
wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz

#解壓

tar -zxf alertmanager-0.21.0.linux-amd64.tar.gz
ln -s  alertmanager-0.21.0.linux-amd64  alertmanager
cd alertmanager

【7.2】構建配置文件（郵件配置）

（記得修改smtp 信息換成你自己的）

[root@DB2 alertmanager]# vi alertmanager.yml 
global:
      # 在沒有報警的情況下聲明為已解決的時間
  resolve_timeout: 5m
      # 配置郵件發送信息
  smtp_smarthost: 'smtp.qq.com'
  smtp_from: '815202984@qq.com'
  smtp_auth_username: '815202984@qq.com'
  smtp_auth_password: 'xxxxxx'
  smtp_require_tls: false  # 禁用tls
# 所有報警信息進入后的根路由，用來設置報警的分發策略
route:
  # 這里的標簽列表是接收到報警信息后的重新分組標簽，例如，接收到的報警信息里面有許多具有 cluster=A 和 alertname=LatncyHigh 這樣的標簽的報警信息將會批量被聚合到一個
分組里面
  group_by: ['alertname', 'cluster']
  # 當一個新的報警分組被創建后，需要等待至少group_wait時間來初始化通知，這種方式可以確保您能有足夠的時間為同一分組來獲取多個警報，然后一起觸發這個報警信息。
  group_wait: 30s
 
  # 當第一個報警發送后，等待'group_interval'時間來發送新的一組報警信息。
  group_interval: 5m
 
  # 如果一個報警信息已經發送成功了，等待'repeat_interval'時間來重新發送他們
  repeat_interval: 5m
 
  # 默認的receiver：如果一個報警沒有被一個route匹配，則發送給默認的接收器
  receiver: default
 
receivers:
- name: 'default'  # 自定義名稱 供receiver: default使用
  email_configs:   # 郵件報警模塊
  - to: '815202984@qq.com,123456' #接收人
    send_resolved: true

可以使用命令核驗配置文件是否有錯誤：

[root@DB2 alertmanager]# ./amtool check-config alertmanager.yml

生產參考文檔

注意，告警模板路徑應該寫絕對路徑

templates:
- '/data/prometheus/alertmanager/wechat.tmpl'
- '/data/prometheus/alertmanager/email.tmpl'

global:
  resolve_timeout: 5m
  smtp_smarthost: '1.1.1.1:25'  #郵件服務器地址
  smtp_from: '123@qq.com'  #發送郵件的地址
  smtp_auth_username: '123' #登錄郵箱的賬戶，和上面應該一樣
  smtp_auth_password: '123456'
  smtp_require_tls: false

templates:
  - '/data/prometheus/alertmanager/wechat.tmpl'
  - '/data/prometheus/alertmanager/email.tmpl'
route:
  group_by: ['instance'] #將類似性質的報警 合並為單個通知
  group_wait: 10s  # 收到告警時 等待10s確認時間內是否有新告警 如果有則一並發送
  group_interval: 10s #下一次評估過程中，同一個組的alert生效，則會等待該時長發送告警通知，此時不會等待group_wait設置時間
  repeat_interval: 10m #告警發送間隔時間 建議10m 或者30m
  receiver: 'wechat'
  routes:
  - receiver: 'happy'
    group_wait: 10s
    group_interval: 10s
    repeat_interval: 10m
    match_re:
      job: ^快樂子公司.*$   #以【快樂子公司】開頭的 job 的告警信息
  - receiver: 'wechat'
    continue: true

  - receiver: 'default-receiver'
    continue: true

#  - receiver: 'test_dba'
#    group_wait: 10s
#    group_interval: 10s
#    repeat_interval: 10m
#    match:
#      job: 大連娛網_mssql


receivers:
  - name: 'default-receiver'
    email_configs:
    - to: '123@qq.com,456@qq.com'
      send_resolved: true
      html: '{{ template "email.html" .}}'
      headers: { Subject: 'prometheus 告警' }

  - name: 'wechat'
    wechat_configs: # 企業微信報警配置
    - send_resolved: true
      to_party: '2' # 接收組的id
#      to_user: 'abc|efg|xyz' # 接收組的id
#      to_user: '@all' # 接收組的id
      agent_id: '1000003' # (企業微信-->自定應用-->AgentId)
      corp_id: 'qwer' # 企業信息(我的企業-->CorpId[在底部])
      api_secret: 'xx--qq' # 企業微信(企業微信-->自定應用-->Secret)
      api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
      message: '{{ template "wechat.default.message" . }}'

  - name: 'happy'
    email_configs:
    - to: 'hayyp@company.com'
      send_resolved: true
      html: '{{ template "email.html" .}}'
      headers: { Subject: 'prometheus 告警' }


inhibit_rules:
  - source_match:  # 當此告警發生，其他的告警被抑制
      severity: 'critical'
    equal: ['id', 'instance']

（3）啟動

配置成系統服務

vim /usr/lib/systemd/system/alertmanager.service

[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
#User=prometheus
ExecStart=/soft/alertmanager/alertmanager  --config.file=/soft/alertmanager/alertmanager.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target

使用系統服務啟動

systemctl daemon-reload

systemctl start alertmanager

systemctl enable alertmanager

systemctl status alertmanager

我們可以看到啟動成功了，且監聽端口為9030

【7.3】核驗是否啟動成功，允許訪問

核心還是上web看看，能不能訪問，如下圖，可以訪問就OK了

http://192.168.175.130:9093/

【8】整合 prometheus 與 alertmanager

【8.1】修改配置 prometheus.yml 配置文件

如下圖，主要是把 Alerting 的信息修改，把這個 targets 數據填上我們的 alertmanager 服務器地址和端口

然后順道在 prometheus所在服務器上，執行在線重載命令

curl -X POST http://localhost:9090/-/reload

【8.2】郵件報警

因為我們在【6.4】中，已經關閉了 agent_linux 下面的target 即192.168.175.129:9100 ，所以我們一關聯上 prometheus 服務器和 alertmanger ，報警郵件立馬就出來了。

然后我們重新把這個節點起來

然后我們發現這個也好了

但已經解決的報警信息並沒有及時發出來，原因是因為我們 alertmanager 配置文件中有2個參數設置了，兩個參數一起，造成了10分鍾一次

  # 當第一個報警發送后，等待'group_interval'時間來發送新的一組報警信息。 group_interval: 5m # 如果一個報警信息已經發送成功了，等待'repeat_interval'時間來重新發送他們 repeat_interval: 5m

我們的故障解決信息頁收到了，到此，完成

【9】企業微信報警

【9.1】企業微信管理設置

企業微信注冊地址：https://work.weixin.qq.com/

然后隨便創建一個，然后把ID 什么的都保存下來，后面要用。

然后企業ID也要保存下來

【9.2】重構 Alertmanager.yml 添加企業微信接收人

官網參考：

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = false ]

# The API key to use when talking to the WeChat API.
[ api_secret: <secret> | default = global.wechat_api_secret ]

# The WeChat API URL.
[ api_url: <string> | default = global.wechat_api_url ]

# The corp id for authentication.
[ corp_id: <string> | default = global.wechat_api_corp_id ]

# API request data as defined by the WeChat API.
[ message: <tmpl_string> | default = '{{ template "wechat.default.message" . }}' ]
[ agent_id: <string> | default = '{{ template "wechat.default.agent_id" . }}' ]
[ to_user: <string> | default = '{{ template "wechat.default.to_user" . }}' ]
[ to_party: <string> | default = '{{ template "wechat.default.to_party" . }}' ]
[ to_tag: <string> | default = '{{ template "wechat.default.to_tag" . }}' ]

可以通過 to_party，來區分小組，然后直接把相關人員在企業微信中加入分組即可；

我的實際參考：

global:
      # 在沒有報警的情況下聲明為已解決的時間
  resolve_timeout: 5m
      # 配置郵件發送信息
  smtp_smarthost: 'smtp.qq.com:25'
  smtp_from: '815202984@qq.com'
  smtp_auth_username: '815202984@qq.com'
  smtp_auth_password: 'a123456!'
  smtp_require_tls: false  # 禁用tls


templates:
  - 'test.tmpl'
  # 所有報警信息進入后的根路由，用來設置報警的分發策略
route:
  # 這里的標簽列表是接收到報警信息后的重新分組標簽，例如，接收到的報警信息里面有許多具有 cluster=A 和 alertname=LatncyHigh 這樣的標簽的報警信息將會批量被聚合到一個
分組里面
  group_by: ['alertname', 'cluster']
  # 當一個新的報警分組被創建后，需要等待至少group_wait時間來初始化通知，這種方式可以確保您能有足夠的時間為同一分組來獲取多個警報，然后一起觸發這個報警信息。
  group_wait: 30s

  # 當第一個報警發送后，等待'group_interval'時間來發送新的一組報警信息。
  group_interval: 10s

  # 如果一個報警信息已經發送成功了，等待'repeat_interval'時間來重新發送他們
  repeat_interval: 10s

  # 默認的receiver：如果一個報警沒有被一個route匹配，則發送給默認的接收器
  #receiver: "default"
  receiver: "wechat"

receivers:
- name: 'default'  # 自定義名稱 供receiver: default使用
  email_configs:   # 郵件報警模塊
  - to: '815202984@qq.com'
    send_resolved: true

- name: 'wechat'
  wechat_configs:
    - send_resolved: true
      agent_id: '1000002'  #應用ID
      to_user: 'GuoChaoQun|Zhangsan' #接受成員賬號
      api_secret: 'xxx' #應用秘鑰
      corp_id: 'xxx' #企業微信ID

這個成員賬戶，要是這個噢　　

[root@DB2 alertmanager]# ./amtool check-config alertmanager.yml

【9.3】構建報警模板 test.tmpl

cd /soft

/soft/alertmanager

vim test.tmpl

{{ define "wechat.default.message" }}
{{ range .Alerts }}
========監控報警==========
告警狀態：{{   .Status }}
告警級別：{{ .Labels.severity }}
告警類型：{{ .Labels.alertname }}
告警應用：{{ .Annotations.summary }}
告警主機：{{ .Labels.instance }}
告警詳情：{{ .Annotations.description }}
觸發閥值：{{ .Annotations.value }}
告警時間：{{ .StartsAt.Format "2006-01-02 15:04:05" }} 
========end============= {{ end }} {{ end }}

在檢查一遍，運行如下命令，如下圖出現了模板才算對。

[root@DB2 alertmanager]# ./amtool check-config alertmanager.yml

【9.4】在prometheus服務器構建報警配置文件

vim first_rules.yml

groups:
- name: node-alert-rule
  rules:
  - alert: "監控節點宕機"
    expr: sum(up{job="agent_linux"}) == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "服務名:{{$labels.alertname}} 監控節點宕機報警"
      description: "{{ $labels.alertname }} 監控節點掛了啊"
      value: "{{ $value }}"

核驗一下 ./promtool check rules first_rules.yml

【9.5】測試

修改后，記得重啟 Alertmanager 服務器

【啟動參數最佳實踐】

（1）prometheus

/data/prometheus/prometheus/prometheus --config.file=/data/prometheus/prometheus/prometheus.yml \
--web.read-timeout=5m --web.max-connections=512 --storage.tsdb.retention.time=30d \
--storage.tsdb.path=/data/prometheus/prometheus/data/ --query.timeout=2m --query.max-concurrency=20 \
--web.listen-address=0.0.0.0:9090 --web.enable-lifecycle --web.enable-admin-api

（2）altermanager

/data/prometheus/alertmanager/alertmanager --log.level=debug  --config.file=/data/prometheus/alertmanager/alertmanager.yml >> /data/prometheus/alertmanager/alertmanager.log  2>&1

（3）mysqld_exporter

nohup  mysqld_exporter  --config.my-cnf=/etc/my.cnf --collect.info_schema.tables  --collect.info_schema.innodb_metrics  --collect.auto_increment.columns  >>/var/log/messages 2>&1 &

（4）node_exporter

nohup  node_exporter    --collector.meminfo_numa --collector.processes  >>/var/log/messages 2>&1  &

（5）redis_exporter

nohup  redis_exporter   -web.listen-address :9121 -redis.addr 127.0.0.1:6379  -redis.password asdGLvQYeW   >>/var/log/messages 2>&1 &

（6）grafana

/usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pidfile=/var/run/grafana/grafana-server.pid \
--packaging=rpm cfg:default.paths.logs=/var/log/grafana \
cfg:default.paths.data=/var/lib/grafana cfg:default.paths.plugins=/var/lib/grafana/plugins cfg:default.paths.provisioning=/etc/grafana/provisioning

（7）windows_exporter

.\windows_exporter.exe --collectors.enabled "cpu,cs,logical_disk,net,os,service,system,textfile,mssql" --collector.service.services-where "Name='windows_exporter'"

【prometheus + Alertmanager】本身的告警 Rules 參考

groups:
- name: prometheus告警規則
  rules:
  - alert: 采集服務down
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus target missing (instance {{ $labels.instance }})"
      #description: "A Prometheus target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: 采集服務整組down
    expr: count by (job) (up) == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus all targets missing (instance {{ $labels.instance }})"
      description: "A Prometheus job does not have living target anymore.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: prometheus配置加載失敗
    expr: prometheus_config_last_reload_successful != 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Prometheus configuration reload failure (instance {{ $labels.instance }})"
      description: "Prometheus configuration reload error\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: Alertmanager配置加載失敗
    expr: alertmanager_config_last_reload_successful != 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})"
      description: "AlertManager configuration reload error\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: Prometheus連接Alertmanager失敗
    expr: prometheus_notifications_alertmanagers_discovered < 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus not connected to alertmanager (instance {{ $labels.instance }})"
      description: "Prometheus cannot connect the alertmanager\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"


  - alert: PrometheusRuleEvaluationFailures
    expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus rule evaluation failures (instance {{ $labels.instance }})"
      description: "Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"


  - alert: PrometheusTemplateTextExpansionFailures
    expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus template text expansion failures (instance {{ $labels.instance }})"
      description: "Prometheus encountered {{ $value }} template text expansion failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: PrometheusRuleEvaluationSlow
    expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Prometheus rule evaluation slow (instance {{ $labels.instance }})"
      description: "Prometheus rule evaluation took more time than the scheduled interval. I indicates a slower storage backend access or too complex query.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"



  - alert: PrometheusNotificationsBacklog
    expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Prometheus notifications backlog (instance {{ $labels.instance }})"
      description: "The Prometheus notification queue has not been empty for 10 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"


  - alert: PrometheusAlertmanagerNotificationFailing
    expr: rate(alertmanager_notifications_failed_total[1m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus AlertManager notification failing (instance {{ $labels.instance }})"
      description: "Alertmanager is failing sending notifications\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: PrometheusTargetEmpty
    expr: prometheus_sd_discovered_targets == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus target empty (instance {{ $labels.instance }})"
      description: "Prometheus has no target in service discovery\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: PrometheusTargetScrapingSlow
    expr: prometheus_target_interval_length_seconds{quantile="0.9"} > 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Prometheus target scraping slow (instance {{ $labels.instance }})"
      description: "Prometheus is scraping exporters slowly\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"


  - alert: PrometheusTargetScrapeDuplicate
    expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Prometheus target scrape duplicate (instance {{ $labels.instance }})"
      description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: PrometheusTsdbCheckpointCreationFailures
    expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})"
      description: "Prometheus encountered {{ $value }} checkpoint creation failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"


  - alert: PrometheusTsdbCheckpointDeletionFailures
    expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})"
      description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"




  - alert: PrometheusTsdbCompactionsFailed
    expr: increase(prometheus_tsdb_compactions_failed_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus TSDB compactions failed (instance {{ $labels.instance }})"
      description: "Prometheus encountered {{ $value }} TSDB compactions failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: PrometheusTsdbHeadTruncationsFailed
    expr: increase(prometheus_tsdb_head_truncations_failed_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus TSDB head truncations failed (instance {{ $labels.instance }})"
      description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: PrometheusTsdbReloadFailures
    expr: increase(prometheus_tsdb_reloads_failures_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus TSDB reload failures (instance {{ $labels.instance }})"
      description: "Prometheus encountered {{ $value }} TSDB reload failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"



  - alert: PrometheusTsdbWalCorruptions
    expr: increase(prometheus_tsdb_wal_corruptions_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})"
      description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: PrometheusTsdbWalTruncationsFailed
    expr: increase(prometheus_tsdb_wal_truncations_failed_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})"
      description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

【參考文檔】

企業微信報警：https://www.cnblogs.com/guoxiangyue/p/11958522.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 prometheus+grafana監控mysql最佳實踐 docker-compose部署prometheus+grafana+alertmanager Prometheus+Grafana+Alertmanager實現告警推送教程 ----- 圖文詳解 Prometheus+Grafana+Alertmanager搭建全方位的監控告警系統從零開始搭建Prometheus+Grafana+AlertManager自動監控報警系統（不推薦docker方式安裝） Prometheus+Grafana監控部署實踐 Prometheus Grafana監控全方位實踐使用Prometheus+Grafana監控MySQL實踐 Prometheus Alertmanager Grafana 監控警報使用 Prometheus + Grafana 對 Kubernetes 進行性能監控的實踐