2.Prometheus郵件報警配置


1、安裝配置 Alertmanager

wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz
tar -zxv -f alertmanager-0.20.0.linux-amd64.tar.gz -C /usr/local
cd /usr/local
mv alertmanager-0.20.0.linux-amd64/ alertmanager

2,創建啟動文件

vim /usr/lib/systemd/system/alertmanager.service 

[Unit]
Description=alertmanager
Documentation=https://github.com/prometheus/alertmanager
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alert-test.yml --storage.path=/usr/local/alertmanager/data
Restart=on-failure

[Install]
WantedBy=multi-user.target

Alertmanager 安裝目錄下默認有 alertmanager.yml 配置文件,可以創建新的配置文件,在啟動時指定即可。

cd /usr/local/alertmanager
vim alert-test.yml
global:
  smtp_smarthost: 'smtp.qiye.aliyun.com:25'
  smtp_from: 'jump@tongchuangkeji.net'
  smtp_auth_username: 'jump@tongchuangkeji.net'
  smtp_auth_password: 'xxxx'
  smtp_require_tls: false
  
templates:
  - '/alertmanager/template/*.tmpl'
  
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 10m
  receiver: default-receiver
  
receivers:
- name: 'default-receiver'
  email_configs:
  - to: 'liqilong@edspay.com'
    html: ''
    headers: {Subject: "[WARN] 報警郵件 test"}

郵箱一開始使用的是公司的郵箱,結果在后邊驗證的時候,總是會報錯level=error ts=2019-01-26T06:21:59.062483579Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="*smtp.plainAuth failed: unencrypted connection",也在這里看了一些人踩坑的報告,試驗了25、465、587端口,發現均無效果,最后改成163郵箱,直接就生效了。

  • smtp_smarthost:是用於發送郵件的郵箱的 SMTP 服務器地址+端口;
  • smtp_auth_password:是發送郵箱的授權碼而不是登錄密碼;
  • smtp_require_tls:不設置的話默認為 true,當為 true 時會有 starttls 錯誤,為了簡單這里設置為 false;
  • templates:指出郵件的模板路徑;
  • receivers 下 html 指出郵件內容模板名,這里模板名為 “alert.html”,在模板路徑中的某個文件中定義。
  • headers:為郵件標題;

使用阿里雲企業郵箱無法發送郵件,報錯如下:

Jun 11 13:40:52 worker alertmanager: level=error ts=2020-06-11T05:40:52.638Z caller=notify.go:372 component=dispatcher msg="Error on notify" err="*smtp.plainAuth auth: unencrypted connection" context_err="context deadline exceeded"
Jun 11 13:40:52 worker alertmanager: level=error ts=2020-06-11T05:40:52.638Z caller=dispatch.go:301 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="*smtp.plainAuth auth: unencrypted connection"

3,配置告警規則

配置 rule.yml。

cd /usr/local/prometheus
vim rule.yml
groups:
- name: alert-rules.yml
  rules:
  - alert: InstanceStatus # alert 名字
    expr: up{job="192.168.75.10"} == 0 # 判斷條件,job是指prometheus.yml文件中的job_name
    for: 10s # 條件保持 10s 才會發出 alter
    labels: # 設置 alert 的標簽
      severity: "critical"
    annotations:  # alert 的其他標簽,但不用於標識 alert
      description: 服務器  已當機超過 20s
      summary: 服務器  運行狀態

在 prometheus.yml 中指定 rule.yml 的路徑

cat prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093 # 這里修改為 localhost,# 新增

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "/usr/local/prometheus/rule.yml" # 新增

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: '192.168.75.11'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090','localhost:9100']

  - job_name: '192.168.75.10'
    scrape_interval: 5s
    static_configs:
    - targets: ['192.168.75.10:9100']

重啟 Prometheus 服務:

chown -R prometheus.prometheus /usr/local/prometheus/rule.yml
systemctl restart prometheus

4,編寫郵件模板

注意:文件后綴為 tmpl

mkdir -pv /alertmanager/template/ # 路徑跟上面的alertmanager.yml 配置文件保持一致
vim /alertmanager/template/alert.tmpl
<table>
	<tr><td>報警名</td><td>開始時間</td></tr>
	<tr><td></td><td></td></tr>
</table>

注意:啟動的時候報錯如下:

Jun 11 12:55:44 worker alertmanager: level=error ts=2020-06-11T04:55:44.744Z caller=main.go:236 msg="Unable to create data directory" err="mkdir data/: permission denied"
Jun 11 12:55:44 worker systemd: alertmanager.service: main process exited, code=exited, status=1/FAILURE
Jun 11 12:55:44 worker systemd: Unit alertmanager.service entered failed state.
Jun 11 12:55:44 worker systemd: alertmanager.service failed.
Jun 11 12:55:44 worker systemd: alertmanager.service holdoff time over, scheduling restart.
Jun 11 12:55:44 worker systemd: Stopped alertmanager.

這是因為在新版本中默認情況下存儲路徑 --storage.path 是相對目錄 data/,但是prometheus用戶在該路徑下沒權限創建目錄,所以導致啟動報錯

解決辦法:在alertmanager.service文件中指定默認存儲路徑在當前路徑下即可

5,啟動 Alertmanager

chown -R prometheus.prometheus /usr/local/alertmanager
systemctl daemon-reload
systemctl start alertmanager.service
systemctl status alertmanager.service
ss -tnl|grep 9093

6,驗證效果

此時到管理界面可以看到如下信息:

然后停止192.168.75.10節點上的 node_exporter 服務,然后再看效果。

接着郵箱應該會收到郵件:


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM