Prometheus-alertmanager組件使用

本文轉載自查看原文 2020-08-09 19:38 1108 監控(Prometheus+Grafana)

警報管理

Alertmanager工作過程

Alertmanager處理從客戶端（通常是Prometheus服務器或其它工具的警報）發來的警報，然后Alertmanager對警報進行去重、分組，然后路由到不同的接收器，如電子郵件、短信或SaaS服務（PagerDuty等）。還可以使用Alertmanager管理維護警報。

先在Prometheus服務器上編寫警報規則，這些規則將使用（exporter）收集的指標並在指定的閾值或標准上觸發警報。當指標達到閾值或標准時，生成一個警報並將其推送到alertmanger。告警在Alertmanger上的HTTP端點上接收。
收到警報后，alertmanager會處理警報並根據其標簽進行路由。一旦路徑確定，他們將由Alertmanager發送到外部目的地。如電子郵件、短信等工具。

alertmanager安裝配置

詳見文檔

配置alertmanager

Alertmanager配置也是基於YAML的配置文件，主要由global,route,receivers這三部分組成。

簡單樣例(alertmanager.yml)

通過電子郵件發送任何收到的警報到另一個郵箱地址。

global:
  smtp_from: 'localhost:25'
  smtp_smarthost: 'alertmanager@example.com'
  smtp_require_tls: false

route:
  receiver: 'email'

receivers:
-name: 'email'
  email_configs:
  - to: 'alerts@example.com'

templates:
- '/ups/app/monitor/alertmanager/template/*.tmpl'

global: 全局配置，為其它塊設置默認值
template: 包含保存警報模板的目錄列表。由於alertmanager可以發送到各種目的地，因此需要能夠自定義警報的外觀及其包含的數據
route: 它告知alertmanager如何處理特定的傳入警報。警報根據規則進行匹配然后采取相應的操作。
receivers（接收器）：指定警報的目的地。每個接收器都有一個名稱和相關配置

在Prometheus中配置alertmanager

在Prometheus中配置，告訴Prometheus關於Alertmanager的信息。在alerting塊中配置alertmanger相關信息。

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - progs:9093  # 對應啟動的altermanager節點的9093端口

alerting塊包含允許Prometheus識別一個或多個Alertmanager的配置。為此，Prometheus使用與查找抓取目標時相同的發現機制，在默認配置中是static_configs。與監控作業一樣，它指定目標列表，此處是主機名progs加端口9093（Alertmanager默認端口）的形式。該列表假定你的Prometheus服務器可以解析progs主機名為IP地址，並且Alertmanager在該主機的端口9093上運行。

監控alertmanager

Alertmanager可以暴露了自身的相關指標作為被監控對象。創建一個作業來監控alertmanager

  - job_name: 'alertmanager'
    static_configs:
    - targets: ['localhost:9093']

這將從http://localhost:9093/metrics收集指標並抓取一系列以alertmanager_為前綴的時間序列數據。這些數據包括按狀態分類的警報計數、按接收器分類的成功和失敗通知的計數、還可以包含Alertmanager集群狀態指標。

添加警報規則

與記錄規則一樣，警報規則也是在Prometheus服務器中配置加載的規則文件（使用YAML語句定義）。現在rules目錄中創建一個新文件node_alerts.yml，以保存節點警報規則為例子進行說明。

在prometheus.yml配置文件中的rule_files塊配置加載的規則文件

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*_rules.yml"
  - "rules/*_alerts.yml"

當前使用通配符加載該目錄中以_rules.yml或_alerts.yml結尾的所有文件。

添加警報規則

vi rules/node_alerts.yml

groups:
- name: node_alerts
  rules:
  - alert: InstanceDown
    expr: up{job="node_exporter"} == 0
    for: 10s
    labels:
      severity: critical
    annotations:
      summary: Host {{ $labels.instance }} is down!

上面指定一個組名為node_alerts的警報規則，該組中的規則包含在rules的塊中。
每個規則通過alert子句指定它的名稱，並且每個警報組中的警報名稱必須唯一。
觸發警報表達式使用expr子句指定。
for子句控制在觸發警報之前必須為true的時間長度。在這個示例中，指標up{job="node_exporter"}需要在觸發警報之前的10秒中內等於0.這限制了警報誤報或暫時狀態的可能。
使用標簽（labels）裝飾警報，指定要附加到警報的其它標簽，這里添加了一個值為critical的severity的標簽。警報上的標簽與警報的名稱組合構成警報的標識
注解（annotations）裝飾警報，用於展現更多的信息的標簽，如描述、處理說明等。這里添加一個名為summary的標簽來描述警報

重啟Prometheus服務啟用新的警報規則

打開web界面查看http://progs:9090/alerts

警報觸發

Prometheus以固定時間間隔（由參數evaluation_interval控制）評估所有的規則。默認值1分鍾。在每個評估周期內，Prometheus運行每個警報規則中定義的表達式並更新警報狀態。

警報狀態

Inactive：警報未激活
Pending：警報已滿足測試表達式條件，但仍在等待for子句指定的持續時間
Firing：警報已滿足測試表達式條件，並且Pending的時間已超過for子句的持續時間

Pending到Firing的轉換可以確保警報更有效，且不會來回浮動。沒有for子句的警報會自動從Inactive轉換為Firing，只需要一個評估周期（evaluation_interval）即可觸發。帶有for子句的警報將首先轉換為Pending，然后轉換為Firing，因此至少需要兩個評估周期才能觸發。

警報的生命周期

節點的可能不斷變化，每隔一段由scrape_interval定義的時間被Prometheus抓取一次，對我們來說是15秒。
根據每個evaluation_interval的指標來評估警報規則，對我們來說還是15秒。
當警報表達式為true時（對於上面示例來說節點發生down），會創建一個警報並轉換到Pending狀態，執行for子句。
在下一個評估周期中，如果警報測試表達式仍然為true，則檢查for的持續時間。如果超過了持續時間，則警報將轉換為Firing，生成通知並將其推送到Alertmanager。
如果警報測試表達式不再為true，則Prometheus會將警報規則的狀態從Pending更改為Inactive。

alertmanager的警報

通過http://progs:9090/alerts Web界面查看警報及其狀態

處於Firing狀態的警報已經推送到alertmanager，可以在alertmanager API(http://progs:9093/api/v1/alerts)查看

[root@progs config]# curl http://progs:9093/api/v1/alerts| python -m json.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1004  100  1004    0     0   172k      0 --:--:-- --:--:-- --:--:--  196k
{
    "data": [
        {
            "annotations": {
                "summary": "Host 192.168.10.181:9100 is down!"
            },
            "endsAt": "2020-08-08T02:37:48.046283806Z",
            "fingerprint": "2c28fee95b3f434c",
            "generatorURL": "http://progs:9090/graph?g0.expr=up%7Bjob%3D%22node_exporter%22%7D+%3D%3D+0&g0.tab=1",
            "labels": {
                "alertname": "InstanceDown",
                "hostname": "192.168.10.181",
                "instance": "192.168.10.181:9100",
                "job": "node_exporter",
                "severity": "critical"
            },
            "receivers": [
                "wechat"
            ],
            "startsAt": "2020-08-08T02:28:48.046283806Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "active"
            }
        },
        {
            "annotations": {
                "summary": "Host  is down!"
            },
            "endsAt": "2020-08-08T02:37:48.046283806Z",
            "fingerprint": "550e18fea3ef4d3d",
            "generatorURL": "http://progs:9090/graph?g0.expr=avg+by%28job%29+%28up%7Bjob%3D%22node_exporter%22%7D%29+%3C+0.75&g0.tab=1",
            "labels": {
                "alertname": "InstancesDown",
                "job": "node_exporter",
                "severity": "critical"
            },
            "receivers": [
                "wechat"
            ],
            "startsAt": "2020-08-08T02:28:48.046283806Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "active"
            }
        }
    ],
    "status": "success"
}
[root@progs config]#

Prometheus為Pending和Firing狀態中的每個警報創建指標（ALERT），如下：

添加模板

模板（template）是一種在警報中使用時間序列數據的標簽和值的方法，可用於注解和標簽。模板使用標准的Go模板語法，並暴露一些包含時間序列的標簽和值的變量。標簽以變量$labels形式表示，指標的值則是變量$value。

引用時間序列的值

summary注解中引用instance標簽，使用{{$labels.instance}}
引用時間序列的值，使用{{$value}}
使用humanize函數，它使用指標前綴將數字轉換為更易於閱讀的形式

路由

將不同屬性的警報路由到不同的目的地。默認使用后序遍歷路由

下面在alertmanager.yml文件中添加一些路由配置

global:
  smtp_from: 'localhost:25'
  smtp_smarthost: 'alertmanager@example.com'
  smtp_require_tls: false

route:
  group_by: ['instance']
  group_wait: 30s
  
  group_interval: 5m
  repeat_interval: 3h
  receiver: 'email'
  routes:
  - match:
      severity: critical
    receiver: pager
  - match_re:
      severity: ^(warning|critical)$
    receiver: support_team

receivers:
- name: 'email'
  email_configs:
  - to: 'alerts@example.com'
- name: 'support_team'
  email_configs:
  - to: 'support@example.com'
- name: 'pager'
  email_configs:
  - to: 'pager@example.com'

templates:
- '/ups/app/monitor/alertmanager/template/*.tmpl'

group_by: 控制Alertmanager分組警報的方式。默認所有警報分成一組
- 示例指定標簽instance對警報分組，意味着來自特定實例的所有警報分在一起
- 分組會改變alertmanager的處理行為，當觸發新警報時，等待group_wait中指定時間段，以便在觸發警報之前是否收到改組的其它警報
group_interval：在發出警報后，如果收到該分組的下一次評估新警報，alertmanager等待group_interval指定的時間段，再發送新警報。防止警報分組的警報泛濫
repeat_interval：僅作用於單個警報，等待重新發送相同警報的時間段。不適用警報組

路由表

routes子句列出分支路由。通過標簽匹配或正則表達式匹配將警報發送到指定的目的地。路由都是分支，可以繼續設置分支路由。

標簽匹配

  - match:
      severity: critical
    receiver: pager

將所有severity標簽與critical值匹配，並將它們發送到pager接收器。

路由分支

新routes塊嵌套已有的routes塊中。

  routes:
  - match:
      severity: critical
    receiver: pager
    routes:
      - match:
        service: application
      receiver: support_team

當新警報severity標簽為critical且service標簽application都成功匹配時，將警報發送到接收器support_team。

正則表達式匹配

  - match_re:
      severity: ^(warning|critical)$
    receiver: support_team

它匹配severity標簽中的warning或critical值。

路由遍歷順序

默認使用后序遍歷路由，可以使用continue選項控制路由遍歷順序，該選項控制警報是否先序遍歷路由，然后再返回已遍歷路由樹。continue默認選項為false，即后序遍歷路由樹。

延申知識點

前序遍歷（先序）：前序遍歷可以記為根左右，若二叉樹為空，則結束返回。
后序遍歷：后序遍歷可以記為左右根，也就是說在二叉樹的遍歷過程中，首先按照后序遍歷的規則遍歷左子樹，接着按照后序遍歷的規則遍歷右子樹，最后訪問根節點。若二叉樹為空，則結束返回。

  routes:
  - match:
      severity: critical
    receiver: pager
    continue: true

continue: true時，則警報將在此路由中觸發（如果匹配），並繼續執行下一個相鄰路由。

接收器

指定接收警報的目的地。

receivers:
- name: 'pager'
  email_configs:
  - to: 'pager@example.com'
  slack_configs:
  - api_url: https://hooks.slack.com/service/ABC123/ABC123/EXAMPLE
    channel: '#monitoring'

添加Slack接收器，它會消息發送到Slack實例。示例中任何向pager接收器發送警報的路由都將被發送到Slack的#monitoring頻道，並通過電子郵件發送到pager@example.com。

通知模板

使用Go template函數來引用外部模板，從而避免在配置文件中嵌入較長且復雜的字符串。

創建模板文件

cat > /ups/app/monitor/alertmanager/template/slack.tmpl <<-EOF
{{ define "slack.example.text" }}{{ .CommonAnnotations.summary }}{{ end}}
EOF

使用define函數定義了一個新模板，以end結尾，並取名為slack.example.text，然后在模板內的text中復制內容。

引用模板

  slack_configs:
  - api_url: https://hooks.slack.com/service/ABC123/ABC123/EXAMPLE
    channel: '#monitoring'
    text: '{{ template "slack.example.text" . }}'

使用了template選項來指定模板的名稱。使用模板通知來填充text字段。

silence和維護

silence: 警報靜音。當明確知道停止服務以進行維護作業時，並不希望觸發警報。這種場景需要用到silence，設定特定時間段內屏蔽觸發警報規則。

設置silence的方法

alertmanager Web控制台
amtool命令行工具

silence配置過程

alertmanager Web控制台設置silence

新建silence

配置相關屬性

amtool工具設置silence

用法

usage: amtool [<flags>] <command> [<args> ...]

View and modify the current Alertmanager state.

Config File: The alertmanager tool will read a config file in YAML format from one of two default config locations:
$HOME/.config/amtool/config.yml or /etc/amtool/config.yml

All flags can be given in the config file, but the following are the suited for static configuration:

  alertmanager.url
  	Set a default alertmanager url for each request

  author
  	Set a default author value for new silences. If this argument is not
  	specified then the username will be used

  require-comment
  	Bool, whether to require a comment on silence creation. Defaults to true

  output
  	Set a default output type. Options are (simple, extended, json)

  date.format
  	Sets the output format for dates. Defaults to "2006-01-02 15:04:05 MST"

Flags:
  -h, --help           Show context-sensitive help (also try --help-long and --help-man).
      --date.format="2006-01-02 15:04:05 MST"  
                       Format of date output
  -v, --verbose        Verbose running information
      --alertmanager.url=ALERTMANAGER.URL  
                       Alertmanager to talk to
  -o, --output=simple  Output formatter (simple, extended, json)
      --timeout=30s    Timeout for the executed command
      --version        Show application version.

Commands:
  help [<command>...]
  alert
    query* [<flags>] [<matcher-groups>...]
    add [<flags>] [<labels>...]
  silence
    add [<flags>] [<matcher-groups>...]
    expire [<silence-ids>...]
    import [<flags>] [<input-file>]
    query* [<flags>] [<matcher-groups>...]
    update [<flags>] [<update-ids>...]
  check-config [<check-files>...]
  cluster
    show*
  config
    show*
    routes [<flags>]
      show*
      test [<flags>] [<labels>...]

amtool配置文件

默認配置文件路徑``$HOME/.config/amtool/config.ymlor/etc/amtool/config.yml`

# Define the path that `amtool` can find your `alertmanager` instance
alertmanager.url: "http://progs:9093"

# Override the default author. (unset defaults to your username)
author: me@example.com

# Force amtool to give you an error if you don't include a comment on a silence
comment_required: true
comment: default

# Set a default output format. (unset defaults to simple)
output: extended

新建silence

將在Alertmanager的http://progs:9093上添加一個新silence，它將警報與兩個標簽匹配：自動填充包含警報名稱的alertname標簽和service標簽。並返回一個silence ID

/ups/app/monitor/alertmanager/bin/amtool --comment=testing --alertmanager.url=http://progs:9093 silence add alertname=InstancesGone service=application

使用默認config.yml文件添加silence

/ups/app/monitor/alertmanager/bin/amtool silence add alertname=InstancesGone

查詢當前的silence列表

/ups/app/monitor/alertmanager/bin/amtool --alertmanager.url=http://progs:9093 silence query

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 釘釘報警-prometheus-alertmanager Prometheus-Alertmanager告警對接到企業微信 prometheus-alertmanager告警推送到釘釘 prometheus學習系列九： Prometheus AlertManager使用 k8s全方位監控-prometheus-alertmanager部署-配置第一條告警郵件 Prometheus+Alertmanager+Grafana監控組件容器化部署 Prometheus監控+Grafana+Alertmanager告警安裝使用 (圖文詳解) Prometheus整合Alertmanager報警 prometheus + alertmanager 實現報警 Prometheus之Alertmanager配置詳解