部署Alertmanager實現郵件/釘釘/微信報警
https://www.qikqiak.com/k8s-book/docs/57.AlertManager%E7%9A%84%E4%BD%BF%E7%94%A8.html
https://www.cnblogs.com/xiangsikai/p/11433276.html
1 簡介
Alertmanager 主要用於接收 Prometheus 發送的告警信息,它支持豐富的告警通知渠道,而且很容易做到告警信息進行去重,降噪,分組等,是一款前衛的告警通知系統。


2 設置告警和通知的主要步驟如下:
一、部署Alertmanager
二、配置Prometheus與Alertmanager通信
三、配置告警
1. prometheus指定rules目錄
2. configmap存儲告警規則
3. configmap掛載到容器rules目錄
3 部署Alertmanager
3.1 使用的自動PV存儲alertmanager-pvc.yaml
[root@k8s-master alertmanager]# cat alertmanager-pvc.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: alertmanager namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: "2Gi" [root@k8s-master alertmanager]# kubectl get pv,pvc -n kube-system | grep alert persistentvolume/kube-system-alertmanager-pvc-f36ec996-fdd3-4cdd-9735-423c5af1a8c9 2Gi RWO Delete Bound kube-system/alertmanager managed-nfs-storage 4m57s persistentvolumeclaim/alertmanager Bound kube-system-alertmanager-pvc-f36ec996-fdd3-4cdd-9735-423c5af1a8c9 2Gi RWO managed-nfs-storage 4m57s [root@k8s-master alertmanager]#
3.2 存儲主配置文件-配置告警發送信息alertmanager-configmap.yaml
[root@k8s-master alertmanager]# cat alertmanager-configmap.yaml apiVersion: v1 kind: ConfigMap metadata: # 配置文件名稱 name: alertmanager-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists data: alertmanager.yml: | global: resolve_timeout: 5m # 告警自定義郵件 smtp_smarthost: 'smtp.163.com:25' smtp_from: 'w.jjwx@163.com' smtp_auth_username: 'w.jjwx@163.com' smtp_auth_password: '密碼' receivers: - name: default-receiver email_configs: - to: "314144952@qq.com" route: group_interval: 1m group_wait: 10s receiver: default-receiver repeat_interval: 1m [root@k8s-master alertmanager]# kubectl apply -f alertmanager-configmap.yaml configmap/alertmanager-config created [root@k8s-master alertmanager]# [root@k8s-master alertmanager]# kubectl get cm -n kube-system| grep alert alertmanager-config 1 12m
3.3 部署核心組件alertmanager-deployment.yaml(不需要修改)
[root@k8s-master alertmanager]# cat alertmanager-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: alertmanager namespace: kube-system labels: k8s-app: alertmanager kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v0.14.0 spec: replicas: 1 selector: matchLabels: k8s-app: alertmanager version: v0.14.0 template: metadata: labels: k8s-app: alertmanager version: v0.14.0 spec: priorityClassName: system-cluster-critical containers: - name: prometheus-alertmanager image: "prom/alertmanager:v0.14.0" imagePullPolicy: "IfNotPresent" args: - --config.file=/etc/config/alertmanager.yml - --storage.path=/data - --web.external-url=/ ports: - containerPort: 9093 readinessProbe: httpGet: path: /#/status port: 9093 initialDelaySeconds: 30 timeoutSeconds: 30 volumeMounts: - name: config-volume mountPath: /etc/config - name: storage-volume mountPath: "/data" subPath: "" resources: limits: cpu: 10m memory: 50Mi requests: cpu: 10m memory: 50Mi - name: prometheus-alertmanager-configmap-reload image: "jimmidyson/configmap-reload:v0.1" imagePullPolicy: "IfNotPresent" args: - --volume-dir=/etc/config - --webhook-url=http://localhost:9093/-/reload volumeMounts: - name: config-volume mountPath: /etc/config readOnly: true resources: limits: cpu: 10m memory: 10Mi requests: cpu: 10m memory: 10Mi volumes: - name: config-volume configMap: name: alertmanager-config - name: storage-volume persistentVolumeClaim: claimName: alertmanager [root@k8s-master alertmanager]# kubectl get deploy,pod -n kube-system| grep alert deployment.extensions/alertmanager 1/1 1 1 9m21s pod/alertmanager-6778cc5b7c-2gc9v 2/2 Running 0 9m21s [root@k8s-master alertmanager]#
3.4 暴露ServicePort端口
[root@k8s-master alertmanager]# cat alertmanager-service.yaml apiVersion: v1 kind: Service metadata: name: alertmanager namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/name: "Alertmanager" spec: ports: - name: http port: 80 protocol: TCP targetPort: 9093 selector: k8s-app: alertmanager type: "ClusterIP" type: "NodePort" [root@k8s-master alertmanager]# kubectl get svc -n kube-system| grep alert alertmanager NodePort 10.111.52.119 <none> 80:32587/TCP 18h [root@k8s-master alertmanager]#
3.5 alertmanager控制台

在這個頁面中我們可以進行一些操作,比如過濾、分組等等,里面還有兩個新的概念:Inhibition(抑制)和 Silences(靜默)。
- Inhibition:如果某些其他警報已經觸發了,則對於某些警報,Inhibition 是一個抑制通知的概念。例如:一個警報已經觸發,它正在通知整個集群是不可達的時,Alertmanager 則可以配置成關心這個集群的其他警報無效。這可以防止與實際問題無關的數百或數千個觸發警報的通知,Inhibition 需要通過上面的配置文件進行配置。
- Silences:靜默是一個非常簡單的方法,可以在給定時間內簡單地忽略所有警報。Silences 基於 matchers配置,類似路由樹。來到的警告將會被檢查,判斷它們是否和活躍的 Silences 相等或者正則表達式匹配。如果匹配成功,則不會將這些警報發送給接收者。
4 配置Prometheus與Alertmanager通信
編輯 prometheus-configmap.yaml 配置文件添加綁定信息
最后的alert模塊修改一下,之前的都注釋
alerting: alertmanagers: - static_configs: - targets: ["alertmanager:80"] ####需要修改alertmanger服務名字,集群內部通過服務名調用
應用加載配置文件
kubectl apply -f prometheus-configmap.yaml
web控制台查看配置是否生效

5 配置告警
5.1 prometheus指定rules目錄
編輯 prometheus-configmap.yaml 添加報警信息
# 添加:指定讀取rules配置 rules_files: - /etc/config/rules/*.rules

生效配置文件
kubectl apply -f prometheus-configmap.yaml
5.2 configmap存儲告警規則
創建yaml文件同過configmap存儲告警規則
#vim prometheus-rules.yaml apiVersion: v1 kind: ConfigMap metadata: name: prometheus-rules namespace: kube-system data: # 通用角色 general.rules: | groups: - name: general.rules rules: - alert: InstanceDown expr: up == 0 for: 1m labels: severity: error annotations: summary: "Instance {{ $labels.instance }} 停止工作" description: "{{ $labels.instance }} job {{ $labels.job }} 已經停止5分鍾以上." # Node對所有資源的監控 node.rules: | groups: - name: node.rules rules: - alert: NodeFilesystemUsage expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分區使用率過高" description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分區使用大於80% (當前值: {{ $value }})" - alert: NodeMemoryUsage expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} 內存使用率過高" description: "{{ $labels.instance }}內存使用大於80% (當前值: {{ $value }})" - alert: NodeCPUUsage expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} CPU使用率過高" description: "{{ $labels.instance }}CPU使用大於60% (當前值: {{ $value }})"
5.3 configmap掛載到容器rules目錄
修改掛載點位置,使用之前部署的prometheus動態PV
#vim prometheus-statefulset.yaml volumeMounts: - name: config-volume mountPath: /etc/config - name: prometheus-data mountPath: /data # 添加:指定rules的configmap配置文件名稱 - name: prometheus-rules mountPath: /etc/config/rules subPath: "" terminationGracePeriodSeconds: 300 volumes: - name: config-volume configMap: name: prometheus-config # 添加:name rules - name: prometheus-rules # 添加:配置文件 configMap: # 添加:定義文件名稱 name: prometheus-rules

創建configmap並更新PV kubectl apply -f prometheus-rules.yaml #如果prometheus-statefulse更新失敗,可以先刪除 #kubectl delete -f prometheus-statefulset.yaml kubectl apply -f prometheus-statefulset.yaml
5.4 完整的prometheus配置文件
自動pvc
[root@k8s-master prometheus]# cat /etc/exports /data/volumes/v1 10.6.76.0/24(rw,no_root_squash) /data/volumes/v2 10.6.76.0/24(rw,no_root_squash) /data/volumes/v3 10.6.76.0/24(rw,no_root_squash) [root@k8s-master prometheus]#
prometheus-rules.yaml
apiVersion: v1 kind: ConfigMap metadata: name: prometheus-rules namespace: kube-system data: # 通用角色 general.rules: | groups: - name: general.rules rules: - alert: InstanceDown expr: up == 0 for: 1m labels: severity: error annotations: summary: "Instance {{ $labels.instance }} 停止工作" description: "{{ $labels.instance }} job {{ $labels.job }} 已經停止5分鍾以上." # Node對所有資源的監控 node.rules: | groups: - name: node.rules rules: - alert: NodeFilesystemUsage expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分區使用率過高" description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分區使用大於80% (當前值: {{ $value }})" - alert: NodeMemoryUsage expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} 內存使用率過高" description: "{{ $labels.instance }}內存使用大於80% (當前值: {{ $value }})" - alert: NodeCPUUsage expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} CPU使用率過高" description: "{{ $labels.instance }}CPU使用大於60% (當前值: {{ $value }})"
prometheus-rbac.yaml
apiVersion: v1 # 創建 ServiceAccount 授予權限 kind: ServiceAccount metadata: name: prometheus namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: - "" # 授予的權限 resources: - nodes - nodes/metrics - services - endpoints - pods verbs: - get - list - watch - apiGroups: - "" resources: - configmaps verbs: - get - nonResourceURLs: - "/metrics" verbs: - get --- # 角色綁定 apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-system
prometheus-service.yaml
kind: Service apiVersion: v1 metadata: name: prometheus namespace: kube-system labels: kubernetes.io/name: "Prometheus" kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile spec: type: NodePort ports: - name: http port: 9090 protocol: TCP targetPort: 9090 selector: k8s-app: Prometheus
prometheus-configmap.yaml
apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists data: prometheus.yml: | rule_files: - /etc/config/rules/*.rules scrape_configs: - job_name: prometheus static_configs: - targets: - localhost:9090 # # - job_name: kubernetes-nodes # scrape_interval: 30s # static_configs: # - targets: # - 10.6.76.23:9100 # - 10.6.76.24:9100 # - 10.6.76.25:9100 - job_name: kubernetes-apiservers kubernetes_sd_configs: - role: endpoints relabel_configs: - action: keep regex: default;kubernetes;https source_labels: - __meta_kubernetes_namespace - __meta_kubernetes_service_name - __meta_kubernetes_endpoint_port_name scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-nodes-kubelet kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-nodes-cadvisor kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __metrics_path__ replacement: /metrics/cadvisor scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-service-endpoints kubernetes_sd_configs: - role: endpoints relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_service_annotation_prometheus_io_scrape - action: replace regex: (https?) source_labels: - __meta_kubernetes_service_annotation_prometheus_io_scheme target_label: __scheme__ - action: replace regex: (.+) source_labels: - __meta_kubernetes_service_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 source_labels: - __address__ - __meta_kubernetes_service_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_service_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: kubernetes_namespace - action: replace source_labels: - __meta_kubernetes_service_name target_label: kubernetes_name - job_name: kubernetes-services kubernetes_sd_configs: - role: service metrics_path: /probe params: module: - http_2xx relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_service_annotation_prometheus_io_probe - source_labels: - __address__ target_label: __param_target - replacement: blackbox target_label: __address__ - source_labels: - __param_target target_label: instance - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: - __meta_kubernetes_namespace target_label: kubernetes_namespace - source_labels: - __meta_kubernetes_service_name target_label: kubernetes_name - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_scrape - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: kubernetes_namespace - action: replace source_labels: - __meta_kubernetes_pod_name target_label: kubernetes_pod_name alerting: alertmanagers: - static_configs: - targets: ["alertmanager:80"]
prometheus-statefulset.yaml
apiVersion: apps/v1 kind: StatefulSet metadata: name: prometheus namespace: kube-system labels: k8s-app: prometheus kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v2.2.1 spec: serviceName: "prometheus" replicas: 1 podManagementPolicy: "Parallel" updateStrategy: type: "RollingUpdate" selector: matchLabels: k8s-app: prometheus template: metadata: labels: k8s-app: prometheus annotations: scheduler.alpha.kubernetes.io/critical-pod: '' spec: priorityClassName: system-cluster-critical serviceAccountName: prometheus initContainers: - name: "init-chown-data" image: "busybox:latest" imagePullPolicy: "IfNotPresent" command: ["chown", "-R", "65534:65534", "/data"] volumeMounts: - name: prometheus-data mountPath: /data subPath: "" containers: - name: prometheus-server-configmap-reload image: "jimmidyson/configmap-reload:v0.1" imagePullPolicy: "IfNotPresent" args: - --volume-dir=/etc/config - --webhook-url=http://localhost:9090/-/reload volumeMounts: - name: config-volume mountPath: /etc/config readOnly: true resources: limits: cpu: 10m memory: 10Mi requests: cpu: 10m memory: 10Mi - name: prometheus-server image: "prom/prometheus:v2.2.1" imagePullPolicy: "IfNotPresent" args: - --config.file=/etc/config/prometheus.yml - --storage.tsdb.path=/data - --web.console.libraries=/etc/prometheus/console_libraries - --web.console.templates=/etc/prometheus/consoles - --web.enable-lifecycle ports: - containerPort: 9090 readinessProbe: httpGet: path: /-/ready port: 9090 initialDelaySeconds: 30 timeoutSeconds: 30 livenessProbe: httpGet: path: /-/healthy port: 9090 initialDelaySeconds: 30 timeoutSeconds: 30 # based on 10 running nodes with 30 pods each resources: limits: cpu: 200m memory: 1000Mi requests: cpu: 200m memory: 1000Mi volumeMounts: - name: config-volume mountPath: /etc/config - name: prometheus-data mountPath: /data - name: prometheus-rules mountPath: /etc/config/rules subPath: "" terminationGracePeriodSeconds: 300 volumes: - name: config-volume configMap: name: prometheus-config - name: prometheus-rules configMap: name: prometheus-rules volumeClaimTemplates: - metadata: name: prometheus-data spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: "16Gi"
查看服務啟動情況
[root@k8s-master prometheus]# kubectl get pv,pvc,cm,pod,svc -n kube-system |grep prometheus persistentvolume/kube-system-prometheus-data-prometheus-0-pvc-939cfe8c-427d-4731-91f3-b146dd8f61e2 16Gi RWO Delete Bound kube-system/prometheus-data-prometheus-0 managed-nfs-storage 2m14s persistentvolumeclaim/prometheus-data-prometheus-0 Bound kube-system-prometheus-data-prometheus-0-pvc-939cfe8c-427d-4731-91f3-b146dd8f61e2 16Gi RWO managed-nfs-storage 2m15s configmap/prometheus-config 1 2m15s configmap/prometheus-rules 2 2m15s pod/prometheus-0 2/2 Running 0 2m15s service/prometheus NodePort 10.97.213.127 <none> 9090:32281/TCP 2m15s
5.5 訪問Prometheus, 查看是否有alerts告警規則



5.6 訪問alerts管理后台

5.7 測試內存郵件報警
我們的內存閾值設置是80%

我們在prometheus上看基本上也就20%吧

我們把他改成10%
[root@k8s-master prometheus]# vim prometheus-rules.yaml - alert: NodeMemoryUsage expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 10 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} 內存使用率過高" description: "{{ $labels.instance }}內存使用大於10% (當前值: {{ $value }})" [root@k8s-master prometheus]# kubectl apply -f prometheus-rules.yaml configmap/prometheus-rules configured #竟然不能熱加載….可能是我太着急了,這個需要5分鍾吧 [root@k8s-master prometheus]# kubectl delete -f prometheus-statefulset.yaml statefulset.apps "prometheus" deleted [root@k8s-master prometheus]# kubectl apply -f prometheus-statefulset.yaml statefulset.apps/prometheus created [root@k8s-master prometheus]#
出現報警


我們可以看到頁面中出現了我們剛剛定義的報警規則信息,而且報警信息中還有狀態顯示。一個報警信息在生命周期內有下面3種狀態:
- inactive: 表示當前報警信息既不是firing狀態也不是pending狀態
- pending: 表示在設置的閾值時間范圍內被激活了
- firing: 表示超過設置的閾值時間被激活了
粉紅色其實已經將告警推送給Alertmanager了,也就是這個狀態下才去發送這個告警信息
alertmanager上也能看到

QQ郵箱 報警郵件是合並的 ,這個不錯

但是我只有一個master兩個node結果報警出來6條
猜一下,這個地方
[root@k8s-master prometheus]# vim prometheus-configmap.yaml #- job_name: kubernetes-nodes # scrape_interval: 30s # static_configs: # - targets: # - 10.6.76.23:9100 # - 10.6.76.24:9100 # - 10.6.76.25:9100 [root@k8s-master prometheus]# kubectl apply -f prometheus-configmap.yaml configmap/prometheus-config configured

果然,網友也很坑啊!!前面的配置文件修改了,不會遇到這個坑了
| 3 alerts for alertname=NodeMemoryUsage |
|||||
|
我們把其中一條報警設置靜音,看起來默認是2小時



過了5分鍾
Prometheus還是3個報警

但alertmanager已經處理了,報警只發出2個


5.8 釘釘報警
https://www.qikqiak.com/k8s-book/docs/57.AlertManager%E7%9A%84%E4%BD%BF%E7%94%A8.html
上面我們配置的是 AlertManager 自帶的郵件報警模板,我們也說了 AlertManager 支持很多中報警接收器,比如 slack、微信之類的,其中最為靈活的方式當然是使用 webhook 了,我們可以定義一個 webhook 來接收報警信息,然后在 webhook 里面去進行處理,需要發送怎樣的報警信息我們自定義就可以。
准備釘釘機器人


悲催的是趕上釘釘升級,機器人新建不了,我們用之前的Jenkins留下的
配置報警文件
大家可以根據自己的需求來定制報警數據, github.com/cnych/alertmanager-dingtalk-hook
對應的資源清單如下:(dingtalk-hook.yaml)
#cat dingtalk-hook.yaml apiVersion: extensions/v1beta1 kind: Deployment metadata: name: dingtalk-hook namespace: kube-system spec: template: metadata: labels: app: dingtalk-hook spec: containers: - name: dingtalk-hook image: cnych/alertmanager-dingtalk-hook:v0.2 imagePullPolicy: IfNotPresent ports: - containerPort: 5000 name: http env: - name: ROBOT_TOKEN valueFrom: secretKeyRef: name: dingtalk-secret key: token resources: requests: cpu: 50m memory: 100Mi limits: cpu: 50m memory: 100Mi --- apiVersion: v1 kind: Service metadata: name: dingtalk-hook namespace: kube-system spec: selector: app: dingtalk-hook ports: - name: hook port: 5000 targetPort: http
要注意上面我們聲明了一個 ROBOT_TOKEN 的環境變量,由於這是一個相對於私密的信息,所以我們這里從一個 Secret 對象中去獲取,通過如下命令創建一個名為 dingtalk-secret 的 Secret 對象,然后部署上面的資源對象即可:
[root@k8s-master alertmanager]# kubectl create secret generic dingtalk-secret --from-literal=token=17549607d838b3015d183384ffe53333b13df0a98563150df241535808e10781 -n kube-system secret/dingtalk-secret created [root@k8s-master alertmanager]# kubectl create -f dingtalk-hook.yaml deployment.extensions/dingtalk-hook created service/dingtalk-hook created [root@k8s-master alertmanager]# [root@k8s-master alertmanager]# kubectl get pod,deploy -n kube-system | grep dingtalk pod/dingtalk-hook-686ddd6976-tp9g4 1/1 Running 0 3m18s deployment.extensions/dingtalk-hook 1/1 1 1 3m18s [root@k8s-master alertmanager]#
部署成功后,現在我們就可以給 AlertManager 配置一個 webhook 了,在上面的配置中增加一個路由接收器
#cat alertmanager-configmap.yaml apiVersion: v1 kind: ConfigMap metadata: # 配置文件名稱 name: alertmanager-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists data: alertmanager.yml: | global: resolve_timeout: 5m # 告警自定義郵件 smtp_smarthost: 'smtp.163.com:25' smtp_from: 'w.jjwx@163.com' smtp_auth_username: 'w.jjwx@163.com' smtp_auth_password: '密碼' smtp_hello: '163.com' smtp_require_tls: false route: # 這里的標簽列表是接收到報警信息后的重新分組標簽,例如,接收到的報警信息里面有許多具有 cluster=A 和 alertname=LatncyHigh 這樣的標簽的報警信息將會批量被聚合到一個分組里面 group_by: ['job','alertname','severity'] # 當一個新的報警分組被創建后,需要等待至少group_wait時間來初始化通知,這種方式可以確保您能有足夠的時間為同一分組來獲取多個警報,然后一起觸發這個報警信息。 group_wait: 30s group_interval: 5m # 當第一個報警發送后,等待'group_interval'時間來發送新的一組報警信息。 repeat_interval: 12h # 如果一個報警信息已經發送成功了,等待'repeat_interval'時間來重新發送他們 #group_interval: 5m #repeat_interval: 12h receiver: default #默認的receiver:如果一個報警沒有被一個route匹配,則發送給默認的接收器 routes: - receiver: webhook match: alertname: NodeMemoryUsage #匹配這個內存報警 receivers: - name: 'default' email_configs: - to: '314144952@qq.com' send_resolved: true - name: 'webhook' webhook_configs: - url: 'http://dingtalk-hook:5000' send_resolved: true
部分參數說明
route:
# 這里的標簽列表是接收到報警信息后的重新分組標簽,例如,接收到的報警信息里面有許多具有 cluster=A 和 alertname=LatncyHigh 這樣的標簽的報警信息將會批量被聚合到一個分組里面
group_by: ['alertname', 'cluster']
# 當一個新的報警分組被創建后,需要等待至少group_wait時間來初始化通知,這種方式可以確保您能有足夠的時間為同一分組來獲取多個警報,然后一起觸發這個報警信息。
group_wait: 30s
# 當第一個報警發送后,等待'group_interval'時間來發送新的一組報警信息。
group_interval: 5m
# 如果一個報警信息已經發送成功了,等待'repeat_interval'時間來重新發送他們
repeat_interval: 5m
# 默認的receiver:如果一個報警沒有被一個route匹配,則發送給默認的接收器
receiver: default
# 上面所有的屬性都由所有子路由繼承,並且可以在每個子路由上進行覆蓋。
[root@k8s-master alertmanager]# kubectl get pod,svc -n kube-system| grep -E "prome|alert" pod/alertmanager-6778cc5b7c-jtwbt 2/2 Running 3 3m43s pod/prometheus-0 2/2 Running 0 104m service/alertmanager NodePort 10.111.52.119 <none> 80:32587/TCP 23h service/prometheus NodePort 10.97.213.127 <none> 9090:32281/TCP 3h59m [root@k8s-master alertmanager]#
我們這里配置了一個名為 webhook 的接收器,地址為:http://dingtalk-hook:5000,這個地址當然就是上面我們部署的釘釘的 webhook 的接收程序的 Service 地址。
然后我們也在報警規則中添加一條關於節點文件系統使用情況的報警規則,注意 labels 標簽要帶上group_by: ['severity'],或者根據自定義設置。這樣報警信息就會被 webhook 這一個接收器所匹配:例如我們有兩個報警alertname=CPU***和alertname=MEM***,webhook匹配一個,剩下的沒有被匹配將使用默認的發出,我們這里是郵件

更新 AlertManager 和 Prometheus 的 ConfigMap 資源對象(先刪除再創建),更新完成后,隔一會兒執行 reload 操作是更新生效:
curl -X POST "http://10.97.213.127:9090/-/reload"
發送報警
隔一會兒關於這個節點文件系統的報警就會被觸發了,由於這個報警信息包含一個severity的 label 標簽,所以會被路由到webhook這個接收器中,也就是上面我們自定義的這個 dingtalk-hook,觸發后可以觀察這個 Pod 的日志:
[root@k8s-master alertmanager]# kubectl logs -f dingtalk-hook-686ddd6976-tp9g4 -n kube-system * Serving Flask app "app" (lazy loading) * Environment: production WARNING: Do not use the development server in a production environment. Use a production WSGI server instead. * Debug mode: off * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit) 127.0.0.1 - - [20/Sep/2019 07:52:39] "GET / HTTP/1.1" 200 - 10.254.2.1 - - [20/Sep/2019 07:54:05] "GET / HTTP/1.1" 200 - 10.254.2.1 - - [20/Sep/2019 07:54:05] "GET /favicon.ico HTTP/1.1" 404 - 10.254.1.176 - - [20/Sep/2019 08:31:11] "POST / HTTP/1.1" 200 - 10.254.2.1 - - [20/Sep/2019 09:16:07] "GET / HTTP/1.1" 200 - {'receiver': 'webhook', 'status': 'firing', 'alerts': [{'status': 'firing', 'labels': {'addonmanager_kubernetes_io_mode': 'Reconcile', 'alertname': 'NodeMemoryUsage', 'instance': '10.6.76.24:9100', 'job': 'kubernetes-service-endpoints', 'kubernetes_io_cluster_service': 'true', 'kubernetes_io_name': 'NodeExporter', 'kubernetes_name': 'node-exporter', 'kubernetes_namespace': 'kube-system', 'severity': 'warning'}, 'annotations': {'description': '10.6.76.24:9100內存使用大於10% (當前值: 15.369550162495912)', 'summary': 'Instance 10.6.76.24:9100 內存使用率過高'}, 'startsAt': '2019-09-20T07:06:10.54216046Z', ………

5.9 企業微信報警
https://www.cnblogs.com/xzkzzz/p/10211394.html
配置企業微信
登錄企業微信,應用管理,點擊創建應用按鈕 -> 填寫應用信息:

我們打開已經創建好的promethues應用。獲取一下信息:

新增告警信息模板
apiVersion: v1 kind: ConfigMap metadata: name: wechat-tmpl namespace: kube-system data: wechat.tmpl: | {{ define "wechat.default.message" }} {{ range .Alerts }} ========start========== 告警程序: prometheus_alert 告警級別: {{ .Labels.severity }} 告警類型: {{ .Labels.alertname }} 故障主機: {{ .Labels.instance }} 告警主題: {{ .Annotations.summary }} 告警詳情: {{ .Annotations.description }} 觸發時間: {{ .StartsAt.Format "2006-01-02 15:04:05" }} ========end========== {{ end }} {{ end }}
把模板掛載到alertmanager 的pod中的/etc/alertmanager-tmpl
[root@k8s-master alertmanager]# tail -25 alertmanager-deployment.yaml - --volume-dir=/etc/config - --webhook-url=http://localhost:9093/-/reload volumeMounts: - name: config-volume mountPath: /etc/config - name: wechattmpl mountPath: /etc/alertmanager-tmpl readOnly: true resources: limits: cpu: 10m memory: 10Mi requests: cpu: 10m memory: 10Mi volumes: - name: config-volume configMap: name: alertmanager-config - name: storage-volume persistentVolumeClaim: claimName: alertmanager - name: wechattmpl configMap: name: wechat-tmpl
設置一個報警
在rules.yml中添加一個新的告警信息。
- alert: NodeMemoryUsage expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 10 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} 內存使用率過高" description: "{{ $labels.instance }}內存使用大於10% (當前值: {{ $value }})"

配置微信發送
alertmanager中默認支持微信告警通知。我們可以通過官網查看 我們的配置如下:
[root@k8s-master alertmanager]# cat alertmanager-configmap.yaml apiVersion: v1 kind: ConfigMap metadata: # 配置文件名稱 name: alertmanager-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists data: alertmanager.yml: | global: resolve_timeout: 5m # 告警自定義郵件 smtp_smarthost: 'smtp.163.com:25' smtp_from: 'w.jjwx@163.com' smtp_auth_username: 'w.jjwx@163.com' smtp_auth_password: '***' smtp_hello: '163.com' smtp_require_tls: false route: # 這里的標簽列表是接收到報警信息后的重新分組標簽,例如,接收到的報警信息里面有許多具有 cluster=A 和 alertname=LatncyHigh 這樣的標簽的報警信息將會批量被聚合到一個分組里面 group_by: ['job','alertname','severity'] # 當一個新的報警分組被創建后,需要等待至少group_wait時間來初始化通知,這種方式可以確保您能有足夠的時間為同一分組來獲取多個警報,然后一起觸發這個報警信息。 group_wait: 30s group_interval: 1m # 當第一個報警發送后,等待'group_interval'時間來發送新的一組報警信息。 repeat_interval: 2m # 如果一個報警信息已經發送成功了,等待'repeat_interval'時間來重新發送他們 #group_interval: 5m #repeat_interval: 12h receiver: default #默認的receiver:如果一個報警沒有被一個route匹配,則發送給默認的接收器 routes: - receiver: wechat match: alertname: NodeCPUUsage #匹配CPU報警 - receiver: webhook match: alertname: NodeMemoryUsage #匹配這個內存報警 receivers: - name: 'default' email_configs: - to: '314144952@qq.com' send_resolved: true - name: 'webhook' webhook_configs: - url: 'http://dingtalk-hook:5000' send_resolved: true - name: 'wechat' wechat_configs: - corp_id: 'wxd6b528f56d453***' to_party: '運維部' to_user: "@all" agent_id: '10000**' api_secret: 'tWDIGSDCIIo4zkh42hn4IhuxB-FPjx2Ui4E0Vqt***' send_resolved: true
wechat_configs 配置詳情
send_resolved告警解決是否通知,默認是falseapi_secret創建微信上應用的Secret- api_url wechat的url。默認即可
- corp_id 企業微信---我的企業---最下面的企業ID
- message 告警消息模板:默認
template "wechat.default.message" - agent_id 創建微信上應用的agent_id
- to_user 接受消息的用戶 所有用戶可以使用 @all
- to_party 接受消息的部門
加載相關配置文件
kubectl apply -f alertmanager-deployment.yaml
kubectl apply -f alertmanager-configmap.yaml
收到報警


