Prometheus監控k8s(9)-部署Alertmanager實現郵件/釘釘/微信報警


部署Alertmanager實現郵件/釘釘/微信報警

https://www.qikqiak.com/k8s-book/docs/57.AlertManager%E7%9A%84%E4%BD%BF%E7%94%A8.html

https://www.cnblogs.com/xiangsikai/p/11433276.html

 

1 簡介

Alertmanager 主要用於接收 Prometheus 發送的告警信息,它支持豐富的告警通知渠道,而且很容易做到告警信息進行去重,降噪,分組等,是一款前衛的告警通知系統。

 

 

 

 

 

 

 

 

 

 

 

 

2 設置告警和通知的主要步驟如下:


一、部署Alertmanager
二、配置Prometheus與Alertmanager通信
三、配置告警
  1. prometheus指定rules目錄
  2. configmap存儲告警規則
  3. configmap掛載到容器rules目錄

 

3 部署Alertmanager

3.1 使用的自動PV存儲alertmanager-pvc.yaml

[root@k8s-master alertmanager]# cat alertmanager-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: alertmanager
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: EnsureExists
spec:
  storageClassName: managed-nfs-storage
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
storage: "2Gi"

[root@k8s-master alertmanager]# kubectl get pv,pvc -n kube-system | grep alert
persistentvolume/kube-system-alertmanager-pvc-f36ec996-fdd3-4cdd-9735-423c5af1a8c9   2Gi        RWO            Delete           Bound    kube-system/alertmanager                   managed-nfs-storage            4m57s
persistentvolumeclaim/alertmanager                   Bound     kube-system-alertmanager-pvc-f36ec996-fdd3-4cdd-9735-423c5af1a8c9   2Gi        RWO            managed-nfs-storage   4m57s
[root@k8s-master alertmanager]#

 

 

3.2 存儲主配置文件-配置告警發送信息alertmanager-configmap.yaml

[root@k8s-master alertmanager]# cat alertmanager-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  # 配置文件名稱
  name: alertmanager-config
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: EnsureExists
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
      # 告警自定義郵件
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: 'w.jjwx@163.com'
      smtp_auth_username: 'w.jjwx@163.com'
      smtp_auth_password: '密碼'

    receivers:
    - name: default-receiver
      email_configs:
      - to: "314144952@qq.com"

    route:
      group_interval: 1m
      group_wait: 10s
      receiver: default-receiver
repeat_interval: 1m
[root@k8s-master alertmanager]# kubectl apply -f alertmanager-configmap.yaml
configmap/alertmanager-config created
[root@k8s-master alertmanager]#
[root@k8s-master alertmanager]# kubectl get cm -n kube-system| grep alert
alertmanager-config                  1      12m

 

 

3.3 部署核心組件alertmanager-deployment.yaml(不需要修改)

[root@k8s-master alertmanager]# cat alertmanager-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: kube-system
  labels:
    k8s-app: alertmanager
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    version: v0.14.0
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: alertmanager
      version: v0.14.0
  template:
    metadata:
      labels:
        k8s-app: alertmanager
        version: v0.14.0
    spec:
      priorityClassName: system-cluster-critical
      containers:
        - name: prometheus-alertmanager
          image: "prom/alertmanager:v0.14.0"
          imagePullPolicy: "IfNotPresent"
          args:
            - --config.file=/etc/config/alertmanager.yml
            - --storage.path=/data
            - --web.external-url=/
          ports:
            - containerPort: 9093
          readinessProbe:
            httpGet:
              path: /#/status
              port: 9093
            initialDelaySeconds: 30
            timeoutSeconds: 30
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
            - name: storage-volume
              mountPath: "/data"
              subPath: ""
          resources:
            limits:
              cpu: 10m
              memory: 50Mi
            requests:
              cpu: 10m
              memory: 50Mi
        - name: prometheus-alertmanager-configmap-reload
          image: "jimmidyson/configmap-reload:v0.1"
          imagePullPolicy: "IfNotPresent"
          args:
            - --volume-dir=/etc/config
            - --webhook-url=http://localhost:9093/-/reload
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
              readOnly: true
          resources:
            limits:
              cpu: 10m
              memory: 10Mi
            requests:
              cpu: 10m
              memory: 10Mi
      volumes:
        - name: config-volume
          configMap:
            name: alertmanager-config
        - name: storage-volume
          persistentVolumeClaim:
            claimName: alertmanager
[root@k8s-master alertmanager]# kubectl get deploy,pod -n kube-system| grep alert

deployment.extensions/alertmanager               1/1     1            1           9m21s
pod/alertmanager-6778cc5b7c-2gc9v             2/2     Running     0          9m21s
[root@k8s-master alertmanager]#

 

 

3.4 暴露ServicePort端口

[root@k8s-master alertmanager]# cat alertmanager-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "Alertmanager"
spec:
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 9093
  selector:
    k8s-app: alertmanager
  type: "ClusterIP" type: "NodePort"


[root@k8s-master alertmanager]# kubectl get svc -n kube-system| grep alert
alertmanager               NodePort    10.111.52.119    <none>        80:32587/TCP             18h
[root@k8s-master alertmanager]#

 

 

 

3.5 alertmanager控制台

 

 

 

 

在這個頁面中我們可以進行一些操作,比如過濾、分組等等,里面還有兩個新的概念:Inhibition(抑制)和 Silences(靜默)。

  • Inhibition:如果某些其他警報已經觸發了,則對於某些警報,Inhibition 是一個抑制通知的概念。例如:一個警報已經觸發,它正在通知整個集群是不可達的時,Alertmanager 則可以配置成關心這個集群的其他警報無效。這可以防止與實際問題無關的數百或數千個觸發警報的通知,Inhibition 需要通過上面的配置文件進行配置。
  • Silences:靜默是一個非常簡單的方法,可以在給定時間內簡單地忽略所有警報。Silences 基於 matchers配置,類似路由樹。來到的警告將會被檢查,判斷它們是否和活躍的 Silences 相等或者正則表達式匹配。如果匹配成功,則不會將這些警報發送給接收者。

 

4 配置Prometheus與Alertmanager通信

編輯 prometheus-configmap.yaml 配置文件添加綁定信息

 

最后的alert模塊修改一下,之前的都注釋

    alerting:
      alertmanagers:
      - static_configs:
          - targets: ["alertmanager:80"]   ####需要修改alertmanger服務名字,集群內部通過服務名調用

 

應用加載配置文件

kubectl apply -f prometheus-configmap.yaml

 

web控制台查看配置是否生效

 

 

 

 

 

5 配置告警

5.1 prometheus指定rules目錄

編輯 prometheus-configmap.yaml 添加報警信息

    # 添加:指定讀取rules配置
    rules_files:
    - /etc/config/rules/*.rules

 

 

 

生效配置文件

kubectl apply -f prometheus-configmap.yaml

 

5.2 configmap存儲告警規則

創建yaml文件同過configmap存儲告警規則

 

#vim prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: kube-system
data:
  # 通用角色
  general.rules: |
    groups:
    - name: general.rules
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: error 
        annotations:
          summary: "Instance {{ $labels.instance }} 停止工作"
          description: "{{ $labels.instance }} job {{ $labels.job }} 已經停止5分鍾以上."
  # Node對所有資源的監控
  node.rules: |
    groups:
    - name: node.rules
      rules:
      - alert: NodeFilesystemUsage
        expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80 
        for: 1m
        labels:
          severity: warning 
        annotations:
          summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分區使用率過高"
          description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分區使用大於80% (當前值: {{ $value }})"

      - alert: NodeMemoryUsage
        expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} 內存使用率過高"
          description: "{{ $labels.instance }}內存使用大於80% (當前值: {{ $value }})"

      - alert: NodeCPUUsage    
        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60 
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} CPU使用率過高"       
          description: "{{ $labels.instance }}CPU使用大於60% (當前值: {{ $value }})"

 

5.3  configmap掛載到容器rules目錄

修改掛載點位置,使用之前部署的prometheus動態PV

#vim prometheus-statefulset.yaml
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
            - name: prometheus-data
              mountPath: /data
            # 添加:指定rules的configmap配置文件名稱
            - name: prometheus-rules
              mountPath: /etc/config/rules
              subPath: ""
      terminationGracePeriodSeconds: 300
      volumes:
        - name: config-volume
          configMap:
            name: prometheus-config
        # 添加:name rules
        - name: prometheus-rules
          # 添加:配置文件
          configMap:
            # 添加:定義文件名稱
            name: prometheus-rules

 

 

 

創建configmap並更新PV
kubectl apply -f prometheus-rules.yaml
#如果prometheus-statefulse更新失敗,可以先刪除
#kubectl delete -f prometheus-statefulset.yaml 
kubectl apply -f prometheus-statefulset.yaml 

 

 

5.4  完整的prometheus配置文件

自動pvc

[root@k8s-master prometheus]# cat /etc/exports
 /data/volumes/v1  10.6.76.0/24(rw,no_root_squash)
 /data/volumes/v2  10.6.76.0/24(rw,no_root_squash)
 /data/volumes/v3  10.6.76.0/24(rw,no_root_squash)
[root@k8s-master prometheus]#

 

prometheus-rules.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: kube-system
data:
  # 通用角色
  general.rules: |
    groups:
    - name: general.rules
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: error
        annotations:
          summary: "Instance {{ $labels.instance }} 停止工作"
          description: "{{ $labels.instance }} job {{ $labels.job }} 已經停止5分鍾以上."
  # Node對所有資源的監控
  node.rules: |
    groups:
    - name: node.rules
      rules:
      - alert: NodeFilesystemUsage
        expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分區使用率過高"
          description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分區使用大於80% (當前值: {{ $value }})"

      - alert: NodeMemoryUsage
        expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} 內存使用率過高"
          description: "{{ $labels.instance }}內存使用大於80% (當前值: {{ $value }})"

      - alert: NodeCPUUsage
        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} CPU使用率過高"
          description: "{{ $labels.instance }}CPU使用大於60% (當前值: {{ $value }})"
prometheus-rules.yaml

prometheus-rbac.yaml

apiVersion: v1
# 創建 ServiceAccount 授予權限
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
rules:
  - apiGroups:
      - ""
    # 授予的權限
    resources:
      - nodes
      - nodes/metrics
      - services
      - endpoints
      - pods
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - configmaps
    verbs:
      - get
  - nonResourceURLs:
      - "/metrics"
    verbs:
      - get
---
# 角色綁定
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: kube-system
prometheus-rbac.yaml

 

prometheus-service.yaml

kind: Service
apiVersion: v1
metadata:
  name: prometheus
  namespace: kube-system
  labels:
    kubernetes.io/name: "Prometheus"
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  type: NodePort
  ports:
    - name: http
      port: 9090
      protocol: TCP
      targetPort: 9090
  selector:
k8s-app: Prometheus
prometheus-service.yaml

prometheus-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: EnsureExists
data:
  prometheus.yml: |
    rule_files:
    - /etc/config/rules/*.rules

    scrape_configs:
    - job_name: prometheus
      static_configs:
      - targets:
        - localhost:9090
#
#    - job_name: kubernetes-nodes
#      scrape_interval: 30s
#      static_configs:
#      - targets:
#        - 10.6.76.23:9100
#        - 10.6.76.24:9100
#        - 10.6.76.25:9100

    - job_name: kubernetes-apiservers
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - action: keep
        regex: default;kubernetes;https
        source_labels:
        - __meta_kubernetes_namespace
        - __meta_kubernetes_service_name
        - __meta_kubernetes_endpoint_port_name
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    - job_name: kubernetes-nodes-kubelet
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    - job_name: kubernetes-nodes-cadvisor
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __metrics_path__
        replacement: /metrics/cadvisor
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    - job_name: kubernetes-service-endpoints
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scrape
      - action: replace
        regex: (https?)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scheme
        target_label: __scheme__
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_service_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_service_name
        target_label: kubernetes_name

    - job_name: kubernetes-services
      kubernetes_sd_configs:
      - role: service
      metrics_path: /probe
      params:
        module:
        - http_2xx
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_probe
      - source_labels:
        - __address__
        target_label: __param_target
      - replacement: blackbox
        target_label: __address__
      - source_labels:
        - __param_target
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - source_labels:
        - __meta_kubernetes_service_name
        target_label: kubernetes_name

    - job_name: kubernetes-pods
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scrape
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_pod_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: kubernetes_pod_name
    alerting:
      alertmanagers:
      - static_configs:
          - targets: ["alertmanager:80"]
prometheus-configmap.yaml

 

prometheus-statefulset.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
  namespace: kube-system
  labels:
    k8s-app: prometheus
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    version: v2.2.1
spec:
  serviceName: "prometheus"
  replicas: 1
  podManagementPolicy: "Parallel"
  updateStrategy:
   type: "RollingUpdate"
  selector:
    matchLabels:
      k8s-app: prometheus
  template:
    metadata:
      labels:
        k8s-app: prometheus
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      priorityClassName: system-cluster-critical
      serviceAccountName: prometheus
      initContainers:
      - name: "init-chown-data"
        image: "busybox:latest"
        imagePullPolicy: "IfNotPresent"
        command: ["chown", "-R", "65534:65534", "/data"]
        volumeMounts:
        - name: prometheus-data
          mountPath: /data
          subPath: ""
      containers:
        - name: prometheus-server-configmap-reload
          image: "jimmidyson/configmap-reload:v0.1"
          imagePullPolicy: "IfNotPresent"
          args:
            - --volume-dir=/etc/config
            - --webhook-url=http://localhost:9090/-/reload
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
              readOnly: true
          resources:
            limits:
              cpu: 10m
              memory: 10Mi
            requests:
              cpu: 10m
              memory: 10Mi

        - name: prometheus-server
          image: "prom/prometheus:v2.2.1"
          imagePullPolicy: "IfNotPresent"
          args:
            - --config.file=/etc/config/prometheus.yml
            - --storage.tsdb.path=/data
            - --web.console.libraries=/etc/prometheus/console_libraries
            - --web.console.templates=/etc/prometheus/consoles
            - --web.enable-lifecycle
          ports:
            - containerPort: 9090
          readinessProbe:
            httpGet:
              path: /-/ready
              port: 9090
            initialDelaySeconds: 30
            timeoutSeconds: 30
          livenessProbe:
            httpGet:
              path: /-/healthy
              port: 9090
            initialDelaySeconds: 30
            timeoutSeconds: 30
          # based on 10 running nodes with 30 pods each
          resources:
            limits:
              cpu: 200m
              memory: 1000Mi
            requests:
              cpu: 200m
              memory: 1000Mi

          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
            - name: prometheus-data
              mountPath: /data
            - name: prometheus-rules
              mountPath: /etc/config/rules
              subPath: ""

      terminationGracePeriodSeconds: 300
      volumes:
        - name: config-volume
          configMap:
            name: prometheus-config
        - name: prometheus-rules
          configMap:
            name: prometheus-rules

  volumeClaimTemplates:
  - metadata:
      name: prometheus-data
    spec:
      storageClassName: managed-nfs-storage
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: "16Gi"
prometheus-statefulset.yaml

 

查看服務啟動情況

[root@k8s-master prometheus]# kubectl get pv,pvc,cm,pod,svc -n kube-system  |grep prometheus

persistentvolume/kube-system-prometheus-data-prometheus-0-pvc-939cfe8c-427d-4731-91f3-b146dd8f61e2   16Gi       RWO            Delete           Bound    kube-system/prometheus-data-prometheus-0   managed-nfs-storage            2m14s
persistentvolumeclaim/prometheus-data-prometheus-0   Bound     kube-system-prometheus-data-prometheus-0-pvc-939cfe8c-427d-4731-91f3-b146dd8f61e2   16Gi       RWO            managed-nfs-storage   2m15s
configmap/prometheus-config                    1      2m15s
configmap/prometheus-rules                     2      2m15s

pod/prometheus-0                              2/2     Running     0          2m15s
service/prometheus                 NodePort    10.97.213.127    <none>        9090:32281/TCP           2m15s

 

 

5.5 訪問Prometheus, 查看是否有alerts告警規則

 

 

 

 

 

 

 

 

 

5.6 訪問alerts管理后台

 

 

 

 

5.7 測試內存郵件報警

我們的內存閾值設置是80%

 

 

 

我們在prometheus上看基本上也就20%吧

 

 

 

我們把他改成10%

 

[root@k8s-master prometheus]# vim prometheus-rules.yaml
      - alert: NodeMemoryUsage
        expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 10
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} 內存使用率過高"
          description: "{{ $labels.instance }}內存使用大於10% (當前值: {{ $value }})"

[root@k8s-master prometheus]# kubectl apply  -f  prometheus-rules.yaml
configmap/prometheus-rules configured

#竟然不能熱加載….可能是我太着急了,這個需要5分鍾吧

[root@k8s-master prometheus]# kubectl delete  -f  prometheus-statefulset.yaml
statefulset.apps "prometheus" deleted
[root@k8s-master prometheus]# kubectl apply  -f  prometheus-statefulset.yaml
statefulset.apps/prometheus created
[root@k8s-master prometheus]#

 

出現報警

 

 

 

 

 

我們可以看到頁面中出現了我們剛剛定義的報警規則信息,而且報警信息中還有狀態顯示。一個報警信息在生命周期內有下面3種狀態:

  • inactive: 表示當前報警信息既不是firing狀態也不是pending狀態
  • pending: 表示在設置的閾值時間范圍內被激活了
  • firing: 表示超過設置的閾值時間被激活了

 

粉紅色其實已經將告警推送給Alertmanager了,也就是這個狀態下才去發送這個告警信息

 

alertmanager上也能看到

 

 

 

 

QQ郵箱 報警郵件是合並的 ,這個不錯

 

 

 

但是我只有一個master兩個node結果報警出來6條

猜一下,這個地方

[root@k8s-master prometheus]# vim prometheus-configmap.yaml
    #- job_name: kubernetes-nodes
    #  scrape_interval: 30s
    #  static_configs:
    #  - targets:
    #    - 10.6.76.23:9100
    #    - 10.6.76.24:9100
    #    - 10.6.76.25:9100

[root@k8s-master prometheus]# kubectl apply  -f  prometheus-configmap.yaml
configmap/prometheus-config configured

 

 

果然,網友也很坑啊!!前面的配置文件修改了,不會遇到這個坑了

3 alerts for alertname=NodeMemoryUsage

View in AlertManager

[3] Firing

Labels
alertname = NodeMemoryUsage
addonmanager_kubernetes_io_mode = Reconcile
instance = 10.6.76.23:9100
job = kubernetes-service-endpoints
kubernetes_io_cluster_service = true
kubernetes_io_name = NodeExporter
kubernetes_name = node-exporter
kubernetes_namespace = kube-system
severity = warning
Annotations
description = 10.6.76.23:9100內存使用大於10% (當前值: 38.64800263090956)
summary = Instance 10.6.76.23:9100 內存使用率過高
Source

Labels
alertname = NodeMemoryUsage
addonmanager_kubernetes_io_mode = Reconcile
instance = 10.6.76.24:9100
job = kubernetes-service-endpoints
kubernetes_io_cluster_service = true
kubernetes_io_name = NodeExporter
kubernetes_name = node-exporter
kubernetes_namespace = kube-system
severity = warning
Annotations
description = 10.6.76.24:9100內存使用大於10% (當前值: 10.784768758831163)
summary = Instance 10.6.76.24:9100 內存使用率過高
Source

Labels
alertname = NodeMemoryUsage
addonmanager_kubernetes_io_mode = Reconcile
instance = 10.6.76.25:9100
job = kubernetes-service-endpoints
kubernetes_io_cluster_service = true
kubernetes_io_name = NodeExporter
kubernetes_name = node-exporter
kubernetes_namespace = kube-system
severity = warning
Annotations
description = 10.6.76.25:9100內存使用大於10% (當前值: 23.137839958556313)
summary = Instance 10.6.76.25:9100 內存使用率過高
Source

 

 

我們把其中一條報警設置靜音,看起來默認是2小時

 

 

 

 

 

 

 

過了5分鍾

Prometheus還是3個報警

 

 

 

但alertmanager已經處理了,報警只發出2個

 

 

 

 

 

 

5.8 釘釘報警

https://www.qikqiak.com/k8s-book/docs/57.AlertManager%E7%9A%84%E4%BD%BF%E7%94%A8.html

 

上面我們配置的是 AlertManager 自帶的郵件報警模板,我們也說了 AlertManager 支持很多中報警接收器,比如 slack、微信之類的,其中最為靈活的方式當然是使用 webhook 了,我們可以定義一個 webhook 來接收報警信息,然后在 webhook 里面去進行處理,需要發送怎樣的報警信息我們自定義就可以。

 

准備釘釘機器人

 

 

 

 

 

 

悲催的是趕上釘釘升級,機器人新建不了,我們用之前的Jenkins留下的

https://oapi.dingtalk.com/robot/send?access_token=17549607d838b3015d183384ffe53333b13df0a98563150df241535808e10781

 

配置報警文件

大家可以根據自己的需求來定制報警數據, github.com/cnych/alertmanager-dingtalk-hook

對應的資源清單如下:(dingtalk-hook.yaml)

#cat dingtalk-hook.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: dingtalk-hook
  namespace: kube-system
spec:
  template:
    metadata:
      labels:
        app: dingtalk-hook
    spec:
      containers:
      - name: dingtalk-hook
        image: cnych/alertmanager-dingtalk-hook:v0.2
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 5000
          name: http
        env:
        - name: ROBOT_TOKEN
          valueFrom:
            secretKeyRef:
              name: dingtalk-secret
              key: token
        resources:
          requests:
            cpu: 50m
            memory: 100Mi
          limits:
            cpu: 50m
            memory: 100Mi

---
apiVersion: v1
kind: Service
metadata:
  name: dingtalk-hook
  namespace: kube-system
spec:
  selector:
    app: dingtalk-hook
  ports:
  - name: hook
    port: 5000
    targetPort: http

 

 

要注意上面我們聲明了一個 ROBOT_TOKEN 的環境變量,由於這是一個相對於私密的信息,所以我們這里從一個 Secret 對象中去獲取,通過如下命令創建一個名為 dingtalk-secret 的 Secret 對象,然后部署上面的資源對象即可:

[root@k8s-master alertmanager]# kubectl create secret generic dingtalk-secret --from-literal=token=17549607d838b3015d183384ffe53333b13df0a98563150df241535808e10781 -n kube-system
secret/dingtalk-secret created
[root@k8s-master alertmanager]# kubectl create -f dingtalk-hook.yaml
deployment.extensions/dingtalk-hook created
service/dingtalk-hook created
[root@k8s-master alertmanager]#
[root@k8s-master alertmanager]# kubectl get pod,deploy -n kube-system | grep dingtalk
pod/dingtalk-hook-686ddd6976-tp9g4            1/1     Running     0          3m18s

deployment.extensions/dingtalk-hook              1/1     1            1           3m18s
[root@k8s-master alertmanager]#

 

 

 

部署成功后,現在我們就可以給 AlertManager 配置一個 webhook 了,在上面的配置中增加一個路由接收器

 

#cat alertmanager-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  # 配置文件名稱
  name: alertmanager-config
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: EnsureExists
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
      # 告警自定義郵件
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: 'w.jjwx@163.com'
      smtp_auth_username: 'w.jjwx@163.com'
      smtp_auth_password: '密碼'
      smtp_hello: '163.com'
      smtp_require_tls: false
    route:
       # 這里的標簽列表是接收到報警信息后的重新分組標簽,例如,接收到的報警信息里面有許多具有 cluster=A 和 alertname=LatncyHigh 這樣的標簽的報警信息將會批量被聚合到一個分組里面
      group_by: ['job','alertname','severity']
      # 當一個新的報警分組被創建后,需要等待至少group_wait時間來初始化通知,這種方式可以確保您能有足夠的時間為同一分組來獲取多個警報,然后一起觸發這個報警信息。
      group_wait: 30s
      group_interval: 5m
      # 當第一個報警發送后,等待'group_interval'時間來發送新的一組報警信息。
      repeat_interval: 12h
      # 如果一個報警信息已經發送成功了,等待'repeat_interval'時間來重新發送他們

      #group_interval: 5m
      #repeat_interval: 12h

      receiver: default #默認的receiver:如果一個報警沒有被一個route匹配,則發送給默認的接收器 routes: - receiver: webhook match: alertname: NodeMemoryUsage #匹配這個內存報警

    receivers:
    - name: 'default'
      email_configs:
      - to: '314144952@qq.com'
        send_resolved: true
    - name: 'webhook'
      webhook_configs:
      - url: 'http://dingtalk-hook:5000'
        send_resolved: true

 

 

部分參數說明

    route:
      # 這里的標簽列表是接收到報警信息后的重新分組標簽,例如,接收到的報警信息里面有許多具有 cluster=A 和 alertname=LatncyHigh 這樣的標簽的報警信息將會批量被聚合到一個分組里面
      group_by: ['alertname', 'cluster']
      # 當一個新的報警分組被創建后,需要等待至少group_wait時間來初始化通知,這種方式可以確保您能有足夠的時間為同一分組來獲取多個警報,然后一起觸發這個報警信息。
      group_wait: 30s

      # 當第一個報警發送后,等待'group_interval'時間來發送新的一組報警信息。
      group_interval: 5m

      # 如果一個報警信息已經發送成功了,等待'repeat_interval'時間來重新發送他們
      repeat_interval: 5m

      # 默認的receiver:如果一個報警沒有被一個route匹配,則發送給默認的接收器
      receiver: default

      # 上面所有的屬性都由所有子路由繼承,並且可以在每個子路由上進行覆蓋。

 

[root@k8s-master alertmanager]# kubectl get pod,svc   -n kube-system| grep  -E "prome|alert"
pod/alertmanager-6778cc5b7c-jtwbt             2/2     Running     3          3m43s

pod/prometheus-0                              2/2     Running     0          104m
service/alertmanager               NodePort    10.111.52.119    <none>        80:32587/TCP             23h
service/prometheus                 NodePort    10.97.213.127    <none>        9090:32281/TCP           3h59m
[root@k8s-master alertmanager]#

 

 

我們這里配置了一個名為 webhook 的接收器,地址為:http://dingtalk-hook:5000,這個地址當然就是上面我們部署的釘釘的 webhook 的接收程序的 Service 地址。

然后我們也在報警規則中添加一條關於節點文件系統使用情況的報警規則,注意 labels 標簽要帶上group_by: ['severity'],或者根據自定義設置。這樣報警信息就會被 webhook 這一個接收器所匹配:例如我們有兩個報警alertname=CPU***和alertname=MEM***,webhook匹配一個,剩下的沒有被匹配將使用默認的發出,我們這里是郵件

 

 

 

 

 

更新 AlertManager 和 Prometheus 的 ConfigMap 資源對象(先刪除再創建),更新完成后,隔一會兒執行 reload 操作是更新生效:

curl -X POST "http://10.97.213.127:9090/-/reload"

 

 

發送報警

 

隔一會兒關於這個節點文件系統的報警就會被觸發了,由於這個報警信息包含一個severity的 label 標簽,所以會被路由到webhook這個接收器中,也就是上面我們自定義的這個 dingtalk-hook,觸發后可以觀察這個 Pod 的日志:

 

[root@k8s-master alertmanager]# kubectl logs -f  dingtalk-hook-686ddd6976-tp9g4 -n kube-system
 * Serving Flask app "app" (lazy loading)
 * Environment: production
   WARNING: Do not use the development server in a production environment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [20/Sep/2019 07:52:39] "GET / HTTP/1.1" 200 -
10.254.2.1 - - [20/Sep/2019 07:54:05] "GET / HTTP/1.1" 200 -
10.254.2.1 - - [20/Sep/2019 07:54:05] "GET /favicon.ico HTTP/1.1" 404 -
10.254.1.176 - - [20/Sep/2019 08:31:11] "POST / HTTP/1.1" 200 -
10.254.2.1 - - [20/Sep/2019 09:16:07] "GET / HTTP/1.1" 200 -
{'receiver': 'webhook', 'status': 'firing', 'alerts': [{'status': 'firing', 'labels': {'addonmanager_kubernetes_io_mode': 'Reconcile', 'alertname': 'NodeMemoryUsage', 'instance': '10.6.76.24:9100', 'job': 'kubernetes-service-endpoints', 'kubernetes_io_cluster_service': 'true', 'kubernetes_io_name': 'NodeExporter', 'kubernetes_name': 'node-exporter', 'kubernetes_namespace': 'kube-system', 'severity': 'warning'}, 'annotations': {'description': '10.6.76.24:9100內存使用大於10% (當前值: 15.369550162495912)', 'summary': 'Instance 10.6.76.24:9100 內存使用率過高'}, 'startsAt': '2019-09-20T07:06:10.54216046Z',
………

 

 

 

 

 

 

 

5.9 企業微信報警

https://www.cnblogs.com/xzkzzz/p/10211394.html

配置企業微信

登錄企業微信,應用管理,點擊創建應用按鈕 -> 填寫應用信息:

 

 

 

我們打開已經創建好的promethues應用。獲取一下信息:

 

 

 

新增告警信息模板

 

apiVersion: v1
kind: ConfigMap
metadata:
  name: wechat-tmpl
  namespace: kube-system
data:
  wechat.tmpl: |
    {{ define "wechat.default.message" }}
    {{ range .Alerts }}
    ========start==========
    告警程序: prometheus_alert
    告警級別: {{ .Labels.severity }}
    告警類型: {{ .Labels.alertname }}
    故障主機: {{ .Labels.instance }}
    告警主題: {{ .Annotations.summary }}
    告警詳情: {{ .Annotations.description }}
    觸發時間: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
    ========end==========
    {{ end }}
    {{ end }}

 

把模板掛載到alertmanager 的pod中的/etc/alertmanager-tmpl 

 

[root@k8s-master alertmanager]# tail -25  alertmanager-deployment.yaml
            - --volume-dir=/etc/config
            - --webhook-url=http://localhost:9093/-/reload
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
            - name: wechattmpl
              mountPath: /etc/alertmanager-tmpl
              readOnly: true
          resources:
            limits:
              cpu: 10m
              memory: 10Mi
            requests:
              cpu: 10m
              memory: 10Mi
      volumes:
        - name: config-volume
          configMap:
            name: alertmanager-config
        - name: storage-volume
          persistentVolumeClaim:
            claimName: alertmanager
        - name: wechattmpl configMap: name: wechat-tmpl

 

 

設置一個報警

在rules.yml中添加一個新的告警信息。

 

- alert: NodeMemoryUsage
        expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 10
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} 內存使用率過高"
          description: "{{ $labels.instance }}內存使用大於10% (當前值: {{ $value }})"

 

 

 

配置微信發送

 

alertmanager中默認支持微信告警通知。我們可以通過官網查看 我們的配置如下:

 

[root@k8s-master alertmanager]# cat alertmanager-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  # 配置文件名稱
  name: alertmanager-config
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: EnsureExists
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
      # 告警自定義郵件
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: 'w.jjwx@163.com'
      smtp_auth_username: 'w.jjwx@163.com'
      smtp_auth_password: '***'
      smtp_hello: '163.com'
      smtp_require_tls: false
    route:
       # 這里的標簽列表是接收到報警信息后的重新分組標簽,例如,接收到的報警信息里面有許多具有 cluster=A 和 alertname=LatncyHigh 這樣的標簽的報警信息將會批量被聚合到一個分組里面
      group_by: ['job','alertname','severity']
      # 當一個新的報警分組被創建后,需要等待至少group_wait時間來初始化通知,這種方式可以確保您能有足夠的時間為同一分組來獲取多個警報,然后一起觸發這個報警信息。
      group_wait: 30s
      group_interval: 1m
      # 當第一個報警發送后,等待'group_interval'時間來發送新的一組報警信息。
      repeat_interval: 2m
      # 如果一個報警信息已經發送成功了,等待'repeat_interval'時間來重新發送他們

      #group_interval: 5m
      #repeat_interval: 12h

      receiver: default #默認的receiver:如果一個報警沒有被一個route匹配,則發送給默認的接收器
      routes:
      - receiver: wechat
        match:
          alertname:  NodeCPUUsage  #匹配CPU報警

      - receiver: webhook
        match:
          alertname:  NodeMemoryUsage #匹配這個內存報警

    receivers:
    - name: 'default'
      email_configs:
      - to: '314144952@qq.com'
        send_resolved: true
    - name: 'webhook'
      webhook_configs:
      - url: 'http://dingtalk-hook:5000'
        send_resolved: true

    - name: 'wechat'
      wechat_configs:
      - corp_id: 'wxd6b528f56d453***'
        to_party: '運維部'
        to_user: "@all"
        agent_id: '10000**'
        api_secret: 'tWDIGSDCIIo4zkh42hn4IhuxB-FPjx2Ui4E0Vqt***'
        send_resolved: true

 

 

wechat_configs 配置詳情
  • send_resolved 告警解決是否通知,默認是false
  • api_secret 創建微信上應用的Secret
  • api_url wechat的url。默認即可
  • corp_id 企業微信---我的企業---最下面的企業ID
  • message 告警消息模板:默認 template "wechat.default.message"
  • agent_id 創建微信上應用的agent_id
  • to_user 接受消息的用戶 所有用戶可以使用 @all
  • to_party 接受消息的部門

 

 

加載相關配置文件

kubectl  apply -f alertmanager-deployment.yaml
kubectl  apply -f alertmanager-configmap.yaml

 

收到報警

 

 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM