k8s配置alertmanager發送報警到qq郵箱


k8s配置alertmanager發送報警到qq郵箱

一、Prometheus報警處理流程

1)Prometheus Server監控目標主機上暴露的http接口(這里假設接口A),通過Promethes配置的'scrape_interval'定義的時間間隔,定期采集目標主機上監控數據。

2)當接口A不可用的時候,Server端會持續的嘗試從接口中取數據,直到"scrape_timeout"時間后停止嘗試。這時候把接口的狀態變為“DOWN”。

3)Prometheus同時根據配置的"evaluation_interval"的時間間隔,定期(默認1min)的對Alert Rule進行評估;當到達評估周期的時候,發現接口A為DOWN,即UP=0為真,激活Alert,進入“PENDING”狀態,並記錄當前active的時間;

4)當下一個alert rule的評估周期到來的時候,發現UP=0繼續為真,然后判斷警報Active的時間是否已經超出rule里的‘for’ 持續時間,如果未超出,則進入下一個評估周期;如果時間超出,則alert的狀態變為“FIRING”;同時調用Alertmanager接口,發送相關報警數據。

5)AlertManager收到報警數據后,會將警報信息進行分組,然后根據alertmanager配置的“group_wait”時間先進行等待。等wait時間過后再發送報警信息。

6)屬於同一個Alert Group的警報,在等待的過程中可能進入新的alert,如果之前的報警已經成功發出,那么間隔“group_interval”的時間間隔后再重新發送報警信息。比如配置的是郵件報警,那么同屬一個group的報警信息會匯總在一個郵件里進行發送。

7)如果Alert Group里的警報一直沒發生變化並且已經成功發送,等待‘repeat_interval’時間間隔之后再重復發送相同的報警郵件;如果之前的警報沒有成功發送,則相當於觸發第6條條件,則需要等待group_interval時間間隔后重復發送。

8)同時最后至於警報信息具體發給誰,滿足什么樣的條件下指定警報接收人,設置不同報警發送頻率,這里有alertmanager的route路由規則進行配置。

二、Prometheus及Alertmanager配置

2.1、配置alertmanager及告警規則

1)創建alertmanager配置文件

[root@k8s-master1 prometheus]# cat alertmanager-cm.yaml
kind: ConfigMap
apiVersion: v1
metadata:
  name: alertmanager
  namespace: monitor-sa
data:
  alertmanager.yml: |-
    global:
      resolve_timeout: 1m
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: '18665870472@163.com'
      smtp_auth_username: '18665870472'
      smtp_auth_password: 'GGCTEDQDVLKPCIID'
      smtp_require_tls: false
    route:	#用於設置告警的分發策略
      group_by: [alertname]	# 采用哪個標簽來作為分組依據
      group_wait: 10s	# 組告警等待時間。也就是告警產生后等待10s,如果有同組告警一起發出
      group_interval: 10s	# 上下兩組發送告警的間隔時間
      repeat_interval: 10m	# 重復發送告警的時間,減少相同郵件的發送頻率,默認是1h
      receiver: default-receiver	#定義誰來收告警
    receivers:
    - name: 'default-receiver'
      email_configs:
      - to: '352972405@qq.com'
        send_resolved: true
        
[root@k8s-master1 prometheus]# kubectl apply -f alertmanager-cm.yaml
configmap/alertmanager created

2)創建prometheus和告警規則配置文件

[root@k8s-master1 prometheus]# cat prometheus-alertmanager-cfg.yaml
kind: ConfigMap
apiVersion: v1
metadata:
  labels:
    app: prometheus
  name: prometheus-config
  namespace: monitor-sa
data:
  prometheus.yml: |
    rule_files:
    - /etc/prometheus/rules.yml
    alerting:
      alertmanagers:
      - static_configs:
        - targets: ["localhost:9093"]
    global:
      scrape_interval: 15s
      scrape_timeout: 10s
      evaluation_interval: 1m
    scrape_configs:
    - job_name: 'kubernetes-node'
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
        action: replace
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
    - job_name: 'kubernetes-node-cadvisor'
      kubernetes_sd_configs:
      - role:  node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    - job_name: 'kubernetes-apiserver'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
    - job_name: 'kubernetes-service-endpoints'
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name 
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scrape
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_pod_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: kubernetes_pod_name
    - job_name: 'kubernetes-schedule'
      scrape_interval: 5s
      static_configs:
      - targets: ['192.168.40.180:10251']
    - job_name: 'kubernetes-controller-manager'
      scrape_interval: 5s
      static_configs:
      - targets: ['192.168.40.180:10252']
    - job_name: 'kubernetes-kube-proxy'
      scrape_interval: 5s
      static_configs:
      - targets: ['192.168.40.180:10249','192.168.40.181:10249','192.168.40.182:10249']
    - job_name: 'kubernetes-etcd'
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ca.crt
        cert_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.crt
        key_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.key
      scrape_interval: 5s
      static_configs:
      - targets: ['192.168.40.180:2379']
  rules.yml: |
    groups:
    - name: example
      rules:
      - alert: kube-proxy的cpu使用率大於80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}組件的cpu使用率超過80%"
      - alert:  kube-proxy的cpu使用率大於90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}組件的cpu使用率超過90%"
      - alert: scheduler的cpu使用率大於80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}組件的cpu使用率超過80%"
      - alert:  scheduler的cpu使用率大於90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}組件的cpu使用率超過90%"
      - alert: controller-manager的cpu使用率大於80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}組件的cpu使用率超過80%"
      - alert:  controller-manager的cpu使用率大於90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 0
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}組件的cpu使用率超過90%"
      - alert: apiserver的cpu使用率大於80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}組件的cpu使用率超過80%"
      - alert:  apiserver的cpu使用率大於90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}組件的cpu使用率超過90%"
      - alert: etcd的cpu使用率大於80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}組件的cpu使用率超過80%"
      - alert:  etcd的cpu使用率大於90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}組件的cpu使用率超過90%"
      - alert: kube-state-metrics的cpu使用率大於80%
        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.k8s_app}}組件的cpu使用率超過80%"
          value: "{{ $value }}%"
          threshold: "80%"      
      - alert: kube-state-metrics的cpu使用率大於90%
        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 0
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.k8s_app}}組件的cpu使用率超過90%"
          value: "{{ $value }}%"
          threshold: "90%"      
      - alert: coredns的cpu使用率大於80%
        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.k8s_app}}組件的cpu使用率超過80%"
          value: "{{ $value }}%"
          threshold: "80%"      
      - alert: coredns的cpu使用率大於90%
        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.k8s_app}}組件的cpu使用率超過90%"
          value: "{{ $value }}%"
          threshold: "90%"      
      - alert: kube-proxy打開句柄數>600
        expr: process_open_fds{job=~"kubernetes-kube-proxy"}  > 600
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打開句柄數>600"
          value: "{{ $value }}"
      - alert: kube-proxy打開句柄數>1000
        expr: process_open_fds{job=~"kubernetes-kube-proxy"}  > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打開句柄數>1000"
          value: "{{ $value }}"
      - alert: kubernetes-schedule打開句柄數>600
        expr: process_open_fds{job=~"kubernetes-schedule"}  > 600
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打開句柄數>600"
          value: "{{ $value }}"
      - alert: kubernetes-schedule打開句柄數>1000
        expr: process_open_fds{job=~"kubernetes-schedule"}  > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打開句柄數>1000"
          value: "{{ $value }}"
      - alert: kubernetes-controller-manager打開句柄數>600
        expr: process_open_fds{job=~"kubernetes-controller-manager"}  > 600
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打開句柄數>600"
          value: "{{ $value }}"
      - alert: kubernetes-controller-manager打開句柄數>1000
        expr: process_open_fds{job=~"kubernetes-controller-manager"}  > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打開句柄數>1000"
          value: "{{ $value }}"
      - alert: kubernetes-apiserver打開句柄數>600
        expr: process_open_fds{job=~"kubernetes-apiserver"}  > 600
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打開句柄數>600"
          value: "{{ $value }}"
      - alert: kubernetes-apiserver打開句柄數>1000
        expr: process_open_fds{job=~"kubernetes-apiserver"}  > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打開句柄數>1000"
          value: "{{ $value }}"
      - alert: kubernetes-etcd打開句柄數>600
        expr: process_open_fds{job=~"kubernetes-etcd"}  > 600
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打開句柄數>600"
          value: "{{ $value }}"
      - alert: kubernetes-etcd打開句柄數>1000
        expr: process_open_fds{job=~"kubernetes-etcd"}  > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}打開句柄數>1000"
          value: "{{ $value }}"
      - alert: coredns
        expr: process_open_fds{k8s_app=~"kube-dns"}  > 600
        for: 2s
        labels:
          severity: warnning 
        annotations:
          description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 打開句柄數超過600"
          value: "{{ $value }}"
      - alert: coredns
        expr: process_open_fds{k8s_app=~"kube-dns"}  > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 打開句柄數超過1000"
          value: "{{ $value }}"
      - alert: kube-proxy
        expr: process_virtual_memory_bytes{job=~"kubernetes-kube-proxy"}  > 2000000000
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "組件{{$labels.job}}({{$labels.instance}}): 使用虛擬內存超過2G"
          value: "{{ $value }}"
      - alert: scheduler
        expr: process_virtual_memory_bytes{job=~"kubernetes-schedule"}  > 2000000000
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "組件{{$labels.job}}({{$labels.instance}}): 使用虛擬內存超過2G"
          value: "{{ $value }}"
      - alert: kubernetes-controller-manager
        expr: process_virtual_memory_bytes{job=~"kubernetes-controller-manager"}  > 2000000000
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "組件{{$labels.job}}({{$labels.instance}}): 使用虛擬內存超過2G"
          value: "{{ $value }}"
      - alert: kubernetes-apiserver
        expr: process_virtual_memory_bytes{job=~"kubernetes-apiserver"}  > 2000000000
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "組件{{$labels.job}}({{$labels.instance}}): 使用虛擬內存超過2G"
          value: "{{ $value }}"
      - alert: kubernetes-etcd
        expr: process_virtual_memory_bytes{job=~"kubernetes-etcd"}  > 2000000000
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "組件{{$labels.job}}({{$labels.instance}}): 使用虛擬內存超過2G"
          value: "{{ $value }}"
      - alert: kube-dns
        expr: process_virtual_memory_bytes{k8s_app=~"kube-dns"}  > 2000000000
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 使用虛擬內存超過2G"
          value: "{{ $value }}"
      - alert: HttpRequestsAvg
        expr: sum(rate(rest_client_requests_total{job=~"kubernetes-kube-proxy|kubernetes-kubelet|kubernetes-schedule|kubernetes-control-manager|kubernetes-apiservers"}[1m]))  > 1000
        for: 2s
        labels:
          team: admin
        annotations:
          description: "組件{{$labels.job}}({{$labels.instance}}): TPS超過1000"
          value: "{{ $value }}"
          threshold: "1000"   
      - alert: Pod_restarts
        expr: kube_pod_container_status_restarts_total{namespace=~"kube-system|default|monitor-sa"} > 0
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "在{{$labels.namespace}}名稱空間下發現{{$labels.pod}}這個pod下的容器{{$labels.container}}被重啟,這個監控指標是由{{$labels.instance}}采集的"
          value: "{{ $value }}"
          threshold: "0"
      - alert: Pod_waiting
        expr: kube_pod_container_status_waiting_reason{namespace=~"kube-system|default"} == 1
        for: 2s
        labels:
          team: admin
        annotations:
          description: "空間{{$labels.namespace}}({{$labels.instance}}): 發現{{$labels.pod}}下的{{$labels.container}}啟動異常等待中"
          value: "{{ $value }}"
          threshold: "1"   
      - alert: Pod_terminated
        expr: kube_pod_container_status_terminated_reason{namespace=~"kube-system|default|monitor-sa"} == 1
        for: 2s
        labels:
          team: admin
        annotations:
          description: "空間{{$labels.namespace}}({{$labels.instance}}): 發現{{$labels.pod}}下的{{$labels.container}}被刪除"
          value: "{{ $value }}"
          threshold: "1"
      - alert: Etcd_leader
        expr: etcd_server_has_leader{job="kubernetes-etcd"} == 0
        for: 2s
        labels:
          team: admin
        annotations:
          description: "組件{{$labels.job}}({{$labels.instance}}): 當前沒有leader"
          value: "{{ $value }}"
          threshold: "0"
      - alert: Etcd_leader_changes
        expr: rate(etcd_server_leader_changes_seen_total{job="kubernetes-etcd"}[1m]) > 0
        for: 2s
        labels:
          team: admin
        annotations:
          description: "組件{{$labels.job}}({{$labels.instance}}): 當前leader已發生改變"
          value: "{{ $value }}"
          threshold: "0"
      - alert: Etcd_failed
        expr: rate(etcd_server_proposals_failed_total{job="kubernetes-etcd"}[1m]) > 0
        for: 2s
        labels:
          team: admin
        annotations:
          description: "組件{{$labels.job}}({{$labels.instance}}): 服務失敗"
          value: "{{ $value }}"
          threshold: "0"
      - alert: Etcd_db_total_size
        expr: etcd_debugging_mvcc_db_total_size_in_bytes{job="kubernetes-etcd"} > 10000000000
        for: 2s
        labels:
          team: admin
        annotations:
          description: "組件{{$labels.job}}({{$labels.instance}}):db空間超過10G"
          value: "{{ $value }}"
          threshold: "10G"
      - alert: Endpoint_ready
        expr: kube_endpoint_address_not_ready{namespace=~"kube-system|default"} == 1
        for: 2s
        labels:
          team: admin
        annotations:
          description: "空間{{$labels.namespace}}({{$labels.instance}}): 發現{{$labels.endpoint}}不可用"
          value: "{{ $value }}"
          threshold: "1"
    - name: 物理節點狀態-監控告警
      rules:
      - alert: 物理節點cpu使用率
        expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 90
        for: 2s
        labels:
          severity: ccritical
        annotations:
          summary: "{{ $labels.instance }}cpu使用率過高"
          description: "{{ $labels.instance }}的cpu使用率超過90%,當前使用率[{{ $value }}],需要排查處理" 
      - alert: 物理節點內存使用率
        expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }}內存使用率過高"
          description: "{{ $labels.instance }}的內存使用率超過90%,當前使用率[{{ $value }}],需要排查處理"
      - alert: InstanceDown
        expr: up == 0
        for: 2s
        labels:
          severity: critical
        annotations:   
          summary: "{{ $labels.instance }}: 服務器宕機"
          description: "{{ $labels.instance }}: 服務器延時超過2分鍾"
      - alert: 物理節點磁盤的IO性能
        expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
        for: 2s
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.mountpoint}} 流入磁盤IO使用率過高!"
          description: "{{$labels.mountpoint }} 流入磁盤IO大於60%(目前使用:{{$value}})"
      - alert: 入網流量帶寬
        expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
        for: 2s
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.mountpoint}} 流入網絡帶寬過高!"
          description: "{{$labels.mountpoint }}流入網絡帶寬持續5分鍾高於100M. RX帶寬使用率{{$value}}"
      - alert: 出網流量帶寬
        expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
        for: 2s
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.mountpoint}} 流出網絡帶寬過高!"
          description: "{{$labels.mountpoint }}流出網絡帶寬持續5分鍾高於100M. RX帶寬使用率{{$value}}"
      - alert: TCP會話
        expr: node_netstat_Tcp_CurrEstab > 1000
        for: 2s
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.mountpoint}} TCP_ESTABLISHED過高!"
          description: "{{$labels.mountpoint }} TCP_ESTABLISHED大於1000%(目前使用:{{$value}}%)"
      - alert: 磁盤容量
        expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80
        for: 2s
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.mountpoint}} 磁盤分區使用率過高!"
          description: "{{$labels.mountpoint }} 磁盤分區使用大於80%(目前使用:{{$value}}%)"
          
# 刪除之前的配置
[root@k8s-master1 prometheus]# kubectl delete -f prometheus-cfg.yaml
configmap "prometheus-config" deleted
# 更新配置
[root@k8s-master1 prometheus]# kubectl apply -f prometheus-alertmanager-cfg.yaml
configmap/prometheus-config created
[root@k8s-master1 prometheus]# kubectl get cm -n monitor-sa 
NAME                DATA   AGE
kube-root-ca.crt    1      14h
prometheus-config   2      29s

3)安裝prometheus和alertmanager

[root@k8s-master1 prometheus]# cat prometheus-alertmanager-deploy.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-server
  namespace: monitor-sa
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
      component: server
    #matchExpressions:
    #- {key: app, operator: In, values: [prometheus]}
    #- {key: component, operator: In, values: [server]}
  template:
    metadata:
      labels:
        app: prometheus
        component: server
      annotations:
        prometheus.io/scrape: 'false'
    spec:
      nodeName: k8s-node1
      serviceAccountName: monitor
      containers:
      - name: prometheus
        image: prom/prometheus:v2.2.1
        imagePullPolicy: IfNotPresent
        command:
        - "/bin/prometheus"
        args:
        - "--config.file=/etc/prometheus/prometheus.yml"
        - "--storage.tsdb.path=/prometheus"
        - "--storage.tsdb.retention=24h"
        - "--web.enable-lifecycle"
        ports:
        - containerPort: 9090
          protocol: TCP
        volumeMounts:
        - mountPath: /etc/prometheus
          name: prometheus-config
        - mountPath: /prometheus/
          name: prometheus-storage-volume
        - name: k8s-certs
          mountPath: /var/run/secrets/kubernetes.io/k8s-certs/etcd/
        - name: localtime
          mountPath: /etc/localtime
      - name: alertmanager
        image: prom/alertmanager:v0.14.0
        imagePullPolicy: IfNotPresent
        args:
        - "--config.file=/etc/alertmanager/alertmanager.yml"
        - "--log.level=debug"
        ports:
        - containerPort: 9093
          protocol: TCP
          name: alertmanager
        volumeMounts:
        - name: alertmanager-config
          mountPath: /etc/alertmanager
        - name: alertmanager-storage
          mountPath: /alertmanager
        - name: localtime
          mountPath: /etc/localtime
      volumes:
        - name: prometheus-config
          configMap:
            name: prometheus-config
        - name: prometheus-storage-volume
          hostPath:
           path: /data
           type: Directory
        - name: k8s-certs
          secret:
           secretName: etcd-certs
        - name: alertmanager-config
          configMap:
            name: alertmanager
        - name: alertmanager-storage
          hostPath:
           path: /data/alertmanager
           type: DirectoryOrCreate
        - name: localtime
          hostPath:
           path: /usr/share/zoneinfo/Asia/Shanghai
           
# 生成一個etcd-certs,這個在部署prometheus需要
[root@k8s-master1 prometheus]# kubectl -n monitor-sa create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/server.key  --from-file=/etc/kubernetes/pki/etcd/server.crt --from-file=/etc/kubernetes/pki/etcd/ca.crt
secret/etcd-certs created

# 更新資源清單yaml文件
[root@k8s-master1 prometheus]# kubectl delete -f prometheus-deploy.yaml
deployment.apps "prometheus-server" deleted
[root@k8s-master1 prometheus]# kubectl apply -f prometheus-alertmanager-deploy.yaml
deployment.apps/prometheus-server created

# 查看prometheus是否部署成功
[root@k8s-master1 prometheus]# kubectl get pods -n monitor-sa | grep prometheus
prometheus-server-76dd9f8dc6-w9fct   2/2     Running   0          32s

4)部署alertmanager的service,方便在瀏覽器訪問

[root@k8s-master1 prometheus]# cat alertmanager-svc.yaml 
---
apiVersion: v1
kind: Service
metadata:
  labels:
    name: prometheus
    kubernetes.io/cluster-service: 'true'
  name: alertmanager
  namespace: monitor-sa
spec:
  ports:
  - name: alertmanager
    nodePort: 30066
    port: 9093
    protocol: TCP
    targetPort: 9093
  selector:
    app: prometheus
  sessionAffinity: None
  type: NodePort
  
[root@k8s-master1 prometheus]# kubectl apply -f alertmanager-svc.yaml
service/alertmanager created

# 查看service在物理機映射的端口
[root@k8s-master1 prometheus]# kubectl get svc -n monitor-sa
NAME           TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
alertmanager   NodePort   10.102.118.253   <none>        9093:30066/TCP   41s
prometheus     NodePort   10.99.104.223    <none>        9090:32367/TCP   13h
# 注意:上面可以看到prometheus的service在物理機映射的端口是32367,alertmanager的service在物理機映射的端口是30066

# 查看service在物理機映射的端口: http://192.168.40.180:30066/#/alerts

image-20210712104342984

查看接收到的郵件報警:

image-20210712104603045

查看prometheus的targets:

image-20210712104845051

2.2、監控kube-scheduler

# 修改kube-scheduler的配置文件
[root@k8s-master1 prometheus]# vim /etc/kubernetes/manifests/kube-scheduler.yaml

# 修改如下內容
1)把--bind-address=127.0.0.1變成--bind-address=192.168.40.180 #192.168.40.180是k8s的控制節點k8s-master1的ip
2)把httpGet:字段下的hosts由127.0.0.1變成192.168.40.180(有兩處)
3)把—port=0刪除

# 重啟各個節點的kubelet
[root@k8s-node1 ~]# systemctl restart kubelet
[root@k8s-node2 ~]# systemctl restart kubelet

# 相應的端口已經被物理機監聽了
[root@k8s-master1 prometheus]# ss -antulp | grep :10251	
tcp    LISTEN     0      128      :::10251                :::*                   users:(("kube-scheduler",pid=36945,fd=7))

image-20210712105711900

2.3、監控kube-controller-manager

# 修改kube-scheduler的配置文件
[root@k8s-master1 prometheus]# vim /etc/kubernetes/manifests/kube-controller-manager.yaml

# 修改如下內容
1)把--bind-address=127.0.0.1變成--bind-address=192.168.40.180 #192.168.40.180是k8s的控制節點k8s-master1的ip
2)把httpGet:字段下的hosts由127.0.0.1變成192.168.40.180(有兩處)
3)把—port=0刪除

# 重啟各個節點的kubelet
[root@k8s-node1 ~]# systemctl restart kubelet
[root@k8s-node2 ~]# systemctl restart kubelet

# 查看狀態
[root@k8s-master1 prometheus]# kubectl get cs 
Warning: v1 ComponentStatus is deprecated in v1.19+
NAME                 STATUS    MESSAGE             ERROR
scheduler            Healthy   ok                  
controller-manager   Healthy   ok                  
etcd-0               Healthy   {"health":"true"}

[root@k8s-master1 prometheus]# ss -antulp | grep :10252
tcp    LISTEN     0      128      :::10252                :::*                   users:(("kube-controller",pid=41653,fd=7))

image-20210712105949370

2.4、監控kube-proxy

# 因為kube-proxy默認端口10249是監聽在127.0.0.1上的,需要改成監聽到物理節點上,按如下方法修改,線上建議在安裝k8s的時候就做修改,這樣風險小一些

# 修改metricsBindAddress
[root@k8s-master1 prometheus]# kubectl edit configmap kube-proxy -n kube-system
metricsBindAddress: "0.0.0.0:10249"

# 重新啟動kube-proxy
[root@k8s-master1 prometheus]# kubectl get pods -n kube-system | grep kube-proxy |awk '{print $1}' | xargs kubectl delete pods -n kube-system

[root@k8s-master1 prometheus]# ss  -antulp |grep :10249
tcp    LISTEN     0      128      :::10249                :::*                   users:(("kube-proxy",pid=45896,fd=19))

image-20210712110543869

image-20210712110601906

2.5、alert查看

image-20210712110705345

image-20210712110742586

FIRING表示prometheus已經將告警發給alertmanager,在Alertmanager 中可以看到有一個 alert。 登錄到alertmanager web界面,瀏覽器輸入192.168.40.180:30066,顯示如下

image-20210712110835355

2.6、配置文件更新

# 修改prometheus任何一個配置文件之后,可通過kubectl apply使配置生效,執行順序如下:
# 注意:生產不要這樣做
kubectl delete -f alertmanager-cm.yaml
kubectl apply -f alertmanager-cm.yaml
kubectl delete -f prometheus-alertmanager-cfg.yaml
kubectl apply  -f prometheus-alertmanager-cfg.yaml 
kubectl delete -f  prometheus-alertmanager-deploy.yaml
kubectl apply  -f prometheus-alertmanager-deploy.yaml


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM