Kubernetes1.13.1安装prometheus-operator监控


一.Prometheus简介

Prometheus是一款面向云原生应用程序的开源监控工具,作为第一个从CNCF毕业的监控工具而言,开发者对于Prometheus寄予了巨大的希望。在Kubernetes社区中,很多人认为Prometheus是容器场景中监控的第一方案,成为容器监控标准的制定者。在本文中,我们会为大家介绍如何快速部署一套Kubernetes的监控解决方案。 

二.安装步骤

1./app/prometheus-operator/alertmanager.yaml文件内容,该文件主要配置告警邮件的接收人与发件人

global:
   resolve_timeout: 5m
   http_config: {}
   smtp_hello:  'smtp.exmail.qq.com:25'
   smtp_from:  'lihaichun@netschina.com'
   smtp_smarthost:  'smtp.exmail.qq.com:25'
   smtp_auth_username:  'lihaichun@netschina.com'
   smtp_auth_password:  'XXXX'
   smtp_require_tls:  false
   pagerduty_url: https: //events.pagerduty.com/v2/enqueue
   hipchat_api_url: https: //api.hipchat.com/
   opsgenie_api_url: https: //api.opsgenie.com/
   wechat_api_url: https: //qyapi.weixin.qq.com/cgi-bin/
   victorops_api_url: https: //alert.victorops.com/integrations/generic/20131114/alert/
# The root route on which each incoming alert enters.
route:
   # The labels by which incoming alerts are grouped together. For example,
   # multiple alerts coming in  for  cluster=A and alertname=LatencyHigh would
   # be batched into a single group.
   group_by: [ 'alertname' 'cluster' 'service' ]
   # When a  new  group of alerts is created by an incoming alert, wait at
   # least  'group_wait'  to send the initial notification.
   # This way ensures that you get multiple alerts  for  the same group that start
   # firing shortly after another are batched together on the first
   # notification.
   group_wait: 30s
   # When the first notification was sent, wait  'group_interval'  to send a batch
   # of  new  alerts that started firing  for  that group.
   group_interval: 30s
   # If an alert has successfully been sent, wait  'repeat_interval'  to
   # resend them.
   #repeat_interval: 20s
   repeat_interval: 12h
   # A  default  receiver
   # If an alert isn't caught by a route, send it to  default .
   receiver:  default
   # All the above attributes are inherited by all child routes and can
   # overwritten on each.
   # The child route trees.
   routes:
   - match:
       severity: critical
     receiver: email_alert
receivers:
- name:  'default'
   email_configs:
   - to :  'lihaichun@zhixueyun.com,zhujun@zhixueyun.com,ouyangluping@zhixueyun.com,tangjie@zhixueyun.com'
     send_resolved:  true
- name:  'email_alert'
   email_configs:
   - to :  'lihaichun@zhixueyun.com,zhujun@zhixueyun.com,ouyangluping@zhixueyun.com,tangjie@zhixueyun.com'
     send_resolved:  true
templates: []

 

2./app/prometheus-operator/bundle.yaml的内容如下

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
   name: prometheus-operator
roleRef:
   apiGroup: rbac.authorization.k8s.io
   kind: ClusterRole
   name: prometheus-operator
subjects:
- kind: ServiceAccount
   name: prometheus-operator
   namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
   name: prometheus-operator
rules:
- apiGroups:
   - apiextensions.k8s.io
   resources:
   - customresourcedefinitions
   verbs:
   '*'
- apiGroups:
   - monitoring.coreos.com
   resources:
   - alertmanagers
   - prometheuses
   - prometheuses/finalizers
   - alertmanagers/finalizers
   - servicemonitors
   - prometheusrules
   verbs:
   '*'
- apiGroups:
   - apps
   resources:
   - statefulsets
   verbs:
   '*'
- apiGroups:
   ""
   resources:
   - configmaps
   - secrets
   verbs:
   '*'
- apiGroups:
   ""
   resources:
   - pods
   verbs:
   - list
   - delete
- apiGroups:
   ""
   resources:
   - services
   - endpoints
   verbs:
   - get
   - create
   - update
- apiGroups:
   ""
   resources:
   - nodes
   verbs:
   - list
   - watch
- apiGroups:
   ""
   resources:
   - namespaces
   verbs:
   - get
   - list
   - watch
---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
   labels:
     k8s-app: prometheus-operator
   name: prometheus-operator
   namespace: monitoring
spec:
   replicas:  1
   selector:
     matchLabels:
       k8s-app: prometheus-operator
   template:
     metadata:
       labels:
         k8s-app: prometheus-operator
     spec:
       containers:
       - args:
         - --kubelet-service=kube-system/kubelet
         - --logtostderr= true
         - --config-reloader-image=quay.io/coreos/configmap-reload:v0. 0.1
         - --prometheus-config-reloader=quay.io/coreos/prometheus-config-reloader:v0. 27.0
         image: quay.io/coreos/prometheus-operator:v0. 27.0
         name: prometheus-operator
         ports:
         - containerPort:  8080
           name: http
         resources:
           limits:
             cpu: 200m
             memory: 200Mi
           requests:
             cpu: 100m
             memory: 100Mi
         securityContext:
           allowPrivilegeEscalation:  false
           readOnlyRootFilesystem:  true
       nodeSelector:
         beta.kubernetes.io/os: linux
       securityContext:
         runAsNonRoot:  true
         runAsUser:  65534
       serviceAccountName: prometheus-operator
---
apiVersion: v1
kind: ServiceAccount
metadata:
   name: prometheus-operator
   namespace: monitoring

3./app/prometheus-operator/manifests文件夹的内容下

[root @iZbp1at8fph52evh70atb1Z  prometheus-operator]# pwd
/app/prometheus-operator
[root @iZbp1at8fph52evh70atb1Z  prometheus-operator]# ls
alertmanager.yaml    bundle.yaml   manifests
[root @iZbp1at8fph52evh70atb1Z  prometheus-operator]# cd manifests/
[root @iZbp1at8fph52evh70atb1Z  manifests]# ls
alertmanager-alertmanager.yaml              kube-state-metrics-service.yaml                      prometheus-clusterRoleBinding.yaml
alertmanager-serviceAccount.yaml            node-exporter-clusterRoleBinding.yaml                prometheus-clusterRole.yaml
alertmanager-serviceMonitor.yaml            node-exporter-clusterRole.yaml                       prometheus-prometheus.yaml
alertmanager-service.yaml                   node-exporter-daemonset.yaml                         prometheus-roleBindingConfig.yaml
grafana-dashboardDatasources.yaml           node-exporter-serviceAccount.yaml                    prometheus-roleBindingSpecificNamespaces.yaml
grafana-dashboardDefinitions.yaml           node-exporter-serviceMonitor.yaml                    prometheus-roleConfig.yaml
grafana-dashboardSources.yaml               node-exporter-service.yaml                           prometheus-roleSpecificNamespaces.yaml
grafana-deployment.yaml                     prometheus-rules.yaml
grafana-serviceAccount.yaml                 prometheus-adapter-clusterRoleBindingDelegator.yaml  prometheus-serviceAccount.yaml
grafana-service.yaml                        prometheus-adapter-clusterRoleBinding.yaml           prometheus-serviceMonitorApiserver.yaml
kube-state-metrics-clusterRoleBinding.yaml  prometheus-adapter-clusterRoleServerResources.yaml   prometheus-serviceMonitorCoreDNS.yaml
kube-state-metrics-clusterRole.yaml         prometheus-adapter-clusterRole.yaml                  prometheus-serviceMonitorKubeControllerManager.yaml
kube-state-metrics-deployment.yaml          prometheus-adapter-configMap.yaml                    prometheus-serviceMonitorKubelet.yaml
kube-state-metrics-roleBinding.yaml         prometheus-adapter-deployment.yaml                   prometheus-serviceMonitorKubeScheduler.yaml
kube-state-metrics-role.yaml                prometheus-adapter-roleBindingAuthReader.yaml        prometheus-serviceMonitor.yaml
kube-state-metrics-serviceAccount.yaml      prometheus-adapter-serviceAccount.yaml               prometheus-service.yaml
kube-state-metrics-serviceMonitor.yaml      prometheus-adapter-service.yaml                     

4./app/prometheus-operator/manifests/prometheus-rules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
   labels:
     prometheus: k8s
     role: alert-rules
   name: prometheus-k8s-rules
   namespace: monitoring
spec:
   groups:
   - name: k8s.rules
     rules:
     - expr: |
         sum(rate(container_cpu_usage_seconds_total{job= "kubelet" , image!= "" , container_name!= "" }[5m])) by (namespace)
       record: namespace:container_cpu_usage_seconds_total:sum_rate
     - expr: |
         sum by (namespace, pod_name, container_name) (
           rate(container_cpu_usage_seconds_total{job= "kubelet" , image!= "" , container_name!= "" }[5m])
         )
       record: namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate
     - expr: |
         sum(container_memory_usage_bytes{job= "kubelet" , image!= "" , container_name!= "" }) by (namespace)
       record: namespace:container_memory_usage_bytes:sum
     - expr: |
         sum by (namespace, label_name) (
            sum(rate(container_cpu_usage_seconds_total{job= "kubelet" , image!= "" , container_name!= "" }[5m])) by (namespace, pod_name)
          * on (namespace, pod_name) group_left(label_name)
            label_replace(kube_pod_labels{job= "kube-state-metrics" },  "pod_name" "$1" "pod" "(.*)" )
         )
       record: namespace_name:container_cpu_usage_seconds_total:sum_rate
     - expr: |
         sum by (namespace, label_name) (
           sum(container_memory_usage_bytes{job= "kubelet" ,image!= "" , container_name!= "" }) by (pod_name, namespace)
         * on (namespace, pod_name) group_left(label_name)
           label_replace(kube_pod_labels{job= "kube-state-metrics" },  "pod_name" "$1" "pod" "(.*)" )
         )
       record: namespace_name:container_memory_usage_bytes:sum
     - expr: |
         sum by (namespace, label_name) (
           sum(kube_pod_container_resource_requests_memory_bytes{job= "kube-state-metrics" }) by (namespace, pod)
         * on (namespace, pod) group_left(label_name)
           label_replace(kube_pod_labels{job= "kube-state-metrics" },  "pod_name" "$1" "pod" "(.*)" )
         )
       record: namespace_name:kube_pod_container_resource_requests_memory_bytes:sum
     - expr: |
         sum by (namespace, label_name) (
           sum(kube_pod_container_resource_requests_cpu_cores{job= "kube-state-metrics" } and on(pod) kube_pod_status_scheduled{condition= "true" }) by (namespace, pod)
         * on (namespace, pod) group_left(label_name)
           label_replace(kube_pod_labels{job= "kube-state-metrics" },  "pod_name" "$1" "pod" "(.*)" )
         )
       record: namespace_name:kube_pod_container_resource_requests_cpu_cores:sum
   - name: kube-scheduler.rules
     rules:
     - expr: |
         histogram_quantile( 0.99 , sum(rate(scheduler_e2e_scheduling_latency_microseconds_bucket{job= "kube-scheduler" }[5m])) without(instance, pod)) / 1e+ 06
       labels:
         quantile:  "0.99"
       record: cluster_quantile:scheduler_e2e_scheduling_latency:histogram_quantile
     - expr: |
         histogram_quantile( 0.99 , sum(rate(scheduler_scheduling_algorithm_latency_microseconds_bucket{job= "kube-scheduler" }[5m])) without(instance, pod)) / 1e+ 06
       labels:
         quantile:  "0.99"
       record: cluster_quantile:scheduler_scheduling_algorithm_latency:histogram_quantile
     - expr: |
         histogram_quantile( 0.99 , sum(rate(scheduler_binding_latency_microseconds_bucket{job= "kube-scheduler" }[5m])) without(instance, pod)) / 1e+ 06
       labels:
         quantile:  "0.99"
       record: cluster_quantile:scheduler_binding_latency:histogram_quantile
     - expr: |
         histogram_quantile( 0.9 , sum(rate(scheduler_e2e_scheduling_latency_microseconds_bucket{job= "kube-scheduler" }[5m])) without(instance, pod)) / 1e+ 06
       labels:
         quantile:  "0.9"
       record: cluster_quantile:scheduler_e2e_scheduling_latency:histogram_quantile
     - expr: |
         histogram_quantile( 0.9 , sum(rate(scheduler_scheduling_algorithm_latency_microseconds_bucket{job= "kube-scheduler" }[5m])) without(instance, pod)) / 1e+ 06
       labels:
         quantile:  "0.9"
       record: cluster_quantile:scheduler_scheduling_algorithm_latency:histogram_quantile
     - expr: |
         histogram_quantile( 0.9 , sum(rate(scheduler_binding_latency_microseconds_bucket{job= "kube-scheduler" }[5m])) without(instance, pod)) / 1e+ 06
       labels:
         quantile:  "0.9"
       record: cluster_quantile:scheduler_binding_latency:histogram_quantile
     - expr: |
         histogram_quantile( 0.5 , sum(rate(scheduler_e2e_scheduling_latency_microseconds_bucket{job= "kube-scheduler" }[5m])) without(instance, pod)) / 1e+ 06
       labels:
         quantile:  "0.5"
       record: cluster_quantile:scheduler_e2e_scheduling_latency:histogram_quantile
     - expr: |
         histogram_quantile( 0.5 , sum(rate(scheduler_scheduling_algorithm_latency_microseconds_bucket{job= "kube-scheduler" }[5m])) without(instance, pod)) / 1e+ 06
       labels:
         quantile:  "0.5"
       record: cluster_quantile:scheduler_scheduling_algorithm_latency:histogram_quantile
     - expr: |
         histogram_quantile( 0.5 , sum(rate(scheduler_binding_latency_microseconds_bucket{job= "kube-scheduler" }[5m])) without(instance, pod)) / 1e+ 06
       labels:
         quantile:  "0.5"
       record: cluster_quantile:scheduler_binding_latency:histogram_quantile
   - name: kube-apiserver.rules
     rules:
     - expr: |
         histogram_quantile( 0.99 , sum(rate(apiserver_request_latencies_bucket{job= "apiserver" }[5m])) without(instance, pod)) / 1e+ 06
       labels:
         quantile:  "0.99"
       record: cluster_quantile:apiserver_request_latencies:histogram_quantile
     - expr: |
         histogram_quantile( 0.9 , sum(rate(apiserver_request_latencies_bucket{job= "apiserver" }[5m])) without(instance, pod)) / 1e+ 06
       labels:
         quantile:  "0.9"
       record: cluster_quantile:apiserver_request_latencies:histogram_quantile
     - expr: |
         histogram_quantile( 0.5 , sum(rate(apiserver_request_latencies_bucket{job= "apiserver" }[5m])) without(instance, pod)) / 1e+ 06
       labels:
         quantile:  "0.5"
       record: cluster_quantile:apiserver_request_latencies:histogram_quantile
   - name: node.rules
     rules:
     - expr: sum(min(kube_pod_info) by (node))
       record:  ':kube_pod_info_node_count:'
     - expr: |
         max(label_replace(kube_pod_info{job= "kube-state-metrics" },  "pod" "$1" "pod" "(.*)" )) by (node, namespace, pod)
       record:  'node_namespace_pod:kube_pod_info:'
     - expr: |
         count by (node) (sum by (node, cpu) (
           node_cpu_seconds_total{job= "node-exporter" }
         * on (namespace, pod) group_left(node)
           node_namespace_pod:kube_pod_info:
         ))
       record: node:node_num_cpu:sum
     - expr: |
         1  - avg(rate(node_cpu_seconds_total{job= "node-exporter" ,mode= "idle" }[1m]))
       record: :node_cpu_utilisation:avg1m
     - expr: |
         1  - avg by (node) (
           rate(node_cpu_seconds_total{job= "node-exporter" ,mode= "idle" }[1m])
         * on (namespace, pod) group_left(node)
           node_namespace_pod:kube_pod_info:)
       record: node:node_cpu_utilisation:avg1m
     - expr: |
         sum(node_load1{job= "node-exporter" })
         /
         sum(node:node_num_cpu:sum)
       record:  ':node_cpu_saturation_load1:'
     - expr: |
         sum by (node) (
           node_load1{job= "node-exporter" }
         * on (namespace, pod) group_left(node)
           node_namespace_pod:kube_pod_info:
         )
         /
         node:node_num_cpu:sum
       record:  'node:node_cpu_saturation_load1:'
     - expr: |
         1  -
         sum(node_memory_MemFree_bytes{job= "node-exporter" } + node_memory_Cached_bytes{job= "node-exporter" } + node_memory_Buffers_bytes{job= "node-exporter" })
         /
         sum(node_memory_MemTotal_bytes{job= "node-exporter" })
       record:  ':node_memory_utilisation:'
     - expr: |
         sum(node_memory_MemFree_bytes{job= "node-exporter" } + node_memory_Cached_bytes{job= "node-exporter" } + node_memory_Buffers_bytes{job= "node-exporter" })
       record: :node_memory_MemFreeCachedBuffers_bytes:sum
     - expr: |
         sum(node_memory_MemTotal_bytes{job= "node-exporter" })
       record: :node_memory_MemTotal_bytes:sum
     - expr: |
         sum by (node) (
           (node_memory_MemFree_bytes{job= "node-exporter" } + node_memory_Cached_bytes{job= "node-exporter" } + node_memory_Buffers_bytes{job= "node-exporter" })
           * on (namespace, pod) group_left(node)
             node_namespace_pod:kube_pod_info:
         )
       record: node:node_memory_bytes_available:sum
     - expr: |
         sum by (node) (
           node_memory_MemTotal_bytes{job= "node-exporter" }
           * on (namespace, pod) group_left(node)
             node_namespace_pod:kube_pod_info:
         )
       record: node:node_memory_bytes_total:sum
     - expr: |
         (node:node_memory_bytes_total:sum - node:node_memory_bytes_available:sum)
         /
         scalar(sum(node:node_memory_bytes_total:sum))
       record: node:node_memory_utilisation:ratio
     - expr: |
         1e3 * sum(
           (rate(node_vmstat_pgpgin{job= "node-exporter" }[1m])
          + rate(node_vmstat_pgpgout{job= "node-exporter" }[1m]))
         )
       record: :node_memory_swap_io_bytes:sum_rate
     - expr: |
         1  -
         sum by (node) (
           (node_memory_MemFree_bytes{job= "node-exporter" } + node_memory_Cached_bytes{job= "node-exporter" } + node_memory_Buffers_bytes{job= "node-exporter" })
         * on (namespace, pod) group_left(node)
           node_namespace_pod:kube_pod_info:
         )
         /
         sum by (node) (
           node_memory_MemTotal_bytes{job= "node-exporter" }
         * on (namespace, pod) group_left(node)
           node_namespace_pod:kube_pod_info:
         )
       record:  'node:node_memory_utilisation:'
     - expr: |
         1  - (node:node_memory_bytes_available:sum / node:node_memory_bytes_total:sum)
       record:  'node:node_memory_utilisation_2:'
     - expr: |
         1e3 * sum by (node) (
           (rate(node_vmstat_pgpgin{job= "node-exporter" }[1m])
          + rate(node_vmstat_pgpgout{job= "node-exporter" }[1m]))
          * on (namespace, pod) group_left(node)
            node_namespace_pod:kube_pod_info:
         )
       record: node:node_memory_swap_io_bytes:sum_rate
     - expr: |
         avg(irate(node_disk_io_time_seconds_total{job= "node-exporter" ,device=~ "nvme.+|rbd.+|sd.+|vd.+|xvd.+" }[1m]))
       record: :node_disk_utilisation:avg_irate
     - expr: |
         avg by (node) (
           irate(node_disk_io_time_seconds_total{job= "node-exporter" ,device=~ "nvme.+|rbd.+|sd.+|vd.+|xvd.+" }[1m])
         * on (namespace, pod) group_left(node)
           node_namespace_pod:kube_pod_info:
         )
       record: node:node_disk_utilisation:avg_irate
     - expr: |
         avg(irate(node_disk_io_time_weighted_seconds_total{job= "node-exporter" ,device=~ "nvme.+|rbd.+|sd.+|vd.+|xvd.+" }[1m]) / 1e3)
       record: :node_disk_saturation:avg_irate
     - expr: |
         avg by (node) (
           irate(node_disk_io_time_weighted_seconds_total{job= "node-exporter" ,device=~ "nvme.+|rbd.+|sd.+|vd.+|xvd.+" }[1m]) / 1e3
         * on (namespace, pod) group_left(node)
           node_namespace_pod:kube_pod_info:
         )
       record: node:node_disk_saturation:avg_irate
     - expr: |
         max by (namespace, pod, device) ((node_filesystem_size_bytes{fstype=~ "ext[234]|btrfs|xfs|zfs" }
         - node_filesystem_avail_bytes{fstype=~ "ext[234]|btrfs|xfs|zfs" })
         / node_filesystem_size_bytes{fstype=~ "ext[234]|btrfs|xfs|zfs" })
       record:  'node:node_filesystem_usage:'
     - expr: |
         max by (namespace, pod, device) (node_filesystem_avail_bytes{fstype=~ "ext[234]|btrfs|xfs|zfs" } / node_filesystem_size_bytes{fstype=~ "ext[234]|btrfs|xfs|zfs" })
       record:  'node:node_filesystem_avail:'
     - expr: |
         sum(irate(node_network_receive_bytes_total{job= "node-exporter" ,device= "eth0" }[1m])) +
         sum(irate(node_network_transmit_bytes_total{job= "node-exporter" ,device= "eth0" }[1m]))
       record: :node_net_utilisation:sum_irate
     - expr: |
         sum by (node) (
           (irate(node_network_receive_bytes_total{job= "node-exporter" ,device= "eth0" }[1m]) +
           irate(node_network_transmit_bytes_total{job= "node-exporter" ,device= "eth0" }[1m]))
         * on (namespace, pod) group_left(node)
           node_namespace_pod:kube_pod_info:
         )
       record: node:node_net_utilisation:sum_irate
     - expr: |
         sum(irate(node_network_receive_drop_total{job= "node-exporter" ,device= "eth0" }[1m])) +
         sum(irate(node_network_transmit_drop_total{job= "node-exporter" ,device= "eth0" }[1m]))
       record: :node_net_saturation:sum_irate
     - expr: |
         sum by (node) (
           (irate(node_network_receive_drop_total{job= "node-exporter" ,device= "eth0" }[1m]) +
           irate(node_network_transmit_drop_total{job= "node-exporter" ,device= "eth0" }[1m]))
         * on (namespace, pod) group_left(node)
           node_namespace_pod:kube_pod_info:
         )
       record: node:node_net_saturation:sum_irate
   - name: kube-prometheus-node-recording.rules
     rules:
     - expr: sum(rate(node_cpu{mode!= "idle" ,mode!= "iowait" }[3m])) BY (instance)
       record: instance:node_cpu:rate:sum
     - expr: sum((node_filesystem_size{mountpoint= "/" } - node_filesystem_free{mountpoint= "/" }))
         BY (instance)
       record: instance:node_filesystem_usage:sum
     - expr: sum(rate(node_network_receive_bytes[3m])) BY (instance)
       record: instance:node_network_receive_bytes:rate:sum
     - expr: sum(rate(node_network_transmit_bytes[3m])) BY (instance)
       record: instance:node_network_transmit_bytes:rate:sum
     - expr: sum(rate(node_cpu{mode!= "idle" ,mode!= "iowait" }[5m])) WITHOUT (cpu, mode)
         / ON(instance) GROUP_LEFT() count(sum(node_cpu) BY (instance, cpu)) BY (instance)
       record: instance:node_cpu:ratio
     - expr: sum(rate(node_cpu{mode!= "idle" ,mode!= "iowait" }[5m]))
       record: cluster:node_cpu:sum_rate5m
     - expr: cluster:node_cpu:rate5m / count(sum(node_cpu) BY (instance, cpu))
       record: cluster:node_cpu:ratio
   - name: kubernetes-absent
     rules:
     - alert: AlertmanagerDown
       annotations:
         message: k8s-master- 10.80 . 154.143  Alertmanager has disappeared from Prometheus target discovery.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-alertmanagerdown
       expr: |
         absent(up{job= "alertmanager-main" } ==  1 )
       for : 1m
       labels:
         severity: critical
     - alert: KubeAPIDown
       annotations:
         message: k8s-master- 10.80 . 154.143  KubeAPI has disappeared from Prometheus target discovery.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown
       expr: |
         absent(up{job= "apiserver" } ==  1 )
       for : 1m
       labels:
         severity: critical
     - alert: KubeStateMetricsDown
       annotations:
         message: k8s-master- 10.80 . 154.143  KubeStateMetrics has disappeared from Prometheus target discovery.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatemetricsdown
       expr: |
         absent(up{job= "kube-state-metrics" } ==  1 )
       for : 1m
       labels:
         severity: critical
     - alert: KubeletDown
       annotations:
         message: k8s-master- 10.80 . 154.143  Kubelet has disappeared from Prometheus target discovery.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown
       expr: |
         absent(up{job= "kubelet" } ==  1 )
       for : 1m
       labels:
         severity: critical
     - alert: NodeExporterDown
       annotations:
         message: k8s-master- 10.80 . 154.143  NodeExporter has disappeared from Prometheus target discovery.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodeexporterdown
       expr: |
         absent(up{job= "node-exporter" } ==  1 )
       for : 1m
       labels:
         severity: critical
     - alert: PrometheusDown
       annotations:
         message: k8s-master- 10.80 . 154.143  Prometheus has disappeared from Prometheus target discovery.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusdown
       expr: |
         absent(up{job= "prometheus-k8s" } ==  1 )
       for : 1m
       labels:
         severity: critical
   - name: kubernetes-apps
     rules:
     - alert: KubePodCrashLooping
       annotations:
         message: k8s-master- 10.80 . 154.143  Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
           }}) is restarting {{ printf  "%.2f"  $value }} times /  5  minutes.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping
       expr: |
         rate(kube_pod_container_status_restarts_total{job= "kube-state-metrics" }[15m]) *  60  5  0
       for : 1m
       labels:
         severity: critical
     - alert: KubePodNotReady
       annotations:
         message: k8s-master- 10.80 . 154.143  Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready
           state  for  longer than an hour.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodnotready
       expr: |
         sum by (namespace, pod) (kube_pod_status_phase{job= "kube-state-metrics" , phase=~ "Pending|Unknown" }) >  0
       for : 1m
       labels:
         severity: critical
     - alert: KubeDeploymentGenerationMismatch
       annotations:
         message: k8s-master- 10.80 . 154.143  Deployment generation  for  {{ $labels.namespace }}/{{ $labels.deployment
           }} does not match,  this  indicates that the Deployment has failed but has
           not been rolled back.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentgenerationmismatch
       expr: |
         kube_deployment_status_observed_generation{job= "kube-state-metrics" }
           !=
         kube_deployment_metadata_generation{job= "kube-state-metrics" }
       for : 1m
       labels:
         severity: critical
     - alert: KubeDeploymentReplicasMismatch
       annotations:
         message: k8s-master- 10.80 . 154.143  Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not
           matched the expected number of replicas  for  longer than an hour.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentreplicasmismatch
       expr: |
         kube_deployment_spec_replicas{job= "kube-state-metrics" }
           !=
         kube_deployment_status_replicas_available{job= "kube-state-metrics" }
       for : 1m
       labels:
         severity: critical
     - alert: KubeStatefulSetReplicasMismatch
       annotations:
         message: k8s-master- 10.80 . 154.143  StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has
           not matched the expected number of replicas  for  longer than  15  minutes.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetreplicasmismatch
       expr: |
         kube_statefulset_status_replicas_ready{job= "kube-state-metrics" }
           !=
         kube_statefulset_status_replicas{job= "kube-state-metrics" }
       for : 1m
       labels:
         severity: critical
     - alert: KubeStatefulSetGenerationMismatch
       annotations:
         message: k8s-master- 10.80 . 154.143  StatefulSet generation  for  {{ $labels.namespace }}/{{ $labels.statefulset
           }} does not match,  this  indicates that the StatefulSet has failed but has
           not been rolled back.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetgenerationmismatch
       expr: |
         kube_statefulset_status_observed_generation{job= "kube-state-metrics" }
           !=
         kube_statefulset_metadata_generation{job= "kube-state-metrics" }
       for : 1m
       labels:
         severity: critical
     - alert: KubeStatefulSetUpdateNotRolledOut
       annotations:
         message: k8s-master- 10.80 . 154.143  StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update
           has not been rolled out.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetupdatenotrolledout
       expr: |
         max without (revision) (
           kube_statefulset_status_current_revision{job= "kube-state-metrics" }
             unless
           kube_statefulset_status_update_revision{job= "kube-state-metrics" }
         )
           *
         (
           kube_statefulset_replicas{job= "kube-state-metrics" }
             !=
           kube_statefulset_status_replicas_updated{job= "kube-state-metrics" }
         )
       for : 1m
       labels:
         severity: critical
     - alert: KubeDaemonSetRolloutStuck
       annotations:
         message: k8s-master- 10.80 . 154.143  Only {{ $value }}% of the desired Pods of DaemonSet {{ $labels.namespace
           }}/{{ $labels.daemonset }} are scheduled and ready.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetrolloutstuck
       expr: |
         kube_daemonset_status_number_ready{job= "kube-state-metrics" }
           /
         kube_daemonset_status_desired_number_scheduled{job= "kube-state-metrics" } *  100  100
       for : 1m
       labels:
         severity: critical
     - alert: KubeDaemonSetNotScheduled
       annotations:
         message: k8s-master- 10.80 . 154.143  '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset
           }} are not scheduled.'
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetnotscheduled
       expr: |
         kube_daemonset_status_desired_number_scheduled{job= "kube-state-metrics" }
           -
         kube_daemonset_status_current_number_scheduled{job= "kube-state-metrics" } >  0
       for : 1m
       labels:
         severity: warning
     - alert: KubeDaemonSetMisScheduled
       annotations:
         message: k8s-master- 10.80 . 154.143  '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset
           }} are running where they are not supposed to run.'
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetmisscheduled
       expr: |
         kube_daemonset_status_number_misscheduled{job= "kube-state-metrics" } >  0
       for : 1m
       labels:
         severity: warning
     - alert: KubeCronJobRunning
       annotations:
         message: k8s-master- 10.80 . 154.143  CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more
           than 1h to complete.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecronjobrunning
       expr: |
         time() - kube_cronjob_next_schedule_time{job= "kube-state-metrics" } >  3600
       for : 1m
       labels:
         severity: warning
     - alert: KubeJobCompletion
       annotations:
         message: k8s-master- 10.80 . 154.143  Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more
           than one hour to complete.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobcompletion
       expr: |
         kube_job_spec_completions{job= "kube-state-metrics" } - kube_job_status_succeeded{job= "kube-state-metrics" }  >  0
       for : 1m
       labels:
         severity: warning
     - alert: KubeJobFailed
       annotations:
         message: k8s-master- 10.80 . 154.143  Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobfailed
       expr: |
         kube_job_status_failed{job= "kube-state-metrics" }  >  0
       for : 1m
       labels:
         severity: warning
   - name: kubernetes-resources
     rules:
     - alert: KubeCPUOvercommit
       annotations:
         message: k8s-master- 10.80 . 154.143  'Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure. '
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit
       expr: |
         sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum)
           /
         sum(node:node_num_cpu:sum)
           >
         (count(node:node_num_cpu:sum)- 1 ) / count(node:node_num_cpu:sum)
       for : 1m
       labels:
         severity: info
 
 
 
 
     - alert: zxyKubeCPUOvercommit
       annotations:
         message: k8s-master- 10.80 . 154.143   '容器的CPU使用率大于100% ,当前值为{{ printf "%0.0f" $value }}%  in namespace {{$labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod}}.'
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit
       expr: |
         round( 100  * label_join(label_join(sum(rate(container_cpu_usage_seconds_total{container_name !=  "POD" , image != "" }[1m])) by (pod_name, container_name, namespace) ,  "pod" "" "pod_name" ),  "container" "" "container_name" )
           /
         ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_cpu_cores) by (pod, container, namespace))
           >
         100
       for : 1m
       labels:
         severity: critical
 
 
     - alert: zxyKubeMemoryOvercommit
       annotations:
         message: k8s-master- 10.80 . 154.143  '容器的内存使用率大于100% ,当前值为{{ printf "%0.0f" $value }}%  in namespace {{$labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod}}.'
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit
       expr: |
         round( 100  * label_join(label_join(sum(container_memory_usage_bytes{container_name !=  "POD" , image != "" }) by (container_name, pod_name, namespace),  "pod" "" "pod_name" ),  "container" "" "container_name" )
           /
         ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_memory_bytes) by (container, pod, namespace))
           >
         100
       for : 1m
       labels:
         severity: critical
 
 
 
 
 
 
     - alert: KubeMemOvercommit
       annotations:
         message: k8s-master- 10.80 . 154.143  Cluster has overcommitted memory resource requests  for  Pods and cannot
           tolerate node failure.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit
       expr: |
         sum(namespace_name:kube_pod_container_resource_requests_memory_bytes:sum)
           /
         sum(node_memory_MemTotal_bytes)
           >
         (count(node:node_num_cpu:sum)- 1 )
           /
         count(node:node_num_cpu:sum)
       for : 1m
       labels:
         severity: warning
     - alert: KubeCPUOvercommit
       annotations:
         message: k8s-master- 10.80 . 154.143  Cluster has overcommitted CPU resource requests  for  Namespaces.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit
       expr: |
         sum(kube_resourcequota{job= "kube-state-metrics" , type= "hard" , resource= "requests.cpu" })
           /
         sum(node:node_num_cpu:sum)
           1.5
       for : 1m
       labels:
         severity: warning
     - alert: KubeMemOvercommit
       annotations:
         message: k8s-master- 10.80 . 154.143  Cluster has overcommitted memory resource requests  for  Namespaces.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit
       expr: |
         sum(kube_resourcequota{job= "kube-state-metrics" , type= "hard" , resource= "requests.memory" })
           /
         sum(node_memory_MemTotal_bytes{job= "node-exporter" })
           1.5
       for : 1m
       labels:
         severity: warning
     - alert: KubeQuotaExceeded
       annotations:
         message: k8s-master- 10.80 . 154.143  Namespace {{ $labels.namespace }} is using {{ printf  "%0.0f"  $value
           }}% of its {{ $labels.resource }} quota.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaexceeded
       expr: |
         100  * kube_resourcequota{job= "kube-state-metrics" , type= "used" }
           / ignoring(instance, job, type)
         (kube_resourcequota{job= "kube-state-metrics" , type= "hard" } >  0 )
           90
       for : 1m
       labels:
         severity: warning
     - alert: CPUThrottlingHigh
       annotations:
         message: k8s-master- 10.80 . 154.143  '{{ printf  "%0.0f"  $value }}% throttling of CPU in namespace {{ $labels.namespace
           }}  for  container {{ $labels.container_name }} in pod {{ $labels.pod_name
           }}.'
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh
       expr: " 100  * sum(increase(container_cpu_cfs_throttled_periods_total{}[5m]))
         by (container_name, pod_name, namespace) \n  / \nsum(increase(container_cpu_cfs_periods_total{}[5m]))
         by (container_name, pod_name, namespace)\n  >  99  \n"
       for : 1m
       labels:
         severity: warning
   - name: kubernetes-storage
     rules:
     - alert: KubePersistentVolumeUsageCritical
       annotations:
         message: k8s-master- 10.80 . 154.143  The PersistentVolume claimed by {{ $labels.persistentvolumeclaim
           }} in Namespace {{ $labels.namespace }} is only {{ printf  "%0.2f"  $value
           }}% free.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeusagecritical
       expr: |
         100  * kubelet_volume_stats_available_bytes{job= "kubelet" }
           /
         kubelet_volume_stats_capacity_bytes{job= "kubelet" }
           3
       for : 1m
       labels:
         severity: critical
     - alert: KubePersistentVolumeFullInFourDays
       annotations:
         message: k8s-master- 10.80 . 154.143  Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim
           }} in Namespace {{ $labels.namespace }} is expected to fill up within four
           days. Currently {{ printf  "%0.2f"  $value }}% is available.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefullinfourdays
       expr: |
         100  * (
           kubelet_volume_stats_available_bytes{job= "kubelet" }
             /
           kubelet_volume_stats_capacity_bytes{job= "kubelet" }
         ) <  15
         and
         predict_linear(kubelet_volume_stats_available_bytes{job= "kubelet" }[6h],  4  24  3600 ) <  0
       for : 1m
       labels:
         severity: critical
     - alert: KubePersistentVolumeErrors
       annotations:
         message: k8s-master- 10.80 . 154.143  The persistent volume {{ $labels.persistentvolume }} has status {{
           $labels.phase }}.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeerrors
       expr: |
         kube_persistentvolume_status_phase{phase=~ "Failed|Pending" ,job= "kube-state-metrics" } >  0
       for : 1m
       labels:
         severity: critical
   - name: kubernetes-system
     rules:
     - alert: KubeNodeNotReady
       annotations:
         message: k8s-master- 10.80 . 154.143  '{{ $labels.node }} has been unready for more than an hour.'
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodenotready
       expr: |
         kube_node_status_condition{job= "kube-state-metrics" ,condition= "Ready" ,status= "true" } ==  0
       for : 1m
       labels:
         severity: warning
     - alert: KubeVersionMismatch
       annotations:
         message: k8s-master- 10.80 . 154.143  There are {{ $value }} different versions of Kubernetes components
           running.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeversionmismatch
       expr: |
         count(count(kubernetes_build_info{job!= "kube-dns" }) by (gitVersion)) >  1
       for : 1m
       labels:
         severity: warning
     - alert: KubeClientErrors
       annotations:
         message: k8s-master- 10.80 . 154.143  Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance
           }} ' is experiencing {{ printf "%0.0f" $value }}% errors.'
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors
       expr: |
         (sum(rate(rest_client_requests_total{code!~ "2..|404" }[5m])) by (instance, job)
           /
         sum(rate(rest_client_requests_total[5m])) by (instance, job))
         100  1
       for : 1m
       labels:
         severity: warning
     - alert: KubeClientErrors
       annotations:
         message: k8s-master- 10.80 . 154.143  Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance
           }}' is experiencing {{ printf  "%0.0f"  $value }} errors / second.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors
       expr: |
         sum(rate(ksm_scrape_error_total{job= "kube-state-metrics" }[5m])) by (instance, job) >  0.1
       for : 1m
       labels:
         severity: warning
     - alert: KubeletTooManyPods
       annotations:
         message: k8s-master- 10.80 . 154.143  Kubelet {{ $labels.instance }} is running {{ $value }} Pods, close
           to the limit of  110 .
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubelettoomanypods
       expr: |
         kubelet_running_pod_count{job= "kubelet" } >  110  0.9
       for : 1m
       labels:
         severity: warning
     - alert: KubeAPILatencyHigh
       annotations:
         message: k8s-master- 10.80 . 154.143  The API server has a 99th percentile latency of {{ $value }} seconds
           for  {{ $labels.verb }} {{ $labels.resource }}.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapilatencyhigh
       expr: |
         cluster_quantile:apiserver_request_latencies:histogram_quantile{job= "apiserver" ,quantile= "0.99" ,subresource!= "log" ,verb!~ "^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$" } >  1
       for : 1m
       labels:
         severity: warning
     - alert: KubeAPILatencyHigh
       annotations:
         message: k8s-master- 10.80 . 154.143  The API server has a 99th percentile latency of {{ $value }} seconds
           for  {{ $labels.verb }} {{ $labels.resource }}.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapilatencyhigh
       expr: |
         cluster_quantile:apiserver_request_latencies:histogram_quantile{job= "apiserver" ,quantile= "0.99" ,subresource!= "log" ,verb!~ "^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$" } >  4
       for : 1m
       labels:
         severity: critical
     - alert: KubeAPIErrorsHigh
       annotations:
         message: k8s-master- 10.80 . 154.143  API server is returning errors  for  {{ $value }}% of requests.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorshigh
       expr: |
         sum(rate(apiserver_request_count{job= "apiserver" ,code=~ "^(?:5..)$" }[5m])) without(instance, pod)
           /
         sum(rate(apiserver_request_count{job= "apiserver" }[5m])) without(instance, pod) *  100  10
       for : 1m
       labels:
         severity: critical
     - alert: KubeAPIErrorsHigh
       annotations:
         message: k8s-master- 10.80 . 154.143  API server is returning errors  for  {{ $value }}% of requests.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorshigh
       expr: |
         sum(rate(apiserver_request_count{job= "apiserver" ,code=~ "^(?:5..)$" }[5m])) without(instance, pod)
           /
         sum(rate(apiserver_request_count{job= "apiserver" }[5m])) without(instance, pod) *  100  5
       for : 1m
       labels:
         severity: warning
     - alert: KubeClientCertificateExpiration
       annotations:
         message: k8s-master- 10.80 . 154.143  Kubernetes API certificate is expiring in less than  7  days.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration
       expr: |
         histogram_quantile( 0.01 , sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job= "apiserver" }[5m]))) <  604800
       labels:
         severity: warning
     - alert: KubeClientCertificateExpiration
       annotations:
         message: k8s-master- 10.80 . 154.143  Kubernetes API certificate is expiring in less than  24  hours.
         runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration
       expr: |
         histogram_quantile( 0.01 , sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job= "apiserver" }[5m]))) <  86400
       labels:
         severity: critical
   - name: alertmanager.rules
     rules:
     - alert: AlertmanagerConfigInconsistent
       annotations:
         message: k8s-master- 10.80 . 154.143  The configuration of the instances of the Alertmanager cluster `{{$labels.service}}`
           are out of sync.
       expr: |
         count_values( "config_hash" , alertmanager_config_hash{job= "alertmanager-main" }) BY (service) / ON(service) GROUP_LEFT() label_replace(prometheus_operator_spec_replicas{job= "prometheus-operator" ,controller= "alertmanager" },  "service" "alertmanager-$1" "name" "(.*)" ) !=  1
       for : 1m
       labels:
         severity: critical
     - alert: AlertmanagerFailedReload
       annotations:
         message: k8s-master- 10.80 . 154.143  Reloading Alertmanager's configuration has failed  for  {{ $labels.namespace
           }}/{{ $labels.pod}}.
       expr: |
         alertmanager_config_last_reload_successful{job= "alertmanager-main" } ==  0
       for : 1m
       labels:
         severity: warning
     - alert: AlertmanagerMembersInconsistent
       annotations:
         message: k8s-master- 10.80 . 154.143  Alertmanager has not found all other members of the cluster.
       expr: |
         alertmanager_cluster_members{job= "alertmanager-main" }
           != on (service) GROUP_LEFT()
         count by (service) (alertmanager_cluster_members{job= "alertmanager-main" })
       for : 1m
       labels:
         severity: critical
   - name: general.rules
     rules:
     - alert: TargetDown
       annotations:
         message: k8s-master- 10.80 . 154.143  '{{ $value }}% of the {{ $labels.job }} targets are down.'
       expr:  100  * (count(up ==  0 ) BY (job) / count(up) BY (job)) >  10
       for : 1m
       labels:
         severity: warning
   - name: kube-prometheus-node-alerting.rules
     rules:
     - alert: NodeDiskRunningFull
       annotations:
         message: k8s-master- 10.80 . 154.143  Device {{ $labels.device }} of node-exporter {{ $labels.namespace
           }}/{{ $labels.pod }} will be full within the next  24  hours.
       expr: |
         (node:node_filesystem_usage: >  0.85 ) and (predict_linear(node:node_filesystem_avail:[6h],  3600  24 ) <  0 )
       for : 1m
       labels:
         severity: warning
     - alert: NodeDiskRunningFull
       annotations:
         message: k8s-master- 10.80 . 154.143  Device {{ $labels.device }} of node-exporter {{ $labels.namespace
           }}/{{ $labels.pod }} will be full within the next  2  hours.
       expr: |
         (node:node_filesystem_usage: >  0.85 ) and (predict_linear(node:node_filesystem_avail:[30m],  3600  2 ) <  0 )
       for : 1m
       labels:
         severity: critical
   - name: prometheus.rules
     rules:
     - alert: PrometheusConfigReloadFailed
       annotations:
         description: Reloading Prometheus' configuration has failed  for  {{$labels.namespace}}/{{$labels.pod}}
         summary: Reloading Prometheus' configuration failed
       expr: |
         prometheus_config_last_reload_successful{job= "prometheus-k8s" } ==  0
       for : 1m
       labels:
         severity: warning
     - alert: PrometheusNotificationQueueRunningFull
       annotations:
         description: Prometheus' alert notification queue is running full  for  {{$labels.namespace}}/{{
           $labels.pod}}
         summary: Prometheus' alert notification queue is running full
       expr: |
         predict_linear(prometheus_notifications_queue_length{job= "prometheus-k8s" }[5m],  60  30 ) > prometheus_notifications_queue_capacity{job= "prometheus-k8s" }
       for : 1m
       labels:
         severity: warning
     - alert: PrometheusErrorSendingAlerts
       annotations:
         description: Errors  while  sending alerts from Prometheus {{$labels.namespace}}/{{
           $labels.pod}} to Alertmanager {{$labels.Alertmanager}}
         summary: Errors  while  sending alert from Prometheus
       expr: |
         rate(prometheus_notifications_errors_total{job= "prometheus-k8s" }[5m]) / rate(prometheus_notifications_sent_total{job= "prometheus-k8s" }[5m]) >  0.01
       for : 1m
       labels:
         severity: warning
     - alert: PrometheusErrorSendingAlerts
       annotations:
         description: Errors  while  sending alerts from Prometheus {{$labels.namespace}}/{{
           $labels.pod}} to Alertmanager {{$labels.Alertmanager}}
         summary: Errors  while  sending alerts from Prometheus
       expr: |
         rate(prometheus_notifications_errors_total{job= "prometheus-k8s" }[5m]) / rate(prometheus_notifications_sent_total{job= "prometheus-k8s" }[5m]) >  0.03
       for : 1m
       labels:
         severity: critical
     - alert: PrometheusNotConnectedToAlertmanagers
       annotations:
         description: Prometheus {{ $labels.namespace }}/{{ $labels.pod}} is not connected
           to any Alertmanagers
         summary: Prometheus is not connected to any Alertmanagers
       expr: |
         prometheus_notifications_alertmanagers_discovered{job= "prometheus-k8s" } <  1
       for : 1m
       labels:
         severity: warning
     - alert: PrometheusTSDBReloadsFailing
       annotations:
         description: '{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}}
           reload failures over the last four hours.'
         summary: Prometheus has issues reloading data blocks from disk
       expr: |
         increase(prometheus_tsdb_reloads_failures_total{job= "prometheus-k8s" }[2h]) >  0
       for : 1m
       labels:
         severity: warning
     - alert: PrometheusTSDBCompactionsFailing
       annotations:
         description: '{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}}
           compaction failures over the last four hours.'
         summary: Prometheus has issues compacting sample blocks
       expr: |
         increase(prometheus_tsdb_compactions_failed_total{job= "prometheus-k8s" }[2h]) >  0
       for : 1m
       labels:
         severity: warning
     - alert: PrometheusTSDBWALCorruptions
       annotations:
         description: '{{$labels.job}} at {{$labels.instance}} has a corrupted write-ahead
           log (WAL).'
         summary: Prometheus write-ahead log is corrupted
       expr: |
         tsdb_wal_corruptions_total{job= "prometheus-k8s" } >  0
       for : 1m
       labels:
         severity: warning
     - alert: PrometheusNotIngestingSamples
       annotations:
         description: Prometheus {{ $labels.namespace }}/{{ $labels.pod}} isn't ingesting
           samples.
         summary: Prometheus isn't ingesting samples
       expr: |
         rate(prometheus_tsdb_head_samples_appended_total{job= "prometheus-k8s" }[5m]) <=  0
       for : 1m
       labels:
         severity: warning
     - alert: PrometheusTargetScrapesDuplicate
       annotations:
         description: '{{$labels.namespace}}/{{$labels.pod}} has many samples rejected
           due to duplicate timestamps but different values'
         summary: Prometheus has many samples rejected
       expr: |
         increase(prometheus_target_scrapes_sample_duplicate_timestamp_total{job= "prometheus-k8s" }[5m]) >  0
       for : 1m
       labels:
         severity: warning
   - name: prometheus-operator
     rules:
     - alert: PrometheusOperatorReconcileErrors
       annotations:
         message: k8s-master- 10.80 . 154.143  Errors  while  reconciling {{ $labels.controller }} in {{ $labels.namespace
           }} Namespace.
       expr: |
         rate(prometheus_operator_reconcile_errors_total{job= "prometheus-operator" }[5m]) >  0.1
       for : 1m
       labels:
         severity: warning
     - alert: PrometheusOperatorNodeLookupErrors
       annotations:
         message: k8s-master- 10.80 . 154.143  Errors  while  reconciling Prometheus in {{ $labels.namespace }} Namespace.
       expr: |
         rate(prometheus_operator_node_address_lookup_errors_total{job= "prometheus-operator" }[5m]) >  0.1
       for : 1m
       labels:
         severity: warning

for: 1m代表1分钟检查一次,如果是for: 1h代表一个小时检查一次,告警邮件也是1小时发一次

 

这里特意加了下面2个监控通知,目的是为了当容器的内存与cpu使用率到了85%的给出邮件通知

- alert: zxyKubeCPUOvercommit
   annotations:
     message:   '容器的CPU使用率大于85% ,当前值为{{ printf "%0.0f" $value }}%  in namespace {{$labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod}}.'
     runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit
   expr: |
     round( 100  * label_join(label_join(sum(rate(container_cpu_usage_seconds_total{container_name !=  "POD" , image != "" }[1m])) by (pod_name, container_name, namespace) ,  "pod" "" "pod_name" ),  "container" "" "container_name" )
       /
     ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_cpu_cores) by (pod, container, namespace))
       >
     85
   for : 1m
   labels:
     severity: critical
- alert: zxyKubeMemoryOvercommit
   annotations:
     message:  '容器的内存使用率大于85% ,当前值为{{ printf "%0.0f" $value }}%  in namespace {{$labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod}}.'
     runbook_url: https: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit
   expr: |
     round( 100  * label_join(label_join(sum(container_memory_usage_bytes{container_name !=  "POD" , image != "" }) by (container_name, pod_name, namespace),  "pod" "" "pod_name" ),  "container" "" "container_name" )
       /
     ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_memory_bytes) by (container, pod, namespace))
       >
     85
   for : 1m
   labels:
     severity: critical

大家也可以自己定义一些告警,可以参考grafana的取值定义

sum(label_replace(container_memory_usage_bytes{namespace= "$namespace" , pod_name= "$pod" , container_name!= "POD" , container_name!= "" },  "container" "$1" "container_name" "(.*)" )) by (container)
sum(kube_pod_container_resource_requests_memory_bytes{namespace= "$namespace" , pod= "$pod" }) by (container)
sum(label_replace(container_memory_usage_bytes{namespace= "$namespace" , pod_name= "$pod" },  "container" "$1" "container_name" "(.*)" )) by (container) / sum(kube_pod_container_resource_requests_memory_bytes{namespace= "$namespace" , pod= "$pod" }) by (container)
sum(kube_pod_container_resource_limits_memory_bytes{namespace= "$namespace" , pod= "$pod" , container!= "" }) by (container)
 
sum(label_replace(container_memory_usage_bytes{namespace= "$namespace" , pod_name= "$pod" , container_name!= "" },  "container" "$1" "container_name" "(.*)" )) by (container) / sum(kube_pod_container_resource_limits_memory_bytes{namespace= "$namespace" , pod= "$pod" }) by (container)
 
round( 100  * label_join(label_join(sum(rate(container_cpu_usage_seconds_total{container_name !=  "POD" , image != "" }[1m])) by (pod_name, container_name, namespace) ,  "pod" "" "pod_name" ),  "container" "" "container_name" ) / ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_cpu_cores) by (pod, container, namespace)) >  75
round( 100  * label_join(label_join(sum(container_memory_usage_bytes{container_name !=  "POD" , image != "" }) by (container_name, pod_name, namespace),  "pod" "" "pod_name" ),  "container" "" "container_name" ) / ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_memory_bytes) by (container, pod, namespace)) >  75

 

 

 

下载地址:

https://zxytest.zhixueyun.com/installer/prometheus-operator.zip

 

4.启动命令

kubectl create namespace monitoring
kubectl delete secret alertmanager-main -n monitoring
kubectl create secret generic alertmanager-main --from-file=/app/prometheus-operator/alertmanager.yaml -n monitoring

#替换message的开头,使告警信息知道具体是哪个环境的,比如zxy9.zhixueyun.com

sed -i 's/message: /message: zxy9.zhixueyun.com/g'  /app/prometheus-operator/prometheus-rules.yaml

#注意要先启动bundle.yaml,否则manifest下面的服务将无法启动

kubectl create -f /app/prometheus-operator/bundle.yaml

kubectl create -f /app/prometheus-operator/manifests

 

5.删除命令

kubectl delete secret alertmanager-main -n monitoring

kubectl delete -f /app/prometheus-operator/manifests

 

6.测试

[root @iZbp1at8fph52evh70atb1Z  manifests]# kubectl get svc -n monitoring
NAME                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
alertmanager-main       NodePort     10.254 . 71.140    <none>         9093 : 30093 /TCP      6m55s
alertmanager-operated   ClusterIP   None            <none>         9093 /TCP, 6783 /TCP   6m51s
grafana                 NodePort     10.254 . 83.196    <none>         3000 : 30000 /TCP      6m55s
kube-state-metrics      ClusterIP   None            <none>         8443 /TCP, 9443 /TCP   6m55s
node-exporter           ClusterIP   None            <none>         9100 /TCP            6m55s
prometheus-adapter      ClusterIP    10.254 . 92.97     <none>         443 /TCP             6m55s
prometheus-k8s          NodePort     10.254 . 148.92    <none>         9090 : 30001 /TCP      6m55s
prometheus-operated     ClusterIP   None            <none>         9090 /TCP            6m44s
prometheus-operator     ClusterIP   None            <none>         8080 /TCP            7h48m

 

grafana的访问地址:http://120.27.159.108:30000

prometheus-k8s的访问地址:http://120.27.159.108:30001

在Prometheus的Alerts类目中可以查看当前的报警规则,红色的规则表示正在触发报警,绿色的规则表示状态正常,默认prometheus operator会自动创建一批报警规则。

 

告警邮件

 

如果需要设置报警压制,需要访问Alter Manager,alertmanager的访问地址:http://120.27.159.108:30093,点击Silence可以设置报警压制的内容。

 

7.alerts分析,访问http://120.27.159.108:30001/alerts

alert: KubeCPUOvercommit

 

sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum) / sum(node:node_num_cpu:sum) > (count(node:node_num_cpu:sum) - 1) / count(node:node_num_cpu:sum)

的值如下,代表所有namespace request cpu总核数/k8s node总核数

 

sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum)的值如下 ,代表所有namespace cpu request总和

 

namespace_name:kube_pod_container_resource_requests_cpu_cores:sum的值如下,代表每个namespace cpu request总和

 

kube_pod_container_resource_requests_cpu_cores的值如下,代表每个pod容器cpu资源request值

 

 

node:node_num_cpu:sum的值如下,代表每个k8s node的cpu总核数

访问http://120.27.159.108:30001/graph,输入kube_pod_container_resource_limits_memory_bytes,可以查询每个pod的内存limit值

 

8.磁盘空间告警配置,使用率大于85%告警

- expr: |
     max by (namespace, pod, device) ((node_filesystem_size_bytes{fstype=~ "ext[234]|btrfs|xfs|zfs" }
     - node_filesystem_avail_bytes{fstype=~ "ext[234]|btrfs|xfs|zfs" })
     / node_filesystem_size_bytes{fstype=~ "ext[234]|btrfs|xfs|zfs" })
   record:  'node:node_filesystem_usage:'
- expr: |
     max by (namespace, pod, device) (node_filesystem_avail_bytes{fstype=~ "ext[234]|btrfs|xfs|zfs" } / node_filesystem_size_bytes{fstype=~ "ext[234]|btrfs|xfs|zfs" })
   record:  'node:node_filesystem_avail:'
 
- alert: NodeDiskRunningFull
   annotations:
     message: Device {{ $labels.device }} of node-exporter {{ $labels.namespace
       }}/{{ $labels.pod }} will be full within the next  24  hours.
   expr: |
     (node:node_filesystem_usage: >  0.85 ) and (predict_linear(node:node_filesystem_avail:[6h],  3600  24 ) <  0 )
   for : 30m
   labels:
     severity: warning
- alert: NodeDiskRunningFull
   annotations:
     message: Device {{ $labels.device }} of node-exporter {{ $labels.namespace
       }}/{{ $labels.pod }} will be full within the next  2  hours.
   expr: |
     (node:node_filesystem_usage: >  0.85 ) and (predict_linear(node:node_filesystem_avail:[30m],  3600  2 ) <  0 )
   for : 10m
   labels:
     severity: critical

 

9.如果node-exporter无法启动出现如下错误

[root @iZbp14qk2dtp82q129jrzqZ  manifests]# kubectl logs node-exporter-9kg72  -n monitoring -c kube-rbac-proxy
I0308  06 : 29 : 35.477100    19438  main.go: 209 ] Generating self signed cert as no cert is provided
log: exiting because of error: log: cannot create log: open /tmp/kube-rbac-proxy.iZbp1hkg813np4ep5cuakvZ.unknownuser.log.INFO. 20190308 - 062935.19438 : permission denied

则需要修改node-exporter-daemonset.yaml,

 runAsNonRoot: false
runAsUser: 0
apiVersion: apps/v1beta2
kind: DaemonSet
metadata:
   labels:
     app: node-exporter
   name: node-exporter
   namespace: monitoring
spec:
   selector:
     matchLabels:
       app: node-exporter
   template:
     metadata:
       labels:
         app: node-exporter
     spec:
       containers:
       - args:
         - --web.listen-address= 127.0 . 0.1 : 9100
         - --path.procfs=/host/proc
         - --path.sysfs=/host/sys
         - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
         - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
         image: quay.io/prometheus/node-exporter:v0. 16.0
         name: node-exporter
         resources:
           limits:
             cpu: 250m
             memory: 180Mi
           requests:
             cpu: 102m
             memory: 180Mi
         volumeMounts:
         - mountPath: /host/proc
           name: proc
           readOnly:  false
         - mountPath: /host/sys
           name: sys
           readOnly:  false
         - mountPath: /host/root
           mountPropagation: HostToContainer
           name: root
           readOnly:  true
       - args:
         - --secure-listen-address=$(IP): 9100
         - --upstream=http: //127.0.0.1:9100/
         env:
         - name: IP
           valueFrom:
             fieldRef:
               fieldPath: status.podIP
         image: quay.io/coreos/kube-rbac-proxy:v0. 4.0
         name: kube-rbac-proxy
         ports:
         - containerPort:  9100
           hostPort:  9100
           name: https
         resources:
           limits:
             cpu: 20m
             memory: 40Mi
           requests:
             cpu: 10m
             memory: 20Mi
       hostNetwork:  true
       hostPID:  true
       nodeSelector:
         beta.kubernetes.io/os: linux
       securityContext:
         runAsNonRoot:  false
         runAsUser:  0
       serviceAccountName: node-exporter
       tolerations:
       - effect: NoSchedule
         key: node-role.kubernetes.io/master
       volumes:
       - hostPath:
           path: /proc
         name: proc
       - hostPath:
           path: /sys
         name: sys
       - hostPath:
           path: /
         name: root

 

10.alertmanager-alertmanager.yaml,prometheus-prometheus.yaml最好通过nodeName: k8s_master_ip都限制在master节点

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
   labels:
     prometheus: k8s
   name: k8s
   namespace: monitoring
spec:
   alerting:
     alertmanagers:
     - name: alertmanager-main
       namespace: monitoring
       port: web
   baseImage: quay.io/prometheus/prometheus
   nodeName:  10.80 . 154.143
   #nodeSelector:
     #beta.kubernetes.io/os: linux
   replicas:  2
   resources:
     requests:
       memory: 600Mi
   ruleSelector:
     matchLabels:
       prometheus: k8s
       role: alert-rules
   securityContext:
     fsGroup:  2000
     runAsNonRoot:  true
     runAsUser:  1000
   serviceAccountName: prometheus-k8s
   serviceMonitorNamespaceSelector: {}
   serviceMonitorSelector: {}
   version: v2. 5.0

11.注意kind: ReplicationController要改成kind: Deployment,ReplicationController中container_memory_usage_bytes{container_name!="POD",image!=""}获取到的pod占用内存不准确,Deployment就准确

比如当kind为ReplicationController时,kubectl top po 获取到该pod的占用内存值与prometheus监测的值对应不上

 

参考文档:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM