k8s監控終極解決方案(kube-prometheus安裝及配置-包含定制化配置)


 

>>> 目錄 <<< 


一、概述
二、結構分析
三、Prometheus配置文件修改
四、添加外部監控
五、Scheduler和Controller配置
六、Alertmanager配置
七、監控數據持久化
八、Grafana儀表板配置
九、匯總

十、定制化配置


 一、概述

 

首先Prometheus整體監控結構略微復雜,一個個部署並不簡單。另外監控Kubernetes就需要訪問內部數據,必定需要進行認證、鑒權、准入控制,

那么這一整套下來將變得難上加難,而且還需要花費一定的時間,如果你沒有特別高的要求,還是建議選用開源比較好的一些方案。

在k8s初期使用 Heapster+cAdvisor方式監控,這是Prometheus Operator出現之前的k8s監控方案。后來出現了Prometheus Operator,但是目前Prometheus Operator已經不包含完整功能,完整的解決方案已經變為kube-prometheus。項目地址為: https://github.com/coreos/kube-prometheus

關於這個kube-prometheus目前應該是開源最好的方案了,該存儲庫收集Kubernetes清單,Grafana儀表板和Prometheus規則,以及文檔和腳本,以使用Prometheus Operator 通過Prometheus提供易於操作的端到端Kubernetes集群監視。以容器的方式部署到k8s集群,而且還可以自定義配置,非常的方便。

 

上圖是Prometheus-Operator官方提供的架構圖,其中Operator是最核心的部分,作為一個控制器,他會去創建PrometheusServiceMonitorAlertManager以及PrometheusRule4個CRD資源對象,然后會一直監控並維持這4個資源對象的狀態。

其中創建的prometheus這種資源對象就是作為Prometheus Server存在,而ServiceMonitor就是exporter的各種抽象,是用來提供專門提供metrics數據接口的工具,Prometheus就是通過ServiceMonitor提供的metrics數據接口去 pull 數據的,當然alertmanager這種資源對象就是對應的AlertManager的抽象,而PrometheusRule是用來被Prometheus實例使用的報警規則文件。

這樣我們要在集群中監控什么數據,就變成了直接去操作 Kubernetes 集群的資源對象了,是不是方便很多了。上圖中的 Service 和 ServiceMonitor 都是 Kubernetes 的資源,一個 ServiceMonitor 可以通過 labelSelector 的方式去匹配一類 Service,Prometheus 也可以通過 labelSelector 去匹配多個ServiceMonitor。

下面是組件的具體介紹:

  • Operator: Operator 資源會根據自定義資源(Custom Resource Definition / CRDs)來部署和管理 Prometheus Server,同時監控這些自定義資源事件的變化來做相應的處理,是整個系統的控制中心。
  • Prometheus: Prometheus 資源是聲明性地描述 Prometheus 部署的期望狀態。
  • Prometheus Server: Operator 根據自定義資源 Prometheus 類型中定義的內容而部署的 Prometheus Server 集群,這些自定義資源可以看作是用來管理 Prometheus Server 集群的 StatefulSets 資源。
  • ServiceMonitor: ServiceMonitor 也是一個自定義資源,它描述了一組被 Prometheus 監控的 targets 列表。該資源通過 Labels 來選取對應的 Service Endpoint,讓 Prometheus Server 通過選取的 Service 來獲取 Metrics 信息。
  • Service: Service 資源主要用來對應 Kubernetes 集群中的 Metrics Server Pod,來提供給 ServiceMonitor 選取讓 Prometheus Server 來獲取信息。簡單的說就是 Prometheus 監控的對象,例如 Node Exporter Service、Mysql Exporter Service 等等。
  • Alertmanager: Alertmanager 也是一個自定義資源類型,由 Operator 根據資源描述內容來部署 Alertmanager 集群。

kube-prometheus包含的組件:

  • Prometheus Operator
  • 高可用的 Prometheus    默認會部署2個pod 
  • 高可用的 Alertmanager
  • Prometheus node-exporter
  • Prometheus Adapter for Kubernetes Metrics APIs
  • kube-state-metrics
  • Grafana

二、結構分析

kube-prometheus相關部署文件在manifests目錄中,共65個yaml,其中setup文件夾中包含所有自定義資源配置CustomResourceDefinition(一般不用修改,也不要輕易修改),所以部署時必須先執行這個文件夾。

其中包括告警(Alertmanager)、監控(Prometheus)、監控項(PrometheusRule)這三類資源定義,所以如果你想直接在k8s中修改對應控制器配置是沒有用的(比如kubectl edit sts prometheus-k8s -n monitoring) 。

這里yaml文件看着很多,只要我們梳理一下就會很容易理解了,首先分為7個組件prometheus-operator、prometheus-adapter、prometheus、alertmanager、grafana、kube-state-metrics、node-exporter,

然后每個組件都會定義控制器、配置文件、集群權限、訪問配置、監控配置, 但是我們一般只需要進行自定義告警配置和監控項,這樣一篩選發現只需要修改幾個文件即可。

[root@ymt108 manifests]# tree
.
├── alertmanager-alertmanager.yaml
├── alertmanager-secret.yaml    # 告警配置
├── alertmanager-serviceAccount.yaml
├── alertmanager-serviceMonitor.yaml
├── alertmanager-service.yaml
├── grafana-dashboardDatasources.yaml
├── grafana-dashboardDefinitions.yaml
├── grafana-dashboardSources.yaml
├── grafana-deployment.yaml
├── grafana-serviceAccount.yaml
├── grafana-serviceMonitor.yaml
├── grafana-service.yaml
├── kube-state-metrics-clusterRoleBinding.yaml
├── kube-state-metrics-clusterRole.yaml
├── kube-state-metrics-deployment.yaml
├── kube-state-metrics-roleBinding.yaml
├── kube-state-metrics-role.yaml
├── kube-state-metrics-serviceAccount.yaml
├── kube-state-metrics-serviceMonitor.yaml
├── kube-state-metrics-service.yaml
├── node-exporter-clusterRoleBinding.yaml
├── node-exporter-clusterRole.yaml
├── node-exporter-daemonset.yaml
├── node-exporter-serviceAccount.yaml
├── node-exporter-serviceMonitor.yaml
├── node-exporter-service.yaml
├── prometheus-adapter-apiService.yaml
├── prometheus-adapter-clusterRoleAggregatedMetricsReader.yaml
├── prometheus-adapter-clusterRoleBindingDelegator.yaml
├── prometheus-adapter-clusterRoleBinding.yaml
├── prometheus-adapter-clusterRoleServerResources.yaml
├── prometheus-adapter-clusterRole.yaml
├── prometheus-adapter-configMap.yaml
├── prometheus-adapter-deployment.yaml
├── prometheus-adapter-roleBindingAuthReader.yaml
├── prometheus-adapter-serviceAccount.yaml
├── prometheus-adapter-service.yaml
├── prometheus-clusterRoleBinding.yaml
├── prometheus-clusterRole.yaml
├── prometheus-operator-serviceMonitor.yaml
├── prometheus-prometheus.yaml  # 監控配置
├── prometheus-roleBindingConfig.yaml
├── prometheus-roleBindingSpecificNamespaces.yaml
├── prometheus-roleConfig.yaml
├── prometheus-roleSpecificNamespaces.yaml
├── prometheus-rules.yaml  # 默認監控項
├── prometheus-serviceAccount.yaml
├── prometheus-serviceMonitorApiserver.yaml
├── prometheus-serviceMonitorCoreDNS.yaml
├── prometheus-serviceMonitorKubeControllerManager.yaml
├── prometheus-serviceMonitorKubelet.yaml
├── prometheus-serviceMonitorKubeScheduler.yaml
├── prometheus-serviceMonitor.yaml
├── prometheus-service.yaml
└── setup
    ├── 0namespace-namespace.yaml
    ├── prometheus-operator-0alertmanagerCustomResourceDefinition.yaml
    ├── prometheus-operator-0podmonitorCustomResourceDefinition.yaml
    ├── prometheus-operator-0prometheusCustomResourceDefinition.yaml
    ├── prometheus-operator-0prometheusruleCustomResourceDefinition.yaml
    ├── prometheus-operator-0servicemonitorCustomResourceDefinition.yaml
    ├── prometheus-operator-clusterRoleBinding.yaml
    ├── prometheus-operator-clusterRole.yaml
    ├── prometheus-operator-deployment.yaml
    ├── prometheus-operator-serviceAccount.yaml
    └── prometheus-operator-service.yaml

1 directories, 65 files

 


三、Prometheus配置文件修改

為了保留原始文件,建議先備份一份kube-prometheus原文件。

1、修改prometheus-prometheus.yaml,以下為修改后並經生產環境驗證過后的配置文件,根據自己的k8s集群環境進行修改

1)replicas:根據項目情況調整副本數

2)retention:修改Prometheus數據保留期限,默認值為“24h”,並且必須與正則表達式“ [0-9] +(ms | s | m | h | d | w | y)”匹配。

3)additionalScrapeConfigs:增加額外監控項配置,具體配置查看第五部分“添加k8s外部監控”。 

4)storage:volumeClaimTemplate:   prometheus數據持久化配置,避免prometheus重啟造成數據丟失,此處使用阿里雲Nas創建的 StorageClass,創建StorageClass相關文章https://www.cnblogs.com/lidong94/p/14518362.html

5)nodeSelector && tolerations :節點選擇器和污點,因為prometheus查詢數據量大會導致k8s節點內存升高,如果所在節點上部署着線上服務會受到影響,故獨立將prometheus放在特定的節點中(如果不需要可以刪除)

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    prometheus: k8s
  name: k8s
  namespace: monitoring
spec:
  alerting:
    alertmanagers:
    - name: alertmanager-main
      namespace: monitoring
      port: web
  image: quay.io/prometheus/prometheus:v2.20.0
  additionalScrapeConfigs:
    name: additional-scrape-configs
    key: prometheus-additional.yaml
  retention: 30d #修改Prometheus數據保留期限,默認值為24h,此處為保留30d,
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: alicloud-nas
        resources:
          requests:
            storage: 100Gi
  nodeSelector:
    zone: problem
  tolerations:
  - effect: NoSchedule
    key: forever
    operator: Equal
    value: problem
  podMonitorNamespaceSelector: {}
  podMonitorSelector: {}
  replicas: 1 #根據項目情況調整副本數,默認是2個 
  resources:
    requests:
      memory: 400Mi
  ruleSelector:
    matchLabels:
      prometheus: k8s
      role: alert-rules
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
  version: v2.20.0

 


2、PrometheusRule配置文件修改

首先查看默認監控項配置prometheus-rules.yaml,其中包括76個告警項,基本覆蓋了k8s常用監控點。

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: k8s
    role: alert-rules
  name: prometheus-k8s-rules
  namespace: monitoring
spec:
  groups:
  - name: node-exporter.rules
    rules:
    - expr: |
        count without (cpu) (
          count without (mode) (
            node_cpu_seconds_total{job="node-exporter"}
          )
        )
      record: instance:node_num_cpu:sum
    - expr: |
        1 - avg without (cpu, mode) (
          rate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[1m])
        )
      record: instance:node_cpu_utilisation:rate1m
    - expr: |
        (
          node_load1{job="node-exporter"}
        /
          instance:node_num_cpu:sum{job="node-exporter"}
        )
      record: instance:node_load1_per_cpu:ratio
    - expr: |
        1 - (
          node_memory_MemAvailable_bytes{job="node-exporter"}
        /
          node_memory_MemTotal_bytes{job="node-exporter"}
        )
      record: instance:node_memory_utilisation:ratio
    - expr: |
        rate(node_vmstat_pgmajfault{job="node-exporter"}[1m])
      record: instance:node_vmstat_pgmajfault:rate1m
    - expr: |
        rate(node_disk_io_time_seconds_total{job="node-exporter", device=~"mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+"}[1m])
      record: instance_device:node_disk_io_time_seconds:rate1m
    - expr: |
        rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+"}[1m])
      record: instance_device:node_disk_io_time_weighted_seconds:rate1m
    - expr: |
        sum without (device) (
          rate(node_network_receive_bytes_total{job="node-exporter", device!="lo"}[1m])
        )
      record: instance:node_network_receive_bytes_excluding_lo:rate1m
    - expr: |
        sum without (device) (
          rate(node_network_transmit_bytes_total{job="node-exporter", device!="lo"}[1m])
        )
      record: instance:node_network_transmit_bytes_excluding_lo:rate1m
    - expr: |
        sum without (device) (
          rate(node_network_receive_drop_total{job="node-exporter", device!="lo"}[1m])
        )
      record: instance:node_network_receive_drop_excluding_lo:rate1m
    - expr: |
        sum without (device) (
          rate(node_network_transmit_drop_total{job="node-exporter", device!="lo"}[1m])
        )
      record: instance:node_network_transmit_drop_excluding_lo:rate1m
  - name: kube-apiserver.rules
    rules:
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[1d]))
            -
            (
              (
                sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[1d]))
                or
                vector(0)
              )
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="namespace",le="0.5"}[1d]))
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="cluster",le="5"}[1d]))
            )
          )
          +
          # errors
          sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[1d]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[1d]))
      labels:
        verb: read
      record: apiserver_request:burnrate1d
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[1h]))
            -
            (
              (
                sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[1h]))
                or
                vector(0)
              )
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="namespace",le="0.5"}[1h]))
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="cluster",le="5"}[1h]))
            )
          )
          +
          # errors
          sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[1h]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[1h]))
      labels:
        verb: read
      record: apiserver_request:burnrate1h
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[2h]))
            -
            (
              (
                sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[2h]))
                or
                vector(0)
              )
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="namespace",le="0.5"}[2h]))
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="cluster",le="5"}[2h]))
            )
          )
          +
          # errors
          sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[2h]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[2h]))
      labels:
        verb: read
      record: apiserver_request:burnrate2h
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[30m]))
            -
            (
              (
                sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30m]))
                or
                vector(0)
              )
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="namespace",le="0.5"}[30m]))
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="cluster",le="5"}[30m]))
            )
          )
          +
          # errors
          sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[30m]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[30m]))
      labels:
        verb: read
      record: apiserver_request:burnrate30m
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[3d]))
            -
            (
              (
                sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[3d]))
                or
                vector(0)
              )
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="namespace",le="0.5"}[3d]))
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="cluster",le="5"}[3d]))
            )
          )
          +
          # errors
          sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[3d]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[3d]))
      labels:
        verb: read
      record: apiserver_request:burnrate3d
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[5m]))
            -
            (
              (
                sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[5m]))
                or
                vector(0)
              )
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="namespace",le="0.5"}[5m]))
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="cluster",le="5"}[5m]))
            )
          )
          +
          # errors
          sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[5m]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[5m]))
      labels:
        verb: read
      record: apiserver_request:burnrate5m
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[6h]))
            -
            (
              (
                sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[6h]))
                or
                vector(0)
              )
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="namespace",le="0.5"}[6h]))
              +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="cluster",le="5"}[6h]))
            )
          )
          +
          # errors
          sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[6h]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[6h]))
      labels:
        verb: read
      record: apiserver_request:burnrate6h
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d]))
            -
            sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",le="1"}[1d]))
          )
          +
          sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[1d]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d]))
      labels:
        verb: write
      record: apiserver_request:burnrate1d
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h]))
            -
            sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",le="1"}[1h]))
          )
          +
          sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[1h]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h]))
      labels:
        verb: write
      record: apiserver_request:burnrate1h
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h]))
            -
            sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",le="1"}[2h]))
          )
          +
          sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[2h]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h]))
      labels:
        verb: write
      record: apiserver_request:burnrate2h
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m]))
            -
            sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",le="1"}[30m]))
          )
          +
          sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[30m]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m]))
      labels:
        verb: write
      record: apiserver_request:burnrate30m
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d]))
            -
            sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",le="1"}[3d]))
          )
          +
          sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[3d]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d]))
      labels:
        verb: write
      record: apiserver_request:burnrate3d
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))
            -
            sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",le="1"}[5m]))
          )
          +
          sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[5m]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))
      labels:
        verb: write
      record: apiserver_request:burnrate5m
    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h]))
            -
            sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",le="1"}[6h]))
          )
          +
          sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[6h]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h]))
      labels:
        verb: write
      record: apiserver_request:burnrate6h
    - expr: |
        sum by (code,resource) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[5m]))
      labels:
        verb: read
      record: code_resource:apiserver_request_total:rate5m
    - expr: |
        sum by (code,resource) (rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))
      labels:
        verb: write
      record: code_resource:apiserver_request_total:rate5m
    - expr: |
        histogram_quantile(0.99, sum by (le, resource) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET"}[5m]))) > 0
      labels:
        quantile: "0.99"
        verb: read
      record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
    - expr: |
        histogram_quantile(0.99, sum by (le, resource) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) > 0
      labels:
        quantile: "0.99"
        verb: write
      record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
    - expr: |
        sum(rate(apiserver_request_duration_seconds_sum{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod)
        /
        sum(rate(apiserver_request_duration_seconds_count{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod)
      record: cluster:apiserver_request_duration_seconds:mean5m
    - expr: |
        histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod))
      labels:
        quantile: "0.99"
      record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
    - expr: |
        histogram_quantile(0.9, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod))
      labels:
        quantile: "0.9"
      record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
    - expr: |
        histogram_quantile(0.5, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod))
      labels:
        quantile: "0.5"
      record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
  - interval: 3m
    name: kube-apiserver-availability.rules
    rules:
    - expr: |
        1 - (
          (
            # write too slow
            sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
            -
            sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d]))
          ) +
          (
            # read too slow
            sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d]))
            -
            (
              (
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d]))
                or
                vector(0)
              )
              +
              sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="namespace",le="0.5"}[30d]))
              +
              sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="cluster",le="5"}[30d]))
            )
          ) +
          # errors
          sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))
        )
        /
        sum(code:apiserver_request_total:increase30d)
      labels:
        verb: all
      record: apiserver_request:availability30d
    - expr: |
        1 - (
          sum(increase(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[30d]))
          -
          (
            # too slow
            (
              sum(increase(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d]))
              or
              vector(0)
            )
            +
            sum(increase(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="namespace",le="0.5"}[30d]))
            +
            sum(increase(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="cluster",le="5"}[30d]))
          )
          +
          # errors
          sum(code:apiserver_request_total:increase30d{verb="read",code=~"5.."} or vector(0))
        )
        /
        sum(code:apiserver_request_total:increase30d{verb="read"})
      labels:
        verb: read
      record: apiserver_request:availability30d
    - expr: |
        1 - (
          (
            # too slow
            sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
            -
            sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d]))
          )
          +
          # errors
          sum(code:apiserver_request_total:increase30d{verb="write",code=~"5.."} or vector(0))
        )
        /
        sum(code:apiserver_request_total:increase30d{verb="write"})
      labels:
        verb: write
      record: apiserver_request:availability30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="LIST",code=~"2.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="GET",code=~"2.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="POST",code=~"2.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="PUT",code=~"2.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="PATCH",code=~"2.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="DELETE",code=~"2.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="LIST",code=~"3.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="GET",code=~"3.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="POST",code=~"3.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="PUT",code=~"3.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="PATCH",code=~"3.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="DELETE",code=~"3.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="LIST",code=~"4.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="GET",code=~"4.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="POST",code=~"4.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="PUT",code=~"4.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="PATCH",code=~"4.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="DELETE",code=~"4.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="LIST",code=~"5.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="GET",code=~"5.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="POST",code=~"5.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="PUT",code=~"5.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="PATCH",code=~"5.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code, verb) (increase(apiserver_request_total{job="apiserver",verb="DELETE",code=~"5.."}[30d]))
      record: code_verb:apiserver_request_total:increase30d
    - expr: |
        sum by (code) (code_verb:apiserver_request_total:increase30d{verb=~"LIST|GET"})
      labels:
        verb: read
      record: code:apiserver_request_total:increase30d
    - expr: |
        sum by (code) (code_verb:apiserver_request_total:increase30d{verb=~"POST|PUT|PATCH|DELETE"})
      labels:
        verb: write
      record: code:apiserver_request_total:increase30d
  - name: k8s.rules
    rules:
    - expr: |
        sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[5m])) by (namespace)
      record: namespace:container_cpu_usage_seconds_total:sum_rate
    - expr: |
        sum by (cluster, namespace, pod, container) (
          rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[5m])
        ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (
          1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
        )
      record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate
    - expr: |
        container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
        * on (namespace, pod) group_left(node) topk by(namespace, pod) (1,
          max by(namespace, pod, node) (kube_pod_info{node!=""})
        )
      record: node_namespace_pod_container:container_memory_working_set_bytes
    - expr: |
        container_memory_rss{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
        * on (namespace, pod) group_left(node) topk by(namespace, pod) (1,
          max by(namespace, pod, node) (kube_pod_info{node!=""})
        )
      record: node_namespace_pod_container:container_memory_rss
    - expr: |
        container_memory_cache{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
        * on (namespace, pod) group_left(node) topk by(namespace, pod) (1,
          max by(namespace, pod, node) (kube_pod_info{node!=""})
        )
      record: node_namespace_pod_container:container_memory_cache
    - expr: |
        container_memory_swap{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
        * on (namespace, pod) group_left(node) topk by(namespace, pod) (1,
          max by(namespace, pod, node) (kube_pod_info{node!=""})
        )
      record: node_namespace_pod_container:container_memory_swap
    - expr: |
        sum(container_memory_usage_bytes{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}) by (namespace)
      record: namespace:container_memory_usage_bytes:sum
    - expr: |
        sum by (namespace) (
            sum by (namespace, pod) (
                max by (namespace, pod, container) (
                    kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"}
                ) * on(namespace, pod) group_left() max by (namespace, pod) (
                    kube_pod_status_phase{phase=~"Pending|Running"} == 1
                )
            )
        )
      record: namespace:kube_pod_container_resource_requests_memory_bytes:sum
    - expr: |
        sum by (namespace) (
            sum by (namespace, pod) (
                max by (namespace, pod, container) (
                    kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}
                ) * on(namespace, pod) group_left() max by (namespace, pod) (
                  kube_pod_status_phase{phase=~"Pending|Running"} == 1
                )
            )
        )
      record: namespace:kube_pod_container_resource_requests_cpu_cores:sum
    - expr: |
        max by (cluster, namespace, workload, pod) (
          label_replace(
            label_replace(
              kube_pod_owner{job="kube-state-metrics", owner_kind="ReplicaSet"},
              "replicaset", "$1", "owner_name", "(.*)"
            ) * on(replicaset, namespace) group_left(owner_name) topk by(replicaset, namespace) (
              1, max by (replicaset, namespace, owner_name) (
                kube_replicaset_owner{job="kube-state-metrics"}
              )
            ),
            "workload", "$1", "owner_name", "(.*)"
          )
        )
      labels:
        workload_type: deployment
      record: namespace_workload_pod:kube_pod_owner:relabel
    - expr: |
        max by (cluster, namespace, workload, pod) (
          label_replace(
            kube_pod_owner{job="kube-state-metrics", owner_kind="DaemonSet"},
            "workload", "$1", "owner_name", "(.*)"
          )
        )
      labels:
        workload_type: daemonset
      record: namespace_workload_pod:kube_pod_owner:relabel
    - expr: |
        max by (cluster, namespace, workload, pod) (
          label_replace(
            kube_pod_owner{job="kube-state-metrics", owner_kind="StatefulSet"},
            "workload", "$1", "owner_name", "(.*)"
          )
        )
      labels:
        workload_type: statefulset
      record: namespace_workload_pod:kube_pod_owner:relabel
  - name: kube-scheduler.rules
    rules:
    - expr: |
        histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
      labels:
        quantile: "0.99"
      record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
    - expr: |
        histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
      labels:
        quantile: "0.99"
      record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
    - expr: |
        histogram_quantile(0.99, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
      labels:
        quantile: "0.99"
      record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile
    - expr: |
        histogram_quantile(0.9, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
      labels:
        quantile: "0.9"
      record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
    - expr: |
        histogram_quantile(0.9, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
      labels:
        quantile: "0.9"
      record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
    - expr: |
        histogram_quantile(0.9, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
      labels:
        quantile: "0.9"
      record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile
    - expr: |
        histogram_quantile(0.5, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
      labels:
        quantile: "0.5"
      record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
    - expr: |
        histogram_quantile(0.5, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
      labels:
        quantile: "0.5"
      record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
    - expr: |
        histogram_quantile(0.5, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
      labels:
        quantile: "0.5"
      record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile
  - name: node.rules
    rules:
    - expr: |
        sum(min(kube_pod_info{node!=""}) by (cluster, node))
      record: ':kube_pod_info_node_count:'
    - expr: |
        topk by(namespace, pod) (1,
          max by (node, namespace, pod) (
            label_replace(kube_pod_info{job="kube-state-metrics",node!=""}, "pod", "$1", "pod", "(.*)")
        ))
      record: 'node_namespace_pod:kube_pod_info:'
    - expr: |
        count by (cluster, node) (sum by (node, cpu) (
          node_cpu_seconds_total{job="node-exporter"}
        * on (namespace, pod) group_left(node)
          node_namespace_pod:kube_pod_info:
        ))
      record: node:node_num_cpu:sum
    - expr: |
        sum(
          node_memory_MemAvailable_bytes{job="node-exporter"} or
          (
            node_memory_Buffers_bytes{job="node-exporter"} +
            node_memory_Cached_bytes{job="node-exporter"} +
            node_memory_MemFree_bytes{job="node-exporter"} +
            node_memory_Slab_bytes{job="node-exporter"}
          )
        ) by (cluster)
      record: :node_memory_MemAvailable_bytes:sum
  - name: kubelet.rules
    rules:
    - expr: |
        histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) by (instance, le) * on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
      labels:
        quantile: "0.99"
      record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile
    - expr: |
        histogram_quantile(0.9, sum(rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) by (instance, le) * on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
      labels:
        quantile: "0.9"
      record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile
    - expr: |
        histogram_quantile(0.5, sum(rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) by (instance, le) * on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
      labels:
        quantile: "0.5"
      record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile
  - name: kube-prometheus-node-recording.rules
    rules:
    - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[3m])) BY (instance)
      record: instance:node_cpu:rate:sum
    - expr: sum(rate(node_network_receive_bytes_total[3m])) BY (instance)
      record: instance:node_network_receive_bytes:rate:sum
    - expr: sum(rate(node_network_transmit_bytes_total[3m])) BY (instance)
      record: instance:node_network_transmit_bytes:rate:sum
    - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[5m])) WITHOUT (cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu_seconds_total) BY (instance, cpu)) BY (instance)
      record: instance:node_cpu:ratio
    - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[5m]))
      record: cluster:node_cpu:sum_rate5m
    - expr: cluster:node_cpu_seconds_total:rate5m / count(sum(node_cpu_seconds_total) BY (instance, cpu))
      record: cluster:node_cpu:ratio
  - name: kube-prometheus-general.rules
    rules:
    - expr: count without(instance, pod, node) (up == 1)
      record: count:up1
    - expr: count without(instance, pod, node) (up == 0)
      record: count:up0
  - name: kube-state-metrics
    rules:
    - alert: KubeStateMetricsListErrors
      annotations:
        description: kube-state-metrics is experiencing errors at an elevated rate in list operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubestatemetricslisterrors
        summary: kube-state-metrics is experiencing errors in list operations.
      expr: |
        (sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m]))
          /
        sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m])))
        > 0.01
      for: 15m
      labels:
        severity: critical
    - alert: KubeStateMetricsWatchErrors
      annotations:
        description: kube-state-metrics is experiencing errors at an elevated rate in watch operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubestatemetricswatcherrors
        summary: kube-state-metrics is experiencing errors in watch operations.
      expr: |
        (sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m]))
          /
        sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m])))
        > 0.01
      for: 15m
      labels:
        severity: critical
  - name: node-exporter
    rules:
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 24 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 40
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up fast.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 4 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeFilesystemAlmostOutOfSpace
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodefilesystemalmostoutofspace
        summary: Filesystem has less than 5% space left.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemAlmostOutOfSpace
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodefilesystemalmostoutofspace
        summary: Filesystem has less than 3% space left.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 3
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeFilesystemFilesFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left and is filling up.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodefilesystemfilesfillingup
        summary: Filesystem is predicted to run out of inodes within the next 24 hours.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40
        and
          predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemFilesFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left and is filling up fast.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodefilesystemfilesfillingup
        summary: Filesystem is predicted to run out of inodes within the next 4 hours.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 20
        and
          predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeFilesystemAlmostOutOfFiles
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodefilesystemalmostoutoffiles
        summary: Filesystem has less than 5% inodes left.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 5
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemAlmostOutOfFiles
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodefilesystemalmostoutoffiles
        summary: Filesystem has less than 3% inodes left.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeNetworkReceiveErrs
      annotations:
        description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last two minutes.'
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodenetworkreceiveerrs
        summary: Network interface is reporting many receive errors.
      expr: |
        rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
      for: 1h
      labels:
        severity: warning
    - alert: NodeNetworkTransmitErrs
      annotations:
        description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last two minutes.'
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodenetworktransmiterrs
        summary: Network interface is reporting many transmit errors.
      expr: |
        rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
      for: 1h
      labels:
        severity: warning
    - alert: NodeHighNumberConntrackEntriesUsed
      annotations:
        description: '{{ $value | humanizePercentage }} of conntrack entries are used.'
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodehighnumberconntrackentriesused
        summary: Number of conntrack are getting close to the limit.
      expr: |
        (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75
      labels:
        severity: warning
    - alert: NodeTextFileCollectorScrapeError
      annotations:
        description: Node Exporter text file collector failed to scrape.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodetextfilecollectorscrapeerror
        summary: Node Exporter text file collector failed to scrape.
      expr: |
        node_textfile_scrape_error{job="node-exporter"} == 1
      labels:
        severity: warning
    - alert: NodeClockSkewDetected
      annotations:
        message: Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure NTP is configured correctly on this host.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodeclockskewdetected
        summary: Clock skew detected.
      expr: |
        (
          node_timex_offset_seconds > 0.05
        and
          deriv(node_timex_offset_seconds[5m]) >= 0
        )
        or
        (
          node_timex_offset_seconds < -0.05
        and
          deriv(node_timex_offset_seconds[5m]) <= 0
        )
      for: 10m
      labels:
        severity: warning
    - alert: NodeClockNotSynchronising
      annotations:
        message: Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodeclocknotsynchronising
        summary: Clock not synchronising.
      expr: |
        min_over_time(node_timex_sync_status[5m]) == 0
        and
        node_timex_maxerror_seconds >= 16
      for: 10m
      labels:
        severity: warning
    - alert: NodeRAIDDegraded
      annotations:
        description: RAID array '{{ $labels.device }}' on {{ $labels.instance }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/noderaiddegraded
        summary: RAID Array is degraded
      expr: |
        node_md_disks_required - ignoring (state) (node_md_disks{state="active"}) > 0
      for: 15m
      labels:
        severity: critical
    - alert: NodeRAIDDiskFailure
      annotations:
        description: At least one device in RAID array on {{ $labels.instance }} failed. Array '{{ $labels.device }}' needs attention and possibly a disk swap.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/noderaiddiskfailure
        summary: Failed device in RAID array
      expr: |
        node_md_disks{state="fail"} > 0
      labels:
        severity: warning
  - name: prometheus-operator
    rules:
    - alert: PrometheusOperatorListErrors
      annotations:
        description: Errors while performing List operations in controller {{$labels.controller}} in {{$labels.namespace}} namespace.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/prometheusoperatorlisterrors
        summary: Errors while performing list operations in controller.
      expr: |
        (sum by (controller,namespace) (rate(prometheus_operator_list_operations_failed_total{job="prometheus-operator",namespace="monitoring"}[10m])) / sum by (controller,namespace) (rate(prometheus_operator_list_operations_total{job="prometheus-operator",namespace="monitoring"}[10m]))) > 0.4
      for: 15m
      labels:
        severity: warning
    - alert: PrometheusOperatorWatchErrors
      annotations:
        description: Errors while performing watch operations in controller {{$labels.controller}} in {{$labels.namespace}} namespace.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/prometheusoperatorwatcherrors
        summary: Errors while performing watch operations in controller.
      expr: |
        (sum by (controller,namespace) (rate(prometheus_operator_watch_operations_failed_total{job="prometheus-operator",namespace="monitoring"}[10m])) / sum by (controller,namespace) (rate(prometheus_operator_watch_operations_total{job="prometheus-operator",namespace="monitoring"}[10m]))) > 0.4
      for: 15m
      labels:
        severity: warning
    - alert: PrometheusOperatorSyncFailed
      annotations:
        description: Controller {{ $labels.controller }} in {{ $labels.namespace }} namespace fails to reconcile {{ $value }} objects.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/prometheusoperatorsyncfailed
        summary: Last controller reconciliation failed
      expr: |
        min_over_time(prometheus_operator_syncs{status="failed",job="prometheus-operator",namespace="monitoring"}[5m]) > 0
      for: 10m
      labels:
        severity: warning
    - alert: PrometheusOperatorReconcileErrors
      annotations:
        description: '{{ $value | humanizePercentage }} of reconciling operations failed for {{ $labels.controller }} controller in {{ $labels.namespace }} namespace.'
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/prometheusoperatorreconcileerrors
        summary: Errors while reconciling controller.
      expr: |
        (sum by (controller,namespace) (rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]))) / (sum by (controller,namespace) (rate(prometheus_operator_reconcile_operations_total{job="prometheus-operator",namespace="monitoring"}[5m]))) > 0.1
      for: 10m
      labels:
        severity: warning
    - alert: PrometheusOperatorNodeLookupErrors
      annotations:
        description: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/prometheusoperatornodelookuperrors
        summary: Errors while reconciling Prometheus.
      expr: |
        rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
      for: 10m
      labels:
        severity: warning
    - alert: PrometheusOperatorNotReady
      annotations:
        description: Prometheus operator in {{ $labels.namespace }} namespace isn't ready to reconcile {{ $labels.controller }} resources.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/prometheusoperatornotready
        summary: Prometheus operator not ready
      expr: |
        min by(namespace, controller) (max_over_time(prometheus_operator_ready{job="prometheus-operator",namespace="monitoring"}[5m]) == 0)
      for: 5m
      labels:
        severity: warning
    - alert: PrometheusOperatorRejectedResources
      annotations:
        description: Prometheus operator in {{ $labels.namespace }} namespace rejected {{ printf "%0.0f" $value }} {{ $labels.controller }}/{{ $labels.resource }} resources.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/prometheusoperatorrejectedresources
        summary: Resources rejected by Prometheus operator
      expr: |
        min_over_time(prometheus_operator_managed_resources{state="rejected",job="prometheus-operator",namespace="monitoring"}[5m]) > 0
      for: 5m
      labels:
        severity: warning
  - name: kubernetes-apps
    rules:
    - alert: KubePodCrashLooping
      annotations:
        description: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubepodcrashlooping
        summary: Pod is crash looping.
      expr: |
        rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[5m]) * 60 * 5 > 0
      for: 15m
      labels:
        severity: warning
    - alert: KubePodNotReady
      annotations:
        description: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubepodnotready
        summary: Pod has been in a non-ready state for more than 15 minutes.
      expr: |
        sum by (namespace, pod) (
          max by(namespace, pod) (
            kube_pod_status_phase{job="kube-state-metrics", phase=~"Pending|Unknown"}
          ) * on(namespace, pod) group_left(owner_kind) topk by(namespace, pod) (
            1, max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"})
          )
        ) > 0
      for: 15m
      labels:
        severity: warning
    - alert: KubeDeploymentGenerationMismatch
      annotations:
        description: Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubedeploymentgenerationmismatch
        summary: Deployment generation mismatch due to possible roll-back
      expr: |
        kube_deployment_status_observed_generation{job="kube-state-metrics"}
          !=
        kube_deployment_metadata_generation{job="kube-state-metrics"}
      for: 15m
      labels:
        severity: warning
    - alert: KubeDeploymentReplicasMismatch
      annotations:
        description: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than 15 minutes.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubedeploymentreplicasmismatch
        summary: Deployment has not matched the expected number of replicas.
      expr: |
        (
          kube_deployment_spec_replicas{job="kube-state-metrics"}
            !=
          kube_deployment_status_replicas_available{job="kube-state-metrics"}
        ) and (
          changes(kube_deployment_status_replicas_updated{job="kube-state-metrics"}[5m])
            ==
          0
        )
      for: 15m
      labels:
        severity: warning
    - alert: KubeStatefulSetReplicasMismatch
      annotations:
        description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubestatefulsetreplicasmismatch
        summary: Deployment has not matched the expected number of replicas.
      expr: |
        (
          kube_statefulset_status_replicas_ready{job="kube-state-metrics"}
            !=
          kube_statefulset_status_replicas{job="kube-state-metrics"}
        ) and (
          changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics"}[5m])
            ==
          0
        )
      for: 15m
      labels:
        severity: warning
    - alert: KubeStatefulSetGenerationMismatch
      annotations:
        description: StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubestatefulsetgenerationmismatch
        summary: StatefulSet generation mismatch due to possible roll-back
      expr: |
        kube_statefulset_status_observed_generation{job="kube-state-metrics"}
          !=
        kube_statefulset_metadata_generation{job="kube-state-metrics"}
      for: 15m
      labels:
        severity: warning
    - alert: KubeStatefulSetUpdateNotRolledOut
      annotations:
        description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubestatefulsetupdatenotrolledout
        summary: StatefulSet update has not been rolled out.
      expr: |
        (
          max without (revision) (
            kube_statefulset_status_current_revision{job="kube-state-metrics"}
              unless
            kube_statefulset_status_update_revision{job="kube-state-metrics"}
          )
            *
          (
            kube_statefulset_replicas{job="kube-state-metrics"}
              !=
            kube_statefulset_status_replicas_updated{job="kube-state-metrics"}
          )
        )  and (
          changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics"}[5m])
            ==
          0
        )
      for: 15m
      labels:
        severity: warning
    - alert: KubeDaemonSetRolloutStuck
      annotations:
        description: DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has not finished or progressed for at least 15 minutes.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubedaemonsetrolloutstuck
        summary: DaemonSet rollout is stuck.
      expr: |
        (
          (
            kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"}
             !=
            kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"}
          ) or (
            kube_daemonset_status_number_misscheduled{job="kube-state-metrics"}
             !=
            0
          ) or (
            kube_daemonset_updated_number_scheduled{job="kube-state-metrics"}
             !=
            kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"}
          ) or (
            kube_daemonset_status_number_available{job="kube-state-metrics"}
             !=
            kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"}
          )
        ) and (
          changes(kube_daemonset_updated_number_scheduled{job="kube-state-metrics"}[5m])
            ==
          0
        )
      for: 15m
      labels:
        severity: warning
    - alert: KubeContainerWaiting
      annotations:
        description: Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container}} has been in waiting state for longer than 1 hour.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubecontainerwaiting
        summary: Pod container waiting longer than 1 hour
      expr: |
        sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 0
      for: 1h
      labels:
        severity: warning
    - alert: KubeDaemonSetNotScheduled
      annotations:
        description: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.'
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubedaemonsetnotscheduled
        summary: DaemonSet pods are not scheduled.
      expr: |
        kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"}
          -
        kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0
      for: 10m
      labels:
        severity: warning
    - alert: KubeDaemonSetMisScheduled
      annotations:
        description: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.'
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubedaemonsetmisscheduled
        summary: DaemonSet pods are misscheduled.
      expr: |
        kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0
      for: 15m
      labels:
        severity: warning
    - alert: KubeJobCompletion
      annotations:
        description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than 12 hours to complete.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubejobcompletion
        summary: Job did not complete in time
      expr: |
        kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"}  > 0
      for: 12h
      labels:
        severity: warning
    - alert: KubeJobFailed
      annotations:
        description: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. Removing failed job after investigation should clear this alert.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubejobfailed
        summary: Job failed to complete.
      expr: |
        kube_job_failed{job="kube-state-metrics"}  > 0
      for: 15m
      labels:
        severity: warning
    - alert: KubeHpaReplicasMismatch
      annotations:
        description: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the desired number of replicas for longer than 15 minutes.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubehpareplicasmismatch
        summary: HPA has not matched descired number of replicas.
      expr: |
        (kube_hpa_status_desired_replicas{job="kube-state-metrics"}
          !=
        kube_hpa_status_current_replicas{job="kube-state-metrics"})
          and
        changes(kube_hpa_status_current_replicas[15m]) == 0
      for: 15m
      labels:
        severity: warning
    - alert: KubeHpaMaxedOut
      annotations:
        description: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has been running at max replicas for longer than 15 minutes.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubehpamaxedout
        summary: HPA is running at max replicas
      expr: |
        kube_hpa_status_current_replicas{job="kube-state-metrics"}
          ==
        kube_hpa_spec_max_replicas{job="kube-state-metrics"}
      for: 15m
      labels:
        severity: warning
  - name: kubernetes-resources
    rules:
    - alert: KubeCPUOvercommit
      annotations:
        description: Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubecpuovercommit
        summary: Cluster has overcommitted CPU resource requests.
      expr: |
        sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{})
          /
        sum(kube_node_status_allocatable_cpu_cores)
          >
        (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores)
      for: 5m
      labels:
        severity: warning
    - alert: KubeMemoryOvercommit
      annotations:
        description: Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubememoryovercommit
        summary: Cluster has overcommitted memory resource requests.
      expr: |
        sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum{})
          /
        sum(kube_node_status_allocatable_memory_bytes)
          >
        (count(kube_node_status_allocatable_memory_bytes)-1)
          /
        count(kube_node_status_allocatable_memory_bytes)
      for: 5m
      labels:
        severity: warning
    - alert: KubeCPUQuotaOvercommit
      annotations:
        description: Cluster has overcommitted CPU resource requests for Namespaces.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubecpuquotaovercommit
        summary: Cluster has overcommitted CPU resource requests.
      expr: |
        sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"})
          /
        sum(kube_node_status_allocatable_cpu_cores)
          > 1.5
      for: 5m
      labels:
        severity: warning
    - alert: KubeMemoryQuotaOvercommit
      annotations:
        description: Cluster has overcommitted memory resource requests for Namespaces.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubememoryquotaovercommit
        summary: Cluster has overcommitted memory resource requests.
      expr: |
        sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="memory"})
          /
        sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"})
          > 1.5
      for: 5m
      labels:
        severity: warning
    - alert: KubeQuotaAlmostFull
      annotations:
        description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubequotaalmostfull
        summary: Namespace quota is going to be full.
      expr: |
        kube_resourcequota{job="kube-state-metrics", type="used"}
          / ignoring(instance, job, type)
        (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
          > 0.9 < 1
      for: 15m
      labels:
        severity: info
    - alert: KubeQuotaFullyUsed
      annotations:
        description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubequotafullyused
        summary: Namespace quota is fully used.
      expr: |
        kube_resourcequota{job="kube-state-metrics", type="used"}
          / ignoring(instance, job, type)
        (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
          == 1
      for: 15m
      labels:
        severity: info
    - alert: KubeQuotaExceeded
      annotations:
        description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubequotaexceeded
        summary: Namespace quota has exceeded the limits.
      expr: |
        kube_resourcequota{job="kube-state-metrics", type="used"}
          / ignoring(instance, job, type)
        (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
          > 1
      for: 15m
      labels:
        severity: warning
    - alert: CPUThrottlingHigh
      annotations:
        description: '{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.'
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/cputhrottlinghigh
        summary: Processes experience elevated CPU throttling.
      expr: |
        sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace)
          /
        sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace)
          > ( 25 / 100 )
      for: 15m
      labels:
        severity: info
  - name: kubernetes-storage
    rules:
    - alert: KubePersistentVolumeFillingUp
      annotations:
        description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubepersistentvolumefillingup
        summary: PersistentVolume is filling up.
      expr: |
        kubelet_volume_stats_available_bytes{job="kubelet", metrics_path="/metrics"}
          /
        kubelet_volume_stats_capacity_bytes{job="kubelet", metrics_path="/metrics"}
          < 0.03
      for: 1m
      labels:
        severity: critical
    - alert: KubePersistentVolumeFillingUp
      annotations:
        description: Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value | humanizePercentage }} is available.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubepersistentvolumefillingup
        summary: PersistentVolume is filling up.
      expr: |
        (
          kubelet_volume_stats_available_bytes{job="kubelet", metrics_path="/metrics"}
            /
          kubelet_volume_stats_capacity_bytes{job="kubelet", metrics_path="/metrics"}
        ) < 0.15
        and
        predict_linear(kubelet_volume_stats_available_bytes{job="kubelet", metrics_path="/metrics"}[6h], 4 * 24 * 3600) < 0
      for: 1h
      labels:
        severity: warning
    - alert: KubePersistentVolumeErrors
      annotations:
        description: The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubepersistentvolumeerrors
        summary: PersistentVolume is having issues with provisioning.
      expr: |
        kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0
      for: 5m
      labels:
        severity: critical
  - name: kubernetes-system
    rules:
    - alert: KubeVersionMismatch
      annotations:
        description: There are {{ $value }} different semantic versions of Kubernetes components running.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeversionmismatch
        summary: Different semantic versions of Kubernetes components running.
      expr: |
        count(count by (gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[0-9]*).*"))) > 1
      for: 15m
      labels:
        severity: warning
    - alert: KubeClientErrors
      annotations:
        description: Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.'
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeclienterrors
        summary: Kubernetes API server client is experiencing errors.
      expr: |
        (sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job)
          /
        sum(rate(rest_client_requests_total[5m])) by (instance, job))
        > 0.01
      for: 15m
      labels:
        severity: warning
  - name: kube-apiserver-slos
    rules:
    - alert: KubeAPIErrorBudgetBurn
      annotations:
        description: The API server is burning too much error budget.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeapierrorbudgetburn
        summary: The API server is burning too much error budget.
      expr: |
        sum(apiserver_request:burnrate1h) > (14.40 * 0.01000)
        and
        sum(apiserver_request:burnrate5m) > (14.40 * 0.01000)
      for: 2m
      labels:
        long: 1h
        severity: critical
        short: 5m
    - alert: KubeAPIErrorBudgetBurn
      annotations:
        description: The API server is burning too much error budget.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeapierrorbudgetburn
        summary: The API server is burning too much error budget.
      expr: |
        sum(apiserver_request:burnrate6h) > (6.00 * 0.01000)
        and
        sum(apiserver_request:burnrate30m) > (6.00 * 0.01000)
      for: 15m
      labels:
        long: 6h
        severity: critical
        short: 30m
    - alert: KubeAPIErrorBudgetBurn
      annotations:
        description: The API server is burning too much error budget.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeapierrorbudgetburn
        summary: The API server is burning too much error budget.
      expr: |
        sum(apiserver_request:burnrate1d) > (3.00 * 0.01000)
        and
        sum(apiserver_request:burnrate2h) > (3.00 * 0.01000)
      for: 1h
      labels:
        long: 1d
        severity: warning
        short: 2h
    - alert: KubeAPIErrorBudgetBurn
      annotations:
        description: The API server is burning too much error budget.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeapierrorbudgetburn
        summary: The API server is burning too much error budget.
      expr: |
        sum(apiserver_request:burnrate3d) > (1.00 * 0.01000)
        and
        sum(apiserver_request:burnrate6h) > (1.00 * 0.01000)
      for: 3h
      labels:
        long: 3d
        severity: warning
        short: 6h
  - name: kubernetes-system-apiserver
    rules:
    - alert: KubeClientCertificateExpiration
      annotations:
        description: A client certificate used to authenticate to the apiserver is expiring in less than 7.0 days.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeclientcertificateexpiration
        summary: Client certificate is about to expire.
      expr: |
        apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800
      labels:
        severity: warning
    - alert: KubeClientCertificateExpiration
      annotations:
        description: A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeclientcertificateexpiration
        summary: Client certificate is about to expire.
      expr: |
        apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400
      labels:
        severity: critical
    - alert: AggregatedAPIErrors
      annotations:
        description: An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has reported errors. The number of errors have increased for it in the past five minutes. High values indicate that the availability of the service changes too often.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/aggregatedapierrors
        summary: An aggregated API has reported errors.
      expr: |
        sum by(name, namespace)(increase(aggregator_unavailable_apiservice_count[5m])) > 2
      labels:
        severity: warning
    - alert: AggregatedAPIDown
      annotations:
        description: An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has been only {{ $value | humanize }}% available over the last 10m.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/aggregatedapidown
        summary: An aggregated API is down.
      expr: |
        (1 - max by(name, namespace)(avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 85
      for: 5m
      labels:
        severity: warning
    - alert: KubeAPIDown
      annotations:
        description: KubeAPI has disappeared from Prometheus target discovery.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeapidown
        summary: Target disappeared from Prometheus target discovery.
      expr: |
        absent(up{job="apiserver"} == 1)
      for: 15m
      labels:
        severity: critical
  - name: kubernetes-system-kubelet
    rules:
    - alert: KubeNodeNotReady
      annotations:
        description: '{{ $labels.node }} has been unready for more than 15 minutes.'
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubenodenotready
        summary: Node is not ready.
      expr: |
        kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0
      for: 15m
      labels:
        severity: warning
    - alert: KubeNodeUnreachable
      annotations:
        description: '{{ $labels.node }} is unreachable and some workloads may be rescheduled.'
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubenodeunreachable
        summary: Node is unreachable.
      expr: |
        (kube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} unless ignoring(key,value) kube_node_spec_taint{job="kube-state-metrics",key=~"ToBeDeletedByClusterAutoscaler|cloud.google.com/impending-node-termination|aws-node-termination-handler/spot-itn"}) == 1
      for: 15m
      labels:
        severity: warning
    - alert: KubeletTooManyPods
      annotations:
        description: Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubelettoomanypods
        summary: Kubelet is running at capacity.
      expr: |
        count by(node) (
          (kube_pod_status_phase{job="kube-state-metrics",phase="Running"} == 1) * on(instance,pod,namespace,cluster) group_left(node) topk by(instance,pod,namespace,cluster) (1, kube_pod_info{job="kube-state-metrics"})
        )
        /
        max by(node) (
          kube_node_status_capacity_pods{job="kube-state-metrics"} != 1
        ) > 0.95
      for: 15m
      labels:
        severity: warning
    - alert: KubeNodeReadinessFlapping
      annotations:
        description: The readiness status of node {{ $labels.node }} has changed {{ $value }} times in the last 15 minutes.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubenodereadinessflapping
        summary: Node readiness status is flapping.
      expr: |
        sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (node) > 2
      for: 15m
      labels:
        severity: warning
    - alert: KubeletPlegDurationHigh
      annotations:
        description: The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeletplegdurationhigh
        summary: Kubelet Pod Lifecycle Event Generator is taking too long to relist.
      expr: |
        node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 10
      for: 5m
      labels:
        severity: warning
    - alert: KubeletPodStartUpLatencyHigh
      annotations:
        description: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on node {{ $labels.node }}.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeletpodstartuplatencyhigh
        summary: Kubelet Pod startup latency is too high.
      expr: |
        histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet", metrics_path="/metrics"}[5m])) by (instance, le)) * on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"} > 60
      for: 15m
      labels:
        severity: warning
    - alert: KubeletClientCertificateExpiration
      annotations:
        description: Client certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeletclientcertificateexpiration
        summary: Kubelet client certificate is about to expire.
      expr: |
        kubelet_certificate_manager_client_ttl_seconds < 604800
      labels:
        severity: warning
    - alert: KubeletClientCertificateExpiration
      annotations:
        description: Client certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeletclientcertificateexpiration
        summary: Kubelet client certificate is about to expire.
      expr: |
        kubelet_certificate_manager_client_ttl_seconds < 86400
      labels:
        severity: critical
    - alert: KubeletServerCertificateExpiration
      annotations:
        description: Server certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeletservercertificateexpiration
        summary: Kubelet server certificate is about to expire.
      expr: |
        kubelet_certificate_manager_server_ttl_seconds < 604800
      labels:
        severity: warning
    - alert: KubeletServerCertificateExpiration
      annotations:
        description: Server certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeletservercertificateexpiration
        summary: Kubelet server certificate is about to expire.
      expr: |
        kubelet_certificate_manager_server_ttl_seconds < 86400
      labels:
        severity: critical
    - alert: KubeletClientCertificateRenewalErrors
      annotations:
        description: Kubelet on node {{ $labels.node }} has failed to renew its client certificate ({{ $value | humanize }} errors in the last 5 minutes).
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeletclientcertificaterenewalerrors
        summary: Kubelet has failed to renew its client certificate.
      expr: |
        increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0
      for: 15m
      labels:
        severity: warning
    - alert: KubeletServerCertificateRenewalErrors
      annotations:
        description: Kubelet on node {{ $labels.node }} has failed to renew its server certificate ({{ $value | humanize }} errors in the last 5 minutes).
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeletservercertificaterenewalerrors
        summary: Kubelet has failed to renew its server certificate.
      expr: |
        increase(kubelet_server_expiration_renew_errors[5m]) > 0
      for: 15m
      labels:
        severity: warning
    - alert: KubeletDown
      annotations:
        description: Kubelet has disappeared from Prometheus target discovery.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeletdown
        summary: Target disappeared from Prometheus target discovery.
      expr: |
        absent(up{job="kubelet", metrics_path="/metrics"} == 1)
      for: 15m
      labels:
        severity: critical
  - name: kubernetes-system-scheduler
    rules:
    - alert: KubeSchedulerDown
      annotations:
        description: KubeScheduler has disappeared from Prometheus target discovery.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubeschedulerdown
        summary: Target disappeared from Prometheus target discovery.
      expr: |
        absent(up{job="kube-scheduler"} == 1)
      for: 15m
      labels:
        severity: critical
  - name: kubernetes-system-controller-manager
    rules:
    - alert: KubeControllerManagerDown
      annotations:
        description: KubeControllerManager has disappeared from Prometheus target discovery.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubecontrollermanagerdown
        summary: Target disappeared from Prometheus target discovery.
      expr: |
        absent(up{job="kube-controller-manager"} == 1)
      for: 15m
      labels:
        severity: critical
  - name: prometheus
    rules:
    - alert: PrometheusBadConfig
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to reload its configuration.
        summary: Failed Prometheus configuration reload.
      expr: |
        # Without max_over_time, failed scrapes could create false negatives, see
        # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
        max_over_time(prometheus_config_last_reload_successful{job="prometheus-k8s",namespace="monitoring"}[5m]) == 0
      for: 10m
      labels:
        severity: critical
    - alert: PrometheusNotificationQueueRunningFull
      annotations:
        description: Alert notification queue of Prometheus {{$labels.namespace}}/{{$labels.pod}} is running full.
        summary: Prometheus alert notification queue predicted to run full in less than 30m.
      expr: |
        # Without min_over_time, failed scrapes could create false negatives, see
        # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
        (
          predict_linear(prometheus_notifications_queue_length{job="prometheus-k8s",namespace="monitoring"}[5m], 60 * 30)
        >
          min_over_time(prometheus_notifications_queue_capacity{job="prometheus-k8s",namespace="monitoring"}[5m])
        )
      for: 15m
      labels:
        severity: warning
    - alert: PrometheusErrorSendingAlertsToSomeAlertmanagers
      annotations:
        description: '{{ printf "%.1f" $value }}% errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to Alertmanager {{$labels.alertmanager}}.'
        summary: Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager.
      expr: |
        (
          rate(prometheus_notifications_errors_total{job="prometheus-k8s",namespace="monitoring"}[5m])
        /
          rate(prometheus_notifications_sent_total{job="prometheus-k8s",namespace="monitoring"}[5m])
        )
        * 100
        > 1
      for: 15m
      labels:
        severity: warning
    - alert: PrometheusErrorSendingAlertsToAnyAlertmanager
      annotations:
        description: '{{ printf "%.1f" $value }}% minimum errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to any Alertmanager.'
        summary: Prometheus encounters more than 3% errors sending alerts to any Alertmanager.
      expr: |
        min without(alertmanager) (
          rate(prometheus_notifications_errors_total{job="prometheus-k8s",namespace="monitoring"}[5m])
        /
          rate(prometheus_notifications_sent_total{job="prometheus-k8s",namespace="monitoring"}[5m])
        )
        * 100
        > 3
      for: 15m
      labels:
        severity: critical
    - alert: PrometheusNotConnectedToAlertmanagers
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not connected to any Alertmanagers.
        summary: Prometheus is not connected to any Alertmanagers.
      expr: |
        # Without max_over_time, failed scrapes could create false negatives, see
        # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
        max_over_time(prometheus_notifications_alertmanagers_discovered{job="prometheus-k8s",namespace="monitoring"}[5m]) < 1
      for: 10m
      labels:
        severity: warning
    - alert: PrometheusTSDBReloadsFailing
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected {{$value | humanize}} reload failures over the last 3h.
        summary: Prometheus has issues reloading blocks from disk.
      expr: |
        increase(prometheus_tsdb_reloads_failures_total{job="prometheus-k8s",namespace="monitoring"}[3h]) > 0
      for: 4h
      labels:
        severity: warning
    - alert: PrometheusTSDBCompactionsFailing
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected {{$value | humanize}} compaction failures over the last 3h.
        summary: Prometheus has issues compacting blocks.
      expr: |
        increase(prometheus_tsdb_compactions_failed_total{job="prometheus-k8s",namespace="monitoring"}[3h]) > 0
      for: 4h
      labels:
        severity: warning
    - alert: PrometheusNotIngestingSamples
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not ingesting samples.
        summary: Prometheus is not ingesting samples.
      expr: |
        rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-k8s",namespace="monitoring"}[5m]) <= 0
      for: 10m
      labels:
        severity: warning
    - alert: PrometheusDuplicateTimestamps
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping {{ printf "%.4g" $value  }} samples/s with different values but duplicated timestamp.
        summary: Prometheus is dropping samples with duplicate timestamps.
      expr: |
        rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
      for: 10m
      labels:
        severity: warning
    - alert: PrometheusOutOfOrderTimestamps
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping {{ printf "%.4g" $value  }} samples/s with timestamps arriving out of order.
        summary: Prometheus drops samples with out-of-order timestamps.
      expr: |
        rate(prometheus_target_scrapes_sample_out_of_order_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
      for: 10m
      labels:
        severity: warning
    - alert: PrometheusRemoteStorageFailures
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} failed to send {{ printf "%.1f" $value }}% of the samples to {{ $labels.remote_name}}:{{ $labels.url }}
        summary: Prometheus fails to send samples to remote storage.
      expr: |
        (
          rate(prometheus_remote_storage_failed_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m])
        /
          (
            rate(prometheus_remote_storage_failed_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m])
          +
            rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m])
          )
        )
        * 100
        > 1
      for: 15m
      labels:
        severity: critical
    - alert: PrometheusRemoteWriteBehind
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write is {{ printf "%.1f" $value }}s behind for {{ $labels.remote_name}}:{{ $labels.url }}.
        summary: Prometheus remote write is behind.
      expr: |
        # Without max_over_time, failed scrapes could create false negatives, see
        # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
        (
          max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="prometheus-k8s",namespace="monitoring"}[5m])
        - on(job, instance) group_right
          max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="prometheus-k8s",namespace="monitoring"}[5m])
        )
        > 120
      for: 15m
      labels:
        severity: critical
    - alert: PrometheusRemoteWriteDesiredShards
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write desired shards calculation wants to run {{ $value }} shards for queue {{ $labels.remote_name}}:{{ $labels.url }}, which is more than the max of {{ printf `prometheus_remote_storage_shards_max{instance="%s",job="prometheus-k8s",namespace="monitoring"}` $labels.instance | query | first | value }}.
        summary: Prometheus remote write desired shards calculation wants to run more than configured max shards.
      expr: |
        # Without max_over_time, failed scrapes could create false negatives, see
        # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
        (
          max_over_time(prometheus_remote_storage_shards_desired{job="prometheus-k8s",namespace="monitoring"}[5m])
        >
          max_over_time(prometheus_remote_storage_shards_max{job="prometheus-k8s",namespace="monitoring"}[5m])
        )
      for: 15m
      labels:
        severity: warning
    - alert: PrometheusRuleFailures
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to evaluate {{ printf "%.0f" $value }} rules in the last 5m.
        summary: Prometheus is failing rule evaluations.
      expr: |
        increase(prometheus_rule_evaluation_failures_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
      for: 15m
      labels:
        severity: critical
    - alert: PrometheusMissingRuleEvaluations
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has missed {{ printf "%.0f" $value }} rule group evaluations in the last 5m.
        summary: Prometheus is missing rule evaluations due to slow rule group evaluation.
      expr: |
        increase(prometheus_rule_group_iterations_missed_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
      for: 15m
      labels:
        severity: warning
    - alert: PrometheusTargetLimitHit
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has dropped {{ printf "%.0f" $value }} targets because the number of targets exceeded the configured target_limit.
        summary: Prometheus has dropped targets because some scrape configs have exceeded the targets limit.
      expr: |
        increase(prometheus_target_scrape_pool_exceeded_target_limit_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
      for: 15m
      labels:
        severity: warning
  - name: alertmanager.rules
    rules:
    - alert: AlertmanagerConfigInconsistent
      annotations:
        message: |
          The configuration of the instances of the Alertmanager cluster `{{ $labels.namespace }}/{{ $labels.service }}` are out of sync.
          {{ range printf "alertmanager_config_hash{namespace=\"%s\",service=\"%s\"}" $labels.namespace $labels.service | query }}
          Configuration hash for pod {{ .Labels.pod }} is "{{ printf "%.f" .Value }}"
          {{ end }}
      expr: |
        count by(namespace,service) (count_values by(namespace,service) ("config_hash", alertmanager_config_hash{job="alertmanager-main",namespace="monitoring"})) != 1
      for: 5m
      labels:
        severity: critical
    - alert: AlertmanagerFailedReload
      annotations:
        message: Reloading Alertmanager's configuration has failed for {{ $labels.namespace }}/{{ $labels.pod}}.
      expr: |
        alertmanager_config_last_reload_successful{job="alertmanager-main",namespace="monitoring"} == 0
      for: 10m
      labels:
        severity: warning
    - alert: AlertmanagerMembersInconsistent
      annotations:
        message: Alertmanager has not found all other members of the cluster.
      expr: |
        alertmanager_cluster_members{job="alertmanager-main",namespace="monitoring"}
          != on (service) GROUP_LEFT()
        count by (service) (alertmanager_cluster_members{job="alertmanager-main",namespace="monitoring"})
      for: 5m
      labels:
        severity: critical
  - name: general.rules
    rules:
    - alert: TargetDown
      annotations:
        message: '{{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service }} targets in {{ $labels.namespace }} namespace are down.'
      expr: 100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job, namespace, service)) > 10
      for: 10m
      labels:
        severity: warning
    - alert: Watchdog
      annotations:
        message: |
          This is an alert meant to ensure that the entire alerting pipeline is functional.
          This alert is always firing, therefore it should always be firing in Alertmanager
          and always fire against a receiver. There are integrations with various notification
          mechanisms that send a notification when this alert is not firing. For example the
          "DeadMansSnitch" integration in PagerDuty.
      expr: vector(1)
      labels:
        severity: none
  - name: node-network
    rules:
    - alert: NodeNetworkInterfaceFlapping
      annotations:
        message: Network interface "{{ $labels.device }}" changing it's up status often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}"
      expr: |
        changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) > 2
      for: 2m
      labels:
        severity: warning
prometheus-rules.yaml

 

創建kube-prometheus
kubectl apply -f manifests/setup kubectl apply -f manifests/
 

四、添加外部監控

一個項目開始可能很難實現全部容器化,比如數據庫、CDH集群。但是我們依然需要監控他們,如果分成兩套prometheus不利於管理,所以我們統一添加這些監控到kube-prometheus中。

那么接下來我們新建prometheus-additional.yaml文件,添加額外監控組件配置scrape_configs。

- job_name: 'k8s_sercice_metrics'
  honor_labels: true
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names: [xxx-online, xxx-sandbox] ##如果你的服務部署在除kube-system、default、monitoring這三個namaspace之外,此處直接寫是不生效的,具體修改請看下文第十步定制化配置
  metrics_path: /swoole/stats
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_label_app, __meta_kubernetes_pod_container_name]
    regex: micro-service-article.*;micro-service-article.*
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace

- job_name: 'ingress-nginx-endpoints'
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names:
      - kube-system
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
  - source_labels: [__meta_kubernetes_service_name]
    regex: prometheus-server
    action: drop

 

然后我們需要將這些監控配置以secret資源類型存儲到k8s集群中。
kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml -n monitoring
kubectl apply -f additional-scrape-configs.yaml

 

 


五、Scheduler和Controller配置     

 

展開Status菜單,查看targets,可以看到只有圖中兩個監控任務沒有對應的目標,這和serviceMonitor資源對象有關。

查看yaml文件prometheus-serviceMonitorKubeScheduler,selector匹配的是service的標簽,但是kube-system namespace中並沒有k8s-app=kube-scheduler的service

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler
  namespace: monitoring
spec:
  endpoints:
  - interval: 30s  # 每30s獲取一次信息
    port: http-metrics  # 對應service的端口名
  jobLabel: k8s-app
  namespaceSelector:  # 表示去匹配某一命名空間中的service,如果想從所有的namespace中匹配用any: true
    matchNames:
    - kube-system
  selector:  # 匹配的 Service 的labels,如果使用mathLabels,則下面的所有標簽都匹配時才會匹配該service,如果使用matchExpressions,則至少匹配一個標簽的service都會被選擇
    matchLabels:
      k8s-app: kube-scheduler

新建prometheus-kubeSchedulerService.yaml

apiVersion: v1
kind: Service
metadata:
  namespace: kube-system
  name: kube-scheduler
  labels:
    k8s-app: kube-scheduler #與servicemonitor中的selector匹配
spec:
  selector: 
    component: kube-scheduler # 與scheduler的pod標簽一直
  ports:
  - name: http-metrics
    port: 10251
    targetPort: 10251
    protocol: TCP

同理新建prometheus-kubeControllerManagerService.yaml

apiVersion: v1
kind: Service
metadata:
  namespace: kube-system
  name: kube-controller-manager
  labels:
    k8s-app: kube-controller-manager
spec:
  selector:
    component: kube-controller-manager
  ports:
  - name: http-metrics
    port: 10252
    targetPort: 10252
    protocol: TCP

六、Alertmanager配置

監控和告警項已經配置好了,那么接下來我們將進行alertmanager告警配置了。

常用的接收方式就是郵件了,但這里我們將使用飛書和郵件進行接收,(飛書和企業微信道理相同),所以需要自行開發一個連接飛書的應用websocket,進行消息轉發和處理。

當然,你也可以直接配置微信號和消息模板

global:

  resolve_timeout: 5m #5m內沒有收到告警標記為已恢復

  smtp_smarthost: 'smtp.exmail.qq.com:465'

  smtp_from: '發件箱名稱'

  smtp_auth_username: '發件箱名稱'

  smtp_auth_password: '發件箱密碼'

  smtp_require_tls: false

 

templates:

- '/usr/local/alertmanager/template/*.tmpl'  郵箱模板

 

route:

  receiver: default

  group_wait: 10s #如果在等待時間內當前group接收到了新的告警,這些告警將會合並為一個通知向receiver發送。

  group_interval: 5m #用於定義相同的Gourp之間發送告警通知的時間間隔。

  repeat_interval: 5m #如果一個報警信息已經發送成功了,等待'repeat_interval'時間來重新發送他們

  group_by: [alertname] #將多條告警合合並為一個通知,如果滿足group_by中定義標簽名稱,那么這些告警將會合並為一個通知發送給接收器。

  routes:

  - receiver: webhook

    group_wait: 10s

    match:

      status: warning

#配置接收器

receivers:

- name: webhook

  webhook_configs:

  - url: http://192.168.116.57:5004 #此處為自行開發的websocket地址,接收alertmanager告警並處理及發送到飛書或者企業微信

    send_resolved: true

- name: default

  email_configs:

  - to: '收件箱名稱'

    send_resolved: true

    html: '{{ template "emai.html" . }}' # 模板

    headers: { Subject: "{{ .CommonAnnotations.summary }}" }

 

#抑制配置

#inhibit_rules:- source_match:

 #   status: 'critical'

  #target_match:

   # status: 'warning'

  # Apply inhibition if the alertname is the same.

  #equal: ['alertname' ]

然后我們需要將alertmanager配置以secret資源類型存儲到k8s集群中。

kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring

 


七、監控數據持久化

默認Prometheus和Grafana不做數據持久化,那么服務重啟以后配置的Dashboard、賬號密碼、監控數據等信息將會丟失,所以做數據持久化也是很有必要的。

原始的數據是以 emptyDir 形式存放在pod里面,生命周期與pod相同,出現問題時,容器重啟,監控相關的數據就全部消失了。

這里我們通過 storageclass 來做數據持久化,具體配置可以參考:Kubernetes實戰總結 - 動態存儲管理StorageClass

對於Grafana來說,我們需要在grafana-deployment.yaml中增加PVC配置,並且更改volumes.grafana-storage。

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana
  namespace: monitoring
  annotations:
    volume.beta.kubernetes.io/storage-class: "managed-nfs-storage"
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: grafana
  name: grafana
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      #- image: grafana/grafana:6.4.3
      - image: registry.cn-shanghai.aliyuncs.com/leozhanggg/prometheus/grafana:6.4.3
        name: grafana
        ports:
        - containerPort: 3000
          name: http
        readinessProbe:
          httpGet:
            path: /api/health
            port: http
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - mountPath: /var/lib/grafana
          name: grafana-storage
          readOnly: false
        ...
volumes: # - emptyDir: {} # name: grafana-storage - name: grafana-storage persistentVolumeClaim: claimName: grafana ...

對於Prometheus來說,我們只需要在prometheus-prometheus.yaml中增加volumeClaimTemplate配置即可。

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    prometheus: k8s
  name: k8s
  namespace: monitoring
spec:
  alerting:
    alertmanagers:
    - name: alertmanager-main
      namespace: monitoring
      port: web
  #baseImage: quay.io/prometheus/prometheus
  baseImage: registry.cn-shanghai.aliyuncs.com/leozhanggg/prometheus/prometheus
  additionalScrapeConfigs:
    name: additional-scrape-configs
    key: prometheus-additional.yaml
  retention: 15d
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: alicloud-nas
        resources:
          requests:
            storage: 100Gi
  ...

九、Grafana儀表板配置

前面我們把監控和告警已經配置好了,那接下來就剩展示了。打開grafana  ->  點擊添加按鈕 ->Import ->Upload .json file,導入監控儀表板。

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "description": "This dashboard provides cluster admins with the ability to monitor nodes and identify workload bottlenecks. It can be deployed with PSPs enabled using the following helm chart - https://github.com/pivotal-cf/charts-grafana",
  "editable": true,
  "gnetId": 10000,
  "graphTooltip": 0,
  "id": 102,
  "iteration": 1597137794957,
  "links": [],
  "panels": [
    {
      "collapsed": false,
      "datasource": null,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 34,
      "panels": [],
      "repeat": null,
      "title": "Summary",
      "type": "row"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorValue": true,
      "colors": [
        "rgba(50, 172, 45, 0.97)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(245, 54, 54, 0.9)"
      ],
      "datasource": "prometheus",
      "editable": true,
      "error": false,
      "format": "percent",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": true,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 5,
        "w": 8,
        "x": 0,
        "y": 1
      },
      "height": "180px",
      "id": 4,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "nullPointMode": "connected",
      "nullText": null,
      "options": {},
      "postfix": "",
      "postfixFontSize": "50%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "sum (container_memory_working_set_bytes{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}) / sum (machine_memory_bytes{kubernetes_io_hostname=~\"^$Node$\"}) * 100",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "refId": "A",
          "step": 10
        }
      ],
      "thresholds": "65, 90",
      "title": "Cluster memory usage",
      "type": "singlestat",
      "valueFontSize": "80%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorValue": true,
      "colors": [
        "rgba(50, 172, 45, 0.97)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(245, 54, 54, 0.9)"
      ],
      "datasource": "prometheus",
      "decimals": 2,
      "editable": true,
      "error": false,
      "format": "percent",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": true,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 5,
        "w": 8,
        "x": 8,
        "y": 1
      },
      "height": "180px",
      "id": 6,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "nullPointMode": "connected",
      "nullText": null,
      "options": {},
      "postfix": "",
      "postfixFontSize": "50%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "sum (rate (container_cpu_usage_seconds_total{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) / sum (machine_cpu_cores{kubernetes_io_hostname=~\"^$Node$\"}) * 100",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "legendFormat": "",
          "refId": "A",
          "step": 10
        }
      ],
      "thresholds": "65, 90",
      "title": "Cluster CPU usage",
      "type": "singlestat",
      "valueFontSize": "80%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorValue": true,
      "colors": [
        "rgba(50, 172, 45, 0.97)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(245, 54, 54, 0.9)"
      ],
      "datasource": "prometheus",
      "decimals": 2,
      "editable": true,
      "error": false,
      "format": "percent",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": true,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 5,
        "w": 8,
        "x": 16,
        "y": 1
      },
      "height": "180px",
      "id": 7,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "nullPointMode": "connected",
      "nullText": null,
      "options": {},
      "postfix": "",
      "postfixFontSize": "50%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "sum (container_fs_usage_bytes{id=\"/\"}) / sum (container_fs_limit_bytes{id=\"/\"}) * 100",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "legendFormat": "",
          "metric": "",
          "refId": "A",
          "step": 10
        }
      ],
      "thresholds": "65, 90",
      "title": "Cluster filesystem usage",
      "type": "singlestat",
      "valueFontSize": "80%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorValue": false,
      "colors": [
        "rgba(50, 172, 45, 0.97)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(245, 54, 54, 0.9)"
      ],
      "datasource": "prometheus",
      "decimals": 2,
      "editable": true,
      "error": false,
      "format": "bytes",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": false,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 3,
        "w": 4,
        "x": 0,
        "y": 6
      },
      "height": "1px",
      "id": 9,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "nullPointMode": "connected",
      "nullText": null,
      "options": {},
      "postfix": "",
      "postfixFontSize": "20%",
      "prefix": "",
      "prefixFontSize": "20%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "sum (container_memory_working_set_bytes{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"})",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "refId": "A",
          "step": 10
        }
      ],
      "thresholds": "",
      "title": "Used",
      "type": "singlestat",
      "valueFontSize": "50%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorValue": false,
      "colors": [
        "rgba(50, 172, 45, 0.97)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(245, 54, 54, 0.9)"
      ],
      "datasource": "prometheus",
      "decimals": 2,
      "editable": true,
      "error": false,
      "format": "bytes",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": false,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 3,
        "w": 4,
        "x": 4,
        "y": 6
      },
      "height": "1px",
      "id": 10,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "nullPointMode": "connected",
      "nullText": null,
      "options": {},
      "postfix": "",
      "postfixFontSize": "50%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "sum (machine_memory_bytes{kubernetes_io_hostname=~\"^$Node$\"})",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "refId": "A",
          "step": 10
        }
      ],
      "thresholds": "",
      "title": "Total",
      "type": "singlestat",
      "valueFontSize": "50%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorValue": false,
      "colors": [
        "rgba(50, 172, 45, 0.97)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(245, 54, 54, 0.9)"
      ],
      "datasource": "prometheus",
      "decimals": 2,
      "editable": true,
      "error": false,
      "format": "none",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": false,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 3,
        "w": 4,
        "x": 8,
        "y": 6
      },
      "height": "1px",
      "id": 11,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "nullPointMode": "connected",
      "nullText": null,
      "options": {},
      "postfix": " cores",
      "postfixFontSize": "30%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "sum (rate (container_cpu_usage_seconds_total{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}[$interval]))",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "legendFormat": "",
          "refId": "A",
          "step": 10
        }
      ],
      "thresholds": "",
      "title": "Used",
      "type": "singlestat",
      "valueFontSize": "50%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorValue": false,
      "colors": [
        "rgba(50, 172, 45, 0.97)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(245, 54, 54, 0.9)"
      ],
      "datasource": "prometheus",
      "decimals": 2,
      "editable": true,
      "error": false,
      "format": "none",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": false,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 3,
        "w": 4,
        "x": 12,
        "y": 6
      },
      "height": "1px",
      "id": 12,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "nullPointMode": "connected",
      "nullText": null,
      "options": {},
      "postfix": " cores",
      "postfixFontSize": "30%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "sum (machine_cpu_cores{kubernetes_io_hostname=~\"^$Node$\"})",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "refId": "A",
          "step": 10
        }
      ],
      "thresholds": "",
      "title": "Total",
      "type": "singlestat",
      "valueFontSize": "50%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorValue": false,
      "colors": [
        "rgba(50, 172, 45, 0.97)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(245, 54, 54, 0.9)"
      ],
      "datasource": "prometheus",
      "decimals": 2,
      "editable": true,
      "error": false,
      "format": "bytes",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": false,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 3,
        "w": 4,
        "x": 16,
        "y": 6
      },
      "height": "1px",
      "id": 13,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "nullPointMode": "connected",
      "nullText": null,
      "options": {},
      "postfix": "",
      "postfixFontSize": "50%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "sum (container_fs_usage_bytes{id=\"/\"})",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "legendFormat": "",
          "refId": "A",
          "step": 10
        }
      ],
      "thresholds": "",
      "title": "Used",
      "type": "singlestat",
      "valueFontSize": "50%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorValue": false,
      "colors": [
        "rgba(50, 172, 45, 0.97)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(245, 54, 54, 0.9)"
      ],
      "datasource": "prometheus",
      "decimals": 2,
      "editable": true,
      "error": false,
      "format": "bytes",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": false,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 3,
        "w": 4,
        "x": 20,
        "y": 6
      },
      "height": "1px",
      "id": 14,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "nullPointMode": "connected",
      "nullText": null,
      "options": {},
      "postfix": "",
      "postfixFontSize": "50%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "sum (container_fs_limit_bytes{id=\"/\"})",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "legendFormat": "",
          "refId": "A",
          "step": 10
        }
      ],
      "thresholds": "",
      "title": "Total",
      "type": "singlestat",
      "valueFontSize": "50%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "collapsed": false,
      "datasource": null,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 9
      },
      "id": 35,
      "panels": [],
      "repeat": null,
      "title": "Memory",
      "type": "row"
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 2,
      "editable": true,
      "error": false,
      "fill": 0,
      "fillGradient": 0,
      "grid": {},
      "gridPos": {
        "h": 7,
        "w": 24,
        "x": 0,
        "y": 10
      },
      "id": 25,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "hideEmpty": false,
        "max": false,
        "min": false,
        "rightSide": true,
        "show": true,
        "sideWidth": 200,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "links": [],
      "nullPointMode": "connected",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": true,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum (container_memory_working_set_bytes{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}) by (pod)",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "legendFormat": "{{ pod }}",
          "metric": "container_memory_usage:sort_desc",
          "refId": "A",
          "step": 10
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Pods memory usage",
      "tooltip": {
        "msResolution": false,
        "shared": true,
        "sort": 2,
        "value_type": "cumulative"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "bytes",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "collapsed": false,
      "datasource": null,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 17
      },
      "id": 37,
      "panels": [],
      "title": "CPU",
      "type": "row"
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 3,
      "editable": true,
      "error": false,
      "fill": 0,
      "fillGradient": 0,
      "grid": {},
      "gridPos": {
        "h": 7,
        "w": 24,
        "x": 0,
        "y": 18
      },
      "height": "",
      "id": 17,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "max": false,
        "min": false,
        "rightSide": true,
        "show": true,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "links": [],
      "nullPointMode": "connected",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": true,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum (rate (container_cpu_usage_seconds_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "legendFormat": "{{ pod }}",
          "metric": "container_cpu",
          "refId": "A",
          "step": 10
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Pods CPU usage",
      "tooltip": {
        "msResolution": true,
        "shared": true,
        "sort": 2,
        "value_type": "cumulative"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "none",
          "label": "cores",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "collapsed": false,
      "datasource": null,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 25
      },
      "id": 33,
      "panels": [],
      "repeat": null,
      "title": "Network I/O",
      "type": "row"
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 2,
      "editable": true,
      "error": false,
      "fill": 1,
      "fillGradient": 0,
      "grid": {},
      "gridPos": {
        "h": 7,
        "w": 24,
        "x": 0,
        "y": 26
      },
      "id": 16,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "max": false,
        "min": false,
        "rightSide": true,
        "show": true,
        "sideWidth": 200,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "links": [],
      "nullPointMode": "connected",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum (rate (container_network_receive_bytes_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "legendFormat": "-> {{ pod }}",
          "metric": "network",
          "refId": "A",
          "step": 10
        },
        {
          "expr": "- sum (rate (container_network_transmit_bytes_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "legendFormat": "<- {{ pod }}",
          "metric": "network",
          "refId": "B",
          "step": 10
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Pods network I/O",
      "tooltip": {
        "msResolution": false,
        "shared": true,
        "sort": 2,
        "value_type": "cumulative"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "Bps",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 2,
      "editable": true,
      "error": false,
      "fill": 1,
      "fillGradient": 0,
      "grid": {},
      "gridPos": {
        "h": 5,
        "w": 24,
        "x": 0,
        "y": 33
      },
      "height": "200px",
      "id": 32,
      "legend": {
        "alignAsTable": false,
        "avg": true,
        "current": true,
        "max": false,
        "min": false,
        "rightSide": false,
        "show": false,
        "sideWidth": 200,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "links": [],
      "nullPointMode": "connected",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum (rate (container_network_receive_bytes_total{kubernetes_io_hostname=~\"^$Node$\"}[$interval]))",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "legendFormat": "Received",
          "metric": "network",
          "refId": "A",
          "step": 10
        },
        {
          "expr": "- sum (rate (container_network_transmit_bytes_total{kubernetes_io_hostname=~\"^$Node$\"}[$interval]))",
          "format": "time_series",
          "interval": "10s",
          "intervalFactor": 1,
          "legendFormat": "Sent",
          "metric": "network",
          "refId": "B",
          "step": 10
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Network I/O pressure",
      "tooltip": {
        "msResolution": false,
        "shared": true,
        "sort": 0,
        "value_type": "cumulative"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "Bps",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "Bps",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
  ],
  "refresh": "10s",
  "schemaVersion": 20,
  "style": "dark",
  "tags": [
    "Prometheus",
    "Kubernetes"
  ],
  "templating": {
    "list": [
      {
        "auto": true,
        "auto_count": 20,
        "auto_min": "2m",
        "current": {
          "text": "auto",
          "value": "$__auto_interval_interval"
        },
        "hide": 2,
        "label": null,
        "name": "interval",
        "options": [
          {
            "selected": true,
            "text": "auto",
            "value": "$__auto_interval_interval"
          },
          {
            "selected": false,
            "text": "1m",
            "value": "1m"
          },
          {
            "selected": false,
            "text": "10m",
            "value": "10m"
          },
          {
            "selected": false,
            "text": "30m",
            "value": "30m"
          },
          {
            "selected": false,
            "text": "1h",
            "value": "1h"
          },
          {
            "selected": false,
            "text": "6h",
            "value": "6h"
          },
          {
            "selected": false,
            "text": "12h",
            "value": "12h"
          },
          {
            "selected": false,
            "text": "1d",
            "value": "1d"
          },
          {
            "selected": false,
            "text": "7d",
            "value": "7d"
          },
          {
            "selected": false,
            "text": "14d",
            "value": "14d"
          },
          {
            "selected": false,
            "text": "30d",
            "value": "30d"
          }
        ],
        "query": "1m,10m,30m,1h,6h,12h,1d,7d,14d,30d",
        "refresh": 2,
        "skipUrlSync": false,
        "type": "interval"
      },
      {
        "current": {
          "text": "prometheus",
          "value": "prometheus"
        },
        "hide": 0,
        "includeAll": false,
        "label": null,
        "multi": false,
        "name": "datasource",
        "options": [],
        "query": "prometheus",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "type": "datasource"
      },
      {
        "allValue": ".*",
        "current": {
          "text": "All",
          "value": "$__all"
        },
        "datasource": "prometheus",
        "definition": "",
        "hide": 0,
        "includeAll": true,
        "label": null,
        "multi": false,
        "name": "Node",
        "options": [],
        "query": "label_values(kubernetes_io_hostname)",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 0,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      }
    ]
  },
  "time": {
    "from": "now-5m",
    "to": "now"
  },
  "timepicker": {
    "refresh_intervals": [
      "5s",
     "10s",
      "30s",
      "1m",
      "5m",
      "15m",
      "30m",
      "1h",
      "2h",
      "1d"
    ],
    "time_options": [
      "5m",
      "15m",
      "1h",
      "6h",
      "12h",
      "24h",
      "2d",
      "7d",
      "30d"
    ]
  },
  "timezone": "browser",
  "title": "育苗通K8S集群監控",
  "uid": "6KoW2MIGk",
  "version": 15
}
View Code

 

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "description": "【中文版本】2020.06.28更新,增加整體資源展示!支持 Grafana6&7,Node Exporter v0.16及以上的版本,優化重要指標展示。包含整體資源展示與資源明細圖表:CPU 內存 磁盤 IO 網絡等監控指標。https://github.com/starsliao/Prometheus",
  "editable": true,
  "gnetId": 8919,
  "graphTooltip": 0,
  "id": 72,
  "iteration": 1597137684806,
  "links": [
    {
      "icon": "external link",
      "tags": [],
      "targetBlank": true,
      "title": "更新node_exporter",
      "tooltip": "",
      "type": "link",
      "url": "https://github.com/prometheus/node_exporter/releases"
    },
    {
      "icon": "external link",
      "tags": [],
      "targetBlank": true,
      "title": "更新當前儀表板",
      "tooltip": "",
      "type": "link",
      "url": "https://grafana.com/dashboards/8919"
    },
    {
      "icon": "external link",
      "tags": [],
      "targetBlank": true,
      "title": "StarsL.cn",
      "tooltip": "",
      "type": "link",
      "url": "https://starsl.cn"
    },
    {
      "asDropdown": true,
      "icon": "external link",
      "tags": [],
      "targetBlank": true,
      "title": "",
      "type": "dashboards"
    }
  ],
  "panels": [
    {
      "collapsed": false,
      "datasource": "prometheus",
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 187,
      "panels": [],
      "title": "資源總覽(關聯JOB項)當前選中主機:【$show_hostname】實例:$node",
      "type": "row"
    },
    {
      "columns": [],
      "datasource": "prometheus",
      "description": "分區使用率、磁盤讀取、磁盤寫入、下載帶寬、上傳帶寬,如果有多個網卡或者多個分區,是采集的使用率最高的網卡或者分區的數值。",
      "fontSize": "100%",
      "gridPos": {
        "h": 12,
        "w": 24,
        "x": 0,
        "y": 1
      },
      "id": 185,
      "options": {},
      "pageSize": 10,
      "showHeader": true,
      "sort": {
        "col": 5,
        "desc": false
      },
      "styles": [
        {
          "alias": "主機名",
          "align": "auto",
          "colorMode": null,
          "colors": [
            "rgba(245, 54, 54, 0.9)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(50, 172, 45, 0.97)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 1,
          "link": false,
          "linkTooltip": "",
          "linkUrl": "",
          "mappingType": 1,
          "pattern": "nodename",
          "thresholds": [],
          "type": "string",
          "unit": "bytes"
        },
        {
          "alias": "IP(鏈接到明細)",
          "align": "auto",
          "colorMode": null,
          "colors": [
            "rgba(245, 54, 54, 0.9)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(50, 172, 45, 0.97)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "link": true,
          "linkTargetBlank": false,
          "linkTooltip": "瀏覽主機明細",
          "linkUrl": "/d/9CWBz0bik/node-exporter?orgId=1&var-job=${job}&var-hostname=All&var-node=${__cell}&var-device=All",
          "mappingType": 1,
          "pattern": "instance",
          "thresholds": [],
          "type": "number",
          "unit": "short"
        },
        {
          "alias": "內存",
          "align": "auto",
          "colorMode": null,
          "colors": [
            "rgba(245, 54, 54, 0.9)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(50, 172, 45, 0.97)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "link": false,
          "mappingType": 1,
          "pattern": "Value #B",
          "thresholds": [],
          "type": "number",
          "unit": "bytes"
        },
        {
          "alias": "CPU核",
          "align": "auto",
          "colorMode": null,
          "colors": [
            "rgba(245, 54, 54, 0.9)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(50, 172, 45, 0.97)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": null,
          "mappingType": 1,
          "pattern": "Value #C",
          "thresholds": [],
          "type": "number",
          "unit": "short"
        },
        {
          "alias": " 運行時間",
          "align": "auto",
          "colorMode": null,
          "colors": [
            "rgba(245, 54, 54, 0.9)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(50, 172, 45, 0.97)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "mappingType": 1,
          "pattern": "Value #D",
          "thresholds": [],
          "type": "number",
          "unit": "s"
        },
        {
          "alias": "分區使用率*",
          "align": "auto",
          "colorMode": "cell",
          "colors": [
            "rgba(50, 172, 45, 0.97)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(245, 54, 54, 0.9)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "mappingType": 1,
          "pattern": "Value #E",
          "thresholds": [
            "70",
            "85"
          ],
          "type": "number",
          "unit": "percent"
        },
        {
          "alias": "CPU使用率",
          "align": "auto",
          "colorMode": "cell",
          "colors": [
            "rgba(50, 172, 45, 0.97)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(245, 54, 54, 0.9)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "mappingType": 1,
          "pattern": "Value #F",
          "thresholds": [
            "70",
            "85"
          ],
          "type": "number",
          "unit": "percent"
        },
        {
          "alias": "內存使用率",
          "align": "auto",
          "colorMode": "cell",
          "colors": [
            "rgba(50, 172, 45, 0.97)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(245, 54, 54, 0.9)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "mappingType": 1,
          "pattern": "Value #G",
          "thresholds": [
            "70",
            "85"
          ],
          "type": "number",
          "unit": "percent"
        },
        {
          "alias": "磁盤讀取*",
          "align": "auto",
          "colorMode": "cell",
          "colors": [
            "rgba(50, 172, 45, 0.97)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(245, 54, 54, 0.9)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "mappingType": 1,
          "pattern": "Value #H",
          "thresholds": [
            "10485760",
            "20485760"
          ],
          "type": "number",
          "unit": "Bps"
        },
        {
          "alias": "磁盤寫入*",
          "align": "auto",
          "colorMode": "cell",
          "colors": [
            "rgba(50, 172, 45, 0.97)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(245, 54, 54, 0.9)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "mappingType": 1,
          "pattern": "Value #I",
          "thresholds": [
            "10485760",
            "20485760"
          ],
          "type": "number",
          "unit": "Bps"
        },
        {
          "alias": "下載帶寬*",
          "align": "auto",
          "colorMode": "cell",
          "colors": [
            "rgba(50, 172, 45, 0.97)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(245, 54, 54, 0.9)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "mappingType": 1,
          "pattern": "Value #J",
          "thresholds": [
            "30485760",
            "104857600"
          ],
          "type": "number",
          "unit": "bps"
        },
        {
          "alias": "上傳帶寬*",
          "align": "auto",
          "colorMode": "cell",
          "colors": [
            "rgba(50, 172, 45, 0.97)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(245, 54, 54, 0.9)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "mappingType": 1,
          "pattern": "Value #K",
          "thresholds": [
            "30485760",
            "104857600"
          ],
          "type": "number",
          "unit": "bps"
        },
        {
          "alias": "5m負載",
          "align": "auto",
          "colorMode": null,
          "colors": [
            "rgba(245, 54, 54, 0.9)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(50, 172, 45, 0.97)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "mappingType": 1,
          "pattern": "Value #L",
          "thresholds": [],
          "type": "number",
          "unit": "short"
        },
        {
          "alias": "",
          "align": "right",
          "colorMode": null,
          "colors": [
            "rgba(245, 54, 54, 0.9)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(50, 172, 45, 0.97)"
          ],
          "decimals": 2,
          "pattern": "/.*/",
          "thresholds": [],
          "type": "hidden",
          "unit": "short"
        }
      ],
      "targets": [
        {
          "expr": "node_uname_info{job=~\"$job\"} - 0",
          "format": "table",
          "instant": true,
          "interval": "",
          "legendFormat": "主機名",
          "refId": "A"
        },
        {
          "expr": "sum(time() - node_boot_time_seconds{job=~\"$job\"})by(instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "interval": "",
          "legendFormat": "運行時間",
          "refId": "D"
        },
        {
          "expr": "node_memory_MemTotal_bytes{job=~\"$job\"} - 0",
          "format": "table",
          "hide": false,
          "instant": true,
          "interval": "",
          "legendFormat": "總內存",
          "refId": "B"
        },
        {
          "expr": "count(node_cpu_seconds_total{job=~\"$job\",mode='system'}) by (instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "interval": "",
          "legendFormat": "總核數",
          "refId": "C"
        },
        {
          "expr": "node_load5{job=~\"$job\"}",
          "format": "table",
          "instant": true,
          "interval": "",
          "legendFormat": "5分鍾負載",
          "refId": "L"
        },
        {
          "expr": "(1 - avg(irate(node_cpu_seconds_total{job=~\"$job\",mode=\"idle\"}[5m])) by (instance)) * 100",
          "format": "table",
          "hide": false,
          "instant": true,
          "interval": "",
          "legendFormat": "CPU使用率",
          "refId": "F"
        },
        {
          "expr": "(1 - (node_memory_MemAvailable_bytes{job=~\"$job\"} / (node_memory_MemTotal_bytes{job=~\"$job\"})))* 100",
          "format": "table",
          "hide": false,
          "instant": true,
          "interval": "",
          "legendFormat": "內存使用率",
          "refId": "G"
        },
        {
          "expr": "max((node_filesystem_size_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}-node_filesystem_free_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}) *100/(node_filesystem_avail_bytes {job=~\"$job\",fstype=~\"ext.?|xfs\"}+(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}-node_filesystem_free_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"})))by(instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "interval": "",
          "legendFormat": "分區使用率",
          "refId": "E"
        },
        {
          "expr": "max(irate(node_disk_read_bytes_total{job=~\"$job\"}[5m])) by (instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "interval": "",
          "legendFormat": "最大讀取",
          "refId": "H"
        },
        {
          "expr": "max(irate(node_disk_written_bytes_total{job=~\"$job\"}[5m])) by (instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "interval": "",
          "legendFormat": "最大寫入",
          "refId": "I"
        },
        {
          "expr": "max(irate(node_network_receive_bytes_total{job=~\"$job\"}[5m])*8) by (instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "interval": "",
          "legendFormat": "下載帶寬",
          "refId": "J"
        },
        {
          "expr": "max(irate(node_network_transmit_bytes_total{job=~\"$job\"}[5m])*8) by (instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "interval": "",
          "legendFormat": "上傳帶寬",
          "refId": "K"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "服務器資源總覽表(每頁10行)",
      "transform": "table",
      "type": "table"
    },
    {
      "aliasColors": {
        "192.168.200.241:9100_Total": "dark-red",
        "Idle - Waiting for something to happen": "#052B51",
        "guest": "#9AC48A",
        "idle": "#052B51",
        "iowait": "#EAB839",
        "irq": "#BF1B00",
        "nice": "#C15C17",
        "sdb_每秒I/O操作%": "#d683ce",
        "softirq": "#E24D42",
        "steal": "#FCE2DE",
        "system": "#508642",
        "user": "#5195CE",
        "磁盤花費在I/O操作占比": "#ba43a9"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": null,
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 0,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 0,
        "y": 13
      },
      "hiddenSeries": false,
      "id": 191,
      "legend": {
        "alignAsTable": false,
        "avg": false,
        "current": true,
        "hideEmpty": true,
        "hideZero": true,
        "max": false,
        "min": false,
        "rightSide": false,
        "show": true,
        "sideWidth": null,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "links": [],
      "maxPerRow": 6,
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "repeat": null,
      "seriesOverrides": [
        {
          "alias": "總平均使用率",
          "lines": false,
          "pointradius": 1,
          "points": true,
          "yaxis": 2
        },
        {
          "alias": "總核數",
          "color": "#C4162A"
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "count(node_cpu_seconds_total{job=~\"$job\", mode='system'})",
          "format": "time_series",
          "hide": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "總核數",
          "refId": "B",
          "step": 240
        },
        {
          "expr": "sum(node_load5{job=~\"$job\"})",
          "format": "time_series",
          "hide": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "總5分鍾負載",
          "refId": "A",
          "step": 240
        },
        {
          "expr": "avg(1 - avg(irate(node_cpu_seconds_total{job=~\"$job\",mode=\"idle\"}[5m])) by (instance)) * 100",
          "format": "time_series",
          "hide": false,
          "interval": "30m",
          "intervalFactor": 1,
          "legendFormat": "總平均使用率",
          "refId": "F",
          "step": 240
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "$job:整體總負載與整體平均CPU使用率",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "decimals": null,
          "format": "short",
          "label": "總負載",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "decimals": 0,
          "format": "percent",
          "label": "平均使用率",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "192.168.200.241:9100_總內存": "dark-red",
        "內存_Avaliable": "#6ED0E0",
        "內存_Cached": "#EF843C",
        "內存_Free": "#629E51",
        "內存_Total": "#6d1f62",
        "內存_Used": "#eab839",
        "可用": "#9ac48a",
        "總內存": "#bf1b00"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 1,
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 0,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 8,
        "y": 13
      },
      "height": "300",
      "hiddenSeries": false,
      "id": 195,
      "legend": {
        "alignAsTable": false,
        "avg": false,
        "current": true,
        "max": false,
        "min": false,
        "rightSide": false,
        "show": true,
        "sort": "current",
        "sortDesc": false,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "links": [],
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [
        {
          "alias": "總內存",
          "color": "#C4162A",
          "fill": 0
        },
        {
          "alias": "總平均使用率",
          "lines": false,
          "pointradius": 1,
          "points": true,
          "yaxis": 2
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum(node_memory_MemTotal_bytes{job=~\"$job\"})",
          "format": "time_series",
          "hide": false,
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "總內存",
          "refId": "A",
          "step": 4
        },
        {
          "expr": "sum(node_memory_MemTotal_bytes{job=~\"$job\"} - node_memory_MemAvailable_bytes{job=~\"$job\"})",
          "format": "time_series",
          "hide": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "總已用",
          "refId": "B",
          "step": 4
        },
        {
          "expr": "(sum(node_memory_MemTotal_bytes{job=~\"$job\"} - node_memory_MemAvailable_bytes{job=~\"$job\"}) / sum(node_memory_MemTotal_bytes{job=~\"$job\"}))*100",
          "format": "time_series",
          "hide": false,
          "interval": "30m",
          "intervalFactor": 1,
          "legendFormat": "總平均使用率",
          "refId": "H"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "$job:整體總內存與整體平均內存使用率",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "decimals": null,
          "format": "bytes",
          "label": "總內存量",
          "logBase": 1,
          "max": null,
          "min": "0",
          "show": true
        },
        {
          "decimals": null,
          "format": "percent",
          "label": "平均使用率",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 1,
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 0,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 16,
        "y": 13
      },
      "hiddenSeries": false,
      "id": 197,
      "legend": {
        "alignAsTable": false,
        "avg": false,
        "current": true,
        "hideEmpty": false,
        "hideZero": false,
        "max": false,
        "min": false,
        "rightSide": false,
        "show": true,
        "sideWidth": null,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "links": [],
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [
        {
          "alias": "總平均使用率",
          "lines": false,
          "pointradius": 1,
          "points": true,
          "yaxis": 2
        },
        {
          "alias": "總磁盤量",
          "color": "#C4162A"
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))",
          "format": "time_series",
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "總磁盤量",
          "refId": "E"
        },
        {
          "expr": "sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))",
          "format": "time_series",
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "總使用量",
          "refId": "C"
        },
        {
          "expr": "(sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))) *100/(sum(avg(node_filesystem_avail_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))+(sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))))",
          "format": "time_series",
          "instant": false,
          "interval": "30m",
          "intervalFactor": 1,
          "legendFormat": "總平均使用率",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "$job:整體總磁盤與整體平均磁盤使用率",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "decimals": 1,
          "format": "bytes",
          "label": "總磁盤量",
          "logBase": 1,
          "max": null,
          "min": "0",
          "show": true
        },
        {
          "decimals": null,
          "format": "percent",
          "label": "平均使用率",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "collapsed": false,
      "datasource": "prometheus",
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 21
      },
      "id": 189,
      "panels": [],
      "title": "資源明細:【$show_hostname】",
      "type": "row"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorPostfix": false,
      "colorPrefix": false,
      "colorValue": true,
      "colors": [
        "rgba(245, 54, 54, 0.9)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(50, 172, 45, 0.97)"
      ],
      "datasource": "prometheus",
      "decimals": 0,
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "format": "s",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": false,
        "threshcisLabels": false,
        "threshcisMarkers": true
      },
      "gridPos": {
        "h": 2,
        "w": 2,
        "x": 0,
        "y": 22
      },
      "hideTimeOverride": true,
      "id": 15,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "nullPointMode": "null",
      "nullText": null,
      "options": {},
      "pluginVersion": "6.4.2",
      "postfix": "",
      "postfixFontSize": "50%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "avg(time() - node_boot_time_seconds{instance=~\"$node\"})",
          "format": "time_series",
          "hide": false,
          "instant": true,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "",
          "refId": "A",
          "step": 40
        }
      ],
      "threshciss": "1,2",
      "thresholds": "1,3",
      "title": "運行時間",
      "type": "singlestat",
      "valueFontSize": "70%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "datasource": "prometheus",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "custom": {},
          "decimals": 2,
          "displayName": "",
          "mappings": [
            {
              "from": "",
              "id": 1,
              "operator": "",
              "text": "N/A",
              "to": "",
              "type": 1,
              "value": "0"
            }
          ],
          "max": 100,
          "min": 0,
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 70
              },
              {
                "color": "#EAB839",
                "value": 90
              }
            ]
          },
          "unit": "percent"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 6,
        "w": 3,
        "x": 2,
        "y": 22
      },
      "id": 177,
      "options": {
        "displayMode": "lcd",
        "fieldOptions": {
          "calcs": [
            "last"
          ],
          "defaults": {
            "decimals": 1,
            "mappings": [
              {
                "from": "",
                "id": 1,
                "operator": "",
                "text": "N/A",
                "to": "",
                "type": 1,
                "value": "0"
              }
            ],
            "max": 100,
            "min": 0.1,
            "thresholds": {
              "0": {
                "color": "green",
                "value": null
              },
              "1": {
                "color": "red",
                "value": 80
              },
              "mode": "absolute",
              "steps": [
                {
                  "color": "green",
                  "value": null
                },
                {
                  "color": "#EAB839",
                  "value": 70
                },
                {
                  "color": "red",
                  "value": 90
                }
              ]
            },
            "unit": "percent"
          },
          "override": {},
          "overrides": [],
          "values": false
        },
        "orientation": "horizontal",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "values": false
        },
        "showUnfilled": true
      },
      "pluginVersion": "6.4.3",
      "targets": [
        {
          "expr": "100 - (avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"idle\"}[5m])) * 100)",
          "instant": true,
          "interval": "",
          "legendFormat": "總CPU使用率",
          "refId": "A"
        },
        {
          "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) * 100",
          "hide": true,
          "instant": true,
          "interval": "",
          "legendFormat": "IOwait使用率",
          "refId": "C"
        },
        {
          "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$node\"} / (node_memory_MemTotal_bytes{instance=~\"$node\"})))* 100",
          "instant": true,
          "interval": "",
          "legendFormat": "內存使用率",
          "refId": "B"
        },
        {
          "expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"})*100 /(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}))",
          "hide": false,
          "instant": true,
          "interval": "",
          "legendFormat": "最大分區({{mountpoint}})使用率",
          "refId": "D"
        },
        {
          "expr": "(1 - ((node_memory_SwapFree_bytes{instance=~\"$node\"} + 1)/ (node_memory_SwapTotal_bytes{instance=~\"$node\"} + 1))) * 100",
          "instant": true,
          "legendFormat": "交換分區使用率",
          "refId": "F"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "",
      "type": "bargauge"
    },
    {
      "columns": [],
      "datasource": "prometheus",
      "description": "本看板中的:磁盤總量、使用量、可用量、使用率保持和df命令的Size、Used、Avail、Use% 列的值一致,並且Use%的值會四舍五入保留一位小數,會更加准確。\n\n注:df中Use%算法為:(size - free) * 100 / (avail + (size - free)),結果是整除則為該值,非整除則為該值+1,結果的單位是%。\n參考df命令源碼:",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fontSize": "100%",
      "gridPos": {
        "h": 6,
        "w": 10,
        "x": 5,
        "y": 22
      },
      "id": 181,
      "links": [
        {
          "targetBlank": true,
          "title": "https://github.com/coreutils/coreutils/blob/master/src/df.c",
          "url": "https://github.com/coreutils/coreutils/blob/master/src/df.c"
        }
      ],
      "options": {},
      "pageSize": null,
      "scroll": true,
      "showHeader": true,
      "sort": {
        "col": 6,
        "desc": false
      },
      "styles": [
        {
          "alias": "分區",
          "align": "auto",
          "colorMode": null,
          "colors": [
            "rgba(50, 172, 45, 0.97)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(245, 54, 54, 0.9)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "mappingType": 1,
          "pattern": "mountpoint",
          "thresholds": [
            ""
          ],
          "type": "string",
          "unit": "bytes"
        },
        {
          "alias": "可用空間",
          "align": "auto",
          "colorMode": "value",
          "colors": [
            "rgba(245, 54, 54, 0.9)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(50, 172, 45, 0.97)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 1,
          "mappingType": 1,
          "pattern": "Value #A",
          "thresholds": [
            "10000000000",
            "20000000000"
          ],
          "type": "number",
          "unit": "bytes"
        },
        {
          "alias": "使用率",
          "align": "auto",
          "colorMode": "cell",
          "colors": [
            "rgba(50, 172, 45, 0.97)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(245, 54, 54, 0.9)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 1,
          "mappingType": 1,
          "pattern": "Value #B",
          "thresholds": [
            "70",
            "85"
          ],
          "type": "number",
          "unit": "percent"
        },
        {
          "alias": "總空間",
          "align": "auto",
          "colorMode": null,
          "colors": [
            "rgba(245, 54, 54, 0.9)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(50, 172, 45, 0.97)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 0,
          "link": false,
          "mappingType": 1,
          "pattern": "Value #C",
          "thresholds": [],
          "type": "number",
          "unit": "bytes"
        },
        {
          "alias": "文件系統",
          "align": "auto",
          "colorMode": null,
          "colors": [
            "rgba(245, 54, 54, 0.9)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(50, 172, 45, 0.97)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "link": false,
          "mappingType": 1,
          "pattern": "fstype",
          "thresholds": [],
          "type": "string",
          "unit": "short"
        },
        {
          "alias": "設備名",
          "align": "auto",
          "colorMode": null,
          "colors": [
            "rgba(245, 54, 54, 0.9)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(50, 172, 45, 0.97)"
          ],
          "dateFormat": "YYYY-MM-DD HH:mm:ss",
          "decimals": 2,
          "link": false,
          "mappingType": 1,
          "pattern": "device",
          "preserveFormat": false,
          "sanitize": false,
          "thresholds": [],
          "type": "string",
          "unit": "short"
        },
        {
          "alias": "",
          "align": "auto",
          "colorMode": null,
          "colors": [
            "rgba(245, 54, 54, 0.9)",
            "rgba(237, 129, 40, 0.89)",
            "rgba(50, 172, 45, 0.97)"
          ],
          "decimals": 2,
          "pattern": "/.*/",
          "preserveFormat": true,
          "sanitize": false,
          "thresholds": [],
          "type": "hidden",
          "unit": "short"
        }
      ],
      "targets": [
        {
          "expr": "node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-0",
          "format": "table",
          "hide": false,
          "instant": true,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "總量",
          "refId": "C"
        },
        {
          "expr": "node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-0",
          "format": "table",
          "hide": false,
          "instant": true,
          "interval": "10s",
          "intervalFactor": 1,
          "legendFormat": "",
          "refId": "A"
        },
        {
          "expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}) *100/(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}))",
          "format": "table",
          "hide": false,
          "instant": true,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "",
          "refId": "B"
        }
      ],
      "title": "【$show_hostname】:各分區可用空間(EXT.*/XFS)",
      "transform": "table",
      "type": "table"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorValue": true,
      "colors": [
        "rgba(50, 172, 45, 0.97)",
        "rgba(237, 129, 40, 0.89)",
        "#d44a3a"
      ],
      "datasource": "prometheus",
      "decimals": 2,
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "format": "percent",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": false,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 2,
        "w": 2,
        "x": 15,
        "y": 22
      },
      "id": 20,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "nullPointMode": "connected",
      "nullText": null,
      "options": {},
      "pluginVersion": "6.4.2",
      "postfix": "",
      "postfixFontSize": "50%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": true,
        "lineColor": "#3274D9",
        "show": true,
        "ymax": null,
        "ymin": null
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) * 100",
          "format": "time_series",
          "hide": false,
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "",
          "refId": "A",
          "step": 20
        }
      ],
      "thresholds": "20,50",
      "timeFrom": null,
      "timeShift": null,
      "title": "CPU iowait",
      "type": "singlestat",
      "valueFontSize": "80%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "avg"
    },
    {
      "aliasColors": {
        "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_in": "light-red",
        "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_in下載": "green",
        "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_out上傳": "yellow",
        "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_in下載": "purple",
        "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_out": "purple",
        "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_out上傳": "blue"
      },
      "bars": true,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "editable": true,
      "error": false,
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "grid": {},
      "gridPos": {
        "h": 6,
        "w": 7,
        "x": 17,
        "y": 22
      },
      "hiddenSeries": false,
      "id": 183,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "hideEmpty": true,
        "hideZero": true,
        "max": true,
        "min": false,
        "show": false,
        "sort": "current",
        "sortDesc": true,
        "total": true,
        "values": true
      },
      "lines": false,
      "linewidth": 2,
      "links": [],
      "nullPointMode": "null as zero",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 1,
      "points": false,
      "renderer": "flot",
      "repeat": null,
      "seriesOverrides": [
        {
          "alias": "/.*_out上傳$/",
          "transform": "negative-Y"
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "increase(node_network_receive_bytes_total{instance=~\"$node\",device=~\"$device\"}[60m])",
          "interval": "60m",
          "intervalFactor": 1,
          "legendFormat": "{{device}}_in下載",
          "metric": "",
          "refId": "A",
          "step": 600,
          "target": ""
        },
        {
          "expr": "increase(node_network_transmit_bytes_total{instance=~\"$node\",device=~\"$device\"}[60m])",
          "hide": false,
          "interval": "60m",
          "intervalFactor": 1,
          "legendFormat": "{{device}}_out上傳",
          "refId": "B",
          "step": 600
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "每小時流量$device",
      "tooltip": {
        "msResolution": false,
        "shared": true,
        "sort": 0,
        "value_type": "cumulative"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "bytes",
          "label": "上傳(-)/下載(+)",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorPostfix": false,
      "colorValue": true,
      "colors": [
        "rgba(245, 54, 54, 0.9)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(50, 172, 45, 0.97)"
      ],
      "datasource": "prometheus",
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "format": "short",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": false,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 2,
        "w": 2,
        "x": 0,
        "y": 24
      },
      "id": 14,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "maxPerRow": 6,
      "nullPointMode": "null",
      "nullText": null,
      "options": {},
      "postfix": "",
      "postfixFontSize": "50%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "count(node_cpu_seconds_total{instance=~\"$node\", mode='system'})",
          "format": "time_series",
          "instant": true,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "",
          "refId": "A",
          "step": 20
        }
      ],
      "thresholds": "1,2",
      "title": "CPU 核數",
      "type": "singlestat",
      "valueFontSize": "80%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorPostfix": false,
      "colorValue": true,
      "colors": [
        "rgba(245, 54, 54, 0.9)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(50, 172, 45, 0.97)"
      ],
      "datasource": "prometheus",
      "decimals": null,
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "format": "short",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": false,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 2,
        "w": 2,
        "x": 15,
        "y": 24
      },
      "id": 179,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "maxPerRow": 6,
      "nullPointMode": "null",
      "nullText": null,
      "options": {},
      "postfix": "",
      "postfixFontSize": "50%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "avg(node_filesystem_files_free{instance=~\"$node\",mountpoint=\"$maxmount\",fstype=~\"ext.?|xfs\"})",
          "format": "time_series",
          "instant": true,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "",
          "refId": "A",
          "step": 20
        }
      ],
      "thresholds": "100000,1000000",
      "title": "剩余節點數:$maxmount ",
      "type": "singlestat",
      "valueFontSize": "70%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorValue": true,
      "colors": [
        "rgba(245, 54, 54, 0.9)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(50, 172, 45, 0.97)"
      ],
      "datasource": "prometheus",
      "decimals": 0,
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "format": "bytes",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": false,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 2,
        "w": 2,
        "x": 0,
        "y": 26
      },
      "id": 75,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "maxPerRow": 6,
      "nullPointMode": "null",
      "nullText": null,
      "options": {},
      "postfix": "",
      "postfixFontSize": "70%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "sum(node_memory_MemTotal_bytes{instance=~\"$node\"})",
          "format": "time_series",
          "instant": true,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{instance}}",
          "refId": "A",
          "step": 20
        }
      ],
      "thresholds": "2,3",
      "title": "總內存",
      "type": "singlestat",
      "valueFontSize": "80%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorPostfix": false,
      "colorValue": true,
      "colors": [
        "rgba(245, 54, 54, 0.9)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(50, 172, 45, 0.97)"
      ],
      "datasource": "prometheus",
      "decimals": null,
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "format": "locale",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": false,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 2,
        "w": 2,
        "x": 15,
        "y": 26
      },
      "id": 178,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "maxPerRow": 6,
      "nullPointMode": "null",
      "nullText": null,
      "options": {},
      "postfix": "",
      "postfixFontSize": "50%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "avg(node_filefd_maximum{instance=~\"$node\"})",
          "format": "time_series",
          "instant": true,
          "intervalFactor": 1,
          "legendFormat": "",
          "refId": "A",
          "step": 20
        }
      ],
      "thresholds": "1024,10000",
      "title": "總文件描述符",
      "type": "singlestat",
      "valueFontSize": "70%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "aliasColors": {
        "192.168.200.241:9100_Total": "dark-red",
        "Idle - Waiting for something to happen": "#052B51",
        "guest": "#9AC48A",
        "idle": "#052B51",
        "iowait": "#EAB839",
        "irq": "#BF1B00",
        "nice": "#C15C17",
        "sdb_每秒I/O操作%": "#d683ce",
        "softirq": "#E24D42",
        "steal": "#FCE2DE",
        "system": "#508642",
        "user": "#5195CE",
        "磁盤花費在I/O操作占比": "#ba43a9"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 2,
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 0,
        "y": 28
      },
      "hiddenSeries": false,
      "id": 7,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "hideEmpty": true,
        "hideZero": true,
        "max": true,
        "min": true,
        "rightSide": false,
        "show": true,
        "sideWidth": null,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "links": [],
      "maxPerRow": 6,
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "repeat": null,
      "seriesOverrides": [
        {
          "alias": "/.*總使用率/",
          "color": "#C4162A",
          "fill": 0
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"system\"}[5m])) by (instance) *100",
          "format": "time_series",
          "hide": false,
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "系統使用率",
          "refId": "A",
          "step": 20
        },
        {
          "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"user\"}[5m])) by (instance) *100",
          "format": "time_series",
          "hide": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "用戶使用率",
          "refId": "B",
          "step": 240
        },
        {
          "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) by (instance) *100",
          "format": "time_series",
          "hide": false,
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "磁盤IO使用率",
          "refId": "D",
          "step": 240
        },
        {
          "expr": "(1 - avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"idle\"}[5m])) by (instance))*100",
          "format": "time_series",
          "hide": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "總使用率",
          "refId": "F",
          "step": 240
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "CPU使用率",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "decimals": 0,
          "format": "percent",
          "label": "",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "192.168.200.241:9100_總內存": "dark-red",
        "使用率": "yellow",
        "內存_Avaliable": "#6ED0E0",
        "內存_Cached": "#EF843C",
        "內存_Free": "#629E51",
        "內存_Total": "#6d1f62",
        "內存_Used": "#eab839",
        "可用": "#9ac48a",
        "總內存": "#bf1b00"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 2,
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 8,
        "y": 28
      },
      "height": "300",
      "hiddenSeries": false,
      "id": 156,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "hideEmpty": true,
        "hideZero": true,
        "max": true,
        "min": true,
        "rightSide": false,
        "show": true,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "links": [],
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [
        {
          "alias": "總內存",
          "color": "#C4162A",
          "fill": 0
        },
        {
          "alias": "使用率",
          "color": "rgb(0, 209, 255)",
          "lines": false,
          "pointradius": 1,
          "points": true,
          "yaxis": 2
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "node_memory_MemTotal_bytes{instance=~\"$node\"}",
          "format": "time_series",
          "hide": false,
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "總內存",
          "refId": "A",
          "step": 4
        },
        {
          "expr": "node_memory_MemTotal_bytes{instance=~\"$node\"} - node_memory_MemAvailable_bytes{instance=~\"$node\"}",
          "format": "time_series",
          "hide": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "已用",
          "refId": "B",
          "step": 4
        },
        {
          "expr": "node_memory_MemAvailable_bytes{instance=~\"$node\"}",
          "format": "time_series",
          "hide": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "可用",
          "refId": "F",
          "step": 4
        },
        {
          "expr": "node_memory_Buffers_bytes{instance=~\"$node\"}",
          "format": "time_series",
          "hide": true,
          "intervalFactor": 1,
          "legendFormat": "內存_Buffers",
          "refId": "D",
          "step": 4
        },
        {
          "expr": "node_memory_MemFree_bytes{instance=~\"$node\"}",
          "format": "time_series",
          "hide": true,
          "intervalFactor": 1,
          "legendFormat": "內存_Free",
          "refId": "C",
          "step": 4
        },
        {
          "expr": "node_memory_Cached_bytes{instance=~\"$node\"}",
          "format": "time_series",
          "hide": true,
          "intervalFactor": 1,
          "legendFormat": "內存_Cached",
          "refId": "E",
          "step": 4
        },
        {
          "expr": "node_memory_MemTotal_bytes{instance=~\"$node\"} - (node_memory_Cached_bytes{instance=~\"$node\"} + node_memory_Buffers_bytes{instance=~\"$node\"} + node_memory_MemFree_bytes{instance=~\"$node\"})",
          "format": "time_series",
          "hide": true,
          "intervalFactor": 1,
          "refId": "G"
        },
        {
          "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$node\"} / (node_memory_MemTotal_bytes{instance=~\"$node\"})))* 100",
          "format": "time_series",
          "hide": false,
          "interval": "30m",
          "intervalFactor": 10,
          "legendFormat": "使用率",
          "refId": "H"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "內存信息",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "bytes",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": "0",
          "show": true
        },
        {
          "format": "percent",
          "label": "內存使用率",
          "logBase": 1,
          "max": "100",
          "min": "0",
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "192.168.10.227:9100_em1_in下載": "super-light-green",
        "192.168.10.227:9100_em1_out上傳": "dark-blue"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 2,
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 16,
        "y": 28
      },
      "height": "300",
      "hiddenSeries": false,
      "id": 157,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "hideEmpty": true,
        "hideZero": true,
        "max": true,
        "min": true,
        "rightSide": false,
        "show": true,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "links": [],
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [
        {
          "alias": "/.*_out上傳$/",
          "transform": "negative-Y"
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "irate(node_network_receive_bytes_total{instance=~'$node',device=~\"$device\"}[5m])*8",
          "format": "time_series",
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{device}}_in下載",
          "refId": "A",
          "step": 4
        },
        {
          "expr": "irate(node_network_transmit_bytes_total{instance=~'$node',device=~\"$device\"}[5m])*8",
          "format": "time_series",
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{device}}_out上傳",
          "refId": "B",
          "step": 4
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "每秒網絡帶寬使用$device",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "bps",
          "label": "上傳(-)/下載(+)",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "15分鍾": "#6ED0E0",
        "1分鍾": "#BF1B00",
        "5分鍾": "#CCA300"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 2,
      "editable": true,
      "error": false,
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 1,
      "grid": {},
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 0,
        "y": 36
      },
      "height": "300",
      "hiddenSeries": false,
      "id": 13,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "hideEmpty": true,
        "hideZero": true,
        "max": true,
        "min": true,
        "rightSide": false,
        "show": true,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "links": [],
      "maxPerRow": 6,
      "nullPointMode": "null as zero",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "repeat": null,
      "seriesOverrides": [
        {
          "alias": "/.*總核數/",
          "color": "#C4162A"
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "node_load1{instance=~\"$node\"}",
          "format": "time_series",
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "1分鍾負載",
          "metric": "",
          "refId": "A",
          "step": 20,
          "target": ""
        },
        {
          "expr": "node_load5{instance=~\"$node\"}",
          "format": "time_series",
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "5分鍾負載",
          "refId": "B",
          "step": 20
        },
        {
          "expr": "node_load15{instance=~\"$node\"}",
          "format": "time_series",
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "15分鍾負載",
          "refId": "C",
          "step": 20
        },
        {
          "expr": " sum(count(node_cpu_seconds_total{instance=~\"$node\", mode='system'}) by (cpu,instance)) by(instance)",
          "format": "time_series",
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "CPU總核數",
          "refId": "D",
          "step": 20
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "系統平均負載",
      "tooltip": {
        "msResolution": false,
        "shared": true,
        "sort": 2,
        "value_type": "cumulative"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "short",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "vda_write": "#6ED0E0"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 2,
      "description": "Read bytes 每個磁盤分區每秒讀取的比特數\nWritten bytes 每個磁盤分區每秒寫入的比特數",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 1,
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 8,
        "y": 36
      },
      "height": "300",
      "hiddenSeries": false,
      "id": 168,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "hideEmpty": true,
        "hideZero": true,
        "max": true,
        "min": true,
        "show": true,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "links": [],
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [
        {
          "alias": "/.*_讀取$/",
          "transform": "negative-Y"
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "irate(node_disk_read_bytes_total{instance=~\"$node\"}[5m])",
          "format": "time_series",
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{device}}_讀取",
          "refId": "A",
          "step": 10
        },
        {
          "expr": "irate(node_disk_written_bytes_total{instance=~\"$node\"}[5m])",
          "format": "time_series",
          "hide": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{device}}_寫入",
          "refId": "B",
          "step": 10
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "每秒磁盤讀寫容量",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "decimals": null,
          "format": "Bps",
          "label": "讀取(-)/寫入(+)",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 1,
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 0,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 16,
        "y": 36
      },
      "hiddenSeries": false,
      "id": 174,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "hideEmpty": true,
        "hideZero": true,
        "max": true,
        "min": true,
        "rightSide": false,
        "show": true,
        "sideWidth": null,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "links": [],
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [
        {
          "alias": "/Inodes.*/",
          "yaxis": 2
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}) *100/(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}))",
          "format": "time_series",
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{mountpoint}}",
          "refId": "A"
        },
        {
          "expr": "node_filesystem_files_free{instance=~'$node',fstype=~\"ext.?|xfs\"} / node_filesystem_files{instance=~'$node',fstype=~\"ext.?|xfs\"}",
          "hide": true,
          "interval": "",
          "legendFormat": "Inodes:{{instance}}:{{mountpoint}}",
          "refId": "B"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "磁盤使用率",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "decimals": null,
          "format": "percent",
          "label": "",
          "logBase": 1,
          "max": "100",
          "min": "0",
          "show": true
        },
        {
          "decimals": 2,
          "format": "percentunit",
          "label": null,
          "logBase": 1,
          "max": "1",
          "min": null,
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "vda_write": "#6ED0E0"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 2,
      "description": "Reads completed: 每個磁盤分區每秒讀完成次數\n\nWrites completed: 每個磁盤分區每秒寫完成次數\n\nIO now 每個磁盤分區每秒正在處理的輸入/輸出請求數",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 0,
      "fillGradient": 0,
      "gridPos": {
        "h": 9,
        "w": 8,
        "x": 0,
        "y": 44
      },
      "height": "300",
      "hiddenSeries": false,
      "id": 161,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "hideEmpty": true,
        "hideZero": true,
        "max": true,
        "min": true,
        "show": true,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "links": [],
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [
        {
          "alias": "/.*_讀取$/",
          "transform": "negative-Y"
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "irate(node_disk_reads_completed_total{instance=~\"$node\"}[5m])",
          "format": "time_series",
          "hide": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{device}}_讀取",
          "refId": "A",
          "step": 10
        },
        {
          "expr": "irate(node_disk_writes_completed_total{instance=~\"$node\"}[5m])",
          "format": "time_series",
          "hide": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{device}}_寫入",
          "refId": "B",
          "step": 10
        },
        {
          "expr": "node_disk_io_now{instance=~\"$node\"}",
          "format": "time_series",
          "hide": true,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{device}}",
          "refId": "C"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "磁盤讀寫速率(IOPS)",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "decimals": null,
          "format": "iops",
          "label": "讀取(-)/寫入(+)I/O ops/sec",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "Idle - Waiting for something to happen": "#052B51",
        "guest": "#9AC48A",
        "idle": "#052B51",
        "iowait": "#EAB839",
        "irq": "#BF1B00",
        "nice": "#C15C17",
        "sdb_每秒I/O操作%": "#d683ce",
        "softirq": "#E24D42",
        "steal": "#FCE2DE",
        "system": "#508642",
        "user": "#5195CE",
        "磁盤花費在I/O操作占比": "#ba43a9"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": null,
      "description": "每一秒鍾的自然時間內,花費在I/O上的耗時。(wall-clock time)\n\nnode_disk_io_time_seconds_total:\n磁盤花費在輸入/輸出操作上的秒數。該值為累加值。(Milliseconds Spent Doing I/Os)\n\nirate(node_disk_io_time_seconds_total[1m]):\n計算每秒的速率:(last值-last前一個值)/時間戳差值,即:1秒鍾內磁盤花費在I/O操作的時間占比。",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 9,
        "w": 8,
        "x": 8,
        "y": 44
      },
      "hiddenSeries": false,
      "id": 175,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "hideEmpty": true,
        "hideZero": true,
        "max": true,
        "min": false,
        "rightSide": false,
        "show": true,
        "sideWidth": null,
        "sort": null,
        "sortDesc": null,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "links": [],
      "maxPerRow": 6,
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "irate(node_disk_io_time_seconds_total{instance=~\"$node\"}[5m])",
          "format": "time_series",
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{device}}_每秒I/O操作%",
          "refId": "C"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "每1秒內I/O操作耗時占比",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "decimals": null,
          "format": "percentunit",
          "label": "",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "vda": "#6ED0E0"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 2,
      "description": "Read time seconds 每個磁盤分區讀操作花費的秒數\n\nWrite time seconds 每個磁盤分區寫操作花費的秒數\n\nIO time seconds 每個磁盤分區輸入/輸出操作花費的秒數\n\nIO time weighted seconds每個磁盤分區輸入/輸出操作花費的加權秒數",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 1,
      "gridPos": {
        "h": 9,
        "w": 8,
        "x": 16,
        "y": 44
      },
      "height": "300",
      "hiddenSeries": false,
      "id": 160,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "hideEmpty": true,
        "hideZero": true,
        "max": true,
        "min": true,
        "show": true,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "links": [],
      "nullPointMode": "null as zero",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [
        {
          "alias": "/,*_讀取$/",
          "transform": "negative-Y"
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "irate(node_disk_read_time_seconds_total{instance=~\"$node\"}[5m]) / irate(node_disk_reads_completed_total{instance=~\"$node\"}[5m])",
          "format": "time_series",
          "hide": false,
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{device}}_讀取",
          "refId": "B"
        },
        {
          "expr": "irate(node_disk_write_time_seconds_total{instance=~\"$node\"}[5m]) / irate(node_disk_writes_completed_total{instance=~\"$node\"}[5m])",
          "format": "time_series",
          "hide": false,
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{device}}_寫入",
          "refId": "C"
        },
        {
          "expr": "irate(node_disk_io_time_seconds_total{instance=~\"$node\"}[5m])",
          "format": "time_series",
          "hide": true,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{device}}",
          "refId": "A",
          "step": 10
        },
        {
          "expr": "irate(node_disk_io_time_weighted_seconds_total{instance=~\"$node\"}[5m])",
          "format": "time_series",
          "hide": true,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{device}}_加權",
          "refId": "D"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "每次IO讀寫的耗時(參考:小於100ms)(beta)",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "s",
          "label": "讀取(-)/寫入(+)",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "192.168.200.241:9100_TCP_alloc": "semi-dark-blue",
        "TCP": "#6ED0E0",
        "TCP_alloc": "blue"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "decimals": 2,
      "description": "Sockets_used - 已使用的所有協議套接字總量\n\nCurrEstab - 當前狀態為 ESTABLISHED 或 CLOSE-WAIT 的 TCP 連接數\n\nTCP_alloc - 已分配(已建立、已申請到sk_buff)的TCP套接字數量\n\nTCP_tw - 等待關閉的TCP連接數\n\nUDP_inuse - 正在使用的 UDP 套接字數量\n\nRetransSegs - TCP 重傳報文數\n\nOutSegs - TCP 發送的報文數\n\nInSegs - TCP 接收的報文數",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 0,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 16,
        "x": 0,
        "y": 53
      },
      "height": "300",
      "hiddenSeries": false,
      "id": 158,
      "interval": "",
      "legend": {
        "alignAsTable": true,
        "avg": false,
        "current": true,
        "hideEmpty": true,
        "hideZero": true,
        "max": true,
        "min": false,
        "rightSide": true,
        "show": true,
        "sideWidth": null,
        "sort": "current",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "links": [],
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [
        {
          "alias": "/.*Sockets_used/",
          "color": "#E02F44",
          "lines": false,
          "pointradius": 1,
          "points": true,
          "yaxis": 2
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "node_netstat_Tcp_CurrEstab{instance=~'$node'}",
          "format": "time_series",
          "hide": false,
          "instant": false,
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "CurrEstab",
          "refId": "A",
          "step": 20
        },
        {
          "expr": "node_sockstat_TCP_tw{instance=~'$node'}",
          "format": "time_series",
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "TCP_tw",
          "refId": "D"
        },
        {
          "expr": "node_sockstat_sockets_used{instance=~'$node'}",
          "hide": false,
          "interval": "30m",
          "intervalFactor": 1,
          "legendFormat": "Sockets_used",
          "refId": "B"
        },
        {
          "expr": "node_sockstat_UDP_inuse{instance=~'$node'}",
          "interval": "",
          "legendFormat": "UDP_inuse",
          "refId": "C"
        },
        {
          "expr": "node_sockstat_TCP_alloc{instance=~'$node'}",
          "interval": "",
          "legendFormat": "TCP_alloc",
          "refId": "E"
        },
        {
          "expr": "irate(node_netstat_Tcp_PassiveOpens{instance=~'$node'}[5m])",
          "hide": true,
          "interval": "",
          "legendFormat": "{{instance}}_Tcp_PassiveOpens",
          "refId": "G"
        },
        {
          "expr": "irate(node_netstat_Tcp_ActiveOpens{instance=~'$node'}[5m])",
          "hide": true,
          "interval": "",
          "legendFormat": "{{instance}}_Tcp_ActiveOpens",
          "refId": "F"
        },
        {
          "expr": "irate(node_netstat_Tcp_InSegs{instance=~'$node'}[5m])",
          "interval": "",
          "legendFormat": "Tcp_InSegs",
          "refId": "H"
        },
        {
          "expr": "irate(node_netstat_Tcp_OutSegs{instance=~'$node'}[5m])",
          "interval": "",
          "legendFormat": "Tcp_OutSegs",
          "refId": "I"
        },
        {
          "expr": "irate(node_netstat_Tcp_RetransSegs{instance=~'$node'}[5m])",
          "hide": false,
          "interval": "",
          "legendFormat": "Tcp_RetransSegs",
          "refId": "J"
        },
        {
          "expr": "irate(node_netstat_TcpExt_ListenDrops{instance=~'$node'}[5m])",
          "hide": true,
          "interval": "",
          "legendFormat": "",
          "refId": "K"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "網絡Socket連接信息",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "transformations": [],
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": "已使用的所有協議套接字總量",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "filefd_192.168.200.241:9100": "super-light-green",
        "switches_192.168.200.241:9100": "semi-dark-red",
        "使用的文件描述符_10.118.72.128:9100": "red",
        "每秒上下文切換次數_10.118.71.245:9100": "yellow",
        "每秒上下文切換次數_10.118.72.128:9100": "yellow"
      },
      "bars": false,
      "cacheTimeout": null,
      "dashLength": 10,
      "dashes": false,
      "datasource": "prometheus",
      "description": "",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 0,
      "fillGradient": 1,
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 16,
        "y": 53
      },
      "hiddenSeries": false,
      "hideTimeOverride": false,
      "id": 16,
      "legend": {
        "alignAsTable": false,
        "avg": false,
        "current": true,
        "max": false,
        "min": false,
        "rightSide": false,
        "show": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "links": [],
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pluginVersion": "6.4.2",
      "pointradius": 1,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [
        {
          "alias": "/每秒上下文切換次數.*/",
          "color": "#FADE2A",
          "lines": false,
          "pointradius": 1,
          "points": true,
          "yaxis": 2
        },
        {
          "alias": "/使用的文件描述符.*/",
          "color": "#F2495C"
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "node_filefd_allocated{instance=~\"$node\"}",
          "format": "time_series",
          "instant": false,
          "interval": "",
          "intervalFactor": 5,
          "legendFormat": "使用的文件描述符",
          "refId": "B"
        },
        {
          "expr": "irate(node_context_switches_total{instance=~\"$node\"}[5m])",
          "interval": "",
          "intervalFactor": 5,
          "legendFormat": "每秒上下文切換次數",
          "refId": "A"
        },
        {
          "expr": "  (node_filefd_allocated{instance=~\"$node\"}/node_filefd_maximum{instance=~\"$node\"}) *100",
          "format": "time_series",
          "hide": true,
          "instant": false,
          "interval": "",
          "intervalFactor": 5,
          "legendFormat": "使用的文件描述符占比_{{instance}}",
          "refId": "C"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "打開的文件描述符(左 )/每秒上下文切換次數(右)",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "short",
          "label": "使用的文件描述符",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": "context_switches",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
  ],
  "refresh": "",
  "schemaVersion": 20,
  "style": "dark",
  "tags": [
    "Prometheus",
    "node_exporter"
  ],
  "templating": {
    "list": [
      {
        "allValue": null,
        "current": {
          "tags": [],
          "text": "node-exporter",
          "value": "node-exporter"
        },
        "datasource": "prometheus",
        "definition": "label_values(node_uname_info, job)",
        "hide": 0,
        "includeAll": false,
        "index": -1,
        "label": "JOB",
        "multi": false,
        "name": "job",
        "options": [],
        "query": "label_values(node_uname_info, job)",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 5,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      },
      {
        "allValue": null,
        "current": {
          "text": "All",
          "value": "$__all"
        },
        "datasource": "prometheus",
        "definition": "label_values(node_uname_info{job=~\"$job\"}, nodename)",
        "hide": 0,
        "includeAll": true,
        "index": -1,
        "label": "主機名",
        "multi": false,
        "name": "hostname",
        "options": [],
        "query": "label_values(node_uname_info{job=~\"$job\"}, nodename)",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 5,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      },
      {
        "allFormat": "glob",
        "allValue": null,
        "current": {
          "text": "ymt108",
          "value": "ymt108"
        },
        "datasource": "prometheus",
        "definition": "label_values(node_uname_info{job=~\"$job\",nodename=~\"$hostname\"},instance)",
        "hide": 0,
        "includeAll": false,
        "index": -1,
        "label": "Instance",
        "multi": true,
        "multiFormat": "regex values",
        "name": "node",
        "options": [],
        "query": "label_values(node_uname_info{job=~\"$job\",nodename=~\"$hostname\"},instance)",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 5,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      },
      {
        "allFormat": "glob",
        "allValue": null,
        "current": {
          "text": "All",
          "value": "$__all"
        },
        "datasource": "prometheus",
        "definition": "label_values(node_network_info{device!~'tap.*|veth.*|br.*|docker.*|virbr.*|lo.*|cni.*'},device)",
        "hide": 0,
        "includeAll": true,
        "index": -1,
        "label": "網卡",
        "multi": true,
        "multiFormat": "regex values",
        "name": "device",
        "options": [],
        "query": "label_values(node_network_info{device!~'tap.*|veth.*|br.*|docker.*|virbr.*|lo.*|cni.*'},device)",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 1,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      },
      {
        "allValue": null,
        "current": {
          "text": "/",
          "value": "/"
        },
        "datasource": "prometheus",
        "definition": "query_result(topk(1,sort_desc (max(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.?|xfs\",mountpoint!~\".*pods.*\"}) by (mountpoint))))",
        "hide": 2,
        "includeAll": false,
        "index": -1,
        "label": "最大掛載目錄",
        "multi": false,
        "name": "maxmount",
        "options": [],
        "query": "query_result(topk(1,sort_desc (max(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.?|xfs\",mountpoint!~\".*pods.*\"}) by (mountpoint))))",
        "refresh": 2,
        "regex": "/.*\\\"(.*)\\\".*/",
        "skipUrlSync": false,
        "sort": 5,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      },
      {
        "allValue": null,
        "current": {
          "text": "ymt108",
          "value": "ymt108"
        },
        "datasource": "prometheus",
        "definition": "label_values(node_uname_info{job=~\"$job\",instance=~\"$node\"}, nodename)",
        "hide": 2,
        "includeAll": false,
        "index": -1,
        "label": "展示使用的主機名",
        "multi": false,
        "name": "show_hostname",
        "options": [],
        "query": "label_values(node_uname_info{job=~\"$job\",instance=~\"$node\"}, nodename)",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 5,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      }
    ]
  },
  "time": {
    "from": "now-12h",
    "to": "now"
  },
  "timepicker": {
    "hidden": false,
    "now": true,
    "refresh_intervals": [
      "15s",
      "30s",
      "1m",
      "5m",
      "15m",
      "30m"
    ],
    "time_options": [
      "5m",
      "15m",
      "1h",
      "6h",
      "12h",
      "24h",
      "2d",
      "7d",
      "30d"
    ]
  },
  "timezone": "browser",
  "title": "育苗通Node資源監控",
  "uid": "hb7fSE0Zz",
  "version": 11
}
View Code

當然,默認還內置了很多k8s相關的資源監控模板。

更多監控模板:https://grafana.com/grafana/dashboards


九、匯總

特殊說明1:我們還可以自定義etcd監控,詳情可參考:https://www.jianshu.com/p/2fbbe767870d

  • 第一步:建立一個 ServiceMonitor 對象,用於 Prometheus 添加監控項;
  • 第二步:為 ServiceMonitor 對象關聯 metrics 數據接口的一個 Service 對象;
  • 第三步:確保 Service 對象可以正確獲取到 metrics 數據。

特殊說明2:部署kube-prometheus可能會出現無法連接apiserver問題,詳情可參考:【解決】Error from server (ServiceUnavailable): the server is currently unable to handle the request

當我們完成了所有配置, 那接下來還需要整理一下,編寫升級腳本upgrade.sh,方便之后部署,以及修改更新。

#!/bin/sh

# deploy kubernetes service
kubectl apply -f prometheus-kubeControllerManagerService.yaml
kubectl apply -f prometheus-kubeSchedulerService.yaml

# upgrade alertmanager configuration
kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring --dry-run -oyaml > alertmanager-secret.yaml
kubectl apply -f alertmanager-secret.yaml

# upgrade scrape configs
kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml -n monitoring --dry-run -oyaml > additional-scrape-configs.yaml
kubectl apply -f additional-scrape-configs.yaml

# upgrade prometheus rules
kubectl apply -f prometheus-additional-rules.yaml
kubectl apply -f prometheus-rules.yaml

# upgrade prometheus configuration
kubectl apply -f prometheus-prometheus.yaml

# upgrade grafana configuration
kubectl apply -f grafana-deployment.yaml

 

 十、定制化配置

1、namespace定制化

如果使用原生kube-prometheus 安裝的 kube-prometheus 只會監控 default kube-system monitoring(kube-prometheus 自己創建的ns)三個命名空間,而如果想要添加其他的命名空間,就需要定制化kube-prometheus

定制步驟

自動化腳本如下

# 所有操作均使用root賬戶
#安裝golang並設置環境變量
dnf install -y git
curl -O https://dl.google.com/go/go1.15.2.linux-amd64.tar.gz
tar -xf go1.15.2.linux-amd64.tar.gz
mv go /usr/local
echo >> "export PATH=$PATH:/usr/local/go/bin" ~/.bashrc
echo >> "export PATH=$PATH:/root/go/bin" ~/.bashrc
echo >> "export GOPATH=/root/go" ~/.bashrc
echo >> "export GO111MODULE="on"" ~/.bashrc  
source ~/.bashrc
# 安裝jsonnet 和 jb 工具 定制化 kube-prometheus 需要用到
#安裝json-bundle
go get github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb
#安裝jsonnet
go get github.com/google/go-jsonnet/cmd/jsonnet
#clone kube-prometheus
git clone https://github.com/prometheus-operator/kube-prometheus.git
mkdir my-kube-prometheus
# 把clone下來的代碼復制一遍,因為定制化代碼,不會生成CRD,但是會把manifests下的文件都刪掉
cp -r kube-prometheus/* my-kube-prometheus
cd my-kube-prometheus
# 安裝必要的jsonnet依賴庫
jb install github.com/prometheus-operator/kube-prometheus/jsonnet/kube-prometheus
jb update
cat add-namespace.yaml << EOF
local kp = (import 'kube-prometheus/kube-prometheus.libsonnet') + {
  _config+:: {
    namespace: 'monitoring',

    prometheus+:: {
      namespaces+: ['default', 'kube-system','monitoring','rook-ceph'],
    },
  },
};

{ ['00namespace-' + name]: kp.kubePrometheus[name] for name in std.objectFields(kp.kubePrometheus) } +
{ ['0prometheus-operator-' + name]: kp.prometheusOperator[name] for name in std.objectFields(kp.prometheusOperator) } +
{ ['node-exporter-' + name]: kp.nodeExporter[name] for name in std.objectFields(kp.nodeExporter) } +
{ ['kube-state-metrics-' + name]: kp.kubeStateMetrics[name] for name in std.objectFields(kp.kubeStateMetrics) } +
{ ['alertmanager-' + name]: kp.alertmanager[name] for name in std.objectFields(kp.alertmanager) } +
{ ['prometheus-' + name]: kp.prometheus[name] for name in std.objectFields(kp.prometheus) } +
{ ['grafana-' + name]: kp.grafana[name] for name in std.objectFields(kp.grafana) }
EOF

如果要除了添加對應的namespace 還要添加對應點 servicemonitor,參看這里

生成客戶化的kube-prometheus

./build.sh add-namespace.yaml
然后就可以看到

[root@k8smaster my-kube-prometheus]# ./build.sh add-namespace.jsonnet 
+ set -o pipefail
++ pwd
+ PATH=/root/my-kube-prometheus/tmp/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/usr/local/go/bin:/usr/local/go/bin:/usr/local/go/bin:/usr/local/go/bin:/root/go/bin
+ rm -rf manifests
+ mkdir -p manifests/setup
+ jsonnet -J vendor -m manifests add-namespace.jsonnet
+ xargs '-I{}' sh -c 'cat {} | gojsontoyaml > {}.yaml' -- '{}'
+ find manifests -type f '!' -name '*.yaml' -delete
+ rm -f kustomization

 

最后通過對比,發現生成出來的文件,只有

  • prometheus-roleSpecificNamespaces.yaml
  • prometheus-roleBindingSpecificNamespaces.yaml 發生了變化

變化內容如下
prometheus-roleBindingSpecificNamespaces.yaml

- apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    name: prometheus-k8s
    namespace: default
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: prometheus-k8s
  subjects:
  - kind: ServiceAccount
    name: prometheus-k8s
    namespace: monitoring
- apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    name: prometheus-k8s
    namespace: kube-system
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: prometheus-k8s
  subjects:
  - kind: ServiceAccount
    name: prometheus-k8s
    namespace: monitoring
- apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    name: prometheus-k8s
    namespace: monitoring
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: prometheus-k8s
  subjects:
  - kind: ServiceAccount
    name: prometheus-k8s
    namespace: monitoring
- apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    name: prometheus-k8s
    namespace: rook-ceph
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: prometheus-k8s
  subjects:
  - kind: ServiceAccount
    name: prometheus-k8s
    namespace: monitoring

 

prometheus-roleSpecificNamespaces.yaml

- apiVersion: rbac.authorization.k8s.io/v1
  kind: Role
  metadata:
    name: prometheus-k8s
    namespace: default
  rules:
  - apiGroups:
    - ""
    resources:
    - services
    - endpoints
    - pods
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - extensions
    resources:
    - ingresses
    verbs:
    - get
    - list
    - watch
- apiVersion: rbac.authorization.k8s.io/v1
  kind: Role
  metadata:
    name: prometheus-k8s
    namespace: kube-system
  rules:
  - apiGroups:
    - ""
    resources:
    - services
    - endpoints
    - pods
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - extensions
    resources:
    - ingresses
    verbs:
    - get
    - list
    - watch
- apiVersion: rbac.authorization.k8s.io/v1
  kind: Role
  metadata:
    name: prometheus-k8s
    namespace: monitoring
  rules:
  - apiGroups:
    - ""
    resources:
    - services
    - endpoints
    - pods
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - extensions
    resources:
    - ingresses
    verbs:
    - get
    - list
    - watch
- apiVersion: rbac.authorization.k8s.io/v1
  kind: Role
  metadata:
    name: prometheus-k8s
    namespace: rook-ceph
  rules:
  - apiGroups:
    - ""
    resources:
    - services
    - endpoints
    - pods
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - extensions
    resources:
    - ingresses
    verbs:
    - get
    - list
    - watch

 ps:不執行腳本也可以直接修改這兩個文件,修改完記得apply

2、修改Prometheus 配置項

 

在容器化編排工具 Kubernetes 集群中配置Prometheus 服務時,一般使用 kube-prometheus 這個項目進行部署。Kube-prometheus 的使用方式,在項目的說明文檔中都有介紹,這里就不贅述了。
Kube-prometheus 項目中對其他資源對象的管理是通過 Prometheus-Operator 項目實例來實現的;所以在管理和更新 Prometheus 服務是需要通過 Operator 來實現。
我們在使用 Kube-prometheus 項目配置 Prometheus 監控后,難免會進行一些調整,增加一些自定義的配置,這時候我們就需要更深入了解 Kube-prometheus 項目是如何配置 Prometheus 服務,並給我們預設什么樣的調整配置方式。

Prometheus 服務配置文件添加配置項

kube-prometheus 項目給我們提供了一份默認的 Prometheus 配置文件內容;在默認的環境中,這份默認配置已經很夠用的了。
在項目維護過程中,我們難免會增加新的監控對象(Target);這是我們改如果修改 Prometheus 服務的配置文件,來實現目的吶?

1.1、修改Prometheus 服務配置文件對象(錯誤)

Prometheus 服務的配置文件由 prometheus-k8s 名稱的 Secret 對象存儲;一般情況下,只需要修改這個對象,既可以完成配置項修改的操作。但是這個對象無法修改,所以這個方式不可行。

1.2、修改Prometheus 服務的Statefulset 對象編輯文件(錯誤)

配置文件不能修改,那就新創建一個配置文件對象;讓 Prometheus 服務引用新的配置文件對象;也可以實現更新配置信息的目的。
修改 prometheus-k8s 名稱的 statefulset 對象,使其掛載的配置文件對象為新的配置文件。但是發現 statefulset 對象也是不能修改的;所以這個方式不可行。

1.3、官方提供的修改方式

Kube-prometheus 項目提供的修改 Prometheus 配置文件的方式是新編輯一個附加配置文件。
https://github.com/prometheus-operator/prometheus-operator/tree/master/example/additional-scrape-configs

具體實現查看第五步,其過程主要包括

  1. 編寫一個附加配置文件(prometheus-additional.yaml),用於擴展 Prometheus 配置。
  2. 利用附加配置文件生產一個 Secret 配置對象(additional-scrape-configs.yaml)。
  3. 修改 Prometheus CRD 對象實例,添加附加配置項的配置信息(prometheus.yaml)。
  4. 其他關於角色、綁定和賬號的配置一般可以忽略。

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM