使用kube-prometheus部署k8s監控(最新版)


kubernetes的最新版本已經到了1.20.x,利用假期時間搭建了最新的k8s v1.20.2版本,截止我整理此文為止,發現官方最新的release已經更新到了v1.20.4

1、概述

1.1 在k8s中部署Prometheus監控的方法

通常在k8s中部署prometheus監控可以采取的方法有以下三種

  • 通過yaml手動部署
  • operator部署
  • 通過helm chart部署

1.2 什么是Prometheus Operator

Prometheus Operator的本職就是一組用戶自定義的CRD資源以及Controller的實現,Prometheus Operator負責監聽這些自定義資源的變化,並且根據這些資源的定義自動化的完成如Prometheus Server自身以及配置的自動化管理工作。以下是Prometheus Operator的架構圖

圖片來源:

https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/master/Documentation/user-guides/images/architecture.png

1.3 為什么用Prometheus Operator

由於Prometheus本身沒有提供管理配置的AP接口(尤其是管理監控目標和管理警報規則),也沒有提供好用的多實例管理手段,因此這一塊往往要自己寫一些代碼或腳本。為了簡化這類應用程序的管理復雜度,CoreOS率先引入了Operator的概念,並且首先推出了針對在Kubernetes下運行和管理EtcdEtcd Operator。並隨后推出了Prometheus Operator

1.4 kube-prometheus項目介紹

prometheus-operator官方地址:https://github.com/prometheus-operator/prometheus-operator
kube-prometheus官方地址:https://github.com/prometheus-operator/kube-prometheus

兩個項目的關系:前者只包含了Prometheus Operator,后者既包含了Operator,又包含了Prometheus相關組件的部署及常用的Prometheus自定義監控,具體包含下面的組件

  • The Prometheus Operator:創建CRD自定義的資源對象
  • Highly available Prometheus:創建高可用的Prometheus
  • Highly available Alertmanager:創建高可用的告警組件
  • Prometheus node-exporter:創建主機的監控組件
  • Prometheus Adapter for Kubernetes Metrics APIs:創建自定義監控的指標工具(例如可以通過nginx的request來進行應用的自動伸縮)
  • kube-state-metrics:監控k8s相關資源對象的狀態指標
  • Grafana:進行圖像展示

2、環境介紹

本文的k8s環境是通過kubeadm搭建的v 1.20.2版本,由1master+2node組合

持久化存儲為storageclass動態存儲,底層由ceph-rbd提供

➜  kubectl version -o yaml
clientVersion:
  buildDate: "2020-12-08T17:59:43Z"
  compiler: gc
  gitCommit: af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38
  gitTreeState: clean
  gitVersion: v1.20.0
  goVersion: go1.15.5
  major: "1"
  minor: "20"
  platform: darwin/amd64
serverVersion:
  buildDate: "2021-01-13T13:20:00Z"
  compiler: gc
  gitCommit: faecb196815e248d3ecfb03c680a4507229c2a56
  gitTreeState: clean
  gitVersion: v1.20.2
  goVersion: go1.15.5
  major: "1"
  minor: "20"
  platform: linux/amd64
➜  kubectl get nodes                                     
NAME       STATUS   ROLES                  AGE   VERSION
k8s-m-01   Ready    control-plane,master   11d    v1.20.2
k8s-n-01   Ready    <none>                 11d    v1.20.2
k8s-n-02   Ready    <none>                 11d    v1.20.2
➜  manifests kubectl get sc                                              
NAME                            PROVISIONER                                   RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
dynamic-ceph-rbd (default)      ceph.com/rbd                                  Delete          Immediate           false                  7d23h

kube-prometheus的兼容性說明(https://github.com/prometheus-operator/kube-prometheus#kubernetes-compatibility-matrix),按照兼容性說明,部署的是最新的release-0.7版本

kube-prometheus stack Kubernetes 1.16 Kubernetes 1.17 Kubernetes 1.18 Kubernetes 1.19 Kubernetes 1.20
release-0.4 ✔ (v1.16.5+)
release-0.5
release-0.6
release-0.7
HEAD

3、清單准備

從官方的地址獲取最新的release-0.7分支,或者直接打包下載release-0.7

➜  git clone https://github.com/prometheus-operator/kube-prometheus.git
➜  git checkout release-0.7
或者
➜  wget -c https://github.com/prometheus-operator/kube-prometheus/archive/v0.7.0.zip

默認下載下來的文件較多,建議把文件進行歸類處理,將相關yaml文件移動到對應目錄下

➜  cd kube-prometheus/manifests
➜  mkdir -p serviceMonitor prometheus adapter node-exporter kube-state-metrics grafana alertmanager operator other

最終結構如下

➜  manifests tree .
.
├── adapter
│   ├── prometheus-adapter-apiService.yaml
│   ├── prometheus-adapter-clusterRole.yaml
│   ├── prometheus-adapter-clusterRoleAggregatedMetricsReader.yaml
│   ├── prometheus-adapter-clusterRoleBinding.yaml
│   ├── prometheus-adapter-clusterRoleBindingDelegator.yaml
│   ├── prometheus-adapter-clusterRoleServerResources.yaml
│   ├── prometheus-adapter-configMap.yaml
│   ├── prometheus-adapter-deployment.yaml
│   ├── prometheus-adapter-roleBindingAuthReader.yaml
│   ├── prometheus-adapter-service.yaml
│   └── prometheus-adapter-serviceAccount.yaml
├── alertmanager
│   ├── alertmanager-alertmanager.yaml
│   ├── alertmanager-secret.yaml
│   ├── alertmanager-service.yaml
│   └── alertmanager-serviceAccount.yaml
├── grafana
│   ├── grafana-dashboardDatasources.yaml
│   ├── grafana-dashboardDefinitions.yaml
│   ├── grafana-dashboardSources.yaml
│   ├── grafana-deployment.yaml
│   ├── grafana-service.yaml
│   └── grafana-serviceAccount.yaml
├── kube-state-metrics
│   ├── kube-state-metrics-clusterRole.yaml
│   ├── kube-state-metrics-clusterRoleBinding.yaml
│   ├── kube-state-metrics-deployment.yaml
│   ├── kube-state-metrics-service.yaml
│   └── kube-state-metrics-serviceAccount.yaml
├── node-exporter
│   ├── node-exporter-clusterRole.yaml
│   ├── node-exporter-clusterRoleBinding.yaml
│   ├── node-exporter-daemonset.yaml
│   ├── node-exporter-service.yaml
│   └── node-exporter-serviceAccount.yaml
├── operator
│   ├── 0namespace-namespace.yaml
│   ├── prometheus-operator-0alertmanagerConfigCustomResourceDefinition.yaml
│   ├── prometheus-operator-0alertmanagerCustomResourceDefinition.yaml
│   ├── prometheus-operator-0podmonitorCustomResourceDefinition.yaml
│   ├── prometheus-operator-0probeCustomResourceDefinition.yaml
│   ├── prometheus-operator-0prometheusCustomResourceDefinition.yaml
│   ├── prometheus-operator-0prometheusruleCustomResourceDefinition.yaml
│   ├── prometheus-operator-0servicemonitorCustomResourceDefinition.yaml
│   ├── prometheus-operator-0thanosrulerCustomResourceDefinition.yaml
│   ├── prometheus-operator-clusterRole.yaml
│   ├── prometheus-operator-clusterRoleBinding.yaml
│   ├── prometheus-operator-deployment.yaml
│   ├── prometheus-operator-service.yaml
│   └── prometheus-operator-serviceAccount.yaml
├── other
├── prometheus
│   ├── prometheus-clusterRole.yaml
│   ├── prometheus-clusterRoleBinding.yaml
│   ├── prometheus-prometheus.yaml
│   ├── prometheus-roleBindingConfig.yaml
│   ├── prometheus-roleBindingSpecificNamespaces.yaml
│   ├── prometheus-roleConfig.yaml
│   ├── prometheus-roleSpecificNamespaces.yaml
│   ├── prometheus-rules.yaml
│   ├── prometheus-service.yaml
│   └── prometheus-serviceAccount.yaml
└── serviceMonitor
    ├── alertmanager-serviceMonitor.yaml
    ├── grafana-serviceMonitor.yaml
    ├── kube-state-metrics-serviceMonitor.yaml
    ├── node-exporter-serviceMonitor.yaml
    ├── prometheus-adapter-serviceMonitor.yaml
    ├── prometheus-operator-serviceMonitor.yaml
    ├── prometheus-serviceMonitor.yaml
    ├── prometheus-serviceMonitorApiserver.yaml
    ├── prometheus-serviceMonitorCoreDNS.yaml
    ├── prometheus-serviceMonitorKubeControllerManager.yaml
    ├── prometheus-serviceMonitorKubeScheduler.yaml
    └── prometheus-serviceMonitorKubelet.yaml

9 directories, 67 files

修改yaml,增加prometheus和grafana的持久化存儲

manifests/prometheus/prometheus-prometheus.yaml

...
  serviceMonitorSelector: {}
  version: v2.22.1
  retention: 3d
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: dynamic-ceph-rbd
        resources:
          requests:
            storage: 5Gi

manifests/grafana/grafana-deployment.yaml

...
      serviceAccountName: grafana
      volumes:
#      - emptyDir: {}
#        name: grafana-storage
      - name: grafana-storage
        persistentVolumeClaim:
          claimName: grafana-data

新增grafana的pvc,manifests/other/grafana-pvc.yaml

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: grafana-data
  namespace: monitoring
  annotations:
    volume.beta.kubernetes.io/storage-class: "dynamic-ceph-rbd"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi

4、開始部署

部署清單

➜  kubectl create -f other/grafana-pvc.yaml 
➜  kubectl create -f operator/
➜  kubectl create -f adapter/ -f alertmanager/ -f grafana/ -f kube-state-metrics/ -f node-exporter/ -f prometheus/ -f serviceMonitor/ 

查看狀態

➜  kubectl get po,svc -n monitoring 
NAME                                       READY   STATUS    RESTARTS   AGE
pod/alertmanager-main-0                    2/2     Running   0          15m
pod/alertmanager-main-1                    2/2     Running   0          10m
pod/alertmanager-main-2                    2/2     Running   0          15m
pod/grafana-d69dcf947-wnspk                1/1     Running   0          22m
pod/kube-state-metrics-587bfd4f97-bffqv    3/3     Running   0          22m
pod/node-exporter-2vvhv                    2/2     Running   0          22m
pod/node-exporter-7nsz5                    2/2     Running   0          22m
pod/node-exporter-wggpp                    2/2     Running   0          22m
pod/prometheus-adapter-69b8496df6-cjw6w    1/1     Running   0          23m
pod/prometheus-k8s-0                       2/2     Running   1          75s
pod/prometheus-k8s-1                       2/2     Running   0          9m33s
pod/prometheus-operator-7649c7454f-nhl72   2/2     Running   0          28m

NAME                            TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
service/alertmanager-main       ClusterIP   10.1.189.238   <none>        9093/TCP                     23m
service/alertmanager-operated   ClusterIP   None           <none>        9093/TCP,9094/TCP,9094/UDP   23m
service/grafana                 ClusterIP   10.1.29.30     <none>        3000/TCP                     23m
service/kube-state-metrics      ClusterIP   None           <none>        8443/TCP,9443/TCP            23m
service/node-exporter           ClusterIP   None           <none>        9100/TCP                     23m
service/prometheus-adapter      ClusterIP   10.1.75.64     <none>        443/TCP                      23m
service/prometheus-k8s          ClusterIP   10.1.111.121   <none>        9090/TCP                     23m
service/prometheus-operated     ClusterIP   None           <none>        9090/TCP                     14m
service/prometheus-operator     ClusterIP   None           <none>        8443/TCP                     28m

prometheusgrafanaalertmanager創建ingress

manifests/other/ingress.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: prom-ingress
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: "nginx"
    prometheus.io/http_probe: "true"
spec:
  rules:
  - host: alert.k8s-1.20.2.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: alertmanager-main
            port:
              number: 9093
  - host: grafana.k8s-1.20.2.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: grafana
            port:
              number: 3000
  - host: prom.k8s-1.20.2.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: prometheus-k8s
            port:
              number: 9090

5、解決ControllerManager、Scheduler監控問題

默認安裝后訪問prometheus,會發現有以下有三個報警:

WatchdogKubeControllerManagerDownKubeSchedulerDown

Watchdog是一個正常的報警,這個告警的作用是:如果alermanger或者prometheus本身掛掉了就發不出告警了,因此一般會采用另一個監控來監控prometheus,或者自定義一個持續不斷的告警通知,哪一天這個告警通知不發了,說明監控出現問題了。prometheus operator已經考慮了這一點,本身攜帶一個watchdog,作為對自身的監控。

如果需要關閉,刪除或注釋掉Watchdog部分

prometheus-rules.yaml

...
  - name: general.rules
    rules:
    - alert: TargetDown
      annotations:
        message: 'xxx'
      expr: 100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job, namespace, service)) > 10
      for: 10m
      labels:
        severity: warning
#    - alert: Watchdog
#      annotations:
#        message: |
#          This is an alert meant to ensure that the entire alerting pipeline is functional.
#          This alert is always firing, therefore it should always be firing in Alertmanager
#          and always fire against a receiver. There are integrations with various notification
#          mechanisms that send a notification when this alert is not firing. For example the
#          "DeadMansSnitch" integration in PagerDuty.
#      expr: vector(1)
#      labels:
#        severity: none

KubeControllerManagerDownKubeSchedulerDown的解決

原因是因為在prometheus-serviceMonitorKubeControllerManager.yaml中有如下內容,但默認安裝的集群並沒有給系統kube-controller-manager組件創建svc

  selector:
    matchLabels:
      k8s-app: kube-controller-manager

修改kube-controller-manager的監聽地址

# vim /etc/kubernetes/manifests/kube-controller-manager.yaml
...
spec:
  containers:
  - command:
    - kube-controller-manager
    - --allocate-node-cidrs=true
    - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --bind-address=0.0.0.0
# netstat -lntup|grep kube-contro                                      
tcp6       0      0 :::10257                :::*                    LISTEN      38818/kube-controll

創建一個serviceendpoint,以便serviceMonitor監聽

other/kube-controller-namager-svc-ep.yaml

apiVersion: v1
kind: Service
metadata:
  name: kube-controller-manager
  namespace: kube-system
  labels:
    k8s-app: kube-controller-manager
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: https-metrics
    port: 10257
    targetPort: 10257
    protocol: TCP

---
apiVersion: v1
kind: Endpoints
metadata:
  name: kube-controller-manager
  namespace: kube-system
  labels:
    k8s-app: kube-controller-manager
subsets:
- addresses:
  - ip: 172.16.1.71
  ports:
    - name: https-metrics
      port: 10257
      protocol: TCP

kube-scheduler同理,修改kube-scheduler的監聽地址

# vim /etc/kubernetes/manifests/kube-scheduler.yaml
...
spec:
  containers:
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=0.0.0.0
# netstat -lntup|grep kube-sched
tcp6       0      0 :::10259                :::*                    LISTEN      100095/kube-schedul

創建一個serviceendpoint,以便serviceMonitor監聽

kube-scheduler-svc-ep.yaml

apiVersion: v1
kind: Service
metadata:
  name: kube-scheduler
  namespace: kube-system
  labels:
    k8s-app: kube-scheduler
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: https-metrics
    port: 10259
    targetPort: 10259
    protocol: TCP

---
apiVersion: v1
kind: Endpoints
metadata:
  name: kube-scheduler
  namespace: kube-system
  labels:
    k8s-app: kube-scheduler
subsets:
- addresses:
  - ip: 172.16.1.71
  ports:
    - name: https-metrics
      port: 10259
      protocol: TCP

再次查看prometheusalert界面,全部恢復正常

登錄到grafana,查看相關圖像展示

至此,通過kube-prometheus部署k8s監控已經基本完成了,后面再分享自定義監控和告警、告警通知、高可用、規模化部署等相關內容

參考:https://github.com/prometheus-operator/kube-prometheus


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM