一、步驟及注意事項(前提,部署參考部署篇)
- 一般etcd集群會開啟HTTPS認證,因此訪問etcd需要對應的證書
- 使用證書創建etcd的secret
- 將etcd的secret掛在到prometheus
- 創建etcd的servicemonitor對象(匹配kube-system空間下具有k8s-app=etcd標簽的service)
- 創建service關聯被監控對象
二、實際操作步驟(etcd證書默認路徑:/etc/kubernetes/pki/etcd/)
1、創建etcd的secret
cd /etc/kubernetes/pki/etcd/ kubectl create secret generic etcd-certs --from-file=healthcheck-client.crt --from-file=healthcheck-client.key --from-file=ca.crt -n monitoring
2、添加secret到名為k8s的prometheus對象上(kubectl edit prometheus k8s -n monitoring或者修改yaml文件並更新資源)
apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: labels: prometheus: k8s name: k8s namespace: monitoring spec: alerting: alertmanagers: - name: alertmanager-main namespace: monitoring port: web baseImage: quay.io/prometheus/prometheus nodeSelector: kubernetes.io/os: linux podMonitorNamespaceSelector: {} podMonitorSelector: {} replicas: 2 secrets: - etcd-certs resources: requests: memory: 400Mi ruleSelector: matchLabels: prometheus: k8s role: alert-rules securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 serviceAccountName: prometheus-k8s serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} version: v2.11.0
3、創建servicemonitoring對象
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: etcd-k8s namespace: monitoring labels: k8s-app: etcd-k8s spec: jobLabel: k8s-app endpoints: - port: port interval: 30s scheme: https tlsConfig: caFile: /etc/prometheus/secrets/etcd-certs/ca.crt certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.crt keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key insecureSkipVerify: true selector: matchLaels: k8s-app: etcd namespaceSelector: matchNames: - kube-system
4、創建service並自定義endpoint(考慮到etcd可能部署在kubernetes集群外,因此自定義endpoint)
apiVersion: v1 kind: Service metadata: name: etcd-k8s namespace: kube-system labels: k8s-app: etcd spec: type: ClusterIP clusterIP: None ports: - name: port port: 2379 protocol: TCP --- apiVersion: v1 kind: Endpoints metadata: name: etcd-k8s namespace: kube-system labels: k8s-app: etcd subsets: - addresses: - ip: 1.1.1.11
- ip: 1.1.1.12
- ip: 1.1.1.13
nodeName: etcd-master ports: - name: port port: 2379 protocol: TCP
此處正常能通過prometheus的頁面看到對應的監控信息了
若監控中出現報錯:connection refused,修改/etc/kubernetes/manifests下的etcd.yaml文件
方法一:--listen-client-urls=https://0.0.0.0:2379
方法二:--listen-client-urls=https://127.0.0.1:2379,https://1.1.1.11:2379
三、創建自定義告警
- 創建一個prometheusRule資源后再prometheus的pod中會生成對應的告警配置文件
- 注意:此處的標簽一定要匹配
- 告警項:若etcd集群有一半以上的節點可用,則認為集群可用,否則產生告警
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: etcd-rules namespace: monitoring spec: groups: - name: etcd-exporter.rules rules: - alert: EtcdClusterUnavailable annotations: summary: etcd cluster small description: If one more etcd peer goes down the cluster will be unavailable expr: | count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2-1) for: 3m labels: severity: critical