===============================================
2021/4/10_第4次修改 ccb_warlock
更新說明:
2021/4/10:
1.補充了alertmanager觸發告警時收到郵件的截圖;
2.增加了cadvisor、kube-state-metrics、prometheus、grafana、alertmanager的功能描述;
2021/2/16:
1.增補了kube-state-metrics鏡像的獲取教程;
2.補全了prometheus、grafana的內容;
2021/2/15:
1.增加了grafana和部分標題;
===============================================
在很多年前整理過的容器監控方案(https://www.cnblogs.com/straycats/p/9281889.html)中,曾經采用在docker swarm中運行cAdvisor、Prometheus、Grafana來實現對容器與宿主機的監控。因為懂docker,上個月又被當作運維要求通過在kubernetes上實現監控系統的解決方案,於是我需要實現在kubernetes上運行這套解決方案。
在使用grafana的demo時,了解到監控k8s資源有個比cAdvisor更好用的服務Kube-state-metrics。
cAdvisor:采集os指標、docker指標的數據(kubelet已集成)
Kube-state-metrics:采集kubernetes指標的數據
Prometheus:檢索、存儲數據
Grafana:可視化通過檢索服務得到的數據(例如Prometheus)
一、部署kubernetes
centos7可以參考:https://www.cnblogs.com/straycats/p/14322995.html
PS.寫教程時部署的版本是v1.20.1
二、創建命名空間
kubectl create namespace monit
三、部署cAdvisor
因為在kubernetes上運行,而kubelet已經集成了cAdvisor,所以不需要額外安裝,直接使用kubelet即可。
四、部署Kube-state-metrics
4.1 創建編排腳本
# 創建目錄
mkdir -p /opt/yaml
# 創建編排腳本
vi /opt/yaml/kube-state-metrics.yaml
將下面的內容保存到kube-state-metrics.yaml中,wq保存。
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.0.0-beta name: kube-state-metrics rules: - apiGroups: - "" resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: - list - watch - apiGroups: - extensions resources: - daemonsets - deployments - replicasets - ingresses verbs: - list - watch - apiGroups: - apps resources: - statefulsets - daemonsets - deployments - replicasets verbs: - list - watch - apiGroups: - batch resources: - cronjobs - jobs verbs: - list - watch - apiGroups: - autoscaling resources: - horizontalpodautoscalers verbs: - list - watch - apiGroups: - authentication.k8s.io resources: - tokenreviews verbs: - create - apiGroups: - authorization.k8s.io resources: - subjectaccessreviews verbs: - create - apiGroups: - policy resources: - poddisruptionbudgets verbs: - list - watch - apiGroups: - certificates.k8s.io resources: - certificatesigningrequests verbs: - list - watch - apiGroups: - storage.k8s.io resources: - storageclasses - volumeattachments verbs: - list - watch - apiGroups: - admissionregistration.k8s.io resources: - mutatingwebhookconfigurations - validatingwebhookconfigurations verbs: - list - watch - apiGroups: - networking.k8s.io resources: - networkpolicies verbs: - list - watch --- apiVersion: v1 kind: ServiceAccount metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.0.0-beta name: kube-state-metrics namespace: monit --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.0.0-beta name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: monit --- apiVersion: v1 kind: Service metadata: annotations: prometheus.io/scrape: 'true' labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.0.0-beta name: kube-state-metrics namespace: monit spec: type: NodePort ports: - name: http-metrics port: 8080 targetPort: http-metrics #nodePort: 30001 - name: telemetry port: 8081 targetPort: telemetry #nodePort: 30002 selector: app.kubernetes.io/name: kube-state-metrics --- apiVersion: apps/v1 kind: Deployment metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.0.0-beta name: kube-state-metrics namespace: monit spec: replicas: 1 selector: matchLabels: app.kubernetes.io/name: kube-state-metrics template: metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.0.0-beta spec: containers: - image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0-beta livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 timeoutSeconds: 5 name: kube-state-metrics ports: - containerPort: 8080 name: http-metrics - containerPort: 8081 name: telemetry readinessProbe: httpGet: path: / port: 8081 initialDelaySeconds: 5 timeoutSeconds: 5 nodeSelector: beta.kubernetes.io/os: linux serviceAccountName: kube-state-metrics
PS.獲取kube-state-metrics鏡像,請參考:https://www.cnblogs.com/straycats/p/14405513.html
4.2 部署
# 執行編排腳本
cd /opt/yaml kubectl apply -f kube-state-metrics.yaml
五、部署Prometheus
5.1 創建數據持久化目錄
mkdir -p /opt/vol/prometheus/data
5.2 創建編排腳本
# 創建目錄
mkdir -p /opt/yaml
# 創建編排腳本
vi /opt/yaml/prometheus.yaml
將下面的內容保存到prometheus.yaml中,wq保存。
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: warlock namespace: monit rules: - apiGroups: [""] resources: - nodes - nodes/proxy - nodes/metrics - services - services/proxy - endpoints - endpoints/proxy - pods - pods/proxy verbs: ["get", "list", "watch"] --- apiVersion: v1 kind: ServiceAccount metadata: name: warlock namespace: monit --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: warlock namespace: monit roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: warlock subjects: - kind: ServiceAccount name: warlock namespace: monit --- apiVersion: v1 kind: Service metadata: name: prometheus-service namespace: monit labels: app: prometheus-service kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists spec: type: NodePort ports: - port: 9090 targetPort: 9090 nodePort: 9090 selector: app: prometheus-deployment --- apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: monit data: prometheus.yml: | global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - static_configs: - targets: - alertmanager-service:9093 rule_files: - "node.yml" scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'k8s-cadvisor' metrics_path: /metrics/cadvisor kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) scheme: https tls_config: insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token metric_relabel_configs: - source_labels: [instance] separator: ; regex: (.+) target_label: node replacement: $1 action: replace - source_labels: [pod_name] separator: ; regex: (.+) target_label: pod replacement: $1 action: replace - source_labels: [container_name] separator: ; regex: (.+) target_label: container replacement: $1 action: replace - job_name: kube-state-metrics kubernetes_sd_configs: - role: endpoints namespaces: names: - monit relabel_configs: - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name] regex: kube-state-metrics replacement: $1 action: keep - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: k8s_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: k8s_sname --- apiVersion: v1 kind: ConfigMap metadata: name: prometheus-node namespace: monit data: node.yml: | groups: - name: node rules: - alert: PrometheusEndpointDown expr: up == 0 for: 10s labels: source: prometheus annotations: title: "Endpoint({{$labels.instance}}) Down" content: "The endpoint({{$labels.instance}}) of target({{$labels.job}}) has been down for more than 10 seconds." --- apiVersion: apps/v1 kind: Deployment metadata: name: prometheus-deployment namespace: monit spec: replicas: 1 selector: matchLabels: app: prometheus-deployment template: metadata: labels: app: prometheus-deployment spec: serviceAccountName: warlock securityContext: runAsUser: 0 volumes: - name: config projected: sources: - configMap: name: prometheus-config - configMap: name: prometheus-node - name: data-vol hostPath: path: /opt/vol/prometheus/data containers: - name: prometheus image: prom/prometheus:v2.24.1 imagePullPolicy: IfNotPresent # Always env: - name: TZ value: "Asia/Shanghai" volumeMounts: - name: config mountPath: "/etc/prometheus/prometheus.yml" subPath: prometheus.yml readOnly: true - name: config mountPath: "/etc/prometheus/node.yml" subPath: node.yml readOnly: true - name: data-vol mountPath: /prometheus ports: - containerPort: 9090
5.3 部署
# 執行編排腳本
cd /opt/yaml
kubectl apply -f prometheus.yaml
六、部署Grafana
6.1 創建數據持久化目錄
mkdir -p /opt/vol/grafana
6.2 創建編排腳本
# 創建目錄
mkdir -p /opt/yaml
# 創建編排腳本
vi /opt/yaml/grafana.yaml
將下面的內容保存到grafana.yaml中,wq保存。
apiVersion: v1 kind: Service metadata: name: grafana-service namespace: monit labels: app: grafana-service spec: type: NodePort ports: - port: 3000 targetPort: 3000 nodePort: 3000 selector: app: grafana-deployment --- apiVersion: v1 kind: PersistentVolume metadata: name: grafana-pv namespace: monit spec: capacity: storage: 10Gi accessModes: - ReadWriteOnce hostPath: path: "/opt/vol/grafana" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: grafana-pvc namespace: monit spec: accessModes: - ReadWriteOnce resources: requests: storage: "10Gi" --- apiVersion: apps/v1 kind: Deployment metadata: name: grafana-deployment namespace: monit spec: replicas: 1 selector: matchLabels: app: grafana-deployment template: metadata: labels: app: grafana-deployment spec: volumes: - name: grafana-pvc persistentVolumeClaim: claimName: grafana-pvc containers: - name: grafana image: grafana/grafana:7.4.1 imagePullPolicy: IfNotPresent # Always env: - name: TZ value: "Asia/Shanghai" volumeMounts: - name: grafana-pvc mountPath: /var/lib/grafana ports: - containerPort: 3000 initContainers: - name: init-chown-data image: busybox:1.33.0 imagePullPolicy: IfNotPresent # Always command: ["chown", "-R", "472:472", "/var/lib/grafana"] volumeMounts: - name: grafana-pvc mountPath: /var/lib/grafana
6.3 部署
# 執行編排腳本
cd /opt/yaml kubectl apply -f grafana.yaml
6.4 登錄grafana
使用初始用戶名/密碼(admin/admin)登錄。
6.5 配置數據源
1)進入數據源信息的界面,點擊“Add data source”
2)選擇“Prometheus”
3)填寫prometheus服務的URL,點擊“Save & Test”(如果服務正常,則會提示)
6.6 導入儀表盤
選用的儀表盤模板(https://grafana.com/grafana/dashboards/13105)
由於作者使用的kube-state-metrics是v1.9.7,而查看文檔(https://github.com/kubernetes/kube-state-metrics)只適用於1.16版本的kubernetes。
對於1.17以上的kubernetes使用的kube-state-metrics:v2.0.0-beta,查看日志2.x修改了部分參數。
直接使用原作者的儀表盤會導致很多數據無法呈現,故我對該儀表盤的某些參數針對最新版本的參數進行了修改后基本可以呈現數據,下面就針對修改后的儀表盤描述操作。
1)獲取儀表盤文件
鏈接:https://pan.baidu.com/s/1BYnaczAeIRuJAK6LI8T7GQ
提取碼:vvcp
2)導入該文件
3)查看
七、部署

7.2 創建編排腳本
# 創建目錄
mkdir -p /opt/yaml
# 創建編排腳本
vi /opt/yaml/alertmanager.yaml
# 修改郵箱相關的配置,將下面的內容保存到alertmanager.yaml中,wq保存。
apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-config namespace: monit data: alertmanager.yml: | global: resolve_timeout: 5m smtp_smarthost: '<smtp服務器:端口>' smtp_from: '<發件郵箱>' smtp_auth_username: '<發件郵箱>' smtp_auth_password: '<郵箱授權碼>' smtp_require_tls: false route: group_by: ['alertname'] group_wait: 30s group_interval: 30s repeat_interval: 1h receiver: 'mail' receivers: - name: 'mail' email_configs: - to: '<收件郵箱>' --- apiVersion: v1 kind: Service metadata: name: alertmanager-service namespace: monit labels: app: alertmanager-service kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists spec: type: NodePort ports: - port: 9093 targetPort: 9093 nodePort: 9093 selector: app: alertmanager-deployment --- apiVersion: apps/v1 kind: Deployment metadata: name: alertmanager-deployment namespace: monit spec: replicas: 1 selector: matchLabels: app: alertmanager-deployment template: metadata: labels: app: alertmanager-deployment spec: volumes: - name: config configMap: name: alertmanager-config containers: - name: alertmanager image: prom/alertmanager:v0.21.0 imagePullPolicy: IfNotPresent # Always env: - name: TZ value: "Asia/Shanghai" volumeMounts: - name: config mountPath: "/etc/alertmanager" readOnly: true ports: - containerPort: 9093
7.3 部署
# 執行編排腳本
cd /opt/yaml kubectl apply -f alertmanager.yaml
7.4 模擬觸發告警
在之前的prometheus配置中,增加了一個告警規則(如果有目標服務掛了,進行告警)
接着通過更換不存在的kube-state-metrics鏡像,來觸發該告警規則。
# 拷貝一份試驗用的yaml腳本
cd /opt/yaml
cp kube-state-metrics.yaml kube-state-metrics-test.yaml
# 更換不存在的鏡像
cd /opt/yaml
sed -i 's/kube-state-metrics:v2.0.0-beta$/kube-state-metrics:abcd/g' kube-state-metrics-test.yaml
# 重新部署kube-state-metrics
cd /opt/yaml kubectl delete -f kube-state-metrics.yaml kubectl create -f kube-state-metrics-test.yaml
由於拉取不到這個tag為abcd的鏡像,故該服務的pod無法啟動,於是觸發告警規則。
目標郵箱就會收到告警郵件如下:
這樣整個監控容器與集群參數的demo基本就已經實現,應用於項目還得根據實際選擇需要的參數和規則進行配置。