一、介紹
Operator是CoreOS公司開發,用於擴展kubernetes API或特定應用程序的控制器,它用來創建、配置、管理復雜的有狀態應用,例如數據庫,監控系統。其中Prometheus-Operator就是其中一個重要的項目。
其架構圖如下:
其中核心部分是Operator,它會去創建Prometheus、ServiceMonitor、AlertManager、PrometheusRule這4個CRD對象,然后會一直監控並維護這4個對象的狀態。
- Prometheus:作為Prometheus Server的抽象
- ServiceMonitor:就是exporter的各種抽象
- AlertManager:作為Prometheus AlertManager的抽象
- PrometheusRule:實現報警規則的文件
上圖中的 Service 和 ServiceMonitor 都是 Kubernetes 的資源,一個 ServiceMonitor 可以通過 labelSelector 的方式去匹配一類 Service,Prometheus 也可以通過 labelSelector 去匹配多個ServiceMonitor。
二、安裝
注意集群版本的坑,自己先到Github上下載對應的版本。
我們使用源碼來安裝,首先克隆源碼到本地:
# git clone https://github.com/coreos/kube-prometheus.git
我們進入kube-prometheus/manifests/setup,就可以直接創建CRD對象:
# cd kube-prometheus/manifests/setup
# kubectl apply -f .
然后在上層目錄創建資源清單:
# cd kube-prometheus/manifests
# kubectl apply -f .
可以看到創建如下的CRD對象:
# kubectl get crd | grep coreos
alertmanagers.monitoring.coreos.com 2019-12-02T03:03:37Z
podmonitors.monitoring.coreos.com 2019-12-02T03:03:37Z
prometheuses.monitoring.coreos.com 2019-12-02T03:03:37Z
prometheusrules.monitoring.coreos.com 2019-12-02T03:03:37Z
servicemonitors.monitoring.coreos.com 2019-12-02T03:03:37Z
查看創建的pod:
# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 2m37s
alertmanager-main-1 2/2 Running 0 2m37s
alertmanager-main-2 2/2 Running 0 2m37s
grafana-77978cbbdc-886cc 1/1 Running 0 2m46s
kube-state-metrics-7f6d7b46b4-vrs8t 3/3 Running 0 2m45s
node-exporter-5552n 2/2 Running 0 2m45s
node-exporter-6snb7 2/2 Running 0 2m45s
prometheus-adapter-68698bc948-6s5f2 1/1 Running 0 2m45s
prometheus-k8s-0 3/3 Running 1 2m27s
prometheus-k8s-1 3/3 Running 1 2m27s
prometheus-operator-6685db5c6-4tdhp 1/1 Running 0 2m52s
查看創建的Service:
# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-main ClusterIP 10.68.97.247 <none> 9093/TCP 3m51s
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 3m41s
grafana ClusterIP 10.68.234.173 <none> 3000/TCP 3m50s
kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 3m50s
node-exporter ClusterIP None <none> 9100/TCP 3m50s
prometheus-adapter ClusterIP 10.68.109.201 <none> 443/TCP 3m50s
prometheus-k8s ClusterIP 10.68.9.232 <none> 9090/TCP 3m50s
prometheus-operated ClusterIP None <none> 9090/TCP 3m31s
prometheus-operator ClusterIP None <none> 8080/TCP 3m57s
我們看到我們常用的prometheus和grafana都是clustorIP,我們要外部訪問可以配置為NodePort類型或者用ingress。比如配置為ingress:
prometheus-ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: prometheus-ingress
namespace: monitoring
annotations:
kubernetes.io/ingress.class: "traefik"
spec:
rules:
- host: prometheus.joker.com
http:
paths:
- path:
backend:
serviceName: prometheus-k8s
servicePort: 9090
grafana-ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: grafana-ingress
namespace: monitoring
annotations:
kubernetes.io/ingress.class: "traefik"
spec:
rules:
- host: grafana.joker.com
http:
paths:
- path:
backend:
serviceName: grafana
servicePort: 3000
但是我們這里由於沒有域名進行備案,我們就用NodePort類型。修改后如下:
# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana NodePort 10.68.234.173 <none> 3000:39807/TCP 3h1m 3h1m
prometheus-k8s NodePort 10.68.9.232 <none> 9090:20547/TCP 3h1m
然后就可以正常在瀏覽器訪問了。
三、配置
3.1、監控集群資源
我們可以看到大部分的配置都是正常的,只有兩三個沒有管理到對應的監控目標,比如 kube-controller-manager 和 kube-scheduler 這兩個系統組件,這就和 ServiceMonitor 的定義有關系了,我們先來查看下 kube-scheduler 組件對應的 ServiceMonitor 資源的定義:(prometheus-serviceMonitorKubeScheduler.yaml)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
k8s-app: kube-scheduler
name: kube-scheduler
namespace: monitoring
spec:
endpoints:
- interval: 30s # 每30s獲取一次信息
port: http-metrics # 對應service的端口名
jobLabel: k8s-app
namespaceSelector: # 表示去匹配某一命名空間中的service,如果想從所有的namespace中匹配用any: true
matchNames:
- kube-system
selector: # 匹配的 Service 的labels,如果使用mathLabels,則下面的所有標簽都匹配時才會匹配該service,如果使用matchExpressions,則至少匹配一個標簽的service都會被選擇
matchLabels:
k8s-app: kube-scheduler
上面是一個典型的 ServiceMonitor 資源文件的聲明方式,上面我們通過selector.matchLabels在 kube-system 這個命名空間下面匹配具有k8s-app=kube-scheduler這樣的 Service,但是我們系統中根本就沒有對應的 Service,所以我們需要手動創建一個 Service:(prometheus-kubeSchedulerService.yaml)
apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: kube-scheduler
labels:
k8s-app: kube-scheduler
spec:
selector:
component: kube-scheduler
ports:
- name: http-metrics
port: 10251
targetPort: 10251
protocol: TCP
10251是
kube-scheduler
組件 metrics 數據所在的端口,10252是kube-controller-manager
組件的監控數據所在端口。
其中最重要的是上面 labels 和 selector 部分,labels 區域的配置必須和我們上面的 ServiceMonitor 對象中的 selector 保持一致,selector下面配置的是component=kube-scheduler,為什么會是這個 label 標簽呢,我們可以去 describe 下 kube-scheduelr 這個 Pod:
$ kubectl describe pod kube-scheduler-master -n kube-system
Name: kube-scheduler-master
Namespace: kube-system
Node: master/10.151.30.57
Start Time: Sun, 05 Aug 2018 18:13:32 +0800
Labels: component=kube-scheduler
tier=control-plane
......
我們可以看到這個 Pod 具有component=kube-scheduler和tier=control-plane這兩個標簽,而前面這個標簽具有更唯一的特性,所以使用前面這個標簽較好,這樣上面創建的 Service 就可以和我們的 Pod 進行關聯了,直接創建即可:
$ kubectl create -f prometheus-kubeSchedulerService.yaml
$ kubectl get svc -n kube-system -l k8s-app=kube-scheduler
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-scheduler ClusterIP 10.102.119.231 <none> 10251/TCP 18m
創建完成后,隔一小會兒后去 prometheus 查看 targets 下面 kube-scheduler 的狀態:promethus kube-scheduler error
我們可以看到現在已經發現了 target,但是抓取數據結果出錯了,這個錯誤是因為我們集群是使用 kubeadm 搭建的,其中 kube-scheduler 默認是綁定在127.0.0.1上面的,而上面我們這個地方是想通過節點的 IP 去訪問,所以訪問被拒絕了,我們只要把 kube-scheduler 綁定的地址更改成0.0.0.0即可滿足要求,由於 kube-scheduler 是以靜態 Pod 的形式運行在集群中的,所以我們只需要更改靜態 Pod 目錄下面對應的 YAML 文件即可:
$ ls /etc/kubernetes/manifests/
etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml
將 kube-scheduler.yaml 文件中-command的--address地址更改成0.0.0.0:
containers:
- command:
- kube-scheduler
- --leader-elect=true
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --address=0.0.0.0
修改完成后我們將該文件從當前文件夾中移除,隔一會兒再移回該目錄,就可以自動更新了,然后再去看 prometheus 中 kube-scheduler 這個 target 是否已經正常了:promethues-operator-kube-scheduler
大家可以按照上面的方法嘗試去修復下 kube-controller-manager 組件的監控。
3.2、監控集群外資源
很多時候我們並不是把所有資源都部署在集群內的,經常有比如ectd,kube-scheduler等都部署在集群外。其監控流程和上面大致一樣,唯一的區別就是在定義Service的時候,其EndPoints是需要我們自己去定義的。
3.2.1、監控kube-scheduler
(1)、定義Service和EndPoints
prometheus-KubeSchedulerService.yaml
apiVersion: v1
kind: Service
metadata:
name: kube-scheduler
namespace: kube-system
labels:
k8s-app: kube-scheduler
spec:
type: ClusterIP
clusterIP: None
ports:
- name: http-metrics
port: 10251
targetPort: 10251
protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
name: kube-scheduler
namespace: kube-system
labels:
k8s-app: kube-scheduler
subsets:
- addresses:
- ip: 172.16.0.33
ports:
- name: http-metrics
port: 10251
protocol: TCP
(2)、定義ServiceMonitor
prometheus-serviceMonitorKubeScheduler.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kube-scheduler
namespace: monitoring
labels:
k8s-app: kube-scheduler
spec:
endpoints:
- interval: 30s
port: http-metrics
jobLabel: k8s-app
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
k8s-app: kube-scheduler
然后我們就可以看到其監控上了:
3.2.2、監控kube-controller-manager
(1)、配置Service和EndPoints,
prometheus-KubeControllerManagerService.yaml
apiVersion: v1
kind: Service
metadata:
name: kube-controller-manager
namespace: kube-system
labels:
k8s-app: kube-controller-manager
spec:
type: ClusterIP
clusterIP: None
ports:
- name: http-metrics
port: 10252
targetPort: 10252
protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
name: kube-controller-manager
namespace: kube-system
labels:
k8s-app: kube-controller-manager
subsets:
- addresses:
- ip: 172.16.0.33
ports:
- name: http-metrics
port: 10252
protocol: TCP
(2)、配置ServiceMonitor
prometheus-serviceMonitorKubeControllerManager.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
k8s-app: kube-controller-manager
name: kube-controller-manager
namespace: monitoring
spec:
endpoints:
- interval: 30s
metricRelabelings:
- action: drop
regex: etcd_(debugging|disk|request|server).*
sourceLabels:
- __name__
port: http-metrics
jobLabel: k8s-app
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
k8s-app: kube-controller-manager
3.2.3、監控etcd
很多情況下,我們的etcd都需要進行SSL認證的,所以首先需要將用到的證書保存到集群中去。
(根據自己集群證書的位置修改)
kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key --from-file=/etc/kubernetes/pki/etcd/ca.crt
然后將上面創建的 etcd-certs 對象配置到 prometheus 資源對象中,直接更新 prometheus 資源對象即可:
# kubectl edit prometheus k8s -n monitoring
添加如下的 secrets 屬性:
nodeSelector:
beta.kubernetes.io/os: linux
replicas: 2
secrets:
- etcd-certs
更新完成后,我們就可以在 Prometheus 的 Pod 中獲取到上面創建的 etcd 證書文件了,具體的路徑我們可以進入 Pod 中查看:
# kubectl exec -it prometheus-k8s-0 -n monitoring -- /bin/sh
Defaulting container name to prometheus.
Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod.
/prometheus $ ls /etc/prometheus/secrets/etcd-certs/
ca.crt healthcheck-client.crt healthcheck-client.key
/prometheus $
(1)、創建ServiceMonitor
prometheus-serviceMonitorEtcd.yamlns
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: k8s-etcd
namespace: monitoring
labels:
k8s-app: k8s-etcd
spec:
jobLabel: k8s-app
endpoints:
- port: port
interval: 30s
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-certs/ca.crt
certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.crt
keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key
insecureSkipVerify: true
selector:
matchLabels:
k8s-app: k8s-etcd
namespaceSelector:
matchNames:
- kube-system
上面我們在 monitoring 命名空間下面創建了名為 k8s-etcd 的 ServiceMonitor 對象,基本屬性和前面章節中的一致,匹配 kube-system 這個命名空間下面的具有 k8s-app=k8s-etcd 這個 label 標簽的 Service,jobLabel 表示用於檢索 job 任務名稱的標簽,和前面不太一樣的地方是 endpoints 屬性的寫法,配置上訪問 etcd 的相關證書,endpoints 屬性下面可以配置很多抓取的參數,比如 relabel、proxyUrl,tlsConfig 表示用於配置抓取監控數據端點的 tls 認證,由於證書 serverName 和 etcd 中簽發的可能不匹配,所以加上了 insecureSkipVerify=true.
然后創建這個配置清單:
# kubectl apply -f prometheus-serviceMonitorEtcd.yaml
(2)、創建Service
apiVersion: v1
kind: Service
metadata:
name: k8s-etcd
namespace: kube-system
labels:
k8s-app: k8s-etcd
spec:
type: ClusterIP
clusterIP: None
ports:
- name: port
port: 2379
protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
name: k8s-etcd
namespace: kube-system
labels:
k8s-app: k8s-etcd
subsets:
- addresses:
- ip: 172.16.0.33
ports:
- name: port
port: 2379
protocol: TCP
然后在Grafana中導入3070的面板。
3.3、配置報警規則Rule
我們創建一個 PrometheusRule 資源對象后,會自動在上面的 prometheus-k8s-rulefiles-0 目錄下面生成一個對應的
如下配置Ectd報警規則:
prometheus-etcdRule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: etcd-rules
namespace: monitoring
labels:
prometheus: k8s
role: alert-rules
spec:
groups:
- name: etcd
rules:
- alert: EtcdClusterUnavailable
annotations:
summary: etcd cluster small
description: If one more etcd peer goes down the cluster will be unavailable
expr: |
count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1)
for: 3m
labels:
severity: critical
然后我們創建這個配置清單:
# kubectl apply -f prometheus-etcdRule.yaml
prometheusrule.monitoring.coreos.com/etcd-rules created
然后我們刷新頁面,就可以看到已經生效了
3.4、配置報警
首先我們將 alertmanager-main 這個 Service 改為 NodePort 類型的 Service,修改完成后我們可以在頁面上的 status 路徑下面查看 AlertManager 的配置信息:
# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-main NodePort 10.68.97.247 <none> 9093:21936/TCP 5h31m
然后在瀏覽器查看:
這些配置信息實際上是來自於我們之前在kube-prometheus/manifests目錄下面創建的 alertmanager-secret.yaml 文件:
apiVersion: v1
data:
alertmanager.yaml: Imdsb2JhbCI6CiAgInJlc29sdmVfdGltZW91dCI6ICI1bSIKInJlY2VpdmVycyI6Ci0gIm5hbWUiOiAibnVsbCIKInJvdXRlIjoKICAiZ3JvdXBfYnkiOgogIC0gImpvYiIKICAiZ3JvdXBfaW50ZXJ2YWwiOiAiNW0iCiAgImdyb3VwX3dhaXQiOiAiMzBzIgogICJyZWNlaXZlciI6ICJudWxsIgogICJyZXBlYXRfaW50ZXJ2YWwiOiAiMTJoIgogICJyb3V0ZXMiOgogIC0gIm1hdGNoIjoKICAgICAgImFsZXJ0bmFtZSI6ICJXYXRjaGRvZyIKICAgICJyZWNlaXZlciI6ICJudWxsIg==
kind: Secret
metadata:
name: alertmanager-main
namespace: monitoring
type: Opaque
可以將 alertmanager.yaml 對應的 value 值做一個 base64 解碼:
# echo "Imdsb2JhbCI6CiAgInJlc29sdmVfdGltZW91dCI6ICI1bSIKInJlY2VpdmVycyI6Ci0gIm5hbWUiOiAibnVsbCIKInJvdXRlIjoKICAiZ3JvdXBfYnkiOgogIC0gImpvYiIKICAiZ3JvdXBfaW50ZXJ2YWwiOiAiNW0iCiAgImdyb3VwX3dhaXQiOiAiMzBzIgogICJyZWNlaXZlciI6ICJudWxsIgogICJyZXBlYXRfaW50ZXJ2YWwiOiAiMTJoIgogICJyb3V0ZXMiOgogIC0gIm1hdGNoIjoKICAgICAgImFsZXJ0bmFtZSI6ICJXYXRjaGRvZyIKICAgICJyZWNlaXZlciI6ICJudWxsIg==" | base64 -d
"global":
"resolve_timeout": "5m"
"receivers":
- "name": "null"
"route":
"group_by":
- "job"
"group_interval": "5m"
"group_wait": "30s"
"receiver": "null"
"repeat_interval": "12h"
"routes":
- "match":
"alertname": "Watchdog"
"receiver": "null"
可以看到上面的內容和我們在網頁上查到的是一致的。
如果要配置報警媒介,就可以修改這個模板:
alertmanager.yaml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:465'
smtp_from: 'fmbankops@163.com'
smtp_auth_username: 'fmbankops@163.com'
smtp_auth_password: '<郵箱密碼>'
smtp_hello: '163.com'
smtp_require_tls: false
route:
group_by: ['job', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: default
routes:
- receiver: webhook
match:
alertname: CoreDNSDown
receivers:
- name: 'default'
email_configs:
- to: '517554016@qq.com'
send_resolved: true
- name: 'webhook'
webhook_configs:
- url: 'http://dingtalk-hook.kube-ops:5000' # 這是我們自定義的webhook
send_resolved: true
然后我們更新secret對象:
# 先將之前的 secret 對象刪除
$ kubectl delete secret alertmanager-main -n monitoring
secret "alertmanager-main" deleted
$ kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring
secret "alertmanager-main" created
然后就會收到報警信息:
四、高級配置
4.1、自動發現規則配置
我們在實際應用中會部署非常多的service和pod,如果要一個一個手動的添加監控將會是一個非常重復,浪費時間的工作,這時候就需要使用自動發現機制。我們在手動搭建Prometheus的過程中曾配置過自動發現service,其主要的配置文件如下:
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
要想自動被發現,只需要在service的配置清單中加上annotations: prometheus.io/scrape=true。
我們將上面的文件保存為prometheus-additional.yaml,然后用這個文件創建一個secret。
# kubectl -n monitoring create secret generic additional-config --from-file=prometheus-additional.yaml
secret/additional-config created
然后我們在prometheus的配置清單中添加這個配置:
cat prometheus-prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: k8s
name: k8s
namespace: monitoring
spec:
alerting:
alertmanagers:
- name: alertmanager-main
namespace: monitoring
port: web
baseImage: quay.io/prometheus/prometheus
nodeSelector:
kubernetes.io/os: linux
podMonitorSelector: {}
replicas: 2
resources:
requests:
memory: 400Mi
ruleSelector:
matchLabels:
prometheus: k8s
role: alert-rules
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
additionalScrapeConfigs:
name: additional-config
key: prometheus-additional.yaml
serviceAccountName: prometheus-k8s
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
version: v2.11.0
然后更新一下prometheus的配置:
# kubectl apply -f prometheus-prometheus.yaml
prometheus.monitoring.coreos.com/k8s configured
然后我們查看prometheus的日志,發現很多錯誤:
# kubectl logs -f prometheus-k8s-0 prometheus -n monitoring
從日志可以看出,其提示的是權限問題,在kubernetes中涉及到權限問題一般就是RBAC中配置問題,我們查看prometheus的配置清單發現其使用了一個prometheus-k8s的ServiceAccount:
而其綁定的是一個叫prometheus-k8s的ClusterRole:
# kubectl get clusterrole prometheus-k8s -n monitoring -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"name":"prometheus-k8s"},"rules":[{"apiGroups":[""],"resources":["nodes/metrics"],"verbs":["get"]},{"nonResourceURLs":["/metrics"],"verbs":["get"]}]}
creationTimestamp: "2019-12-02T03:03:44Z"
name: prometheus-k8s
resourceVersion: "1128592"
selfLink: /apis/rbac.authorization.k8s.io/v1/clusterroles/prometheus-k8s
uid: 4f87ca47-7769-432b-b96a-1b826b28003d
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
從上面可以知道,這個clusterrole並沒有service和pod的一些相關權限。接下來我們修改這個clusterrole。
prometheus-clusterRole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-k8s
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
- configmaps
verbs:
- get
- apiGroups:
- ""
resources:
- nodes
- pods
- services
- endpoints
- nodes/proxy
verbs:
- get
- list
- watch
- nonResourceURLs:
- /metrics
verbs:
- get
然后我們更新這個資源清單:
# kubectl apply -f prometheus-clusterRole.yaml
clusterrole.rbac.authorization.k8s.io/prometheus-k8s configured
然后等待一段時間我們可以發現自動發現成功。
提示:配置自動發現,首先annotations里需要配置prometheus.io/scrape=true,其次你的應用要有exporter去收集信息,比如我們如下的redis配置:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: redis
namespace: kube-ops
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9121"
labels:
app: redis
spec:
containers:
- name: redis
image: redis:4
resources:
requests:
cpu: 100m
memory: 100Mi
ports:
- containerPort: 6379
- name: redis-exporter
image: oliver006/redis_exporter:latest
resources:
requests:
cpu: 100m
memory: 100Mi
ports:
- containerPort: 9121
---
kind: Service
apiVersion: v1
metadata:
name: redis
namespace: kube-ops
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9121"
spec:
selector:
app: redis
ports:
- name: redis
port: 6379
targetPort: 6379
- name: prom
port: 9121
targetPort: 9121
4.2、數據持久化配置
如果我們直接git clone下來的,不做任何修改,Prometheus雖然使用的是statefuleSet,但是其用的存儲卷是emptyDir,在刪除Pod或者重建Pod,原始數據是會丟失的。所以在真實環境我們需要對其進行持久化,首先創建storageClass,如果是用NFS做持久化,詳見第四章持久化存儲中的storageClass部分。我們這里依然用的NFS做持久化。
創建StorageClass:
prometheus-storage.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: prometheus-storage
provisioner: rookieops/nfs
其中provisioner需要指定我們在創建nfs-client-provisioner中指定的名字,不能隨意修改。
配置prometheus的配置清單:
prometheus-prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: k8s
name: k8s
namespace: monitoring
spec:
alerting:
alertmanagers:
- name: alertmanager-main
namespace: monitoring
port: web
storage:
volumeClaimTemplate:
spec:
storageClassName: prometheus-storage
resources:
requests:
storage: 20Gi
baseImage: quay.io/prometheus/prometheus
nodeSelector:
kubernetes.io/os: linux
podMonitorSelector: {}
replicas: 2
resources:
requests:
memory: 400Mi
ruleSelector:
matchLabels:
prometheus: k8s
role: alert-rules
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
additionalScrapeConfigs:
name: additional-config
key: prometheus-additional.yaml
serviceAccountName: prometheus-k8s
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
version: v2.11.0
然后就可以正常使用持久化了,建議在部署之初就做更改。