我們用的是阿里雲托管的K8S集群1.21版本,用的 kube-prometheus 0.9 版本,如果你也是用的阿里雲托管的ACK,提前提工單打開授權管理,不然安裝的時候會找不到RoleBinding。
參考文檔:
http://www.servicemesher.com/blog/prometheus-operator-manual/ https://github.com/coreos/prometheus-operator https://github.com/coreos/kube-prometheus
https://www.cnblogs.com/twobrother/p/11165417.html
1、概述
1.1在k8s中部署Prometheus監控的方法
通常在k8s
中部署prometheus
監控可以采取的方法有以下三種
- 通過yaml手動部署
- operator部署
- 通過helm chart部署
1.2 什么是Prometheus Operator
Prometheus Operator
的本職就是一組用戶自定義的CRD
資源以及Controller
的實現,Prometheus Operator
負責監聽這些自定義資源的變化,並且根據這些資源的定義自動化的完成如Prometheus Server
自身以及配置的自動化管理工作。以下是Prometheus Operator
的架構圖:
在配置prometheus-operator 監控jvm之前,我們必須要了解prometheus-operator的4個crd組件,這四個CRD作用如下:
Prometheus: 由 Operator 依據一個自定義資源kind: Prometheus類型中,所描述的內容而部署的 Prometheus Server 集群,可以將這個自定義資源看作是一種特別用來管理Prometheus Server的StatefulSets資源。
ServiceMonitor: 一個Kubernetes自定義資源(和kind: Prometheus一樣是CRD),該資源描述了Prometheus Server的Target列表,Operator 會監聽這個資源的變化來動態的更新Prometheus Server的Scrape targets並讓prometheus server去reload配置(prometheus有對應reload的http接口/-/reload)。而該資源主要通過Selector來依據 Labels 選取對應的Service的endpoints,並讓 Prometheus Server 通過 Service 進行拉取(拉)指標資料(也就是metrics信息),metrics信息要在http的url輸出符合metrics格式的信息,ServiceMonitor也可以定義目標的metrics的url.
Alertmanager:Prometheus Operator 不只是提供 Prometheus Server 管理與部署,也包含了 AlertManager,並且一樣通過一個 kind: Alertmanager 自定義資源來描述信息,再由 Operator 依據描述內容部署 Alertmanager 集群。
PrometheusRule:對於Prometheus而言,在原生的管理方式上,我們需要手動創建Prometheus的告警文件,並且通過在Prometheus配置中聲明式的加載。而在Prometheus Operator模式中,告警規則也編程一個通過Kubernetes API 聲明式創建的一個資源.告警規則創建成功后,通過在Prometheus中使用想servicemonitor那樣用ruleSelector通過label匹配選擇需要關聯的PrometheusRule即可。
2.安裝部署
1.下載部署包
wget -c https://github.com/prometheus-operator/kube-prometheus/archive/v0.7.0.zip
2.修改文件
其中kubelet的metrics采集端口,10250是https的,10255是http的
kube-scheduler的metrics采集端,10259是https的,10251是http的
Kube-controller的metrics采集端,10257是https的,10252是http的
測試:在主機上curl相關端口/metrics,即可獲取相關metrics,如獲取kubelet相關指標只需curl http://127.0.0.1:10255/metrics
即可
-
kubernetes-serviceMonitorKubeScheduler.yaml
-
kubernetes-serviceMonitorKubeControllerManager.yaml
-
kubernetes-serviceMonitorKubelet.yaml
Yaml文件中相關信息采集默認采用https的端口,即10250端口,這樣我們需要將port的端口改為http-metrics,同樣的scheme改為http
參考:https://www.cnblogs.com/xinbat/p/15116903.html
3.部署
# cd kube-prometheus\manifests\setup
# kubectl apply .
# cd kube-prometheus\manifests\
# kubectl apply .
為prometheus
、grafana
、alertmanager
創建 ingress:
apiVersion: extensions/v1beta1 kind: Ingress metadata: name: prometheus-alertmangaer-grafana-ingress namespace: monitoring annotations: nginx.ingress.kubernetes.io/rewrite-target: / nginx.ingress.kubernetes.io/ssl-redirect: 'true' nginx.ingress.kubernetes.io/proxy-connect-timeout: "600" nginx.ingress.kubernetes.io/proxy-read-timeout: "600" nginx.ingress.kubernetes.io/proxy-send-timeout: "600" nginx.ingress.kubernetes.io/connection-proxy-header: "keep-alive" nginx.ingress.kubernetes.io/proxy-http-version: "1.1" nginx.ingress.kubernetes.io/proxy-body-size: 80m spec: tls: - hosts: - 'prometheus.xxx.com' secretName: xxx-com-secret - hosts: - 'grafana.xxx.com' secretName: xxx-com-secret - hosts: - 'alertmanager.xxx.com' secretName: xxx-com-secret rules: - host: prometheus.xxx.com http: paths: - path: / backend: serviceName: prometheus-k8s servicePort: 9090 - host: grafana.xxx.com http: paths: - path: / backend: serviceName: grafana servicePort: 3000 - host: alertmanager.xxx.com http: paths: - path: / backend: serviceName: alertmanager-main servicePort: 9093
解決Watchdog、ControllerManager、Scheduler監控問題
Watchdog
是一個正常的報警,這個告警的作用是:如果alermanger
或者prometheus
本身掛掉了就發不出告警了,因此一般會采用另一個監控來監控prometheus
,或者自定義一個持續不斷的告警通知,哪一天這個告警通知不發了,說明監控出現問題了。prometheus operator
已經考慮了這一點,本身攜帶一個watchdog
,作為對自身的監控。
如果需要關閉,刪除或注釋掉Watchdog
部分
prometheus-rules.yaml ... - name: general.rules rules: - alert: TargetDown annotations: message: 'xxx' expr: 100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job, namespace, service)) > 10 for: 10m labels: severity: warning # - alert: Watchdog # annotations: # message: | # This is an alert meant to ensure that the entire alerting pipeline is functional. # This alert is always firing, therefore it should always be firing in Alertmanager # and always fire against a receiver. There are integrations with various notification # mechanisms that send a notification when this alert is not firing. For example the # "DeadMansSnitch" integration in PagerDuty. # expr: vector(1) # labels: # severity: none
對應的Watchdog的ServiceMonitor也可以刪除。
KubeControllerManagerDown
、KubeSchedulerDown
的解決
原因是因為在prometheus-serviceMonitorKubeControllerManager.yaml中有如下內容,但默認安裝的集群並沒有給系統kube-controller-manager
組件創建svc
selector:
matchLabels: k8s-app: kube-controller-manager
修改kube-controller-manager
的監聽地址:
# vim /etc/kubernetes/manifests/kube-controller-manager.yaml ... spec: containers: - command: - kube-controller-manager - --allocate-node-cidrs=true - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf - --bind-address=0.0.0.0
# netstat -lntup|grep kube-contro tcp6 0 0 :::10257 :::* LISTEN 38818/kube-controll
創建
prometheus-kube-controller-manager-service.yaml
prometheus-kube-scheduler-service.yaml,以便serviceMonitor
監聽
# cat prometheus-kube-controller-manager-service.yaml apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-controller-manager labels: k8s-app: kube-controller-manager spec: selector: component: kube-controller-manager ports: - name: http-metrics port: 10252 targetPort: 10252 protocol: TCP
# cat prometheus-kube-scheduler-service.yaml apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-scheduler labels: k8s-app: kube-scheduler spec: selector: component: kube-scheduler ports: - name: http-metrics port: 10251 targetPort: 10251 protocol: TCP
#10251是kube-scheduler
組件 metrics 數據所在的端口,10252是kube-controller-manager
組件的監控數據所在端口。
上面 labels 和 selector 部分,labels 區域的配置必須和我們上面的 ServiceMonitor 對象中的 selector 保持一致,selector
下面配置的是component=kube-scheduler
,為什么會是這個 label 標簽呢,我們可以去 describe 下 kube-scheduelr 這個 Pod
# kubectl describe pod kube-scheduler-k8s-master -n kube-system Name: kube-scheduler-k8s-master Namespace: kube-system Priority: 2000000000 Priority Class Name: system-cluster-critical Node: k8s-master/10.6.76.25 Start Time: Thu, 29 Aug 2019 09:21:01 +0800 Labels: component=kube-scheduler tier=control-plane # kubectl describe pod kube-controller-manager-k8s-master -n kube-system Name: kube-controller-manager-k8s-master Namespace: kube-system Priority: 2000000000 Priority Class Name: system-cluster-critical Node: k8s-master/10.6.76.25 Start Time: Thu, 29 Aug 2019 09:21:01 +0800 Labels: component=kube-controller-manager tier=control-plane
瀏覽器ingress方式訪問
https://prometheus.xxx.com/
https://alertmanager.xxx.com/
https://grafana.xxx.com/
grafana默認賬號密碼admin admin需要重置密碼進入
參考:
https://www.cnblogs.com/huss2016/p/14865316.html
http://t.zoukankan.com/ssgeek-p-14441149.html
https://www.cnblogs.com/xinbat/p/15116903.html
https://www.cnblogs.com/zhangrui153169/p/13609172.html
https://blog.csdn.net/twingao/article/details/105261641
https://www.cnblogs.com/twobrother/p/11165417.html
https://blog.csdn.net/qq_43164571/article/details/119990724
https://www.kococ.cn/20210302/cid=697.html