監控方案
cAdvisor+Heapster+InfluxDB+Grafana |
Y |
簡單 |
容器監控 |
cAdvisor/exporter+Prometheus+Grafana |
Y |
擴展性好 |
容器,應用,主機全方面監控 |
Prometheus+Grafana是監控告警解決方案里的后起之秀
通過各種exporter采集不同維度的監控指標,並通過Prometheus支持的數據格式暴露出來,Prometheus定期pull數據並用Grafana展示,異常情況使用AlertManager告警。
通過cadvisor采集容器、Pod相關的性能指標數據,並通過暴露的/metrics接口用prometheus抓取
通過prometheus-node-exporter采集主機的性能指標數據,並通過暴露的/metrics接口用prometheus抓取
應用側自己采集容器中進程主動暴露的指標數據(暴露指標的功能由應用自己實現,並添加平台側約定的annotation,平台側負責根據annotation實現通過Prometheus的抓取)
通過kube-state-metrics采集k8s資源對象的狀態指標數據,並通過暴露的/metrics接口用prometheus抓取
通過etcd、kubelet、kube-apiserver、kube-controller-manager、kube-scheduler自身暴露的/metrics獲取節點上與k8s集群相關的一些特征指標數據。
實現思路
監控指標 |
具體實現 |
舉例 |
Pod性能 |
cAdvisor |
容器CPU,內存利用率 |
Node性能 |
node-exporter |
節點CPU,內存利用率 |
K8S資源對象 |
kube-state-metrics |
Pod/Deployment/Service |
k8s中部署prometheus
官網:https://prometheus.io
下載yaml文件:https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/prometheus
修改yaml文件
#使用nfs存儲 [root@localhost prometheus]# kubectl get storageclass NAME PROVISIONER AGE managed-nfs-storage fuseim.pri/ifs 9d [root@localhost prometheus]# sed -i s/standard/managed-nfs-storage/ prometheus-statefulset.yaml
#修改service使用NOdePort
[root@localhost prometheus]# vim prometheus-service.yaml
。。。。
spec:
type: NodePort
ports:
- name: http
port: 9090
protocol: TCP
targetPort: 9090
selector:
k8s-app: prometheus
啟動prometheus
[root@localhost prometheus]# kubectl apply -f prometheus-rbac.yaml serviceaccount/prometheus created clusterrole.rbac.authorization.k8s.io/prometheus created clusterrolebinding.rbac.authorization.k8s.io/prometheus created [root@localhost prometheus]# kubectl apply -f prometheus-configmap.yaml configmap/prometheus-config created [root@localhost prometheus]# kubectl apply -f prometheus-statefulset.yaml statefulset.apps/prometheus created [root@localhost prometheus]# vim prometheus-service.yaml [root@localhost prometheus]# kubectl apply -f prometheus-service.yaml service/prometheus created
查看
[root@localhost prometheus]# kubectl get pod,svc -n kube-system NAME READY STATUS RESTARTS AGE pod/coredns-5b8c57999b-z9jh8 1/1 Running 1 16d pod/kubernetes-dashboard-644c96f9c6-bvw8w 1/1 Running 1 16d pod/prometheus-0 2/2 Running 0 2m40s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 16d service/kubernetes-dashboard NodePort 10.0.0.84 <none> 443:30001/TCP 16d service/prometheus NodePort 10.0.0.89 <none> 9090:41782/TCP 39s [root@localhost prometheus]# kubectl get pv,pvc -n kube-system NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/kube-system-prometheus-data-prometheus-0-pvc-0e92f36c-8d9e-11e9-b018-525400828c1f 16Gi RWO Delete Bound kube-system/prometheus-data-prometheus-0 managed-nfs-storage 25m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/prometheus-data-prometheus-0 Bound kube-system-prometheus-data-prometheus-0-pvc-0e92f36c-8d9e-11e9-b018-525400828c1f 16Gi RWO managed-nfs-storage 25m
訪問
部署grafana
[root@localhost prometheus]# cat grafana.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: grafana namespace: kube-system spec: serviceName: "grafana" replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana resources: limits: cpu: 100m memory: 256Mi requests: cpu: 100m memory: 256Mi volumeMounts: - name: grafana-data mountPath: /var/lib/grafana subPath: grafana securityContext: fsGroup: 472 runAsUser: 472 volumeClaimTemplates: - metadata: name: grafana-data spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: "1Gi" --- apiVersion: v1 kind: Service metadata: name: grafana namespace: kube-system spec: type: NodePort ports: - port: 80 targetPort: 3000 nodePort: 30007 selector: app: grafana [root@localhost prometheus]# kubectl apply -f grafana.yaml statefulset.apps/grafana created service/grafana created [root@localhost prometheus]# kubectl get pod,svc -n kube-system NAME READY STATUS RESTARTS AGE pod/coredns-5b8c57999b-z9jh8 1/1 Running 1 17d pod/grafana-0 1/1 Running 0 45s pod/kubernetes-dashboard-644c96f9c6-bvw8w 1/1 Running 1 17d pod/prometheus-0 2/2 Running 0 25h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/grafana NodePort 10.0.0.78 <none> 80:30007/TCP 44s service/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 17d service/kubernetes-dashboard NodePort 10.0.0.84 <none> 443:30001/TCP 17d service/prometheus NodePort 10.0.0.89 <none> 9090:41782/TCP 25h
訪問
監控k8s集群中的pod
kubelet的節點使用cAdvisor提供的metrics接口獲取該節點所有容器相關的性能指標數據。
暴露接口地址:
https://NodeIP:10255/metrics/cadvisor
https://NodeIP:10250/metrics/cadvisor
導入grafana模板
https://grafana.com/grafana/download
集群資源監控:3119
監控k8s集群中的node
使用文檔:https://prometheus.io/docs/guides/node-exporter/
GitHub:https://github.com/prometheus/node_exporter
exporter列表:https://prometheus.io/docs/instrumenting/exporters/
所有node節點部署node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz tar zxf node_exporter-0.17.0.linux-amd64.tar.gz mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter cat <<EOF >/usr/lib/systemd/system/node_exporter.service [Unit] Description=https://prometheus.io [Service] Restart=on-failure ExecStart=/usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable node_exporter systemctl restart node_exporter
修改prometheus-configmap.yaml,並重新部署
查看prometheus是否收集到kubernetes-nodes
導入grafana模板
集群資源監控:9276
監控k8s資源對象
https://github.com/kubernetes/kube-state-metrics
kube-state-metrics是一個簡單的服務,它監聽Kubernetes API服務器並生成有關對象狀態的指標。它不關注單個Kubernetes組件的運行狀況,而是關注內部各種對象的運行狀況,例如部署,節點和容器。
[root@localhost prometheus]# kubectl apply -f kube-state-metrics-rbac.yaml serviceaccount/kube-state-metrics created clusterrole.rbac.authorization.k8s.io/kube-state-metrics created role.rbac.authorization.k8s.io/kube-state-metrics-resizer created clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created rolebinding.rbac.authorization.k8s.io/kube-state-metrics created [root@localhost prometheus]# vim kube-state-metrics-deployment.yaml [root@localhost prometheus]# kubectl apply -f kube-state-metrics-deployment.yaml deployment.apps/kube-state-metrics created configmap/kube-state-metrics-config created [root@localhost prometheus]# kubectl apply -f kube-state-metrics-service.yaml service/kube-state-metrics created
導入grafana模板
集群資源監控:6417
在K8S中部署Alertmanager
部署Alertmanager
[root@localhost prometheus]# sed -i s/standard/managed-nfs-storage/ alertmanager-pvc.yaml [root@localhost prometheus]# kubectl apply -f alertmanager-configmap.yaml configmap/alertmanager-config created [root@localhost prometheus]# kubectl apply -f alertmanager-pvc.yaml persistentvolumeclaim/alertmanager created [root@localhost prometheus]# kubectl apply -f alertmanager-deployment.yaml deployment.apps/alertmanager created [root@localhost prometheus]# kubectl apply -f alertmanager-service.yaml service/alertmanager created [root@localhost prometheus]# kubectl get pod -n kube-system NAME READY STATUS RESTARTS AGE alertmanager-6b5bbd5bd4-lgjn8 2/2 Running 0 95s coredns-5b8c57999b-z9jh8 1/1 Running 1 20d grafana-0 1/1 Running 3 2d22h kube-state-metrics-f86fd9f4f-j4rdc 2/2 Running 0 3h2m kubernetes-dashboard-644c96f9c6-bvw8w 1/1 Running 1 20d prometheus-0 2/2 Running 0 4d
配置Prometheus與Alertmanager通信
[root@localhost prometheus]# vim prometheus-configmap.yaml 。。。。 alerting: alertmanagers: - static_configs: - targets: ["alertmanager:80"]
[root@localhost prometheus]# kubectl apply -f prometheus-configmap.yaml
configmap/prometheus-config configured
配置告警
prometheus指定rules目錄
configmap存儲告警規則
[root@localhost prometheus]# cat prometheus-rules.yaml apiVersion: v1 kind: ConfigMap metadata: name: prometheus-rules namespace: kube-system data: general.rules: | groups: - name: general.rules rules: - alert: InstanceDown expr: up == 0 for: 1m labels: severity: error annotations: summary: "Instance {{ $labels.instance }} 停止工作" description: "{{ $labels.instance }} job {{ $labels.job }} 已經停止5分鍾以上." node.rules: | groups: - name: node.rules rules: - alert: NodeFilesystemUsage expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分區使用率過高" description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分區使用大於80% (當前值: {{ $value }})" - alert: NodeMemoryUsage expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} 內存使用率過高" description: "{{ $labels.instance }}內存使用大於80% (當前值: {{ $value }})" - alert: NodeCPUUsage expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} CPU使用率過高" description: "{{ $labels.instance }}CPU使用大於60% (當前值: {{ $value }})" [root@localhost prometheus]# kubectl apply -f prometheus-rules.yaml configmap/prometheus-rules created
configmap掛載到容器rules目錄
[root@localhost prometheus]# vim prometheus-statefulset.yaml ...... volumeMounts: - name: config-volume mountPath: /etc/config - name: prometheus-data mountPath: /data subPath: "" - name: prometheus-rules mountPath: /etc/config/rules terminationGracePeriodSeconds: 300 volumes: - name: config-volume configMap: name: prometheus-config - name: prometheus-rules configMap: name: prometheus-rules ......
怎加alertmanager的告警配置
[root@localhost prometheus]# cat alertmanager-configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists data: alertmanager.yml: | global: resolve_timeout: 5m smtp_smarthost: 'smtp.163.com:25' smtp_from: 'xxxxx@163.com' smtp_auth_username: 'xxxxx@163.com' smtp_auth_password: 'xxxxx' receivers: - name: default-receiver email_configs: - to: "xxxxx@qq.com" route: group_interval: 1m group_wait: 10s receiver: default-receiver repeat_interval: 1m [root@localhost prometheus]# kubectl apply -f alertmanager-configmap.yaml configmap/alertmanager-config configured
郵件告警