kubernetes監控-prometheus(十六)


監控方案

cAdvisor+Heapster+InfluxDB+Grafana

Y

簡單

容器監控

cAdvisor/exporter+Prometheus+Grafana

Y

擴展性好

容器,應用,主機全方面監控

Prometheus+Grafana是監控告警解決方案里的后起之秀

通過各種exporter采集不同維度的監控指標,並通過Prometheus支持的數據格式暴露出來,Prometheus定期pull數據並用Grafana展示,異常情況使用AlertManager告警。

通過cadvisor采集容器、Pod相關的性能指標數據,並通過暴露的/metrics接口用prometheus抓取

通過prometheus-node-exporter采集主機的性能指標數據,並通過暴露的/metrics接口用prometheus抓取

應用側自己采集容器中進程主動暴露的指標數據(暴露指標的功能由應用自己實現,並添加平台側約定的annotation,平台側負責根據annotation實現通過Prometheus的抓取)

通過kube-state-metrics采集k8s資源對象的狀態指標數據,並通過暴露的/metrics接口用prometheus抓取

通過etcd、kubelet、kube-apiserver、kube-controller-manager、kube-scheduler自身暴露的/metrics獲取節點上與k8s集群相關的一些特征指標數據。

實現思路

監控指標

具體實現

舉例

Pod性能

cAdvisor

容器CPU,內存利用率

Node性能

node-exporter

節點CPU,內存利用率

K8S資源對象

kube-state-metrics

Pod/Deployment/Service

 k8s中部署prometheus

官網:https://prometheus.io

下載yaml文件:https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/prometheus

修改yaml文件

#使用nfs存儲
[root@localhost prometheus]# kubectl get storageclass
NAME                  PROVISIONER      AGE
managed-nfs-storage   fuseim.pri/ifs   9d
[root@localhost prometheus]# sed -i s/standard/managed-nfs-storage/ prometheus-statefulset.yaml

#修改service使用NOdePort
[root@localhost prometheus]# vim prometheus-service.yaml
。。。。
spec:
  type: NodePort
  ports:
    - name: http
      port: 9090
      protocol: TCP
      targetPort: 9090
  selector:
    k8s-app: prometheus

啟動prometheus

[root@localhost prometheus]# kubectl apply -f prometheus-rbac.yaml 
serviceaccount/prometheus created
clusterrole.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created
[root@localhost prometheus]# kubectl apply -f prometheus-configmap.yaml 
configmap/prometheus-config created
[root@localhost prometheus]# kubectl apply -f prometheus-statefulset.yaml 
statefulset.apps/prometheus created
[root@localhost prometheus]# vim prometheus-service.yaml 
[root@localhost prometheus]# kubectl apply -f prometheus-service.yaml
service/prometheus created

查看

[root@localhost prometheus]# kubectl get pod,svc -n kube-system
NAME                                        READY   STATUS    RESTARTS   AGE
pod/coredns-5b8c57999b-z9jh8                1/1     Running   1          16d
pod/kubernetes-dashboard-644c96f9c6-bvw8w   1/1     Running   1          16d
pod/prometheus-0                            2/2     Running   0          2m40s

NAME                           TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)          AGE
service/kube-dns               ClusterIP   10.0.0.2     <none>        53/UDP,53/TCP    16d
service/kubernetes-dashboard   NodePort    10.0.0.84    <none>        443:30001/TCP    16d
service/prometheus             NodePort    10.0.0.89    <none>        9090:41782/TCP   39s
[root@localhost prometheus]# kubectl get pv,pvc -n kube-system
NAME                                                                                                 CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                      STORAGECLASS          REASON   AGE
persistentvolume/kube-system-prometheus-data-prometheus-0-pvc-0e92f36c-8d9e-11e9-b018-525400828c1f   16Gi       RWO            Delete           Bound    kube-system/prometheus-data-prometheus-0   managed-nfs-storage            25m

NAME                                                 STATUS   VOLUME                                                                              CAPACITY   ACCESS MODES   STORAGECLASS          AGE
persistentvolumeclaim/prometheus-data-prometheus-0   Bound    kube-system-prometheus-data-prometheus-0-pvc-0e92f36c-8d9e-11e9-b018-525400828c1f   16Gi       RWO            managed-nfs-storage   25m

訪問

部署grafana
[root@localhost prometheus]# cat grafana.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: grafana namespace: kube-system spec: serviceName: "grafana" replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana resources: limits: cpu: 100m memory: 256Mi requests: cpu: 100m memory: 256Mi volumeMounts: - name: grafana-data mountPath: /var/lib/grafana subPath: grafana securityContext: fsGroup: 472 runAsUser: 472 volumeClaimTemplates: - metadata: name: grafana-data spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: "1Gi" --- apiVersion: v1 kind: Service metadata: name: grafana namespace: kube-system spec: type: NodePort ports: - port: 80 targetPort: 3000 nodePort: 30007 selector: app: grafana [root@localhost prometheus]# kubectl apply -f grafana.yaml statefulset.apps/grafana created service/grafana created [root@localhost prometheus]# kubectl get pod,svc -n kube-system NAME READY STATUS RESTARTS AGE pod/coredns-5b8c57999b-z9jh8 1/1 Running 1 17d pod/grafana-0 1/1 Running 0 45s pod/kubernetes-dashboard-644c96f9c6-bvw8w 1/1 Running 1 17d pod/prometheus-0 2/2 Running 0 25h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/grafana NodePort 10.0.0.78 <none> 80:30007/TCP 44s service/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 17d service/kubernetes-dashboard NodePort 10.0.0.84 <none> 443:30001/TCP 17d service/prometheus NodePort 10.0.0.89 <none> 9090:41782/TCP 25h

訪問

監控k8s集群中的pod

kubelet的節點使用cAdvisor提供的metrics接口獲取該節點所有容器相關的性能指標數據。

暴露接口地址:

https://NodeIP:10255/metrics/cadvisor

https://NodeIP:10250/metrics/cadvisor

導入grafana模板

https://grafana.com/grafana/download

集群資源監控:3119

 監控k8s集群中的node

使用文檔:https://prometheus.io/docs/guides/node-exporter/

GitHub:https://github.com/prometheus/node_exporter

exporter列表:https://prometheus.io/docs/instrumenting/exporters/

所有node節點部署node_exporter

wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz

tar zxf node_exporter-0.17.0.linux-amd64.tar.gz
mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter

cat <<EOF >/usr/lib/systemd/system/node_exporter.service
[Unit]
Description=https://prometheus.io

[Service]
Restart=on-failure
ExecStart=/usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable node_exporter
systemctl restart node_exporter

修改prometheus-configmap.yaml,並重新部署

 查看prometheus是否收集到kubernetes-nodes

導入grafana模板

集群資源監控:9276

監控k8s資源對象

https://github.com/kubernetes/kube-state-metrics

kube-state-metrics是一個簡單的服務,它監聽Kubernetes API服務器並生成有關對象狀態的指標。它不關注單個Kubernetes組件的運行狀況,而是關注內部各種對象的運行狀況,例如部署,節點和容器。

[root@localhost prometheus]# kubectl apply -f kube-state-metrics-rbac.yaml 
serviceaccount/kube-state-metrics created
clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
role.rbac.authorization.k8s.io/kube-state-metrics-resizer created
clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
rolebinding.rbac.authorization.k8s.io/kube-state-metrics created
[root@localhost prometheus]# vim kube-state-metrics-deployment.yaml 
[root@localhost prometheus]# kubectl apply -f kube-state-metrics-deployment.yaml
deployment.apps/kube-state-metrics created
configmap/kube-state-metrics-config created
[root@localhost prometheus]# kubectl apply -f kube-state-metrics-service.yaml 
service/kube-state-metrics created

導入grafana模板

集群資源監控:6417

在K8S中部署Alertmanager

部署Alertmanager

[root@localhost prometheus]# sed -i s/standard/managed-nfs-storage/ alertmanager-pvc.yaml
[root@localhost prometheus]# kubectl apply -f  alertmanager-configmap.yaml 
configmap/alertmanager-config created
[root@localhost prometheus]# kubectl apply -f  alertmanager-pvc.yaml 
persistentvolumeclaim/alertmanager created
[root@localhost prometheus]# kubectl apply -f  alertmanager-deployment.yaml 
deployment.apps/alertmanager created
[root@localhost prometheus]# kubectl apply -f  alertmanager-service.yaml 
service/alertmanager created

[root@localhost prometheus]# kubectl get pod -n kube-system
NAME                                    READY   STATUS    RESTARTS   AGE
alertmanager-6b5bbd5bd4-lgjn8           2/2     Running   0          95s
coredns-5b8c57999b-z9jh8                1/1     Running   1          20d
grafana-0                               1/1     Running   3          2d22h
kube-state-metrics-f86fd9f4f-j4rdc      2/2     Running   0          3h2m
kubernetes-dashboard-644c96f9c6-bvw8w   1/1     Running   1          20d
prometheus-0                            2/2     Running   0          4d

配置Prometheus與Alertmanager通信

[root@localhost prometheus]# vim prometheus-configmap.yaml
。。。。
    alerting:
      alertmanagers:
      - static_configs:
          - targets: ["alertmanager:80"]
[root@localhost prometheus]# kubectl apply -f prometheus-configmap.yaml
configmap/prometheus-config configured

配置告警

prometheus指定rules目錄

configmap存儲告警規則

[root@localhost prometheus]# cat prometheus-rules.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: kube-system
data:
  general.rules: |
    groups:
    - name: general.rules
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: error 
        annotations:
          summary: "Instance {{ $labels.instance }} 停止工作"
          description: "{{ $labels.instance }} job {{ $labels.job }} 已經停止5分鍾以上."
  node.rules: |
    groups:
    - name: node.rules
      rules:
      - alert: NodeFilesystemUsage
        expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80 
        for: 1m
        labels:
          severity: warning 
        annotations:
          summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分區使用率過高"
          description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分區使用大於80% (當前值: {{ $value }})"

      - alert: NodeMemoryUsage
        expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} 內存使用率過高"
          description: "{{ $labels.instance }}內存使用大於80% (當前值: {{ $value }})"

      - alert: NodeCPUUsage    
        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60 
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} CPU使用率過高"       
          description: "{{ $labels.instance }}CPU使用大於60% (當前值: {{ $value }})"

[root@localhost prometheus]# kubectl apply -f prometheus-rules.yaml
configmap/prometheus-rules created

configmap掛載到容器rules目錄

[root@localhost prometheus]# vim prometheus-statefulset.yaml
......
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
            - name: prometheus-data
              mountPath: /data
              subPath: ""
            - name: prometheus-rules
              mountPath: /etc/config/rules
      terminationGracePeriodSeconds: 300
      volumes:
        - name: config-volume
          configMap:
            name: prometheus-config
        - name: prometheus-rules
          configMap:
            name: prometheus-rules
......

怎加alertmanager的告警配置

[root@localhost prometheus]# cat alertmanager-configmap.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: EnsureExists
data:
  alertmanager.yml: |
    global: 
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: 'xxxxx@163.com'
      smtp_auth_username: 'xxxxx@163.com'
      smtp_auth_password: 'xxxxx'
    receivers:
    - name: default-receiver
      email_configs:
      - to: "xxxxx@qq.com"
    route:
      group_interval: 1m
      group_wait: 10s
      receiver: default-receiver
      repeat_interval: 1m

[root@localhost prometheus]# kubectl apply -f alertmanager-configmap.yaml
configmap/alertmanager-config configured

 郵件告警

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM