本次搭建實現:grafana圖形 prometheus監控告警 釘釘告警
一、了解服務作用
- Prometheus開源的系統監控和報警框架,靈感源自Google的Borgmon監控系統
- AlertManager 處理由客戶端應用程序(如Prometheus server)發送的警報。它負責將重復數據刪除,分組和路由到正確的接收者集成,還負責沉默和抑制警報
- Node_Exporter 用來監控各節點的資源信息的exporter,應部署到prometheus監控的所有節點
- prometheus-webhook-dingtalk 釘釘告警插件
- grafana 監控可視化
二、創建prometheus目錄 便於存放所有監控 。以及機器信息
服務器就一台:10.1.1.10 存放所有服務。想監控多台 配置文件新增個job ,被監控方啟個Node_Exporter服務即可
mkdir /data/prometheus #以下所有操作都在prometheus目錄下操作
cd /data/prometheus
三、創建prometheus配置文件以及數據目錄。用於啟動prometheu時讀取
mkdir /prometheus/data -p
chmod 777 /prometheus/data #創建存放prometheus數據目錄
vim /prometheus/prometheus.yml
global: scrape_interval: 15s # 多久 收集 一次數據 evaluation_interval: 15s # 多久 評估 一次規則 scrape_timeout: 10s # 每次 收集數據的 超時時間 # 收集數據 配置 列表 scrape_configs: - job_name: prometheus # 必須配置, 自動附加的job labels, 必須唯一 static_configs: - targets: ['10.1.1.10:9090'] # 指定prometheus ip端口 labels: instance: prometheus #標簽 - job_name: ehospital-exploit-database static_configs: - targets: ['10.1.1.10:9100'] labels: instance: eehospital-exploit-database alerting: #Alertmanager相關的配置 alertmanagers: - static_configs: - targets: - 10.1.1.10:9093 #指定告警模塊 rule_files: #告警規則文件, 可以使用通配符 - "/etc/prometheus/rules/*.yml"
四、創建告警規則文件及觸發條件文件 。用於prometheus配置文件讀取此告警內容
4.1:
mkdir runles #先創建rules目錄
vim runles/alert-rules.yml #通用
groups: - name: prometheus-alert rules: - alert: prometheus-down expr: prometheus:up == 0 for: 1m labels: severity: 'critical' annotations: summary: "instance: {{ $labels.instance }} 宕機了" description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} 關機了, 時間已經1分鍾了。" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-cpu-high expr: prometheus:cpu:total:percent > 80 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} cpu 使用率高於 {{ $value }}" description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} CPU使用率已經持續一分鍾高過80% 。" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-cpu-iowait-high expr: prometheus:cpu:iowait:percent >= 12 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} cpu iowait 使用率高於 {{ $value }}" description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} cpu iowait使用率已經持續三分鍾高過12%" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-load-load1-high expr: (prometheus:load:load1) > (prometheus:cpu:count) * 1.2 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} load1 使用率高於 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-memory-high expr: prometheus:memory:used:percent > 85 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} memory 使用率高於 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-disk-high expr: prometheus:disk:used:percent > 80 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} disk 使用率高於 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-disk-read:count-high expr: prometheus:disk:read:count:rate > 2000 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} iops read 使用率高於 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-disk-write-count-high expr: prometheus:disk:write:count:rate > 2000 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} iops write 使用率高於 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-disk-read-mb-high expr: prometheus:disk:read:mb:rate > 60 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 讀取字節數 高於 {{ $value }}" description: "" instance: "{{ $labels.instance }}" value: "{{ $value }}" - alert: prometheus-disk-write-mb-high expr: prometheus:disk:write:mb:rate > 60 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 寫入字節數 高於 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-filefd-allocated-percent-high expr: prometheus:filefd_allocated:percent > 80 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 打開文件描述符 高於 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-network-netin-error-rate-high expr: prometheus:network:netin:error:rate > 4 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包進入的錯誤速率 高於 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-network-netin-packet-rate-high expr: prometheus:network:netin:packet:rate > 35000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包進入速率 高於 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-network-netout-packet-rate-high expr: prometheus:network:netout:packet:rate > 35000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包流出速率 高於 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-network-tcp-total-count-high expr: prometheus:network:tcp:total:count > 40000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} tcp連接數量 高於 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-process-zoom-total-count-high expr: prometheus:process:zoom:total:count > 10 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 僵死進程數量 高於 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: prometheus-time-offset-high expr: prometheus:time:offset > 0.03 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} {{ $labels.desc }} {{ $value }} {{ $labels.unit }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}"
vim rules/record-rules.yml
groups: - name: prometheus-record rules: - expr: up{job!="prometheus"} record: prometheus:up labels: desc: "節點是否在線, 在線1,不在線0" unit: " " job: "prometheus" - expr: time() - node_boot_time_seconds{} record: prometheus:node_uptime labels: desc: "節點的運行時間" unit: "s" job: "prometheus" ############################################################################################## # cpu # - expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m]))) * 100 record: prometheus:cpu:total:percent labels: desc: "節點的cpu總消耗百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m]))) * 100 record: prometheus:cpu:idle:percent labels: desc: "節點的cpu idle百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="iowait"}[5m]))) * 100 record: prometheus:cpu:iowait:percent labels: desc: "節點的cpu iowait百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="system"}[5m]))) * 100 record: prometheus:cpu:system:percent labels: desc: "節點的cpu system百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="user"}[5m]))) * 100 record: prometheus:cpu:user:percent labels: desc: "節點的cpu user百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode=~"softirq|nice|irq|steal"}[5m]))) * 100 record: prometheus:cpu:other:percent labels: desc: "節點的cpu 其他的百分比" unit: "%" job: "prometheus" ############################################################################################## ############################################################################################## # memory # - expr: node_memory_MemTotal_bytes{job!="prometheus"} record: prometheus:memory:total labels: desc: "節點的內存總量" unit: byte job: "prometheus" - expr: node_memory_MemFree_bytes{job!="prometheus"} record: prometheus:memory:free labels: desc: "節點的剩余內存量" unit: byte job: "prometheus" - expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemFree_bytes{job!="prometheus"} record: prometheus:memory:used labels: desc: "節點的已使用內存量" unit: byte job: "prometheus" - expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemAvailable_bytes{job!="prometheus"} record: prometheus:memory:actualused labels: desc: "節點用戶實際使用的內存量" unit: byte job: "prometheus" - expr: (1-(node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100 record: prometheus:memory:used:percent labels: desc: "節點的內存使用百分比" unit: "%" job: "prometheus" - expr: ((node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100 record: prometheus:memory:free:percent labels: desc: "節點的內存剩余百分比" unit: "%" job: "prometheus" ############################################################################################## # load # - expr: sum by (instance) (node_load1{job!="prometheus"}) record: prometheus:load:load1 labels: desc: "系統1分鍾負載" unit: " " job: "prometheus" - expr: sum by (instance) (node_load5{job!="prometheus"}) record: prometheus:load:load5 labels: desc: "系統5分鍾負載" unit: " " job: "prometheus" - expr: sum by (instance) (node_load15{job!="prometheus"}) record: prometheus:load:load15 labels: desc: "系統15分鍾負載" unit: " " job: "prometheus" ############################################################################################## # disk # - expr: node_filesystem_size_bytes{job!="prometheus" ,fstype=~"ext4|xfs"} record: prometheus:disk:usage:total labels: desc: "節點的磁盤總量" unit: byte job: "prometheus" - expr: node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"} record: prometheus:disk:usage:free labels: desc: "節點的磁盤剩余空間" unit: byte job: "prometheus" - expr: node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"} record: prometheus:disk:usage:used labels: desc: "節點的磁盤使用的空間" unit: byte job: "prometheus" - expr: (1 - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"}) * 100 record: prometheus:disk:used:percent labels: desc: "節點的磁盤的使用百分比" unit: "%" job: "prometheus" - expr: irate(node_disk_reads_completed_total{job!="prometheus"}[1m]) record: prometheus:disk:read:count:rate labels: desc: "節點的磁盤讀取速率" unit: "次/秒" job: "prometheus" - expr: irate(node_disk_writes_completed_total{job!="prometheus"}[1m]) record: prometheus:disk:write:count:rate labels: desc: "節點的磁盤寫入速率" unit: "次/秒" job: "prometheus" - expr: (irate(node_disk_written_bytes_total{job!="prometheus"}[1m]))/1024/1024 record: prometheus:disk:read:mb:rate labels: desc: "節點的設備讀取MB速率" unit: "MB/s" job: "prometheus" - expr: (irate(node_disk_read_bytes_total{job!="prometheus"}[1m]))/1024/1024 record: prometheus:disk:write:mb:rate labels: desc: "節點的設備寫入MB速率" unit: "MB/s" job: "prometheus" ############################################################################################## # filesystem # - expr: (1 -node_filesystem_files_free{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_files{job!="prometheus",fstype=~"ext4|xfs"}) * 100 record: prometheus:filesystem:used:percent labels: desc: "節點的inode的剩余可用的百分比" unit: "%" job: "prometheus" ############################################################################################# # filefd # - expr: node_filefd_allocated{job!="prometheus"} record: prometheus:filefd_allocated:count labels: desc: "節點的文件描述符打開個數" unit: "%" job: "prometheus" - expr: node_filefd_allocated{job!="prometheus"}/node_filefd_maximum{job!="prometheus"} * 100 record: prometheus:filefd_allocated:percent labels: desc: "節點的文件描述符打開百分比" unit: "%" job: "prometheus" ############################################################################################# # network # - expr: avg by (environment,instance,device) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netin:bit:rate labels: desc: "節點網卡eth0每秒接收的比特數" unit: "bit/s" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netout:bit:rate labels: desc: "節點網卡eth0每秒發送的比特數" unit: "bit/s" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_receive_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netin:packet:rate labels: desc: "節點網卡每秒接收的數據包個數" unit: "個/秒" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netout:packet:rate labels: desc: "節點網卡發送的數據包個數" unit: "個/秒" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netin:error:rate labels: desc: "節點設備驅動器檢測到的接收錯誤包的數量" unit: "個/秒" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_transmit_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netout:error:rate labels: desc: "節點設備驅動器檢測到的發送錯誤包的數量" unit: "個/秒" job: "prometheus" - expr: node_tcp_connection_states{job!="prometheus", state="established"} record: prometheus:network:tcp:established:count labels: desc: "節點當前established的個數" unit: "個" job: "prometheus" - expr: node_tcp_connection_states{job!="prometheus", state="time_wait"} record: prometheus:network:tcp:timewait:count labels: desc: "節點timewait的連接數" unit: "個" job: "prometheus" - expr: sum by (environment,instance) (node_tcp_connection_states{job!="prometheus"}) record: prometheus:network:tcp:total:count labels: desc: "節點tcp連接總數" unit: "個" job: "prometheus"
五、創建grafana數據目錄以及配置文件 。 用於grafana存放數據
mkdir grafana/grafana-storage -p chmod 777 grafana/grafana-storage
grafana.ini 配置文件可從grafana容器里拷貝一份出來
六、創建alert配置。用於向webhook發送告警
mkdir alert
vim alert/alertmanager.yml
global: resolve_timeout: 5m route: receiver: webhook group_wait: 30s group_interval: 5m repeat_interval: 5m group_by: [alertname] routes: - receiver: webhook group_wait: 10s receivers: - name: webhook webhook_configs: - url: http://10.1.1.10:8060/dingtalk/webhook1/send send_resolved: true ~
指向webhook的地址
七、編輯docker-compose啟動服務yml
vim docker-compose.yml
version: '3.2' services: prometheus: image: prom/prometheus restart: "always" ports: - 9090:9090 container_name: "prometheus" volumes: - "./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml" - "./rules:/etc/prometheus/rules" - "./prometheus/data:/prometheus" command: - '--config.file=/etc/prometheus/prometheus.yml' 設置yml路徑 跟上面掛載對應 - '--storage.tsdb.path=/prometheus' #設置數據路徑 跟上面掛載對應 #告警模塊 alertmanager: image: prom/alertmanager:latest restart: "always" ports: - 9093:9093 container_name: "alertmanager" volumes: - "./alert/alertmanager.yml:/etc/alertmanager/alertmanager.yml" #釘釘插件 webhook: image: timonwong/prometheus-webhook-dingtalk restart: "always" ports: - 8060:8060 container_name: "webhook" #token指定釘釘 command: - '--ding.profile=webhook1=https://oapi.dingtalk.com/robot/send?access_token=* 釘釘機器人地址' #web界面 grafana: image: grafana/grafana restart: "always" ports: - 3000:3000 container_name: "grafana" volumes: - "./grafana/grafana.ini:/etc/grafana/grafana.ini" #配置文件自行拷貝出來 - "./grafana/grafana-storage:/var/lib/grafana"
7.2 啟動
docker-compose -f docker-compose.yml up -d
八、創建啟動收集服務node-exporter-compose.yml
vim node-exporter-compose.yml
docker-compose -f node-exporter-compose.yml up -d
version: '3.2' services: node-exporter: image: prom/node-exporter restart: "always" ports: - 9100:9100 container_name: "node-exporter" volumes: - "/proc:/host/proc:ro" - "/sys:/host/sys:ro" - "/:/rootfs:ro"
每加一台。創建一份即可。 本機也行
九、檢查
docker ps -a #檢查容器是否啟動
netstat -nltp #檢查端口是否啟動
頁面訪問ip:9090
十、配置Grafana
效果展示
#去官方下載監控模板即可
插件地址:
到這就部署完了。 謝謝觀看,轉載請@此文章