Prometheus 告警規則

本文轉載自查看原文 2021-12-08 21:57 2388

Prometheus 告警規則

Prometheus 告警規則概念：

警報規則允許您根據 Prometheus 表達式語言表達式定義警報條件，並將有關觸發警報的通知發送到外部服務。每當警報表達式在給定的時間點產生一個或多個向量元素時，警報對於這些元素的標簽集算作活動。
類似於記錄規則, 告警規則(Alerting rule) 也定義在獨立的文件中, 而后由 Prometheus 在 rule_files 配置段中加載配置如下：

rule_files:
  - alerting_rules/*.yml          # 告警規則文件路徑

Prometheus指標含義：

選項	含義
- group	配置頂級，用於定義一個監控組
- name	規則名稱
- rules	規則
- alert	告警規則名稱
- expr	表達式基於PromQL表達式告警觸發條件，用於計算是否有時間序列滿足該條件
- for	評估等待時間，在等待時間狀態為pending，滿足時長告警狀態為firing，恢復則為inactive狀態
- lables	自定義標簽
- annotations	用於指定一組附加信息，比如用於描述告警詳細信息的文字等，annotations的內容在告警產生時會一同作為參數發送到Alertmanager。summary描述告警的概要信息，description用於描述告警的詳細信息。同時Alertmanager的UI也會根據這兩個標簽值，顯示告警信息。

1.1Prometheus 自我監控（25 條規則）

1.1.1 Prometheus 自我監控模板：

- alert: PrometheusJobMissing               // 告警規則名稱：
    expr: absent(up{job="prometheus"})      // 匹配規則,表達式：
    for: 0m                                 // 檢測持續時間,表示持續一分鍾獲取不到信息，則觸發報警。0表示不使用持續時間
    labels:                                 // 定義當前告警規則級別
      severity: warning                     // 指定告警級別
    annotations:                            // 注釋 告警通知
	//調用標簽，具體指附加通知信息
      summary: Prometheus job missing (instance {{ $labels.instance }})     //自定義摘要
      description: "A Prometheus job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"                     //自定義具體描述

1.1.2 Prometheus 目標丟失模板：

- alert: PrometheusTargetMissing           // 告警規則名稱
    expr: up == 0                          // 匹配規則,表達式
    for: 0m                                // 檢測持續時間,表示持續一分鍾獲取不到信息，則觸發報警。0表示不使用持續時間
    labels:                                // 定義當前告警規則級別
      severity: critical                   // 指定告警級別
    annotations:                           // 注釋 告警通知
	// 調用標簽具體指附加通知信息
      summary: Prometheus target missing (instance {{ $labels.instance }})  // 自定義摘要 
      description: "A Prometheus target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}" // 自定義具體描述

1.1.3 Prometheus 所有目標丟失模板：

  - alert: PrometheusAllTargetsMissing  // 告警規則名稱
    expr: count by (job) (up) == 0      // 匹配規則,表達式
    for: 0m                             // 檢測持續時間,表示持續一分鍾獲取不到信息，則觸發報警。0表示不使用持續時間
    labels:                             // 定義當前告警規則級別
      severity: critical                // 指定告警級別
    annotations:                        // 注釋 告警通知
	  // 調用標簽具體指附加通知信息
      summary: Prometheus all targets missing (instance {{ $labels.instance }})  // 自定義摘要
      description: "A Prometheus job does not have living target anymore.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"  // 自定義具體描述

1.1.4 Prometheus 配置重載失敗模板：

  - alert: PrometheusConfigurationReloadFailure           // 告警規則名稱
    expr: prometheus_config_last_reload_successful != 1   // 匹配規則,表達式
    for: 0m                                               // 檢測持續時間,表示持續一分鍾獲取不到信息，則觸發報警。0表示不使用持續時間
    labels:                                               // 定義當前告警規則級別
      severity: warning                                   // 指定告警級別
    annotations:                                          // 注釋 告警通知
	// 調用標簽具體指附加通知信息
      summary: Prometheus configuration reload failure (instance {{ $labels.instance }})  // 自定義摘要
      description: "Prometheus configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"  // 自定義具體描述

1.1.5 Prometheus 重啟太多模板：

- alert: PrometheusTooManyRestarts                      // 告警規則名稱
    expr:                           		  	//匹配規則,表達式changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2
    for: 0m                                     // 檢測持續時間,表示持續一分鍾獲取不到信息，則觸發報警。0表示不使用持續時間
    labels:                                     // 定義當前告警規則級別
      severity: warning							// 指定告警級別
    annotations:								 // 注釋 告警通知
	// 調用標簽具體指附加通知信息
      summary: Prometheus too many restarts (instance {{ $labels.instance }})		 // 自定義摘要
      description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"		// 自定義具體描述

1.1.6 Prometheus AlertManager 配置重載失敗模板：

 - alert: PrometheusAlertmanagerConfigurationReloadFailure		// 告警規則名稱
    expr: alertmanager_config_last_reload_successful != 1		// 匹配規則,表達式
    for: 0m			// 檢測持續時間,表示持續一分鍾獲取不到信息，則觸發報警。0表示不使用持續時間
    labels:			// 定義當前告警規則級別
      severity: warning			// 指定告警級別
    annotations:		// 注釋 告警通知
	// 調用標簽具體指附加通知信息
      summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})		// 自定義摘要
      description: "AlertManager configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"		// 自定義具體描述

1.1.7 Prometheus AlertManager 配置未同步模板：

- alert: PrometheusAlertmanagerConfigNotSynced		// 告警規則名稱
    expr: count(count_values("config_hash", alertmanager_config_hash)) > 1		// 匹配規則,表達式
    for: 0m		// 檢測持續時間,表示持續一分鍾獲取不到信息，則觸發報警。0表示不使用持續時間
    labels:		// 定義當前告警規則級別
      severity: warning		// 指定告警級別
    annotations:		// 注釋 告警通知
	// 調用標簽具體指附加通知信息
      summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }})	// 自定義摘要
      description: "Configurations of AlertManager cluster instances are out of sync\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"	// 自定義具體描述

1.1.8 Prometheus AlertManager E2E dead man switch模板：

- alert: PrometheusAlertmanagerE2eDeadManSwitch	// 告警規則名稱
    expr: vector(1)	// 匹配規則,表達式
    for: 0m	// 檢測持續時間,表示持續一分鍾獲取不到信息，則觸發報警。0表示不使用持續時間
    labels:	// 定義當前告警規則級別
      severity: critical	// 指定告警級別
    annotations:	// 注釋 告警通知
	// 調用標簽具體指附加通知信息
      summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }})	// 自定義摘要
      description: "Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"	// 自定義具體描述

1.1.9 Prometheus 無法連接alertmanager模板：

  - alert: PrometheusNotConnectedToAlertmanager	// 告警規則組名稱
    expr: prometheus_notifications_alertmanagers_discovered < 1	// 匹配規則,表達式
    for: 0m	// 檢測持續時間,表示持續一分鍾獲取不到信息，則觸發報警。0表示不使用持續時間
    labels:	// 定義當前告警規則級別
      severity: critical	// 指定告警級別
    annotations:	// 注釋 告警通知
	// 調用標簽具體指附加通知信息
      summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }})	// 自定義摘要
      description: "Prometheus cannot connect the alertmanager\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"	// 自定義具體描述

1.1.10 Prometheus 規則評估失敗模板：

 - alert: PrometheusTemplateTextExpansionFailures
    expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus template text expansion failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} template text expansion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.11 Prometheus 規則評估緩慢模板：

- alert: PrometheusRuleEvaluationSlow
    expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus rule evaluation slow (instance {{ $labels.instance }})
      description: "Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.12 Prometheus 通知隊列積壓模板：

- alert: PrometheusNotificationsBacklog
    expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus notifications backlog (instance {{ $labels.instance }})
      description: "The Prometheus notification queue has not been empty for 10 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.13 Prometheus AlertManager 通知失敗

- alert: PrometheusAlertmanagerNotificationFailing
    expr: rate(alertmanager_notifications_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }})
      description: "Alertmanager is failing sending notifications\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.14 Prometheus 目標為空模板：

- alert: PrometheusTargetEmpty
    expr: prometheus_sd_discovered_targets == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus target empty (instance {{ $labels.instance }})
      description: "Prometheus has no target in service discovery\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.15 Prometheus目標抓取緩慢模板：

- alert: PrometheusTargetScrapingSlow
    expr: prometheus_target_interval_length_seconds{quantile="0.9"} > 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus target scraping slow (instance {{ $labels.instance }})
      description: "Prometheus is scraping exporters slowly\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.16 Prometheus large scrape模板：

- alert: PrometheusLargeScrape
    expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus large scrape (instance {{ $labels.instance }})
      description: "Prometheus has many scrapes that exceed the sample limit\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.17 Prometheus目標抓取重復模板：

- alert: PrometheusTargetScrapeDuplicate
    expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus target scrape duplicate (instance {{ $labels.instance }})
      description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.18 Prometheus TSDB 檢查點創建失敗模板：

 - alert: PrometheusTsdbCheckpointCreationFailures
    expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} checkpoint creation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.19 Prometheus TSDB 檢查點刪除失敗模板：

 - alert: PrometheusTsdbCheckpointDeletionFailures
    expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.20 Prometheus TSDB 壓縮失敗模板：

- alert: PrometheusTsdbCompactionsFailed
    expr: increase(prometheus_tsdb_compactions_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB compactions failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.21 Prometheus TSDB 頭部截斷失敗模板：

 - alert: PrometheusTsdbHeadTruncationsFailed
    expr: increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.22 Prometheus TSDB 重新加載失敗模板：

 - alert: PrometheusTsdbReloadFailures
    expr: increase(prometheus_tsdb_reloads_failures_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB reload failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB reload failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.23 普羅米修斯 TSDB WAL 損壞模板：

 - alert: PrometheusTsdbWalCorruptions
    expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.24 Prometheus TSDB WAL 截斷失敗模板：

 - alert: PrometheusTsdbWalTruncationsFailed
    expr: increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2 主機和硬件：節點導出器（33條規則）

1.2.1. 主機內存不足模板：

節點內存已滿（剩余< 10%）
- alert: HostOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of memory (instance {{ $labels.instance }})
      description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.2 內存壓力下的主機內存模板：

節點內存壓力很大。主要頁面錯誤率高
 - alert: HostMemoryUnderMemoryPressure
    expr: rate(node_vmstat_pgmajfault[1m]) > 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host memory under memory pressure (instance {{ $labels.instance }})
      description: "The node is under heavy memory pressure. High rate of major page faults\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.3 主機異常網絡吞吐量模板：

主機網絡接口可能接收太多數據 (> 100 MB/s)
 - alert: HostUnusualNetworkThroughputIn
    expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput in (instance {{ $labels.instance }})
      description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.4 主機異常網絡吞吐量模板：

主機網絡接口可能發送過多數據（> 100 MB/s）
 - alert: HostUnusualNetworkThroughputOut
    expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput out (instance {{ $labels.instance }})
      description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.5 主機異常磁盤讀取率模板：

磁盤可能正在讀取太多數據（> 50 MB/s）
 - alert: HostUnusualDiskReadRate
    expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read rate (instance {{ $labels.instance }})
      description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.6 主機異常磁盤寫入率模板：

磁盤可能正在寫入太多數據（> 50 MB/s）
- alert: HostUnusualDiskWriteRate
    expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write rate (instance {{ $labels.instance }})
      description: "Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.7 主機磁盤空間不足模板：

磁盤幾乎已滿（還剩 < 10%）
 - alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of disk space (instance {{ $labels.instance }})
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.8 主機磁盤將在 24 小時內填滿模板：

以當前寫入速率，預計文件系統將在未來 24 小時內耗盡空間
- alert: HostDiskWillFillIn24Hours
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host disk will fill in 24 hours (instance {{ $labels.instance }})
      description: "Filesystem is predicted to run out of space within the next 24 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.9 主機不足 inode模板：

磁盤幾乎用完了可用的 inode（剩余 < 10%）
- alert: HostOutOfInodes
    expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint="/rootfs"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/rootfs"} == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of inodes (instance {{ $labels.instance }})
      description: "Disk is almost running out of available inodes (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.10 主機 inode 將在 24 小時內填滿模板：

以當前寫入速率，預計文件系統將在未來 24 小時內耗盡 inode
 - alert: HostInodesWillFillIn24Hours
    expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint="/rootfs"} * 100 < 10 and predict_linear(node_filesystem_files_free{mountpoint="/rootfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/rootfs"} == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host inodes will fill in 24 hours (instance {{ $labels.instance }})
      description: "Filesystem is predicted to run out of inodes within the next 24 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.11 主機異常磁盤讀取延遲模板：

磁盤延遲增加（讀取操作 > 100 毫秒）
- alert: HostUnusualDiskReadLatency
    expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read latency (instance {{ $labels.instance }})
      description: "Disk latency is growing (read operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.12 主機異常磁盤寫入延遲模板：

磁盤延遲增加（寫入操作 > 100 毫秒）
- alert: HostUnusualDiskWriteLatency
    expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write latency (instance {{ $labels.instance }})
      description: "Disk latency is growing (write operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.13 主機 CPU 負載高模板：

CPU 負載 > 80%
 - alert: HostHighCpuLoad
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host high CPU load (instance {{ $labels.instance }})
      description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.14 主機 CPU 竊取模板：

CPU 竊取率 > 10%
- alert: HostCpuStealNoisyNeighbor
    expr: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }})
      description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.15 主機上下文切換模板：

節點上的上下文切換正在增長（> 1000 / s）
- alert: HostContextSwitching
    expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host context switching (instance {{ $labels.instance }})
      description: "Context switching is growing on node (> 1000 / s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.16. 主機交換已滿模板：

掉期已滿 (>80%)
 - alert: HostSwapIsFillingUp
    expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host swap is filling up (instance {{ $labels.instance }})
      description: "Swap is filling up (>80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.17. 主機 systemd 服務崩潰模板：

systemd 服務崩潰
- alert: HostSystemdServiceCrashed
    expr: node_systemd_unit_state{state="failed"} == 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host systemd service crashed (instance {{ $labels.instance }})
      description: "systemd service crashed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.18 主機物理組件過熱模板：

systemd 服務崩潰
- alert: HostPhysicalComponentTooHot
    expr: node_hwmon_temp_celsius > 75
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host physical component too hot (instance {{ $labels.instance }})
      description: "Physical hardware component too hot\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.19. 主機節點超溫告警模板：

物理節點溫度告警觸發
 - alert: HostNodeOvertemperatureAlarm
    expr: node_hwmon_temp_crit_alarm_celsius == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Host node overtemperature alarm (instance {{ $labels.instance }})
      description: "Physical node temperature alarm triggered\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.20 主機REID陣列處於非活動狀態模板：

由於一個或多個磁盤故障，RAID 陣列 {{ $labels.device }} 處於降級狀態。備用驅動器的數量不足以自動修復問題。
- alert: HostRaidArrayGotInactive
    expr: node_md_state{state="inactive"} > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Host RAID array got inactive (instance {{ $labels.instance }})
      description: "RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.21 主機RAID磁盤故障模板：

{{ $labels.instance }} 上的 RAID 陣列中至少有一個設備出現故障。數組 {{ $labels.md_device }} 需要注意，可能需要磁盤交換
- alert: HostRaidDiskFailure
    expr: node_md_disks{state="failed"} > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host RAID disk failure (instance {{ $labels.instance }})
      description: "At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.22 主機內核版本偏差模板：

不同的內核版本正在運行
- alert: HostKernelVersionDeviations
    expr: count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: Host kernel version deviations (instance {{ $labels.instance }})
      description: "Different kernel versions are running\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.23 檢測到主機 OOM 終止模板：

檢測到 OOM 殺死
 - alert: HostOomKillDetected
    expr: increase(node_vmstat_oom_kill[1m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host OOM kill detected (instance {{ $labels.instance }})
      description: "OOM kill detected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.24 檢測到主機 EDAC 可糾正錯誤模板：

在過去 5 分鍾內，主機 {{ $labels.instance }} 有 {{ printf "%.0f" $value }} 由 EDAC 報告的可糾正內存錯誤。
 - alert: HostEdacCorrectableErrorsDetected
    expr: increase(node_edac_correctable_errors_total[1m]) > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.25. 檢測到主機 EDAC 無法糾正的錯誤模板：

在過去 5 分鍾內，主機 {{ $labels.instance }} 有 {{ printf "%.0f" $value }} 由 EDAC 報告的無法糾正的內存錯誤。
 - alert: HostEdacUncorrectableErrorsDetected
    expr: node_edac_uncorrectable_errors_total > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.26 主機網絡接收錯誤模板：

主機 {{ $labels.instance }} interface {{ $labels.device }} 在過去兩分鍾內遇到了 {{ printf "%.0f" $value }} 接收錯誤。
- alert: HostNetworkReceiveErrors
    expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host Network Receive Errors (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.27 主機網絡接口飽和模板：

“{{ $labels.instance }}”上的網絡接口“{{ $labels.device }}”正在過載。
- alert: HostNetworkInterfaceSaturated
    expr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8 < 10000
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Host Network Interface Saturated (instance {{ $labels.instance }})
      description: "The network interface \"{{ $labels.device }}\" on \"{{ $labels.instance }}\" is getting overloaded.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.28 主機網絡綁定降級模板：

綁定“{{ $labels.device }}”在“{{ $labels.instance }}”上降級。
- alert: HostNetworkBondDegraded
    expr: (node_bonding_active - node_bonding_slaves) != 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host Network Bond Degraded (instance {{ $labels.instance }})
      description: "Bond \"{{ $labels.device }}\" degraded on \"{{ $labels.instance }}\".\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.29 主機連接限制模板：

conntrack數量接近極限
- alert: HostConntrackLimit
    expr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host conntrack limit (instance {{ $labels.instance }})
      description: "The number of conntrack is approaching limit\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.30 主機時鍾偏差模板：

檢測到時鍾偏差。時鍾不同步。確保在此主機上正確配置了 NTP。
- alert: HostClockSkew
    expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host clock skew (instance {{ $labels.instance }})
      description: "Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.31 模板主機時鍾不同步:

時鍾不同步。確保在此主機上配置了 NTP。
- alert: HostClockNotSynchronising
    expr: min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host clock not synchronising (instance {{ $labels.instance }})
      description: "Clock not synchronising. Ensure NTP is configured on this host.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.32 主機需要重啟模板：

{{ $labels.instance }} 需要重新啟動。
 - alert: HostRequiresReboot
    expr: node_reboot_required > 0
    for: 4h
    labels:
      severity: info
    annotations:
      summary: Host requires reboot (instance {{ $labels.instance }})
      description: "{{ $labels.instance }} requires a reboot.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3 Docker 容器： google/cAdvisor （7 條規則）

1.3.1 容器--被殺死模板：

一個容器消失了
 - alert: ContainerKilled
    expr: time() - container_last_seen > 60
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Container killed (instance {{ $labels.instance }})
      description: "A container has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3.2 容器--不存在模板：

容器不存在---5分鍾
- alert: ContainerAbsent
    expr: absent(container_last_seen)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Container absent (instance {{ $labels.instance }})
      description: "A container is absent for 5 min\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3.3 容器--CPU使用率模板：

容器CPU使用率在80%以上
- alert: ContainerCpuUsage
    expr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container CPU usage (instance {{ $labels.instance }})
      description: "Container CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3.4 容器--內存使用模板：

容器內存使用率在80%以上
- alert: ContainerMemoryUsage
    expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container Memory usage (instance {{ $labels.instance }})
      description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3.5 容器--卷使用模板：

Container Volume 使用率超過 80%
 - alert: ContainerVolumeUsage
    expr: (1 - (sum(container_fs_inodes_free{name!=""}) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container Volume usage (instance {{ $labels.instance }})
      description: "Container Volume usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3.6 容器--卷 IO 使用情況模板：

Container Volume IO 使用率在 80% 以上
 - alert: ContainerVolumeIoUsage
    expr: (sum(container_fs_io_current{name!=""}) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container Volume IO usage (instance {{ $labels.instance }})
      description: "Container Volume IO usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3.7 容器--高節流率模板：

容器被限制
- alert: ContainerHighThrottleRate
    expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container high throttle rate (instance {{ $labels.instance }})
      description: "Container is being throttled\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4 黑盒： prometheus/blackbox_exporter （8 條規則）

1.4.1. 黑盒探測失敗模板：

探測失敗
 - alert: BlackboxProbeFailed
    expr: probe_success == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox probe failed (instance {{ $labels.instance }})
      description: "Probe failed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.2 黑盒慢探針模板：

黑盒探測用了 1 秒完成
- alert: BlackboxSlowProbe
    expr: avg_over_time(probe_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Blackbox slow probe (instance {{ $labels.instance }})
      description: "Blackbox probe took more than 1s to complete\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.3 黑盒探測 HTTP 失敗模板：

HTTP 狀態碼不是 200-399
- alert: BlackboxProbeHttpFailure
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
      description: "HTTP status code is not 200-399\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.4 Blackbox SSL 證書即將到期模板：

 - alert: BlackboxSslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
      description: "SSL certificate expires in 30 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.5 Blackbox SSL 證書即將到期模板：

SSL 證書 3 天后到期
 - alert: BlackboxSslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
      description: "SSL certificate expires in 3 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.6 Blackbox SSL 證書已過期模板：

SSL 證書已過期
- alert: BlackboxSslCertificateExpired
    expr: probe_ssl_earliest_cert_expiry - time() <= 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox SSL certificate expired (instance {{ $labels.instance }})
      description: "SSL certificate has expired already\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.7 黑盒探測慢速 HTTP模板：

HTTP 請求耗時超過 1 秒
 - alert: BlackboxProbeSlowHttp
    expr: avg_over_time(probe_http_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Blackbox probe slow HTTP (instance {{ $labels.instance }})
      description: "HTTP request took more than 1s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.8 黑盒探測慢 ping模板：

Blackbox ping 耗時超過 1 秒
- alert: BlackboxProbeSlowPing
    expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Blackbox probe slow ping (instance {{ $labels.instance }})
      description: "Blackbox ping took more than 1s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.5 Windows 服務器： prometheus-community/windows_exporter （5 條規則）

1.5.1. Windows Server 收集器錯誤模板：

收集器 {{ $labels.collector }} 不成功
 - alert: WindowsServerCollectorError
    expr: windows_exporter_collector_success == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Windows Server collector Error (instance {{ $labels.instance }})
      description: "Collector {{ $labels.collector }} was not successful\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.5.2 Windows Server 服務狀態模板：

Windows 服務狀態不正常
- alert: WindowsServerServiceStatus
    expr: windows_service_status{status="ok"} != 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Windows Server service Status (instance {{ $labels.instance }})
      description: "Windows Service state is not OK\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.5.3. Windows 服務器 CPU 使用率模板：

CPU使用率超過80%
- alert: WindowsServerCpuUsage
    expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 80
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Windows Server CPU Usage (instance {{ $labels.instance }})
      description: "CPU Usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.5.4. Windows Server 內存使用情況模板：

內存使用率超過 90%
 - alert: WindowsServerMemoryUsage
    expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Windows Server memory Usage (instance {{ $labels.instance }})
      description: "Memory usage is more than 90%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.5.5 Windows Server 磁盤空間使用情況模板：

磁盤使用率超過80%
- alert: WindowsServerDiskSpaceUsage
    expr: 100.0 - 100 * ((windows_logical_disk_free_bytes / 1024 / 1024 ) / (windows_logical_disk_size_bytes / 1024 / 1024)) > 80
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Windows Server disk Space Usage (instance {{ $labels.instance }})
      description: "Disk usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

VMware : pryorda/vmware_exporter （4 條規則）

1.6.1. 虛擬機內存警告模板：

{{ $labels.instance }} 上的高內存使用：{{ $value | printf "%.2f"}}%
 - alert: VirtualMachineMemoryWarning
    expr: vmware_vm_mem_usage_average / 100 >= 80 and vmware_vm_mem_usage_average / 100 < 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Virtual Machine Memory Warning (instance {{ $labels.instance }})
      description: "High memory usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.6.2 虛擬機內存嚴重模板：

{{ $labels.instance }} 上的高內存使用：{{ $value | printf "%.2f"}}%
 - alert: VirtualMachineMemoryCritical
    expr: vmware_vm_mem_usage_average / 100 >= 90
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Virtual Machine Memory Critical (instance {{ $labels.instance }})
      description: "High memory usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.6.3 大量快照模板：

{{ $labels.instance }} 上的高快照數：{{ $value }}
- alert: HighNumberOfSnapshots
    expr: vmware_vm_snapshots > 3
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: High Number of Snapshots (instance {{ $labels.instance }})
      description: "High snapshots number on {{ $labels.instance }}: {{ $value }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.6.4 過時的快照模板：

{{ $labels.instance }} 上的過時快照：{{ $value | printf "%.0f"}} 天
- alert: OutdatedSnapshots
    expr: (time() - vmware_vm_snapshot_timestamp_seconds) / (60 * 60 * 24) >= 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Outdated Snapshots (instance {{ $labels.instance }})
      description: "Outdated snapshots on {{ $labels.instance }}: {{ $value | printf \"%.0f\"}} days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7. Netdata：嵌入式導出器（9 條規則）

1.7.1. Netdata CPU占用率高模板：

Netdata 高 CPU 使用率 (> 80%)
 - alert: NetdataHighCpuUsage
    expr: rate(netdata_cpu_cpu_percentage_average{dimension="idle"}[1m]) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Netdata high cpu usage (instance {{ $labels.instance }})
      description: "Netdata high CPU usage (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.2 主機 CPU 竊取模板：

CPU 竊取率 > 10%。嘈雜的鄰居正在扼殺 VM 性能，或者 Spot 實例可能信用不足。
- alert: HostCpuStealNoisyNeighbor
    expr: rate(netdata_cpu_cpu_percentage_average{dimension="steal"}[1m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }})
      description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.3 Netdata 內存占用高模板：

Netdata 高內存使用率 (> 80%)
- alert: NetdataHighMemoryUsage
    expr: 100 / netdata_system_ram_MB_average * netdata_system_ram_MB_average{dimension=~"free|cached"} < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Netdata high memory usage (instance {{ $labels.instance }})
      description: "Netdata high memory usage (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.4 Netdata 磁盤空間不足模板：

Netdata 磁盤空間不足 (> 80%)
- alert: NetdataLowDiskSpace
    expr: 100 / netdata_disk_space_GB_average * netdata_disk_space_GB_average{dimension=~"avail|cached"} < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Netdata low disk space (instance {{ $labels.instance }})
      description: "Netdata low disk space (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.5 Netdata 預測磁盤已滿模板：

Netdata 預測 24 小時內磁盤已滿
- alert: NetdataPredictedDiskFull
    expr: predict_linear(netdata_disk_space_GB_average{dimension=~"avail|cached"}[3h], 24 * 3600) < 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Netdata predicted disk full (instance {{ $labels.instance }})
      description: "Netdata predicted disk full in 24 hours\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.6. Netdata MD 不匹配 cnt 未同步塊模板：

RAID 陣列有未同步的塊
- alert: NetdataMdMismatchCntUnsynchronizedBlocks
    expr: netdata_md_mismatch_cnt_unsynchronized_blocks_average > 1024
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Netdata MD mismatch cnt unsynchronized blocks (instance {{ $labels.instance }})
      description: "RAID Array have unsynchronized blocks\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.7. Netdata磁盤重新分配扇區模板：

磁盤上重新分配的扇區
- alert: NetdataDiskReallocatedSectors
    expr: increase(netdata_smartd_log_reallocated_sectors_count_sectors_average[1m]) > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Netdata disk reallocated sectors (instance {{ $labels.instance }})
      description: "Reallocated sectors on disk\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.8 Netdata磁盤當前掛起扇區模板：

磁盤當前掛起扇區
- alert: NetdataDiskCurrentPendingSector
    expr: netdata_smartd_log_current_pending_sector_count_sectors_average > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Netdata disk current pending sector (instance {{ $labels.instance }})
      description: "Disk current pending sector\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.9 Netdata 報告無法糾正的磁盤扇區模板：

報告無法糾正的磁盤扇區
- alert: NetdataReportedUncorrectableDiskSectors
    expr: increase(netdata_smartd_log_offline_uncorrectable_sector_count_sectors_average[2m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Netdata reported uncorrectable disk sectors (instance {{ $labels.instance }})
      description: "Reported uncorrectable disk sectors\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Prometheus alerts 各種告警規則 prometheus各種告警規則 prometheus各種告警規則 prometheus-告警規則 Prometheus 告警規則 prometheus告警規則分發服務 prometheus告警規則設置（二）【轉】 Prometheus 編寫告警規則案例 Prometheus中使用的告警規則 Prometheus自身的監控告警規則

Prometheus 告警規則

Prometheus 告警規則

Prometheus 告警規則概念：

Prometheus指標含義：

1.1Prometheus 自我監控 （25 條規則）

1.1.1 Prometheus 自我監控模板：

1.1.2 Prometheus 目標丟失模板：

1.1.3 Prometheus 所有目標丟失模板：

1.1.4 Prometheus 配置重載失敗模板：

1.1.5 Prometheus 重啟太多模板：

1.1.6 Prometheus AlertManager 配置重載失敗模板：

1.1.7 Prometheus AlertManager 配置未同步模板：

1.1.8 Prometheus AlertManager E2E dead man switch模板：

1.1.9 Prometheus 無法連接alertmanager模板：

1.1.10 Prometheus 規則評估失敗模板：

1.1.11 Prometheus 規則評估緩慢模板：

1.1.12 Prometheus 通知隊列積壓模板：

1.1.13 Prometheus AlertManager 通知失敗

1.1.14 Prometheus 目標為空模板：

1.1.15 Prometheus目標抓取緩慢模板：

1.1.16 Prometheus large scrape模板：

1.1.17 Prometheus目標抓取重復模板：

1.1.18 Prometheus TSDB 檢查點創建失敗模板：

1.1.19 Prometheus TSDB 檢查點刪除失敗模板：

1.1.20 Prometheus TSDB 壓縮失敗模板：

1.1.21 Prometheus TSDB 頭部截斷失敗模板：

1.1.22 Prometheus TSDB 重新加載失敗模板：

1.1.23 普羅米修斯 TSDB WAL 損壞模板：

1.1.24 Prometheus TSDB WAL 截斷失敗模板：

1.2 主機和硬件：節點導出器（33條規則）

1.2.1. 主機內存不足模板：

1.2.2 內存壓力下的主機內存模板：

1.2.3 主機異常網絡吞吐量模板：

1.2.4 主機異常網絡吞吐量模板：

1.2.5 主機異常磁盤讀取率模板：

1.2.6 主機異常磁盤寫入率模板：

1.2.7 主機磁盤空間不足模板：

1.2.8 主機磁盤將在 24 小時內填滿模板：

1.2.9 主機不足 inode模板：

1.2.10 主機 inode 將在 24 小時內填滿模板：

1.2.11 主機異常磁盤讀取延遲模板：

1.2.12 主機異常磁盤寫入延遲模板：

1.2.13 主機 CPU 負載高模板：

1.2.14 主機 CPU 竊取模板：

1.2.15 主機上下文切換模板：

1.2.16. 主機交換已滿模板：

1.2.17. 主機 systemd 服務崩潰模板：

1.2.18 主機物理組件過熱模板：

1.2.19. 主機節點超溫告警模板：

1.2.20 主機REID陣列處於非活動狀態模板：

1.2.21 主機RAID磁盤故障模板：

1.2.22 主機內核版本偏差模板：

1.2.23 檢測到主機 OOM 終止模板：

1.2.24 檢測到主機 EDAC 可糾正錯誤模板：

1.2.25. 檢測到主機 EDAC 無法糾正的錯誤模板：

1.2.26 主機網絡接收錯誤模板：

1.2.27 主機網絡接口飽和模板：

1.2.28 主機網絡綁定降級模板：

1.2.29 主機連接限制模板：

1.2.30 主機時鍾偏差模板：

1.2.31 模板主機時鍾不同步:

1.2.32 主機需要重啟模板：

1.3 Docker 容器： google/cAdvisor （7 條規則）

1.3.1 容器--被殺死模板：

1.3.2 容器--不存在模板：

1.3.3 容器--CPU使用率模板：

1.3.4 容器--內存使用模板：

1.3.5 容器--卷使用模板：

1.3.6 容器--卷 IO 使用情況模板：

1.3.7 容器--高節流率模板：

1.4 黑盒： prometheus/blackbox_exporter （8 條規則）

1.4.1. 黑盒探測失敗模板：

1.4.2 黑盒慢探針模板：

1.4.3 黑盒探測 HTTP 失敗模板：

1.4.4 Blackbox SSL 證書即將到期模板：

1.4.5 Blackbox SSL 證書即將到期模板：

1.4.6 Blackbox SSL 證書已過期模板：

1.4.7 黑盒探測慢速 HTTP模板：

1.4.8 黑盒探測慢 ping模板：

1.5 Windows 服務器： prometheus-community/windows_exporter （5 條規則）

1.1Prometheus 自我監控（25 條規則）

1.7. Netdata：嵌入式導出器（9 條規則）