Awesome Prometheus alerts

本文转载自查看原文 2020-11-10 15:03 904 prometheus

http://t.zoukankan.com/shoufu-p-14110485.html

转载于https://awesome-prometheus-alerts.grep.to/rules#host-and-hardware

Collection of alerting rules

AlertManager config Rules Contribute on GitHub

⚠️ Caution ⚠️

Alert thresholds depend on nature of applications.
Some queries in this page may have arbitrary tolerance threshold.

Building an efficient and battle-tested monitoring platform takes time. 😉

1. 1. Prometheus self-monitoring (25 rules)[copy all]

1.1.1. Prometheus job missing

A Prometheus job has disappeared[copy]

  - alert: PrometheusJobMissing expr: absent(up{job="prometheus"}) for: 5m labels: severity: warning annotations: summary: Prometheus job missing (instance {{ $labels.instance }}) description: A Prometheus job has disappeared\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.2. Prometheus target missing

A Prometheus target has disappeared. An exporter might be crashed.[copy]

  - alert: PrometheusTargetMissing expr: up == 0 for: 5m labels: severity: critical annotations: summary: Prometheus target missing (instance {{ $labels.instance }}) description: A Prometheus target has disappeared. An exporter might be crashed.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.3. Prometheus all targets missing

A Prometheus job does not have living target anymore.[copy]

  - alert: PrometheusAllTargetsMissing expr: count by (job) (up) == 0 for: 5m labels: severity: critical annotations: summary: Prometheus all targets missing (instance {{ $labels.instance }}) description: A Prometheus job does not have living target anymore.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.4. Prometheus configuration reload failure

Prometheus configuration reload error[copy]

  - alert: PrometheusConfigurationReloadFailure expr: prometheus_config_last_reload_successful != 1 for: 5m labels: severity: warning annotations: summary: Prometheus configuration reload failure (instance {{ $labels.instance }}) description: Prometheus configuration reload error\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.5. Prometheus too many restarts

Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.[copy]

  - alert: PrometheusTooManyRestarts expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2 for: 5m labels: severity: warning annotations: summary: Prometheus too many restarts (instance {{ $labels.instance }}) description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.6. Prometheus AlertManager configuration reload failure

AlertManager configuration reload error[copy]

  - alert: PrometheusAlertmanagerConfigurationReloadFailure expr: alertmanager_config_last_reload_successful != 1 for: 5m labels: severity: warning annotations: summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }}) description: AlertManager configuration reload error\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.7. Prometheus AlertManager config not synced

Configurations of AlertManager cluster instances are out of sync[copy]

  - alert: PrometheusAlertmanagerConfigNotSynced expr: count(count_values("config_hash", alertmanager_config_hash)) > 1 for: 5m labels: severity: warning annotations: summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }}) description: Configurations of AlertManager cluster instances are out of sync\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.8. Prometheus AlertManager E2E dead man switch

Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.[copy]

  - alert: PrometheusAlertmanagerE2eDeadManSwitch expr: vector(1) for: 5m labels: severity: critical annotations: summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }}) description: Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.9. Prometheus not connected to alertmanager

Prometheus cannot connect the alertmanager[copy]

  - alert: PrometheusNotConnectedToAlertmanager expr: prometheus_notifications_alertmanagers_discovered < 1 for: 5m labels: severity: critical annotations: summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }}) description: Prometheus cannot connect the alertmanager\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.10. Prometheus rule evaluation failures

Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.[copy]

  - alert: PrometheusRuleEvaluationFailures expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus rule evaluation failures (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.11. Prometheus template text expansion failures

Prometheus encountered {{ $value }} template text expansion failures[copy]

  - alert: PrometheusTemplateTextExpansionFailures expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus template text expansion failures (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} template text expansion failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.12. Prometheus rule evaluation slow

Prometheus rule evaluation took more time than the scheduled interval. I indicates a slower storage backend access or too complex query.[copy]

  - alert: PrometheusRuleEvaluationSlow expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds for: 5m labels: severity: warning annotations: summary: Prometheus rule evaluation slow (instance {{ $labels.instance }}) description: Prometheus rule evaluation took more time than the scheduled interval. I indicates a slower storage backend access or too complex query.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.13. Prometheus notifications backlog

The Prometheus notification queue has not been empty for 10 minutes[copy]

  - alert: PrometheusNotificationsBacklog expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0 for: 5m labels: severity: warning annotations: summary: Prometheus notifications backlog (instance {{ $labels.instance }}) description: The Prometheus notification queue has not been empty for 10 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.14. Prometheus AlertManager notification failing

Alertmanager is failing sending notifications[copy]

  - alert: PrometheusAlertmanagerNotificationFailing expr: rate(alertmanager_notifications_failed_total[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }}) description: Alertmanager is failing sending notifications\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.15. Prometheus target empty

Prometheus has no target in service discovery[copy]

  - alert: PrometheusTargetEmpty expr: prometheus_sd_discovered_targets == 0 for: 5m labels: severity: critical annotations: summary: Prometheus target empty (instance {{ $labels.instance }}) description: Prometheus has no target in service discovery\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.16. Prometheus target scraping slow

Prometheus is scraping exporters slowly[copy]

  - alert: PrometheusTargetScrapingSlow expr: prometheus_target_interval_length_seconds{quantile="0.9"} > 60 for: 5m labels: severity: warning annotations: summary: Prometheus target scraping slow (instance {{ $labels.instance }}) description: Prometheus is scraping exporters slowly\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.17. Prometheus large scrape

Prometheus has many scrapes that exceed the sample limit[copy]

  - alert: PrometheusLargeScrape expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10 for: 5m labels: severity: warning annotations: summary: Prometheus large scrape (instance {{ $labels.instance }}) description: Prometheus has many scrapes that exceed the sample limit\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.18. Prometheus target scrape duplicate

Prometheus has many samples rejected due to duplicate timestamps but different values[copy]

  - alert: PrometheusTargetScrapeDuplicate expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Prometheus target scrape duplicate (instance {{ $labels.instance }}) description: Prometheus has many samples rejected due to duplicate timestamps but different values\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.19. Prometheus TSDB checkpoint creation failures

Prometheus encountered {{ $value }} checkpoint creation failures[copy]

  - alert: PrometheusTsdbCheckpointCreationFailures expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} checkpoint creation failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.20. Prometheus TSDB checkpoint deletion failures

Prometheus encountered {{ $value }} checkpoint deletion failures[copy]

  - alert: PrometheusTsdbCheckpointDeletionFailures expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} checkpoint deletion failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.21. Prometheus TSDB compactions failed

Prometheus encountered {{ $value }} TSDB compactions failures[copy]

  - alert: PrometheusTsdbCompactionsFailed expr: increase(prometheus_tsdb_compactions_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} TSDB compactions failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.22. Prometheus TSDB head truncations failed

Prometheus encountered {{ $value }} TSDB head truncation failures[copy]

  - alert: PrometheusTsdbHeadTruncationsFailed expr: increase(prometheus_tsdb_head_truncations_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} TSDB head truncation failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.23. Prometheus TSDB reload failures

Prometheus encountered {{ $value }} TSDB reload failures[copy]

  - alert: PrometheusTsdbReloadFailures expr: increase(prometheus_tsdb_reloads_failures_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB reload failures (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} TSDB reload failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.24. Prometheus TSDB WAL corruptions

Prometheus encountered {{ $value }} TSDB WAL corruptions[copy]

  - alert: PrometheusTsdbWalCorruptions expr: increase(prometheus_tsdb_wal_corruptions_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} TSDB WAL corruptions\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.1.25. Prometheus TSDB WAL truncations failed

Prometheus encountered {{ $value }} TSDB WAL truncation failures[copy]

  - alert: PrometheusTsdbWalTruncationsFailed expr: increase(prometheus_tsdb_wal_truncations_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} TSDB WAL truncation failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1. 2. Host and hardware : node-exporter (26 rules)[copy all]

1.2.1. Host out of memory

Node memory is filling up (< 10% left)[copy]

  - alert: HostOutOfMemory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 for: 5m labels: severity: warning annotations: summary: Host out of memory (instance {{ $labels.instance }}) description: Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.2. Host memory under memory pressure

The node is under heavy memory pressure. High rate of major page faults[copy]

  - alert: HostMemoryUnderMemoryPressure expr: rate(node_vmstat_pgmajfault[1m]) > 1000 for: 5m labels: severity: warning annotations: summary: Host memory under memory pressure (instance {{ $labels.instance }}) description: The node is under heavy memory pressure. High rate of major page faults\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.3. Host unusual network throughput in

Host network interfaces are probably receiving too much data (> 100 MB/s)[copy]

  - alert: HostUnusualNetworkThroughputIn expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100 for: 5m labels: severity: warning annotations: summary: Host unusual network throughput in (instance {{ $labels.instance }}) description: Host network interfaces are probably receiving too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.4. Host unusual network throughput out

Host network interfaces are probably sending too much data (> 100 MB/s)[copy]

  - alert: HostUnusualNetworkThroughputOut expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100 for: 5m labels: severity: warning annotations: summary: Host unusual network throughput out (instance {{ $labels.instance }}) description: Host network interfaces are probably sending too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.5. Host unusual disk read rate

Disk is probably reading too much data (> 50 MB/s)[copy]

  - alert: HostUnusualDiskReadRate expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50 for: 5m labels: severity: warning annotations: summary: Host unusual disk read rate (instance {{ $labels.instance }}) description: Disk is probably reading too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.6. Host unusual disk write rate

Disk is probably writing too much data (> 50 MB/s)[copy]

  - alert: HostUnusualDiskWriteRate expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50 for: 5m labels: severity: warning annotations: summary: Host unusual disk write rate (instance {{ $labels.instance }}) description: Disk is probably writing too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.7. Host out of disk space

Disk is almost full (< 10% left)[copy]

  # please add ignored mountpoints in node_exporter parameters like # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)" - alert: HostOutOfDiskSpace expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 for: 5m labels: severity: warning annotations: summary: Host out of disk space (instance {{ $labels.instance }}) description: Disk is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.8. Host disk will fill in 4 hours

Disk will fill in 4 hours at current write rate[copy]

  - alert: HostDiskWillFillIn4Hours expr: predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs"}[1h], 4 * 3600) < 0 for: 5m labels: severity: warning annotations: summary: Host disk will fill in 4 hours (instance {{ $labels.instance }}) description: Disk will fill in 4 hours at current write rate\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.9. Host out of inodes

Disk is almost running out of available inodes (< 10% left)[copy]

  - alert: HostOutOfInodes expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint ="/rootfs"} * 100 < 10 for: 5m labels: severity: warning annotations: summary: Host out of inodes (instance {{ $labels.instance }}) description: Disk is almost running out of available inodes (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.10. Host unusual disk read latency

Disk latency is growing (read operations > 100ms)[copy]

  - alert: HostUnusualDiskReadLatency expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0 for: 5m labels: severity: warning annotations: summary: Host unusual disk read latency (instance {{ $labels.instance }}) description: Disk latency is growing (read operations > 100ms)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.11. Host unusual disk write latency

Disk latency is growing (write operations > 100ms)[copy]

  - alert: HostUnusualDiskWriteLatency expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0 for: 5m labels: severity: warning annotations: summary: Host unusual disk write latency (instance {{ $labels.instance }}) description: Disk latency is growing (write operations > 100ms)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.12. Host high CPU load

CPU load is > 80%[copy]

  - alert: HostHighCpuLoad expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: Host high CPU load (instance {{ $labels.instance }}) description: CPU load is > 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.13. Host context switching

Context switching is growing on node (> 1000 / s)[copy]

  # 1000 context switches is an arbitrary number. # Alert threshold depends on nature of application. # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58 - alert: HostContextSwitching expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000 for: 5m labels: severity: warning annotations: summary: Host context switching (instance {{ $labels.instance }}) description: Context switching is growing on node (> 1000 / s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.14. Host swap is filling up

Swap is filling up (>80%)[copy]

  - alert: HostSwapIsFillingUp expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80 for: 5m labels: severity: warning annotations: summary: Host swap is filling up (instance {{ $labels.instance }}) description: Swap is filling up (>80%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.15. Host SystemD service crashed

SystemD service crashed[copy]

  - alert: HostSystemdServiceCrashed expr: node_systemd_unit_state{state="failed"} == 1 for: 5m labels: severity: warning annotations: summary: Host SystemD service crashed (instance {{ $labels.instance }}) description: SystemD service crashed\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.16. Host physical component too hot

Physical hardware component too hot[copy]

  - alert: HostPhysicalComponentTooHot expr: node_hwmon_temp_celsius > 75 for: 5m labels: severity: warning annotations: summary: Host physical component too hot (instance {{ $labels.instance }}) description: Physical hardware component too hot\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.17. Host node overtemperature alarm

Physical node temperature alarm triggered[copy]

  - alert: HostNodeOvertemperatureAlarm expr: node_hwmon_temp_alarm == 1 for: 5m labels: severity: critical annotations: summary: Host node overtemperature alarm (instance {{ $labels.instance }}) description: Physical node temperature alarm triggered\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.18. Host RAID array got inactive

RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.[copy]

  - alert: HostRaidArrayGotInactive expr: node_md_state{state="inactive"} > 0 for: 5m labels: severity: critical annotations: summary: Host RAID array got inactive (instance {{ $labels.instance }}) description: RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.19. Host RAID disk failure

At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap[copy]

  - alert: HostRaidDiskFailure expr: node_md_disks{state="failed"} > 0 for: 5m labels: severity: warning annotations: summary: Host RAID disk failure (instance {{ $labels.instance }}) description: At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.20. Host kernel version deviations

Different kernel versions are running[copy]

  - alert: HostKernelVersionDeviations expr: count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1 for: 5m labels: severity: warning annotations: summary: Host kernel version deviations (instance {{ $labels.instance }}) description: Different kernel versions are running\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.21. Host OOM kill detected

OOM kill detected[copy]

  - alert: HostOomKillDetected expr: increase(node_vmstat_oom_kill[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Host OOM kill detected (instance {{ $labels.instance }}) description: OOM kill detected\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.22. Host EDAC Correctable Errors detected

{{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.[copy]

  - alert: HostEdacCorrectableErrorsDetected expr: increase(node_edac_correctable_errors_total[5m]) > 0 for: 5m labels: severity: info annotations: summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }}) description: {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.23. Host EDAC Uncorrectable Errors detected

{{ $labels.instance }} has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.[copy]

  - alert: HostEdacUncorrectableErrorsDetected expr: node_edac_uncorrectable_errors_total > 0 for: 5m labels: severity: warning annotations: summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }}) description: {{ $labels.instance }} has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.24. Host Network Receive Errors

{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.[copy]

  - alert: HostNetworkReceiveErrors expr: increase(node_network_receive_errs_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Host Network Receive Errors (instance {{ $labels.instance }}) description: {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.25. Host Network Transmit Errors

{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last five minutes.[copy]

  - alert: HostNetworkTransmitErrors expr: increase(node_network_transmit_errs_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Host Network Transmit Errors (instance {{ $labels.instance }}) description: {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last five minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.2.26. Host Network Interface Saturated

The network interface "{{ $labels.interface }}" on "{{ $labels.instance }}" is getting overloaded.[copy]

  - alert: HostNetworkInterfaceSaturated expr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8 for: 5m labels: severity: warning annotations: summary: Host Network Interface Saturated (instance {{ $labels.instance }}) description: The network interface "{{ $labels.interface }}" on "{{ $labels.instance }}" is getting overloaded.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1. 3. Docker containers : google/cAdvisor (6 rules)[copy all]

1.3.1. Container killed

A container has disappeared[copy]

  - alert: ContainerKilled expr: time() - container_last_seen > 60 for: 5m labels: severity: warning annotations: summary: Container killed (instance {{ $labels.instance }}) description: A container has disappeared\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.3.2. Container CPU usage

Container CPU usage is above 80%[copy]

  # cAdvisor can sometimes consume a lot of CPU, so this alert will fire constantly. # If you want to exclude it from this alert, just use: container_cpu_usage_seconds_total{name!=""} - alert: ContainerCpuUsage expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80 for: 5m labels: severity: warning annotations: summary: Container CPU usage (instance {{ $labels.instance }}) description: Container CPU usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.3.3. Container Memory usage

Container Memory usage is above 80%[copy]

  # See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d - alert: ContainerMemoryUsage expr: (sum(container_memory_working_set_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80 for: 5m labels: severity: warning annotations: summary: Container Memory usage (instance {{ $labels.instance }}) description: Container Memory usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.3.4. Container Volume usage

Container Volume usage is above 80%[copy]

  - alert: ContainerVolumeUsage expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance)) * 100) > 80 for: 5m labels: severity: warning annotations: summary: Container Volume usage (instance {{ $labels.instance }}) description: Container Volume usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.3.5. Container Volume IO usage

Container Volume IO usage is above 80%[copy]

  - alert: ContainerVolumeIoUsage expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80 for: 5m labels: severity: warning annotations: summary: Container Volume IO usage (instance {{ $labels.instance }}) description: Container Volume IO usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.3.6. Container high throttle rate

Container is being throttled[copy]

  - alert: ContainerHighThrottleRate expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1 for: 5m labels: severity: warning annotations: summary: Container high throttle rate (instance {{ $labels.instance }}) description: Container is being throttled\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1. 4. Blackbox : prometheus/blackbox_exporter (8 rules)[copy all]

1.4.1. Blackbox probe failed

Probe failed[copy]

  - alert: BlackboxProbeFailed expr: probe_success == 0 for: 5m labels: severity: critical annotations: summary: Blackbox probe failed (instance {{ $labels.instance }}) description: Probe failed\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.4.2. Blackbox slow probe

Blackbox probe took more than 1s to complete[copy]

  - alert: BlackboxSlowProbe expr: avg_over_time(probe_duration_seconds[1m]) > 1 for: 5m labels: severity: warning annotations: summary: Blackbox slow probe (instance {{ $labels.instance }}) description: Blackbox probe took more than 1s to complete\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.4.3. Blackbox probe HTTP failure

HTTP status code is not 200-399[copy]

  - alert: BlackboxProbeHttpFailure expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400 for: 5m labels: severity: critical annotations: summary: Blackbox probe HTTP failure (instance {{ $labels.instance }}) description: HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.4.4. Blackbox SSL certificate will expire soon

SSL certificate expires in 30 days[copy]

  - alert: BlackboxSslCertificateWillExpireSoon expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30 for: 5m labels: severity: warning annotations: summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }}) description: SSL certificate expires in 30 days\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.4.5. Blackbox SSL certificate will expire soon

SSL certificate expires in 3 days[copy]

  - alert: BlackboxSslCertificateWillExpireSoon expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3 for: 5m labels: severity: critical annotations: summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }}) description: SSL certificate expires in 3 days\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.4.6. Blackbox SSL certificate expired

SSL certificate has expired already[copy]

  - alert: BlackboxSslCertificateExpired expr: probe_ssl_earliest_cert_expiry - time() <= 0 for: 5m labels: severity: critical annotations: summary: Blackbox SSL certificate expired (instance {{ $labels.instance }}) description: SSL certificate has expired already\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.4.7. Blackbox probe slow HTTP

HTTP request took more than 1s[copy]

  - alert: BlackboxProbeSlowHttp expr: avg_over_time(probe_http_duration_seconds[1m]) > 1 for: 5m labels: severity: warning annotations: summary: Blackbox probe slow HTTP (instance {{ $labels.instance }}) description: HTTP request took more than 1s\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.4.8. Blackbox probe slow ping

Blackbox ping took more than 1s[copy]

  - alert: BlackboxProbeSlowPing expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1 for: 5m labels: severity: warning annotations: summary: Blackbox probe slow ping (instance {{ $labels.instance }}) description: Blackbox ping took more than 1s\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1. 5. Windows Server : prometheus-community/windows_exporter (5 rules)[copy all]

1.5.1. Windows Server collector Error

Collector {{ $labels.collector }} was not successful[copy]

  - alert: WindowsServerCollectorError expr: windows_exporter_collector_success == 0 for: 5m labels: severity: critical annotations: summary: Windows Server collector Error (instance {{ $labels.instance }}) description: Collector {{ $labels.collector }} was not successful\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.5.2. Windows Server service Status

Windows Service state is not OK[copy]

  - alert: WindowsServerServiceStatus expr: windows_service_status{status="ok"} != 1 for: 5m labels: severity: critical annotations: summary: Windows Server service Status (instance {{ $labels.instance }}) description: Windows Service state is not OK\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.5.3. Windows Server CPU Usage

CPU Usage is more than 80%[copy]

  - alert: WindowsServerCpuUsage expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: Windows Server CPU Usage (instance {{ $labels.instance }}) description: CPU Usage is more than 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.5.4. Windows Server memory Usage

Memory usage is more than 90%[copy]

  - alert: WindowsServerMemoryUsage expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90 for: 5m labels: severity: warning annotations: summary: Windows Server memory Usage (instance {{ $labels.instance }}) description: Memory usage is more than 90%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

1.5.5. Windows Server disk Space Usage

Disk usage is more than 80%[copy]

  - alert: WindowsServerDiskSpaceUsage expr: 100.0 - 100 * ((windows_logical_disk_free_bytes / 1024 / 1024 ) / (windows_logical_disk_size_bytes / 1024 / 1024)) > 80 for: 5m labels: severity: critical annotations: summary: Windows Server disk Space Usage (instance {{ $labels.instance }}) description: Disk usage is more than 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2. 1. MySQL : prometheus/mysqld_exporter (8 rules)[copy all]

2.1.1. MySQL down

MySQL instance is down on {{ $labels.instance }}[copy]

  - alert: MysqlDown expr: mysql_up == 0 for: 5m labels: severity: critical annotations: summary: MySQL down (instance {{ $labels.instance }}) description: MySQL instance is down on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.1.2. MySQL too many connections

More than 80% of MySQL connections are in use on {{ $labels.instance }}[copy]

  - alert: MysqlTooManyConnections expr: avg by (instance) (max_over_time(mysql_global_status_threads_connected[5m])) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 80 for: 5m labels: severity: warning annotations: summary: MySQL too many connections (instance {{ $labels.instance }}) description: More than 80% of MySQL connections are in use on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.1.3. MySQL high threads running

More than 60% of MySQL connections are in running state on {{ $labels.instance }}[copy]

  - alert: MysqlHighThreadsRunning expr: avg by (instance) (max_over_time(mysql_global_status_threads_running[5m])) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 60 for: 5m labels: severity: warning annotations: summary: MySQL high threads running (instance {{ $labels.instance }}) description: More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.1.4. MySQL Slave IO thread not running

MySQL Slave IO thread not running on {{ $labels.instance }}[copy]

  - alert: MysqlSlaveIoThreadNotRunning expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_io_running == 0 for: 5m labels: severity: critical annotations: summary: MySQL Slave IO thread not running (instance {{ $labels.instance }}) description: MySQL Slave IO thread not running on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.1.5. MySQL Slave SQL thread not running

MySQL Slave SQL thread not running on {{ $labels.instance }}[copy]

  - alert: MysqlSlaveSqlThreadNotRunning expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_sql_running == 0 for: 5m labels: severity: critical annotations: summary: MySQL Slave SQL thread not running (instance {{ $labels.instance }}) description: MySQL Slave SQL thread not running on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.1.6. MySQL Slave replication lag

MysqL replication lag on {{ $labels.instance }}[copy]

  - alert: MysqlSlaveReplicationLag expr: mysql_slave_status_master_server_id > 0 and ON (instance) (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) > 300 for: 5m labels: severity: warning annotations: summary: MySQL Slave replication lag (instance {{ $labels.instance }}) description: MysqL replication lag on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.1.7. MySQL slow queries

MySQL server mysql has some new slow query.[copy]

  - alert: MysqlSlowQueries expr: rate(mysql_global_status_slow_queries[2m]) > 0 for: 5m labels: severity: warning annotations: summary: MySQL slow queries (instance {{ $labels.instance }}) description: MySQL server mysql has some new slow query.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.1.8. MySQL restarted

MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.[copy]

  - alert: MysqlRestarted expr: mysql_global_status_uptime < 60 for: 5m labels: severity: warning annotations: summary: MySQL restarted (instance {{ $labels.instance }}) description: MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2. 2. PostgreSQL : wrouesnel/postgres_exporter (25 rules)[copy all]

2.2.1. Postgresql down

Postgresql instance is down[copy]

  - alert: PostgresqlDown expr: pg_up == 0 for: 5m labels: severity: critical annotations: summary: Postgresql down (instance {{ $labels.instance }}) description: Postgresql instance is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.2. Postgresql restarted

Postgresql restarted[copy]

  - alert: PostgresqlRestarted expr: time() - pg_postmaster_start_time_seconds < 60 for: 5m labels: severity: critical annotations: summary: Postgresql restarted (instance {{ $labels.instance }}) description: Postgresql restarted\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.3. Postgresql exporter error

Postgresql exporter is showing errors. A query may be buggy in query.yaml[copy]

  - alert: PostgresqlExporterError expr: pg_exporter_last_scrape_error > 0 for: 5m labels: severity: warning annotations: summary: Postgresql exporter error (instance {{ $labels.instance }}) description: Postgresql exporter is showing errors. A query may be buggy in query.yaml\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.4. Postgresql replication lag

PostgreSQL replication lag is going up (> 10s)[copy]

  - alert: PostgresqlReplicationLag expr: (pg_replication_lag) > 10 and ON(instance) (pg_replication_is_replica == 1) for: 5m labels: severity: warning annotations: summary: Postgresql replication lag (instance {{ $labels.instance }}) description: PostgreSQL replication lag is going up (> 10s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.5. Postgresql table not vaccumed

Table has not been vaccum for 24 hours[copy]

  - alert: PostgresqlTableNotVaccumed expr: time() - pg_stat_user_tables_last_autovacuum > 60 * 60 * 24 for: 5m labels: severity: warning annotations: summary: Postgresql table not vaccumed (instance {{ $labels.instance }}) description: Table has not been vaccum for 24 hours\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.6. Postgresql table not analyzed

Table has not been analyzed for 24 hours[copy]

  - alert: PostgresqlTableNotAnalyzed expr: time() - pg_stat_user_tables_last_autoanalyze > 60 * 60 * 24 for: 5m labels: severity: warning annotations: summary: Postgresql table not analyzed (instance {{ $labels.instance }}) description: Table has not been analyzed for 24 hours\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.7. Postgresql too many connections

PostgreSQL instance has too many connections[copy]

  - alert: PostgresqlTooManyConnections expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) > pg_settings_max_connections * 0.9 for: 5m labels: severity: warning annotations: summary: Postgresql too many connections (instance {{ $labels.instance }}) description: PostgreSQL instance has too many connections\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.8. Postgresql not enough connections

PostgreSQL instance should have more connections (> 5)[copy]

  - alert: PostgresqlNotEnoughConnections expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5 for: 5m labels: severity: warning annotations: summary: Postgresql not enough connections (instance {{ $labels.instance }}) description: PostgreSQL instance should have more connections (> 5)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.9. Postgresql dead locks

PostgreSQL has dead-locks[copy]

  - alert: PostgresqlDeadLocks expr: rate(pg_stat_database_deadlocks{datname!~"template.*|postgres"}[1m]) > 0 for: 5m labels: severity: warning annotations: summary: Postgresql dead locks (instance {{ $labels.instance }}) description: PostgreSQL has dead-locks\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.10. Postgresql slow queries

PostgreSQL executes slow queries[copy]

  - alert: PostgresqlSlowQueries expr: pg_slow_queries > 0 for: 5m labels: severity: warning annotations: summary: Postgresql slow queries (instance {{ $labels.instance }}) description: PostgreSQL executes slow queries\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.11. Postgresql high rollback rate

Ratio of transactions being aborted compared to committed is > 2 %[copy]

  - alert: PostgresqlHighRollbackRate expr: rate(pg_stat_database_xact_rollback{datname!~"template.*"}[3m]) / rate(pg_stat_database_xact_commit{datname!~"template.*"}[3m]) > 0.02 for: 5m labels: severity: warning annotations: summary: Postgresql high rollback rate (instance {{ $labels.instance }}) description: Ratio of transactions being aborted compared to committed is > 2 %\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.12. Postgresql commit rate low

Postgres seems to be processing very few transactions[copy]

  - alert: PostgresqlCommitRateLow expr: rate(pg_stat_database_xact_commit[1m]) < 10 for: 5m labels: severity: critical annotations: summary: Postgresql commit rate low (instance {{ $labels.instance }}) description: Postgres seems to be processing very few transactions\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.13. Postgresql low XID consumption

Postgresql seems to be consuming transaction IDs very slowly[copy]

  - alert: PostgresqlLowXidConsumption expr: rate(pg_txid_current[1m]) < 5 for: 5m labels: severity: warning annotations: summary: Postgresql low XID consumption (instance {{ $labels.instance }}) description: Postgresql seems to be consuming transaction IDs very slowly\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.14. Postgresqllow XLOG consumption

Postgres seems to be consuming XLOG very slowly[copy]

  - alert: PostgresqllowXlogConsumption expr: rate(pg_xlog_position_bytes[1m]) < 100 for: 5m labels: severity: warning annotations: summary: Postgresqllow XLOG consumption (instance {{ $labels.instance }}) description: Postgres seems to be consuming XLOG very slowly\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.15. Postgresql WALE replication stopped

WAL-E replication seems to be stopped[copy]

  - alert: PostgresqlWaleReplicationStopped expr: rate(pg_xlog_position_bytes[1m]) == 0 for: 5m labels: severity: critical annotations: summary: Postgresql WALE replication stopped (instance {{ $labels.instance }}) description: WAL-E replication seems to be stopped\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.16. Postgresql high rate statement timeout

Postgres transactions showing high rate of statement timeouts[copy]

  - alert: PostgresqlHighRateStatementTimeout expr: rate(postgresql_errors_total{type="statement_timeout"}[5m]) > 3 for: 5m labels: severity: critical annotations: summary: Postgresql high rate statement timeout (instance {{ $labels.instance }}) description: Postgres transactions showing high rate of statement timeouts\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.17. Postgresql high rate deadlock

Postgres detected deadlocks[copy]

  - alert: PostgresqlHighRateDeadlock expr: rate(postgresql_errors_total{type="deadlock_detected"}[1m]) * 60 > 1 for: 5m labels: severity: critical annotations: summary: Postgresql high rate deadlock (instance {{ $labels.instance }}) description: Postgres detected deadlocks\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.18. Postgresql replication lab bytes

Postgres Replication lag (in bytes) is high[copy]

  - alert: PostgresqlReplicationLabBytes expr: (pg_xlog_position_bytes and pg_replication_is_replica == 0) - GROUP_RIGHT(instance) (pg_xlog_position_bytes and pg_replication_is_replica == 1) > 1e+09 for: 5m labels: severity: critical annotations: summary: Postgresql replication lab bytes (instance {{ $labels.instance }}) description: Postgres Replication lag (in bytes) is high\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.19. Postgresql unused replication slot

Unused Replication Slots[copy]

  - alert: PostgresqlUnusedReplicationSlot expr: pg_replication_slots_active == 0 for: 5m labels: severity: warning annotations: summary: Postgresql unused replication slot (instance {{ $labels.instance }}) description: Unused Replication Slots\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.20. Postgresql too many dead tuples

PostgreSQL dead tuples is too large[copy]

  - alert: PostgresqlTooManyDeadTuples expr: ((pg_stat_user_tables_n_dead_tup > 10000) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.1 unless ON(instance) (pg_replication_is_replica == 1) for: 5m labels: severity: warning annotations: summary: Postgresql too many dead tuples (instance {{ $labels.instance }}) description: PostgreSQL dead tuples is too large\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.21. Postgresql split brain

Split Brain, too many primary Postgresql databases in read-write mode[copy]

  - alert: PostgresqlSplitBrain expr: count(pg_replication_is_replica == 0) != 1 for: 5m labels: severity: critical annotations: summary: Postgresql split brain (instance {{ $labels.instance }}) description: Split Brain, too many primary Postgresql databases in read-write mode\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.22. Postgresql promoted node

Postgresql standby server has been promoted as primary node[copy]

  - alert: PostgresqlPromotedNode expr: pg_replication_is_replica and changes(pg_replication_is_replica[1m]) > 0 for: 5m labels: severity: warning annotations: summary: Postgresql promoted node (instance {{ $labels.instance }}) description: Postgresql standby server has been promoted as primary node\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.23. Postgresql configuration changed

Postgres Database configuration change has occurred[copy]

  - alert: PostgresqlConfigurationChanged expr: {__name__=~"pg_settings_.*"} != ON(__name__) {__name__=~"pg_settings_([^t]|t[^r]|tr[^a]|tra[^n]|tran[^s]|trans[^a]|transa[^c]|transac[^t]|transact[^i]|transacti[^o]|transactio[^n]|transaction[^_]|transaction_[^r]|transaction_r[^e]|transaction_re[^a]|transaction_rea[^d]|transaction_read[^_]|transaction_read_[^o]|transaction_read_o[^n]|transaction_read_on[^l]|transaction_read_onl[^y]).*"} OFFSET 5m for: 5m labels: severity: warning annotations: summary: Postgresql configuration changed (instance {{ $labels.instance }}) description: Postgres Database configuration change has occurred\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.24. Postgresql SSL compression active

Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.[copy]

  - alert: PostgresqlSslCompressionActive expr: sum(pg_stat_ssl_compression) > 0 for: 5m labels: severity: critical annotations: summary: Postgresql SSL compression active (instance {{ $labels.instance }}) description: Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.2.25. Postgresql too many locks acquired

Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.[copy]

  - alert: PostgresqlTooManyLocksAcquired expr: ((sum (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20 for: 5m labels: severity: critical annotations: summary: Postgresql too many locks acquired (instance {{ $labels.instance }}) description: Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2. 3. SQL Server : Ozarklake/prometheus-mssql-exporter (2 rules)[copy all]

2.3.1. SQL Server down

SQl server instance is down[copy]

  - alert: SqlServerDown expr: mssql_up == 0 for: 5m labels: severity: critical annotations: summary: SQL Server down (instance {{ $labels.instance }}) description: SQl server instance is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.3.2. SQL Server deadlock

SQL Server is having some deadlock.[copy]

  - alert: SqlServerDeadlock expr: rate(mssql_deadlocks[1m]) > 0 for: 5m labels: severity: warning annotations: summary: SQL Server deadlock (instance {{ $labels.instance }}) description: SQL Server is having some deadlock.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2. 4. PGBouncer : spreaker/prometheus-pgbouncer-exporter (3 rules)[copy all]

2.4.1. PGBouncer active connectinos

PGBouncer pools are filling up[copy]

  - alert: PgbouncerActiveConnectinos expr: pgbouncer_pools_server_active_connections > 200 for: 5m labels: severity: warning annotations: summary: PGBouncer active connectinos (instance {{ $labels.instance }}) description: PGBouncer pools are filling up\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.4.2. PGBouncer errors

PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.[copy]

  - alert: PgbouncerErrors expr: increase(pgbouncer_errors_count{errmsg!="server conn crashed?"}[5m]) > 10 for: 5m labels: severity: warning annotations: summary: PGBouncer errors (instance {{ $labels.instance }}) description: PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.4.3. PGBouncer max connections

The number of PGBouncer client connections has reached max_client_conn.[copy]

  - alert: PgbouncerMaxConnections expr: rate(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[1m]) > 0 for: 5m labels: severity: critical annotations: summary: PGBouncer max connections (instance {{ $labels.instance }}) description: The number of PGBouncer client connections has reached max_client_conn.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2. 5. Redis : oliver006/redis_exporter (11 rules)[copy all]

2.5.1. Redis down

Redis instance is down[copy]

  - alert: RedisDown expr: redis_up == 0 for: 5m labels: severity: critical annotations: summary: Redis down (instance {{ $labels.instance }}) description: Redis instance is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.5.2. Redis missing master

Redis cluster has no node marked as master.[copy]

  - alert: RedisMissingMaster expr: count(redis_instance_info{role="master"}) == 0 for: 5m labels: severity: critical annotations: summary: Redis missing master (instance {{ $labels.instance }}) description: Redis cluster has no node marked as master.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.5.3. Redis too many masters

Redis cluster has too many nodes marked as master.[copy]

  - alert: RedisTooManyMasters expr: count(redis_instance_info{role="master"}) > 1 for: 5m labels: severity: critical annotations: summary: Redis too many masters (instance {{ $labels.instance }}) description: Redis cluster has too many nodes marked as master.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.5.4. Redis disconnected slaves

Redis not replicating for all slaves. Consider reviewing the redis replication status.[copy]

  - alert: RedisDisconnectedSlaves expr: count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 1 for: 5m labels: severity: critical annotations: summary: Redis disconnected slaves (instance {{ $labels.instance }}) description: Redis not replicating for all slaves. Consider reviewing the redis replication status.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.5.5. Redis replication broken

Redis instance lost a slave[copy]

  - alert: RedisReplicationBroken expr: delta(redis_connected_slaves[1m]) < 0 for: 5m labels: severity: critical annotations: summary: Redis replication broken (instance {{ $labels.instance }}) description: Redis instance lost a slave\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.5.6. Redis cluster flapping

Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).[copy]

  - alert: RedisClusterFlapping expr: changes(redis_connected_slaves[5m]) > 2 for: 5m labels: severity: critical annotations: summary: Redis cluster flapping (instance {{ $labels.instance }}) description: Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.5.7. Redis missing backup

Redis has not been backuped for 24 hours[copy]

  - alert: RedisMissingBackup expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24 for: 5m labels: severity: critical annotations: summary: Redis missing backup (instance {{ $labels.instance }}) description: Redis has not been backuped for 24 hours\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.5.8. Redis out of memory

Redis is running out of memory (> 90%)[copy]

  - alert: RedisOutOfMemory expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90 for: 5m labels: severity: warning annotations: summary: Redis out of memory (instance {{ $labels.instance }}) description: Redis is running out of memory (> 90%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.5.9. Redis too many connections

Redis instance has too many connections[copy]

  - alert: RedisTooManyConnections expr: redis_connected_clients > 100 for: 5m labels: severity: warning annotations: summary: Redis too many connections (instance {{ $labels.instance }}) description: Redis instance has too many connections\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.5.10. Redis not enough connections

Redis instance should have more connections (> 5)[copy]

  - alert: RedisNotEnoughConnections expr: redis_connected_clients < 5 for: 5m labels: severity: warning annotations: summary: Redis not enough connections (instance {{ $labels.instance }}) description: Redis instance should have more connections (> 5)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.5.11. Redis rejected connections

Some connections to Redis has been rejected[copy]

  - alert: RedisRejectedConnections expr: increase(redis_rejected_connections_total[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Redis rejected connections (instance {{ $labels.instance }}) description: Some connections to Redis has been rejected\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2. 6. MongoDB : percona/mongodb_exporter

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

2. 6. MongoDB : dcu/mongodb_exporter (10 rules)[copy all]

2.6.1. MongoDB replication lag

Mongodb replication lag is more than 10s[copy]

  - alert: MongodbReplicationLag expr: avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 10 for: 5m labels: severity: critical annotations: summary: MongoDB replication lag (instance {{ $labels.instance }}) description: Mongodb replication lag is more than 10s\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.6.2. MongoDB replication Status 3

MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync[copy]

  - alert: MongodbReplicationStatus3 expr: mongodb_replset_member_state == 3 for: 5m labels: severity: critical annotations: summary: MongoDB replication Status 3 (instance {{ $labels.instance }}) description: MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.6.3. MongoDB replication Status 6

MongoDB Replication set member as seen from another member of the set, is not yet known[copy]

  - alert: MongodbReplicationStatus6 expr: mongodb_replset_member_state == 6 for: 5m labels: severity: critical annotations: summary: MongoDB replication Status 6 (instance {{ $labels.instance }}) description: MongoDB Replication set member as seen from another member of the set, is not yet known\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.6.4. MongoDB replication Status 8

MongoDB Replication set member as seen from another member of the set, is unreachable[copy]

  - alert: MongodbReplicationStatus8 expr: mongodb_replset_member_state == 8 for: 5m labels: severity: critical annotations: summary: MongoDB replication Status 8 (instance {{ $labels.instance }}) description: MongoDB Replication set member as seen from another member of the set, is unreachable\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.6.5. MongoDB replication Status 9

MongoDB Replication set member is actively performing a rollback. Data is not available for reads[copy]

  - alert: MongodbReplicationStatus9 expr: mongodb_replset_member_state == 9 for: 5m labels: severity: critical annotations: summary: MongoDB replication Status 9 (instance {{ $labels.instance }}) description: MongoDB Replication set member is actively performing a rollback. Data is not available for reads\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.6.6. MongoDB replication Status 10

MongoDB Replication set member was once in a replica set but was subsequently removed[copy]

  - alert: MongodbReplicationStatus10 expr: mongodb_replset_member_state == 10 for: 5m labels: severity: critical annotations: summary: MongoDB replication Status 10 (instance {{ $labels.instance }}) description: MongoDB Replication set member was once in a replica set but was subsequently removed\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.6.7. MongoDB number cursors open

Too many cursors opened by MongoDB for clients (> 10k)[copy]

  - alert: MongodbNumberCursorsOpen expr: mongodb_metrics_cursor_open{state="total_open"} > 10000 for: 5m labels: severity: warning annotations: summary: MongoDB number cursors open (instance {{ $labels.instance }}) description: Too many cursors opened by MongoDB for clients (> 10k)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.6.8. MongoDB cursors timeouts

Too many cursors are timing out[copy]

  - alert: MongodbCursorsTimeouts expr: increase(mongodb_metrics_cursor_timed_out_total[10m]) > 100 for: 5m labels: severity: warning annotations: summary: MongoDB cursors timeouts (instance {{ $labels.instance }}) description: Too many cursors are timing out\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.6.9. MongoDB too many connections

Too many connections[copy]

  - alert: MongodbTooManyConnections expr: mongodb_connections{state="current"} > 500 for: 5m labels: severity: warning annotations: summary: MongoDB too many connections (instance {{ $labels.instance }}) description: Too many connections\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.6.10. MongoDB virtual memory usage

High memory usage[copy]

  - alert: MongodbVirtualMemoryUsage expr: (sum(mongodb_memory{type="virtual"}) BY (ip) / sum(mongodb_memory{type="mapped"}) BY (ip)) > 3 for: 5m labels: severity: warning annotations: summary: MongoDB virtual memory usage (instance {{ $labels.instance }}) description: High memory usage\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2. 7. RabbitMQ (official exporter) : rabbitmq/rabbitmq-prometheus (9 rules)[copy all]

2.7.1. Rabbitmq node down

Less than 3 nodes running in RabbitMQ cluster[copy]

  - alert: RabbitmqNodeDown expr: sum(rabbitmq_build_info) < 3 for: 5m labels: severity: critical annotations: summary: Rabbitmq node down (instance {{ $labels.instance }}) description: Less than 3 nodes running in RabbitMQ cluster\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.2. Rabbitmq node not distributed

Distribution link state is not 'up'[copy]

  - alert: RabbitmqNodeNotDistributed expr: erlang_vm_dist_node_state < 3 for: 5m labels: severity: critical annotations: summary: Rabbitmq node not distributed (instance {{ $labels.instance }}) description: Distribution link state is not 'up'\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.3. Rabbitmq instances different versions

Running different version of Rabbitmq in the same cluster, can lead to failure.[copy]

  - alert: RabbitmqInstancesDifferentVersions expr: count(count(rabbitmq_build_info) by (rabbitmq_version)) > 1 for: 5m labels: severity: warning annotations: summary: Rabbitmq instances different versions (instance {{ $labels.instance }}) description: Running different version of Rabbitmq in the same cluster, can lead to failure.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.4. Rabbitmq memory high

A node use more than 90% of allocated RAM[copy]

  - alert: RabbitmqMemoryHigh expr: rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes * 100 > 90 for: 5m labels: severity: warning annotations: summary: Rabbitmq memory high (instance {{ $labels.instance }}) description: A node use more than 90% of allocated RAM\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.5. Rabbitmq file descriptors usage

A node use more than 90% of file descriptors[copy]

  - alert: RabbitmqFileDescriptorsUsage expr: rabbitmq_process_open_fds / rabbitmq_process_max_fds * 100 > 90 for: 5m labels: severity: warning annotations: summary: Rabbitmq file descriptors usage (instance {{ $labels.instance }}) description: A node use more than 90% of file descriptors\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.6. Rabbitmq too much unack

Too much unacknowledged messages[copy]

  - alert: RabbitmqTooMuchUnack expr: sum(rabbitmq_queue_messages_unacked) BY (queue) > 1000 for: 5m labels: severity: warning annotations: summary: Rabbitmq too much unack (instance {{ $labels.instance }}) description: Too much unacknowledged messages\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.7. Rabbitmq too much connections

The total connections of a node is too high[copy]

  - alert: RabbitmqTooMuchConnections expr: rabbitmq_connections > 1000 for: 5m labels: severity: warning annotations: summary: Rabbitmq too much connections (instance {{ $labels.instance }}) description: The total connections of a node is too high\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.8. Rabbitmq no queue consumer

A queue has less than 1 consumer[copy]

  - alert: RabbitmqNoQueueConsumer expr: rabbitmq_queue_consumers < 1 for: 5m labels: severity: warning annotations: summary: Rabbitmq no queue consumer (instance {{ $labels.instance }}) description: A queue has less than 1 consumer\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.9. Rabbitmq unroutable messages

A queue has unroutable messages[copy]

  - alert: RabbitmqUnroutableMessages expr: increase(rabbitmq_channel_messages_unroutable_returned_total[5m]) > 0 or increase(rabbitmq_channel_messages_unroutable_dropped_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Rabbitmq unroutable messages (instance {{ $labels.instance }}) description: A queue has unroutable messages\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2. 7. RabbitMQ (official exporter) : kbudde/rabbitmq-exporter (11 rules)[copy all]

2.7.1. Rabbitmq down

RabbitMQ node down[copy]

  - alert: RabbitmqDown expr: rabbitmq_up == 0 for: 5m labels: severity: critical annotations: summary: Rabbitmq down (instance {{ $labels.instance }}) description: RabbitMQ node down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.2. Rabbitmq cluster down

Less than 3 nodes running in RabbitMQ cluster[copy]

  - alert: RabbitmqClusterDown expr: sum(rabbitmq_running) < 3 for: 5m labels: severity: critical annotations: summary: Rabbitmq cluster down (instance {{ $labels.instance }}) description: Less than 3 nodes running in RabbitMQ cluster\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.3. Rabbitmq cluster partition

Cluster partition[copy]

  - alert: RabbitmqClusterPartition expr: rabbitmq_partitions > 0 for: 5m labels: severity: critical annotations: summary: Rabbitmq cluster partition (instance {{ $labels.instance }}) description: Cluster partition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.4. Rabbitmq out of memory

Memory available for RabbmitMQ is low (< 10%)[copy]

  - alert: RabbitmqOutOfMemory expr: rabbitmq_node_mem_used / rabbitmq_node_mem_limit * 100 > 90 for: 5m labels: severity: warning annotations: summary: Rabbitmq out of memory (instance {{ $labels.instance }}) description: Memory available for RabbmitMQ is low (< 10%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.5. Rabbitmq too many connections

RabbitMQ instance has too many connections (> 1000)[copy]

  - alert: RabbitmqTooManyConnections expr: rabbitmq_connectionsTotal > 1000 for: 5m labels: severity: warning annotations: summary: Rabbitmq too many connections (instance {{ $labels.instance }}) description: RabbitMQ instance has too many connections (> 1000)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.6. Rabbitmq dead letter queue filling up

Dead letter queue is filling up (> 10 msgs)[copy]

  - alert: RabbitmqDeadLetterQueueFillingUp expr: rabbitmq_queue_messages{queue="my-dead-letter-queue"} > 10 for: 5m labels: severity: critical annotations: summary: Rabbitmq dead letter queue filling up (instance {{ $labels.instance }}) description: Dead letter queue is filling up (> 10 msgs)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.7. Rabbitmq too many messages in queue

Queue is filling up (> 1000 msgs)[copy]

  - alert: RabbitmqTooManyMessagesInQueue expr: rabbitmq_queue_messages_ready{queue="my-queue"} > 1000 for: 5m labels: severity: warning annotations: summary: Rabbitmq too many messages in queue (instance {{ $labels.instance }}) description: Queue is filling up (> 1000 msgs)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.8. Rabbitmq slow queue consuming

Queue messages are consumed slowly (> 60s)[copy]

  - alert: RabbitmqSlowQueueConsuming expr: time() - rabbitmq_queue_head_message_timestamp{queue="my-queue"} > 60 for: 5m labels: severity: warning annotations: summary: Rabbitmq slow queue consuming (instance {{ $labels.instance }}) description: Queue messages are consumed slowly (> 60s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.9. Rabbitmq no consumer

Queue has no consumer[copy]

  - alert: RabbitmqNoConsumer expr: rabbitmq_queue_consumers == 0 for: 5m labels: severity: critical annotations: summary: Rabbitmq no consumer (instance {{ $labels.instance }}) description: Queue has no consumer\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.10. Rabbitmq too many consumers

Queue should have only 1 consumer[copy]

  - alert: RabbitmqTooManyConsumers expr: rabbitmq_queue_consumers > 1 for: 5m labels: severity: critical annotations: summary: Rabbitmq too many consumers (instance {{ $labels.instance }}) description: Queue should have only 1 consumer\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.7.11. Rabbitmq unactive exchange

Exchange receive less than 5 msgs per second[copy]

  - alert: RabbitmqUnactiveExchange expr: rate(rabbitmq_exchange_messages_published_in_total{exchange="my-exchange"}[1m]) < 5 for: 5m labels: severity: warning annotations: summary: Rabbitmq unactive exchange (instance {{ $labels.instance }}) description: Exchange receive less than 5 msgs per second\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2. 8. Elasticsearch : justwatchcom/elasticsearch_exporter (13 rules)[copy all]

2.8.1. Elasticsearch Heap Usage Too High

The heap usage is over 90% for 5m[copy]

  - alert: ElasticsearchHeapUsageTooHigh expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90 for: 5m labels: severity: critical annotations: summary: Elasticsearch Heap Usage Too High (instance {{ $labels.instance }}) description: The heap usage is over 90% for 5m\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.8.2. Elasticsearch Heap Usage warning

The heap usage is over 80% for 5m[copy]

  - alert: ElasticsearchHeapUsageWarning expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80 for: 5m labels: severity: warning annotations: summary: Elasticsearch Heap Usage warning (instance {{ $labels.instance }}) description: The heap usage is over 80% for 5m\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.8.3. Elasticsearch disk space low

The disk usage is over 80%[copy]

  - alert: ElasticsearchDiskSpaceLow expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 20 for: 5m labels: severity: warning annotations: summary: Elasticsearch disk space low (instance {{ $labels.instance }}) description: The disk usage is over 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.8.4. Elasticsearch disk out of space

The disk usage is over 90%[copy]

  - alert: ElasticsearchDiskOutOfSpace expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10 for: 5m labels: severity: critical annotations: summary: Elasticsearch disk out of space (instance {{ $labels.instance }}) description: The disk usage is over 90%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.8.5. Elasticsearch Cluster Red

Elastic Cluster Red status[copy]

  - alert: ElasticsearchClusterRed expr: elasticsearch_cluster_health_status{color="red"} == 1 for: 5m labels: severity: critical annotations: summary: Elasticsearch Cluster Red (instance {{ $labels.instance }}) description: Elastic Cluster Red status\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.8.6. Elasticsearch Cluster Yellow

Elastic Cluster Yellow status[copy]

  - alert: ElasticsearchClusterYellow expr: elasticsearch_cluster_health_status{color="yellow"} == 1 for: 5m labels: severity: warning annotations: summary: Elasticsearch Cluster Yellow (instance {{ $labels.instance }}) description: Elastic Cluster Yellow status\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.8.7. Elasticsearch Healthy Nodes

Number Healthy Nodes less then number_of_nodes[copy]

  - alert: ElasticsearchHealthyNodes expr: elasticsearch_cluster_health_number_of_nodes < number_of_nodes for: 5m labels: severity: critical annotations: summary: Elasticsearch Healthy Nodes (instance {{ $labels.instance }}) description: Number Healthy Nodes less then number_of_nodes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.8.8. Elasticsearch Healthy Data Nodes

Number Healthy Data Nodes less then number_of_data_nodes[copy]

  - alert: ElasticsearchHealthyDataNodes expr: elasticsearch_cluster_health_number_of_data_nodes < number_of_data_nodes for: 5m labels: severity: critical annotations: summary: Elasticsearch Healthy Data Nodes (instance {{ $labels.instance }}) description: Number Healthy Data Nodes less then number_of_data_nodes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.8.9. Elasticsearch relocation shards

Number of relocation shards for 20 min[copy]

  - alert: ElasticsearchRelocationShards expr: elasticsearch_cluster_health_relocating_shards > 0 for: 5m labels: severity: critical annotations: summary: Elasticsearch relocation shards (instance {{ $labels.instance }}) description: Number of relocation shards for 20 min\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.8.10. Elasticsearch initializing shards

Number of initializing shards for 10 min[copy]

  - alert: ElasticsearchInitializingShards expr: elasticsearch_cluster_health_initializing_shards > 0 for: 5m labels: severity: warning annotations: summary: Elasticsearch initializing shards (instance {{ $labels.instance }}) description: Number of initializing shards for 10 min\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.8.11. Elasticsearch unassigned shards

Number of unassigned shards for 2 min[copy]

  - alert: ElasticsearchUnassignedShards expr: elasticsearch_cluster_health_unassigned_shards > 0 for: 5m labels: severity: critical annotations: summary: Elasticsearch unassigned shards (instance {{ $labels.instance }}) description: Number of unassigned shards for 2 min\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.8.12. Elasticsearch pending tasks

Number of pending tasks for 10 min. Cluster works slowly.[copy]

  - alert: ElasticsearchPendingTasks expr: elasticsearch_cluster_health_number_of_pending_tasks > 0 for: 5m labels: severity: warning annotations: summary: Elasticsearch pending tasks (instance {{ $labels.instance }}) description: Number of pending tasks for 10 min. Cluster works slowly.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.8.13. Elasticsearch no new documents

No new documents for 10 min![copy]

  - alert: ElasticsearchNoNewDocuments expr: rate(elasticsearch_indices_docs{es_data_node="true"}[10m]) < 1 for: 5m labels: severity: warning annotations: summary: Elasticsearch no new documents (instance {{ $labels.instance }}) description: No new documents for 10 min!\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2. 9. Cassandra : instaclustr/cassandra-exporter

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

2. 9. Cassandra : criteo/cassandra_exporter (18 rules)[copy all]

2.9.1. Cassandra hints count

Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down[copy]

  - alert: CassandraHintsCount expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:totalhints:count"}[1m]) > 3 for: 5m labels: severity: critical annotations: summary: Cassandra hints count (instance {{ $labels.instance }}) description: Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.2. Cassandra compaction task pending

Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.[copy]

  - alert: CassandraCompactionTaskPending expr: avg_over_time(cassandra_stats{name="org:apache:cassandra:metrics:compaction:pendingtasks:value"}[30m]) > 100 for: 5m labels: severity: warning annotations: summary: Cassandra compaction task pending (instance {{ $labels.instance }}) description: Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.3. Cassandra viewwrite latency

High viewwrite latency on {{ $labels.instance }} cassandra node[copy]

  - alert: CassandraViewwriteLatency expr: cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:viewwrite:viewwritelatency:99thpercentile",service="cas"} > 100000 for: 5m labels: severity: warning annotations: summary: Cassandra viewwrite latency (instance {{ $labels.instance }}) description: High viewwrite latency on {{ $labels.instance }} cassandra node\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.4. Cassandra cool hacker

Increase of Cassandra authentication failures[copy]

  - alert: CassandraCoolHacker expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:client:authfailure:count"}[1m]) > 5 for: 5m labels: severity: warning annotations: summary: Cassandra cool hacker (instance {{ $labels.instance }}) description: Increase of Cassandra authentication failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.5. Cassandra node down

Cassandra node down[copy]

  - alert: CassandraNodeDown expr: sum(cassandra_stats{name="org:apache:cassandra:net:failuredetector:downendpointcount"}) by (service,group,cluster,env) > 0 for: 5m labels: severity: critical annotations: summary: Cassandra node down (instance {{ $labels.instance }}) description: Cassandra node down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.6. Cassandra commitlog pending tasks

Unexpected number of Cassandra commitlog pending tasks[copy]

  - alert: CassandraCommitlogPendingTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:commitlog:pendingtasks:value"} > 15 for: 5m labels: severity: warning annotations: summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }}) description: Unexpected number of Cassandra commitlog pending tasks\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.7. Cassandra compaction executor blocked tasks

Some Cassandra compaction executor tasks are blocked[copy]

  - alert: CassandraCompactionExecutorBlockedTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:compactionexecutor:currentlyblockedtasks:count"} > 0 for: 5m labels: severity: warning annotations: summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }}) description: Some Cassandra compaction executor tasks are blocked\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.8. Cassandra flush writer blocked tasks

Some Cassandra flush writer tasks are blocked[copy]

  - alert: CassandraFlushWriterBlockedTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:memtableflushwriter:currentlyblockedtasks:count"} > 0 for: 5m labels: severity: warning annotations: summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }}) description: Some Cassandra flush writer tasks are blocked\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.9. Cassandra repair pending tasks

Some Cassandra repair tasks are pending[copy]

  - alert: CassandraRepairPendingTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:pendingtasks:value"} > 2 for: 5m labels: severity: warning annotations: summary: Cassandra repair pending tasks (instance {{ $labels.instance }}) description: Some Cassandra repair tasks are pending\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.10. Cassandra repair blocked tasks

Some Cassandra repair tasks are blocked[copy]

  - alert: CassandraRepairBlockedTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:currentlyblockedtasks:count"} > 0 for: 5m labels: severity: warning annotations: summary: Cassandra repair blocked tasks (instance {{ $labels.instance }}) description: Some Cassandra repair tasks are blocked\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.11. Cassandra connection timeouts total

Some connection between nodes are ending in timeout[copy]

  - alert: CassandraConnectionTimeoutsTotal expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m]) > 5 for: 5m labels: severity: critical annotations: summary: Cassandra connection timeouts total (instance {{ $labels.instance }}) description: Some connection between nodes are ending in timeout\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.12. Cassandra storage exceptions

Something is going wrong with cassandra storage[copy]

  - alert: CassandraStorageExceptions expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:exceptions:count"}[1m]) > 1 for: 5m labels: severity: critical annotations: summary: Cassandra storage exceptions (instance {{ $labels.instance }}) description: Something is going wrong with cassandra storage\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.13. Cassandra tombstone dump

Too much tombstones scanned in queries[copy]

  - alert: CassandraTombstoneDump expr: cassandra_stats{name="org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile"} > 1000 for: 5m labels: severity: critical annotations: summary: Cassandra tombstone dump (instance {{ $labels.instance }}) description: Too much tombstones scanned in queries\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.14. Cassandra client request unvailable write

Write failures have occurred because too many nodes are unavailable[copy]

  - alert: CassandraClientRequestUnvailableWrite expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:unavailables:count"}[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Cassandra client request unvailable write (instance {{ $labels.instance }}) description: Write failures have occurred because too many nodes are unavailable\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.15. Cassandra client request unvailable read

Read failures have occurred because too many nodes are unavailable[copy]

  - alert: CassandraClientRequestUnvailableRead expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:unavailables:count"}[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Cassandra client request unvailable read (instance {{ $labels.instance }}) description: Read failures have occurred because too many nodes are unavailable\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.16. Cassandra client request write failure

A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.[copy]

  - alert: CassandraClientRequestWriteFailure expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate"}[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Cassandra client request write failure (instance {{ $labels.instance }}) description: A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.17. Cassandra client request read failure

A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.[copy]

  - alert: CassandraClientRequestReadFailure expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate"}[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Cassandra client request read failure (instance {{ $labels.instance }}) description: A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.9.18. Cassandra cache hit rate key cache

Key cache hit rate is below 85%[copy]

  - alert: CassandraCacheHitRateKeyCache expr: cassandra_stats{name="org:apache:cassandra:metrics:cache:keycache:hitrate:value"} < .85 for: 5m labels: severity: critical annotations: summary: Cassandra cache hit rate key cache (instance {{ $labels.instance }}) description: Key cache hit rate is below 85%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2. 10. Zookeeper : cloudflare/kafka_zookeeper_exporter

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

2. 11. Kafka : danielqsj/kafka_exporter (2 rules)[copy all]

2.11.1. Kafka topics replicas

Kafka topic in-sync partition[copy]

  - alert: KafkaTopicsReplicas expr: sum(kafka_topic_partition_in_sync_replica) by (topic) < 3 for: 5m labels: severity: critical annotations: summary: Kafka topics replicas (instance {{ $labels.instance }}) description: Kafka topic in-sync partition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

2.11.2. Kafka consumers group

Kafka consumers group[copy]

  - alert: KafkaConsumersGroup expr: sum(kafka_consumergroup_lag) by (consumergroup) > 50 for: 5m labels: severity: critical annotations: summary: Kafka consumers group (instance {{ $labels.instance }}) description: Kafka consumers group\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3. 1. Nginx : nginx-lua-prometheus (3 rules)[copy all]

3.1.1. Nginx high HTTP 4xx error rate

Too many HTTP requests with status 4xx (> 5%)[copy]

  - alert: NginxHighHttp4xxErrorRate expr: sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }}) description: Too many HTTP requests with status 4xx (> 5%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.1.2. Nginx high HTTP 5xx error rate

Too many HTTP requests with status 5xx (> 5%)[copy]

  - alert: NginxHighHttp5xxErrorRate expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }}) description: Too many HTTP requests with status 5xx (> 5%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.1.3. Nginx latency high

Nginx p99 latency is higher than 10 seconds[copy]

  - alert: NginxLatencyHigh expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[30m])) by (host, node)) > 10 for: 5m labels: severity: warning annotations: summary: Nginx latency high (instance {{ $labels.instance }}) description: Nginx p99 latency is higher than 10 seconds\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3. 2. Apache : Lusitaniae/apache_exporter (3 rules)[copy all]

3.2.1. Apache down

Apache down[copy]

  - alert: ApacheDown expr: apache_up == 0 for: 5m labels: severity: critical annotations: summary: Apache down (instance {{ $labels.instance }}) description: Apache down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.2.2. Apache workers load

Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}[copy]

  - alert: ApacheWorkersLoad expr: (sum by (instance) (apache_workers{state="busy"}) / sum by (instance) (apache_scoreboard) ) * 100 > 80 for: 5m labels: severity: critical annotations: summary: Apache workers load (instance {{ $labels.instance }}) description: Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.2.3. Apache restart

Apache has just been restarted, less than one minute ago.[copy]

  - alert: ApacheRestart expr: apache_uptime_seconds_total / 60 < 1 for: 5m labels: severity: warning annotations: summary: Apache restart (instance {{ $labels.instance }}) description: Apache has just been restarted, less than one minute ago.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3. 3. HaProxy : Embedded exporter (HAProxy >= v2)

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

3. 3. HaProxy : prometheus/haproxy_exporter (HAProxy < v2) (16 rules)[copy all]

3.3.1. HAProxy down

HAProxy down[copy]

  - alert: HaproxyDown expr: haproxy_up == 0 for: 5m labels: severity: critical annotations: summary: HAProxy down (instance {{ $labels.instance }}) description: HAProxy down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.2. HAProxy high HTTP 4xx error rate backend

Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}[copy]

  - alert: HaproxyHighHttp4xxErrorRateBackend expr: sum by (backend) rate(haproxy_server_http_responses_total{code="4xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5 for: 5m labels: severity: critical annotations: summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }}) description: Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.3. HAProxy high HTTP 4xx error rate backend

Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}[copy]

  - alert: HaproxyHighHttp4xxErrorRateBackend expr: sum by (backend) rate(haproxy_server_http_responses_total{code="5xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5 for: 5m labels: severity: critical annotations: summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }}) description: Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.4. HAProxy high HTTP 4xx error rate server

Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}[copy]

  - alert: HaproxyHighHttp4xxErrorRateServer expr: sum by (server) rate(haproxy_server_http_responses_total{code="4xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5 for: 5m labels: severity: critical annotations: summary: HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }}) description: Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.5. HAProxy high HTTP 5xx error rate server

Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}[copy]

  - alert: HaproxyHighHttp5xxErrorRateServer expr: sum by (server) rate(haproxy_server_http_responses_total{code="5xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5 for: 5m labels: severity: critical annotations: summary: HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }}) description: Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.6. HAProxy server response errors

Too many response errors to {{ $labels.server }} server (> 5%).[copy]

  - alert: HaproxyServerResponseErrors expr: sum by (server) rate(haproxy_server_response_errors_total[1m]) / sum by (server) rate(haproxy_server_http_responses_total[1m]) * 100 > 5 for: 5m labels: severity: critical annotations: summary: HAProxy server response errors (instance {{ $labels.instance }}) description: Too many response errors to {{ $labels.server }} server (> 5%).\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.7. HAProxy backend connection errors

Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be to high.[copy]

  - alert: HaproxyBackendConnectionErrors expr: sum by (backend) rate(haproxy_backend_connection_errors_total[1m]) > 100 for: 5m labels: severity: critical annotations: summary: HAProxy backend connection errors (instance {{ $labels.instance }}) description: Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be to high.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.8. HAProxy server connection errors

Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be to high.[copy]

  - alert: HaproxyServerConnectionErrors expr: sum by (server) rate(haproxy_server_connection_errors_total[1m]) > 100 for: 5m labels: severity: critical annotations: summary: HAProxy server connection errors (instance {{ $labels.instance }}) description: Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be to high.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.9. HAProxy backend max active session

HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).[copy]

  - alert: HaproxyBackendMaxActiveSession expr: avg_over_time((sum by (backend) (haproxy_server_max_sessions) / sum by (backend) (haproxy_server_limit_sessions)) [2m]) * 100 > 80 for: 5m labels: severity: warning annotations: summary: HAProxy backend max active session (instance {{ $labels.instance }}) description: HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.10. HAProxy pending requests

Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend[copy]

  - alert: HaproxyPendingRequests expr: sum by (backend) haproxy_backend_current_queue > 0 for: 5m labels: severity: warning annotations: summary: HAProxy pending requests (instance {{ $labels.instance }}) description: Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.11. HAProxy HTTP slowing down

Average request time is increasing[copy]

  - alert: HaproxyHttpSlowingDown expr: avg by (backend) (haproxy_backend_http_total_time_average_seconds) > 2 for: 5m labels: severity: warning annotations: summary: HAProxy HTTP slowing down (instance {{ $labels.instance }}) description: Average request time is increasing\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.12. HAProxy retry high

High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend[copy]

  - alert: HaproxyRetryHigh expr: rate(sum by (backend) (haproxy_backend_retry_warnings_total)) > 10 for: 5m labels: severity: warning annotations: summary: HAProxy retry high (instance {{ $labels.instance }}) description: High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.13. HAProxy backend down

HAProxy backend is down[copy]

  - alert: HaproxyBackendDown expr: haproxy_backend_up == 0 for: 5m labels: severity: critical annotations: summary: HAProxy backend down (instance {{ $labels.instance }}) description: HAProxy backend is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.14. HAProxy server down

HAProxy server is down[copy]

  - alert: HaproxyServerDown expr: haproxy_server_up == 0 for: 5m labels: severity: critical annotations: summary: HAProxy server down (instance {{ $labels.instance }}) description: HAProxy server is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.15. HAProxy frontend security blocked requests

HAProxy is blocking requests for security reason[copy]

  - alert: HaproxyFrontendSecurityBlockedRequests expr: rate(sum by (frontend) (haproxy_frontend_requests_denied_total)) > 10 for: 5m labels: severity: warning annotations: summary: HAProxy frontend security blocked requests (instance {{ $labels.instance }}) description: HAProxy is blocking requests for security reason\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.3.16. HAProxy server healthcheck failure

Some server healthcheck are failing on {{ $labels.server }}[copy]

  - alert: HaproxyServerHealthcheckFailure expr: increase(haproxy_server_check_failures_total) > 0 for: 5m labels: severity: warning annotations: summary: HAProxy server healthcheck failure (instance {{ $labels.instance }}) description: Some server healthcheck are failing on {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3. 4. Traefik : Embedded exporter (3 rules)[copy all]

3.4.1. Traefik backend down

All Traefik backends are down[copy]

  - alert: TraefikBackendDown expr: count(traefik_backend_server_up) by (backend) == 0 for: 5m labels: severity: critical annotations: summary: Traefik backend down (instance {{ $labels.instance }}) description: All Traefik backends are down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.4.2. Traefik high HTTP 4xx error rate backend

Traefik backend 4xx error rate is above 5%[copy]

  - alert: TraefikHighHttp4xxErrorRateBackend expr: sum(rate(traefik_backend_requests_total{code=~"4.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Traefik high HTTP 4xx error rate backend (instance {{ $labels.instance }}) description: Traefik backend 4xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.4.3. Traefik high HTTP 5xx error rate backend

Traefik backend 5xx error rate is above 5%[copy]

  - alert: TraefikHighHttp5xxErrorRateBackend expr: sum(rate(traefik_backend_requests_total{code=~"5.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Traefik high HTTP 5xx error rate backend (instance {{ $labels.instance }}) description: Traefik backend 5xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3. 4. Traefik : Embedded exporter v2 (3 rules)[copy all]

3.4.1. Traefik service down

All Traefik services are down[copy]

  - alert: TraefikServiceDown expr: count(traefik_service_server_up) by (service) == 0 for: 5m labels: severity: critical annotations: summary: Traefik service down (instance {{ $labels.instance }}) description: All Traefik services are down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.4.2. Traefik high HTTP 4xx error rate service

Traefik service 4xx error rate is above 5%[copy]

  - alert: TraefikHighHttp4xxErrorRateService expr: sum(rate(traefik_service_requests_total{code=~"4.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Traefik high HTTP 4xx error rate service (instance {{ $labels.instance }}) description: Traefik service 4xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

3.4.3. Traefik high HTTP 5xx error rate service

Traefik service 5xx error rate is above 5%[copy]

  - alert: TraefikHighHttp5xxErrorRateService expr: sum(rate(traefik_service_requests_total{code=~"5.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Traefik high HTTP 5xx error rate service (instance {{ $labels.instance }}) description: Traefik service 5xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

4. 1. PHP-FPM : bakins/php-fpm-exporter

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

4. 2. JVM : java-client (1 rules)[copy all]

4.2.1. JVM memory filling up

JVM memory is filling up (> 80%)[copy]

  - alert: JvmMemoryFillingUp expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8 for: 5m labels: severity: warning annotations: summary: JVM memory filling up (instance {{ $labels.instance }}) description: JVM memory is filling up (> 80%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

4. 3. Sidekiq : Strech/sidekiq-prometheus-exporter (2 rules)[copy all]

4.3.1. Sidekiq queue size

Sidekiq queue {{ $labels.name }} is growing[copy]

  - alert: SidekiqQueueSize expr: sidekiq_queue_size > 100 for: 5m labels: severity: warning annotations: summary: Sidekiq queue size (instance {{ $labels.instance }}) description: Sidekiq queue {{ $labels.name }} is growing\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

4.3.2. Sidekiq scheduling latency too high

Sidekiq jobs are taking more than 2 minutes to be picked up. Users may be seeing delays in background processing.[copy]

  - alert: SidekiqSchedulingLatencyTooHigh expr: max(sidekiq_queue_latency) > 120 for: 5m labels: severity: critical annotations: summary: Sidekiq scheduling latency too high (instance {{ $labels.instance }}) description: Sidekiq jobs are taking more than 2 minutes to be picked up. Users may be seeing delays in background processing.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5. 1. Kubernetes : kube-state-metrics (32 rules)[copy all]

5.1.1. Kubernetes Node ready

Node {{ $labels.node }} has been unready for a long time[copy]

  - alert: KubernetesNodeReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 5m labels: severity: critical annotations: summary: Kubernetes Node ready (instance {{ $labels.instance }}) description: Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.2. Kubernetes memory pressure

{{ $labels.node }} has MemoryPressure condition[copy]

  - alert: KubernetesMemoryPressure expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1 for: 5m labels: severity: critical annotations: summary: Kubernetes memory pressure (instance {{ $labels.instance }}) description: {{ $labels.node }} has MemoryPressure condition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.3. Kubernetes disk pressure

{{ $labels.node }} has DiskPressure condition[copy]

  - alert: KubernetesDiskPressure expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1 for: 5m labels: severity: critical annotations: summary: Kubernetes disk pressure (instance {{ $labels.instance }}) description: {{ $labels.node }} has DiskPressure condition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.4. Kubernetes out of disk

{{ $labels.node }} has OutOfDisk condition[copy]

  - alert: KubernetesOutOfDisk expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1 for: 5m labels: severity: critical annotations: summary: Kubernetes out of disk (instance {{ $labels.instance }}) description: {{ $labels.node }} has OutOfDisk condition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.5. Kubernetes out of capacity

{{ $labels.node }} is out of capacity[copy]

  - alert: KubernetesOutOfCapacity expr: sum(kube_pod_info) by (node) / sum(kube_node_status_allocatable_pods) by (node) * 100 > 90 for: 5m labels: severity: warning annotations: summary: Kubernetes out of capacity (instance {{ $labels.instance }}) description: {{ $labels.node }} is out of capacity\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.6. Kubernetes Job failed

Job {{$labels.namespace}}/{{$labels.exported_job}} failed to complete[copy]

  - alert: KubernetesJobFailed expr: kube_job_status_failed > 0 for: 5m labels: severity: warning annotations: summary: Kubernetes Job failed (instance {{ $labels.instance }}) description: Job {{$labels.namespace}}/{{$labels.exported_job}} failed to complete\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.7. Kubernetes CronJob suspended

CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended[copy]

  - alert: KubernetesCronjobSuspended expr: kube_cronjob_spec_suspend != 0 for: 5m labels: severity: warning annotations: summary: Kubernetes CronJob suspended (instance {{ $labels.instance }}) description: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.8. Kubernetes PersistentVolumeClaim pending

PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending[copy]

  - alert: KubernetesPersistentvolumeclaimPending expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1 for: 5m labels: severity: warning annotations: summary: Kubernetes PersistentVolumeClaim pending (instance {{ $labels.instance }}) description: PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.9. Kubernetes Volume out of disk space

Volume is almost full (< 10% left)[copy]

  - alert: KubernetesVolumeOutOfDiskSpace expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10 for: 5m labels: severity: warning annotations: summary: Kubernetes Volume out of disk space (instance {{ $labels.instance }}) description: Volume is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.10. Kubernetes Volume full in four days

{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.[copy]

  - alert: KubernetesVolumeFullInFourDays expr: predict_linear(kubelet_volume_stats_available_bytes[6h], 4 * 24 * 3600) < 0 for: 5m labels: severity: critical annotations: summary: Kubernetes Volume full in four days (instance {{ $labels.instance }}) description: {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.11. Kubernetes PersistentVolume error

Persistent volume is in bad state[copy]

  - alert: KubernetesPersistentvolumeError expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0 for: 5m labels: severity: critical annotations: summary: Kubernetes PersistentVolume error (instance {{ $labels.instance }}) description: Persistent volume is in bad state\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.12. Kubernetes StatefulSet down

A StatefulSet went down[copy]

  - alert: KubernetesStatefulsetDown expr: (kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1 for: 5m labels: severity: critical annotations: summary: Kubernetes StatefulSet down (instance {{ $labels.instance }}) description: A StatefulSet went down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.13. Kubernetes HPA scaling ability

Pod is unable to scale[copy]

  - alert: KubernetesHpaScalingAbility expr: kube_hpa_status_condition{status="false", condition ="AbleToScale"} == 1 for: 5m labels: severity: warning annotations: summary: Kubernetes HPA scaling ability (instance {{ $labels.instance }}) description: Pod is unable to scale\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.14. Kubernetes HPA metric availability

HPA is not able to collect metrics[copy]

  - alert: KubernetesHpaMetricAvailability expr: kube_hpa_status_condition{status="false", condition="ScalingActive"} == 1 for: 5m labels: severity: warning annotations: summary: Kubernetes HPA metric availability (instance {{ $labels.instance }}) description: HPA is not able to collect metrics\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.15. Kubernetes HPA scale capability

The maximum number of desired Pods has been hit[copy]

  - alert: KubernetesHpaScaleCapability expr: kube_hpa_status_desired_replicas >= kube_hpa_spec_max_replicas for: 5m labels: severity: warning annotations: summary: Kubernetes HPA scale capability (instance {{ $labels.instance }}) description: The maximum number of desired Pods has been hit\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.16. Kubernetes Pod not healthy

Pod has been in a non-ready state for longer than an hour.[copy]

  - alert: KubernetesPodNotHealthy expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:]) > 0 for: 5m labels: severity: critical annotations: summary: Kubernetes Pod not healthy (instance {{ $labels.instance }}) description: Pod has been in a non-ready state for longer than an hour.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.17. Kubernetes pod crash looping

Pod {{ $labels.pod }} is crash looping[copy]

  - alert: KubernetesPodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 5 for: 5m labels: severity: warning annotations: summary: Kubernetes pod crash looping (instance {{ $labels.instance }}) description: Pod {{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.18. Kubernetes ReplicasSet mismatch

Deployment Replicas mismatch[copy]

  - alert: KubernetesReplicassetMismatch expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas for: 5m labels: severity: warning annotations: summary: Kubernetes ReplicasSet mismatch (instance {{ $labels.instance }}) description: Deployment Replicas mismatch\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.19. Kubernetes Deployment replicas mismatch

Deployment Replicas mismatch[copy]

  - alert: KubernetesDeploymentReplicasMismatch expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available for: 5m labels: severity: warning annotations: summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }}) description: Deployment Replicas mismatch\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.20. Kubernetes StatefulSet replicas mismatch

A StatefulSet has not matched the expected number of replicas for longer than 15 minutes.[copy]

  - alert: KubernetesStatefulsetReplicasMismatch expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas for: 5m labels: severity: warning annotations: summary: Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }}) description: A StatefulSet has not matched the expected number of replicas for longer than 15 minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.21. Kubernetes Deployment generation mismatch

A Deployment has failed but has not been rolled back.[copy]

  - alert: KubernetesDeploymentGenerationMismatch expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation for: 5m labels: severity: critical annotations: summary: Kubernetes Deployment generation mismatch (instance {{ $labels.instance }}) description: A Deployment has failed but has not been rolled back.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.22. Kubernetes StatefulSet generation mismatch

A StatefulSet has failed but has not been rolled back.[copy]

  - alert: KubernetesStatefulsetGenerationMismatch expr: kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation for: 5m labels: severity: critical annotations: summary: Kubernetes StatefulSet generation mismatch (instance {{ $labels.instance }}) description: A StatefulSet has failed but has not been rolled back.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.23. Kubernetes StatefulSet update not rolled out

StatefulSet update has not been rolled out.[copy]

  - alert: KubernetesStatefulsetUpdateNotRolledOut expr: max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated) for: 5m labels: severity: critical annotations: summary: Kubernetes StatefulSet update not rolled out (instance {{ $labels.instance }}) description: StatefulSet update has not been rolled out.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.24. Kubernetes DaemonSet rollout stuck

Some Pods of DaemonSet are not scheduled or not ready[copy]

  - alert: KubernetesDaemonsetRolloutStuck expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0 for: 5m labels: severity: critical annotations: summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }}) description: Some Pods of DaemonSet are not scheduled or not ready\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.25. Kubernetes DaemonSet misscheduled

Some DaemonSet Pods are running where they are not supposed to run[copy]

  - alert: KubernetesDaemonsetMisscheduled expr: kube_daemonset_status_number_misscheduled > 0 for: 5m labels: severity: critical annotations: summary: Kubernetes DaemonSet misscheduled (instance {{ $labels.instance }}) description: Some DaemonSet Pods are running where they are not supposed to run\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.26. Kubernetes CronJob too long

CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.[copy]

  - alert: KubernetesCronjobTooLong expr: time() - kube_cronjob_next_schedule_time > 3600 for: 5m labels: severity: warning annotations: summary: Kubernetes CronJob too long (instance {{ $labels.instance }}) description: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.27. Kubernetes job completion

Kubernetes Job failed to complete[copy]

  - alert: KubernetesJobCompletion expr: kube_job_spec_completions - kube_job_status_succeeded > 0 or kube_job_status_failed > 0 for: 5m labels: severity: critical annotations: summary: Kubernetes job completion (instance {{ $labels.instance }}) description: Kubernetes Job failed to complete\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.28. Kubernetes API server errors

Kubernetes API server is experiencing high error rate[copy]

  - alert: KubernetesApiServerErrors expr: sum(rate(apiserver_request_count{job="apiserver",code=~"^(?:5..)$"}[2m])) / sum(rate(apiserver_request_count{job="apiserver"}[2m])) * 100 > 3 for: 5m labels: severity: critical annotations: summary: Kubernetes API server errors (instance {{ $labels.instance }}) description: Kubernetes API server is experiencing high error rate\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.29. Kubernetes API client errors

Kubernetes API client is experiencing high error rate[copy]

  - alert: KubernetesApiClientErrors expr: (sum(rate(rest_client_requests_total{code=~"(4|5).."}[2m])) by (instance, job) / sum(rate(rest_client_requests_total[2m])) by (instance, job)) * 100 > 1 for: 5m labels: severity: critical annotations: summary: Kubernetes API client errors (instance {{ $labels.instance }}) description: Kubernetes API client is experiencing high error rate\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.30. Kubernetes client certificate expires next week

A client certificate used to authenticate to the apiserver is expiring next week.[copy]

  - alert: KubernetesClientCertificateExpiresNextWeek expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 7*24*60*60 for: 5m labels: severity: warning annotations: summary: Kubernetes client certificate expires next week (instance {{ $labels.instance }}) description: A client certificate used to authenticate to the apiserver is expiring next week.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.31. Kubernetes client certificate expires soon

A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.[copy]

  - alert: KubernetesClientCertificateExpiresSoon expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 24*60*60 for: 5m labels: severity: critical annotations: summary: Kubernetes client certificate expires soon (instance {{ $labels.instance }}) description: A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.1.32. Kubernetes API server latency

Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.[copy]

  - alert: KubernetesApiServerLatency expr: histogram_quantile(0.99, sum(apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH|PROXY"}) WITHOUT (instance, resource)) / 1e+06 > 1 for: 5m labels: severity: warning annotations: summary: Kubernetes API server latency (instance {{ $labels.instance }}) description: Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5. 2. Nomad : Embedded exporter

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

5. 3. Consul : prometheus/consul_exporter (3 rules)[copy all]

5.3.1. Consul service healthcheck failed

Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`[copy]

  - alert: ConsulServiceHealthcheckFailed expr: consul_catalog_service_node_healthy == 0 for: 5m labels: severity: critical annotations: summary: Consul service healthcheck failed (instance {{ $labels.instance }}) description: Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.3.2. Consul missing master node

Numbers of consul raft peers should be 3, in order to preserve quorum.[copy]

  - alert: ConsulMissingMasterNode expr: consul_raft_peers < 3 for: 5m labels: severity: critical annotations: summary: Consul missing master node (instance {{ $labels.instance }}) description: Numbers of consul raft peers should be 3, in order to preserve quorum.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.3.3. Consul agent unhealthy

A Consul agent is down[copy]

  - alert: ConsulAgentUnhealthy expr: consul_health_node_status{status="critical"} == 1 for: 5m labels: severity: critical annotations: summary: Consul agent unhealthy (instance {{ $labels.instance }}) description: A Consul agent is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5. 4. Etcd (13 rules)[copy all]

5.4.1. Etcd insufficient Members

Etcd cluster should have an odd number of members[copy]

  - alert: EtcdInsufficientMembers expr: count(etcd_server_id) % 2 == 0 for: 5m labels: severity: critical annotations: summary: Etcd insufficient Members (instance {{ $labels.instance }}) description: Etcd cluster should have an odd number of members\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.4.2. Etcd no Leader

Etcd cluster have no leader[copy]

  - alert: EtcdNoLeader expr: etcd_server_has_leader == 0 for: 5m labels: severity: critical annotations: summary: Etcd no Leader (instance {{ $labels.instance }}) description: Etcd cluster have no leader\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.4.3. Etcd high number of leader changes

Etcd leader changed more than 3 times during last hour[copy]

  - alert: EtcdHighNumberOfLeaderChanges expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3 for: 5m labels: severity: warning annotations: summary: Etcd high number of leader changes (instance {{ $labels.instance }}) description: Etcd leader changed more than 3 times during last hour\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.4.4. Etcd high number of failed GRPC requests

More than 1% GRPC request failure detected in Etcd for 5 minutes[copy]

  - alert: EtcdHighNumberOfFailedGrpcRequests expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.01 for: 5m labels: severity: warning annotations: summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }}) description: More than 1% GRPC request failure detected in Etcd for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.4.5. Etcd high number of failed GRPC requests

More than 5% GRPC request failure detected in Etcd for 5 minutes[copy]

  - alert: EtcdHighNumberOfFailedGrpcRequests expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.05 for: 5m labels: severity: critical annotations: summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }}) description: More than 5% GRPC request failure detected in Etcd for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.4.6. Etcd GRPC requests slow

GRPC requests slowing down, 99th percentil is over 0.15s for 5 minutes[copy]

  - alert: EtcdGrpcRequestsSlow expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type="unary"}[5m])) by (grpc_service, grpc_method, le)) > 0.15 for: 5m labels: severity: warning annotations: summary: Etcd GRPC requests slow (instance {{ $labels.instance }}) description: GRPC requests slowing down, 99th percentil is over 0.15s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.4.7. Etcd high number of failed HTTP requests

More than 1% HTTP failure detected in Etcd for 5 minutes[copy]

  - alert: EtcdHighNumberOfFailedHttpRequests expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.01 for: 5m labels: severity: warning annotations: summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }}) description: More than 1% HTTP failure detected in Etcd for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.4.8. Etcd high number of failed HTTP requests

More than 5% HTTP failure detected in Etcd for 5 minutes[copy]

  - alert: EtcdHighNumberOfFailedHttpRequests expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.05 for: 5m labels: severity: critical annotations: summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }}) description: More than 5% HTTP failure detected in Etcd for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.4.9. Etcd HTTP requests slow

HTTP requests slowing down, 99th percentil is over 0.15s for 5 minutes[copy]

  - alert: EtcdHttpRequestsSlow expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m])) > 0.15 for: 5m labels: severity: warning annotations: summary: Etcd HTTP requests slow (instance {{ $labels.instance }}) description: HTTP requests slowing down, 99th percentil is over 0.15s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.4.10. Etcd member communication slow

Etcd member communication slowing down, 99th percentil is over 0.15s for 5 minutes[copy]

  - alert: EtcdMemberCommunicationSlow expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) > 0.15 for: 5m labels: severity: warning annotations: summary: Etcd member communication slow (instance {{ $labels.instance }}) description: Etcd member communication slowing down, 99th percentil is over 0.15s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.4.11. Etcd high number of failed proposals

Etcd server got more than 5 failed proposals past hour[copy]

  - alert: EtcdHighNumberOfFailedProposals expr: increase(etcd_server_proposals_failed_total[1h]) > 5 for: 5m labels: severity: warning annotations: summary: Etcd high number of failed proposals (instance {{ $labels.instance }}) description: Etcd server got more than 5 failed proposals past hour\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.4.12. Etcd high fsync durations

Etcd WAL fsync duration increasing, 99th percentil is over 0.5s for 5 minutes[copy]

  - alert: EtcdHighFsyncDurations expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: Etcd high fsync durations (instance {{ $labels.instance }}) description: Etcd WAL fsync duration increasing, 99th percentil is over 0.5s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5.4.13. Etcd high commit durations

Etcd commit duration increasing, 99th percentil is over 0.25s for 5 minutes[copy]

  - alert: EtcdHighCommitDurations expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25 for: 5m labels: severity: warning annotations: summary: Etcd high commit durations (instance {{ $labels.instance }}) description: Etcd commit duration increasing, 99th percentil is over 0.25s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5. 5. Linkerd : Embedded exporter (1 rules)[copy all]

5.5.1. Linkerd high error rate

Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10%[copy]

  - alert: LinkerdHighErrorRate expr: sum(rate(request_errors_total[5m])) by (deployment, statefulset, daemonset) / sum(rate(request_total[5m])) by (deployment, statefulset, daemonset) * 100 > 10 for: 5m labels: severity: warning annotations: summary: Linkerd high error rate (instance {{ $labels.instance }}) description: Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

5. 6. Istio

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

6. 1. Ceph : Embedded exporter (13 rules)[copy all]

6.1.1. Ceph State

Ceph instance unhealthy[copy]

  - alert: CephState expr: ceph_health_status != 0 for: 5m labels: severity: critical annotations: summary: Ceph State (instance {{ $labels.instance }}) description: Ceph instance unhealthy\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.1.2. Ceph monitor clock skew

Ceph monitor clock skew detected. Please check ntp and hardware clock settings[copy]

  - alert: CephMonitorClockSkew expr: abs(ceph_monitor_clock_skew_seconds) > 0.2 for: 5m labels: severity: warning annotations: summary: Ceph monitor clock skew (instance {{ $labels.instance }}) description: Ceph monitor clock skew detected. Please check ntp and hardware clock settings\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.1.3. Ceph monitor low space

Ceph monitor storage is low.[copy]

  - alert: CephMonitorLowSpace expr: ceph_monitor_avail_percent < 10 for: 5m labels: severity: warning annotations: summary: Ceph monitor low space (instance {{ $labels.instance }}) description: Ceph monitor storage is low.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.1.4. Ceph OSD Down

Ceph Object Storage Daemon Down[copy]

  - alert: CephOsdDown expr: ceph_osd_up == 0 for: 5m labels: severity: critical annotations: summary: Ceph OSD Down (instance {{ $labels.instance }}) description: Ceph Object Storage Daemon Down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.1.5. Ceph high OSD latency

Ceph Object Storage Daemon latetncy is high. Please check if it doesn't stuck in weird state.[copy]

  - alert: CephHighOsdLatency expr: ceph_osd_perf_apply_latency_seconds > 10 for: 5m labels: severity: warning annotations: summary: Ceph high OSD latency (instance {{ $labels.instance }}) description: Ceph Object Storage Daemon latetncy is high. Please check if it doesn't stuck in weird state.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.1.6. Ceph OSD low space

Ceph Object Storage Daemon is going out of space. Please add more disks.[copy]

  - alert: CephOsdLowSpace expr: ceph_osd_utilization > 90 for: 5m labels: severity: warning annotations: summary: Ceph OSD low space (instance {{ $labels.instance }}) description: Ceph Object Storage Daemon is going out of space. Please add more disks.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.1.7. Ceph OSD reweighted

Ceph Object Storage Daemon take ttoo much time to resize.[copy]

  - alert: CephOsdReweighted expr: ceph_osd_weight < 1 for: 5m labels: severity: warning annotations: summary: Ceph OSD reweighted (instance {{ $labels.instance }}) description: Ceph Object Storage Daemon take ttoo much time to resize.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.1.8. Ceph PG down

Some Ceph placement groups are down. Please ensure that all the data are available.[copy]

  - alert: CephPgDown expr: ceph_pg_down > 0 for: 5m labels: severity: critical annotations: summary: Ceph PG down (instance {{ $labels.instance }}) description: Some Ceph placement groups are down. Please ensure that all the data are available.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.1.9. Ceph PG incomplete

Some Ceph placement groups are incomplete. Please ensure that all the data are available.[copy]

  - alert: CephPgIncomplete expr: ceph_pg_incomplete > 0 for: 5m labels: severity: critical annotations: summary: Ceph PG incomplete (instance {{ $labels.instance }}) description: Some Ceph placement groups are incomplete. Please ensure that all the data are available.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.1.10. Ceph PG inconsistant

Some Ceph placement groups are inconsitent. Data is available but inconsistent across nodes.[copy]

  - alert: CephPgInconsistant expr: ceph_pg_inconsistent > 0 for: 5m labels: severity: warning annotations: summary: Ceph PG inconsistant (instance {{ $labels.instance }}) description: Some Ceph placement groups are inconsitent. Data is available but inconsistent across nodes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.1.11. Ceph PG activation long

Some Ceph placement groups are too long to activate.[copy]

  - alert: CephPgActivationLong expr: ceph_pg_activating > 0 for: 5m labels: severity: warning annotations: summary: Ceph PG activation long (instance {{ $labels.instance }}) description: Some Ceph placement groups are too long to activate.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.1.12. Ceph PG backfill full

Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.[copy]

  - alert: CephPgBackfillFull expr: ceph_pg_backfill_toofull > 0 for: 5m labels: severity: warning annotations: summary: Ceph PG backfill full (instance {{ $labels.instance }}) description: Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.1.13. Ceph PG unavailable

Some Ceph placement groups are unavailable.[copy]

  - alert: CephPgUnavailable expr: ceph_pg_total - ceph_pg_active > 0 for: 5m labels: severity: critical annotations: summary: Ceph PG unavailable (instance {{ $labels.instance }}) description: Some Ceph placement groups are unavailable.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6. 2. SpeedTest : Speedtest exporter (2 rules)[copy all]

6.2.1. SpeedTest Slow Internet Download

Internet download speed is currently {{humanize $value}} Mbps.[copy]

  - alert: SpeedtestSlowInternetDownload expr: avg_over_time(speedtest_download[30m]) < 75 for: 5m labels: severity: warning annotations: summary: SpeedTest Slow Internet Download (instance {{ $labels.instance }}) description: Internet download speed is currently {{humanize $value}} Mbps.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.2.2. SpeedTest Slow Internet Upload

Internet upload speed is currently {{humanize $value}} Mbps.[copy]

  - alert: SpeedtestSlowInternetUpload expr: avg_over_time(speedtest_upload[30m]) < 20 for: 5m labels: severity: warning annotations: summary: SpeedTest Slow Internet Upload (instance {{ $labels.instance }}) description: Internet upload speed is currently {{humanize $value}} Mbps.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6. 3. ZFS : node-exporter

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

6. 4. OpenEBS : Embedded exporter (1 rules)[copy all]

6.4.1. OpenEBS used pool capacity

OpenEBS Pool use more than 80% of his capacity\n VALUE = {{ $value }}\n LABELS: {{ $labels }}[copy]

  - alert: OpenebsUsedPoolCapacity expr: (openebs_used_pool_capacity_percent) > 80 for: 5m labels: severity: warning annotations: summary: OpenEBS used pool capacity (instance {{ $labels.instance }}) description: OpenEBS Pool use more than 80% of his capacity\n VALUE = {{ $value }}\n LABELS: {{ $labels }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6. 5. Minio : Embedded exporter (2 rules)[copy all]

6.5.1. Minio disk offline

Minio disk is offline[copy]

  - alert: MinioDiskOffline expr: minio_offline_disks > 0 for: 5m labels: severity: critical annotations: summary: Minio disk offline (instance {{ $labels.instance }}) description: Minio disk is offline\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.5.2. Minio storage space exhausted

Minio storage space is low (< 10 GB)[copy]

  - alert: MinioStorageSpaceExhausted expr: minio_disk_storage_free_bytes / 1024 / 1024 / 1024 < 10 for: 5m labels: severity: warning annotations: summary: Minio storage space exhausted (instance {{ $labels.instance }}) description: Minio storage space is low (< 10 GB)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6. 6. Juniper : czerwonk/junos_exporter (3 rules)[copy all]

6.6.1. Juniper switch down

The switch appears to be down[copy]

  - alert: JuniperSwitchDown expr: junos_up == 0 for: 5m labels: severity: critical annotations: summary: Juniper switch down (instance {{ $labels.instance }}) description: The switch appears to be down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.6.2. Juniper high Bandwith Usage 1GiB

Interface is highly saturated for at least 1 min. (> 0.90GiB/s)[copy]

  - alert: JuniperHighBandwithUsage1gib expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.90 for: 5m labels: severity: critical annotations: summary: Juniper high Bandwith Usage 1GiB (instance {{ $labels.instance }}) description: Interface is highly saturated for at least 1 min. (> 0.90GiB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6.6.3. Juniper high Bandwith Usage 1GiB

Interface is getting saturated for at least 1 min. (> 0.80GiB/s)[copy]

  - alert: JuniperHighBandwithUsage1gib expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.80 for: 5m labels: severity: warning annotations: summary: Juniper high Bandwith Usage 1GiB (instance {{ $labels.instance }}) description: Interface is getting saturated for at least 1 min. (> 0.80GiB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

6. 7. CoreDNS : Embedded exporter (1 rules)[copy all]

6.7.1. CoreDNS Panic Count

Number of CoreDNS panics encountered[copy]

  - alert: CorednsPanicCount expr: increase(coredns_panic_count_total[10m]) > 0 for: 5m labels: severity: critical annotations: summary: CoreDNS Panic Count (instance {{ $labels.instance }}) description: Number of CoreDNS panics encountered\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

7. 1. Thanos (3 rules)[copy all]

7.1.1. Thanos compaction halted

Thanos compaction has failed to run and is now halted.[copy]

  - alert: ThanosCompactionHalted expr: thanos_compactor_halted == 1 for: 5m labels: severity: critical annotations: summary: Thanos compaction halted (instance {{ $labels.instance }}) description: Thanos compaction has failed to run and is now halted.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

7.1.2. Thanos compact bucket operation failure

Thanos compaction has failing storage operations[copy]

  - alert: ThanosCompactBucketOperationFailure expr: rate(thanos_objstore_bucket_operation_failures_total[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Thanos compact bucket operation failure (instance {{ $labels.instance }}) description: Thanos compaction has failing storage operations\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

7.1.3. Thanos compact not run

Thanos compaction has not run in 24 hours.[copy]

  - alert: ThanosCompactNotRun expr: (time() - thanos_objstore_bucket_last_successful_upload_time) > 24*60*60 for: 5m labels: severity: critical annotations: summary: Thanos compact not run (instance {{ $labels.instance }}) description: Thanos compaction has not run in 24 hours.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}

Awesome Prometheus alerts is maintained by samber.

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 Prometheus alerts 各种告警规则 Bootstrap 4 中 Alerts 实现 Sentry 监控 - Alerts 告警 Azure Cost alerts 花费警报 Azure Cost alerts 费用成本分析 Prometheus 什么是prometheus? prometheus Prometheus Bootstrap <基础二十五>警告（Alerts）