Prometheus集群+alertmanager集群+influxDB遠程存儲,實現監控高可用


服務器A :192.168.1.190 (Prometheus、alertmanager)

服務器B :192.168.1.206(Prometheus、alertmanager、influxdb、nginx)

 

基本HA + 遠程存儲

在基本HA模式的基礎上通過添加Remote Storage存儲支持,將監控數據保存在第三方存儲服務上。
在保證Promthues服務可用性的基礎上,同時確保了數據的持久化,當Promthues Server發生宕機或者數據丟失的情況下,可以快速的恢復。 同時Promthues Server可能很好的進行遷移。因此,該方案適用於用戶監控規模不大,但是希望能夠將監控數據持久化,同時能夠確保Promthues Server的可遷移性的場景。

 

 

在B 上使用docker安裝influxDB庫

mkdir -p /data/infuxdb

vi  /data/infuxdb/docker-compose-monitor.yml

version: '2'
services:
  influxdb:
    image: influxdb
    container_name: influxdb
    hostname: influxdb
    restart: always
    command: -config /etc/influxdb/influxdb.conf
    ports:
      - "8086:8086"
      - "8083:8083"
    volumes:
      - /data/influxdb/conf:/etc/influxdb
      - /data/influxdb/data:/var/lib/influxdb/data 
      - /data/influxdb/meta:/var/lib/influxdb/meta
      - /data/influxdb/wal:/var/lib/influxdb/wal

vi /data/influxdb/conf/influxdb.conf

### Welcome to the InfluxDB configuration file.

# The values in this file override the default values used by the system if
# a config option is not specified. The commented out lines are the configuration
# field and the default value used. Uncommenting a line and changing the value
# will change the value used at runtime when the process is restarted.

# Once every 24 hours InfluxDB will report usage data to usage.influxdata.com
# The data includes a random ID, os, arch, version, the number of series and other
# usage data. No data from user databases is ever transmitted.
# Change this option to true to disable reporting.
# reporting-disabled = false

# Bind address to use for the RPC service for backup and restore.
# bind-address = "127.0.0.1:8088"
#
#
###############################
#InfluxDB 配置優化 version 1.6##
###############################
###
### [meta]
###
### Controls the parameters for the Raft consensus group that stores metadata
### about the InfluxDB cluster.
###

[meta]
# Where the metadata/raft database is stored
# 元數據存儲目錄
dir = "/var/lib/influxdb/meta"

# Automatically create a default retention policy when creating a database.
# retention-autocreate = true

# If log messages are printed for the meta service
# logging-enabled = true

###
### [data]
###
### Controls where the actual shard data for InfluxDB lives and how it is
### flushed from the WAL. "dir" may need to be changed to a suitable place
### for your system, but the WAL settings are an advanced configuration. The
### defaults should work for most systems.
###

[data]
# The directory where the TSM storage engine stores TSM files.
# 數據存儲的目錄
dir = "/var/lib/influxdb/data"

# The directory where the TSM storage engine stores WAL files.
# wal數據目錄
wal-dir = "/var/lib/influxdb/wal"

# The amount of time that a write will wait before fsyncing. A duration
# greater than 0 can be used to batch up multiple fsync calls. This is useful for slower
# disks or when WAL write contention is seen. A value of 0s fsyncs every write to the WAL.
# Values in the range of 0-100ms are recommended for non-SSD disks.
# wal-fsync-delay = "0s"


# The type of shard index to use for new shards. The default is an in-memory index that is
# recreated at startup. A value of "tsi1" will use a disk based index that supports higher
# cardinality datasets.
# index-version = "inmem"

# Trace logging provides more verbose output around the tsm engine. Turning
# this on can provide more useful output for debugging tsm engine issues.
# trace-logging-enabled = false

# Whether queries should be logged before execution. Very useful for troubleshooting, but will
# log any sensitive data contained within a query.
# query-log-enabled = true

# Settings for the TSM engine

# CacheMaxMemorySize is the maximum size a shard's cache can
# reach before it starts rejecting writes.
# Valid size suffixes are k, m, or g (case insensitive, 1024 = 1k).
# Values without a size suffix are in bytes.
#
# 4294967296(b)=4G 最大緩存數據,先緩存再寫入
cache-max-memory-size = "8g"

# CacheSnapshotMemorySize is the size at which the engine will
# snapshot the cache and write it to a TSM file, freeing up memory
# Valid size suffixes are k, m, or g (case insensitive, 1024 = 1k).
# Values without a size suffix are in bytes.
# cache-snapshot-memory-size = "25m"

# CacheSnapshotWriteColdDuration is the length of time at
# which the engine will snapshot the cache and write it to
# a new TSM file if the shard hasn't received writes or deletes
# cache-snapshot-write-cold-duration = "10m"

# CompactFullWriteColdDuration is the duration at which the engine
# will compact all TSM files in a shard if it hasn't received a
# write or delete
# compact-full-write-cold-duration = "4h"

# The maximum number of concurrent full and level compactions that can run at one time. A
# value of 0 results in 50% of runtime.GOMAXPROCS(0) used at runtime. Any number greater
# than 0 limits compactions to that value. This setting does not apply
# to cache snapshotting.
# max-concurrent-compactions = 0

# The threshold, in bytes, when an index write-ahead log file will compact
# into an index file. Lower sizes will cause log files to be compacted more
# quickly and result in lower heap usage at the expense of write throughput.
# Higher sizes will be compacted less frequently, store more series in-memory,
# and provide higher write throughput.
# Valid size suffixes are k, m, or g (case insensitive, 1024 = 1k).
# Values without a size suffix are in bytes.
# max-index-log-file-size = "1m"

# The maximum series allowed per database before writes are dropped. This limit can prevent
# high cardinality issues at the database level. This limit can be disabled by setting it to
# 0.
max-series-per-database = 0

# The maximum number of tag values per tag that are allowed before writes are dropped. This limit
# can prevent high cardinality tag values from being written to a measurement. This limit can be
# disabled by setting it to 0.
max-values-per-tag = 0

# If true, then the mmap advise value MADV_WILLNEED will be provided to the kernel with respect to
# TSM files. This setting has been found to be problematic on some kernels, and defaults to off.
# It might help users who have slow disks in some cases.
# tsm-use-madv-willneed = false

###
### [coordinator]
###
### Controls the clustering service configuration.
###

[coordinator]
# The default time a write request will wait until a "timeout" error is returned to the caller.
write-timeout = "10s"

# The maximum number of concurrent queries allowed to be executing at one time. If a query is
# executed and exceeds this limit, an error is returned to the caller. This limit can be disabled
# by setting it to 0.
#
# max-concurrent-queries項是配置最大的可執行的命令數,此項值為零則表示無限制。
# 如果你執行的命令數超過這個配置項的數量,則會報如下錯誤:
# ERR: max concurrent queries reached
#
max-concurrent-queries = 0

# The maximum time a query will is allowed to execute before being killed by the system. This limit
# can help prevent run away queries. Setting the value to 0 disables the limit.
#
# query-timeout項用來配置命令的超時時間,如果命令的執行時長超過了此時間,則influxDB會殺掉這條語句並報出如下錯誤:
# ERR: query timeout reached
# 如果配置了連續查詢,那么最好不要配置query-timeout超時時間,因為隨着數據量的增加,連續查詢生成的數據所需要的時間更長,配置之后會導致數據生成不成功。
query-timeout = "0"

# The time threshold when a query will be logged as a slow query. This limit can be set to help
# discover slow or resource intensive queries. Setting the value to 0 disables the slow query logging.
#
# log-queries-after用來配置執行時長為多少的語句會被記錄為慢查詢。配置為0則表示不會記錄這些語句。
# 比如,改項配置為“1s”,則執行時長超過1秒的語句會被標記為慢查詢,並記錄在日志里。
#
log-queries-after = "10s"

# The maximum number of points a SELECT can process. A value of 0 will make
# the maximum point count unlimited. This will only be checked every second so queries will not
# be aborted immediately when hitting the limit.
#
# 在point可控的情況下,可以設置此參數
# max-select-point配置一次可查詢出的數據量,因為在influxDB中一條數據看做一個點,因此這個配置叫每次可查詢的最大的點數。
# 配置為0則表示無限制,如果查詢出來的數量大於此項配置,則influxDB會殺掉這條語句並報出如下錯誤:
# ERR: max number of points reached
#
max-select-point = 0

# The maximum number of series a SELECT can run. A value of 0 will make the maximum series
# count unlimited.
#
# max-select-series用來配置influxDB語句中最多可處理的series的數量,如果你的語句中要處理的series數量大於此配置,則influxDB不會執行這條語句並且會報出如下錯誤:
# ERR: max select series count exceeded: <query_series_count> series
#
max-select-series = 0

# The maxium number of group by time bucket a SELECT can create. A value of zero will max the maximum
# number of buckets unlimited.
max-select-buckets = 0

###
### [retention]
###
### Controls the enforcement of retention policies for evicting old data.
###

[retention]
# Determines whether retention policy enforcement enabled.
# enabled = true

# The interval of time when retention policy enforcement checks run.
# check-interval = "30m"

###
### [shard-precreation]
###
### Controls the precreation of shards, so they are available before data arrives.
### Only shards that, after creation, will have both a start- and end-time in the
### future, will ever be created. Shards are never precreated that would be wholly
### or partially in the past.

[shard-precreation]
# Determines whether shard pre-creation service is enabled.
# enabled = true

# The interval of time when the check to pre-create new shards runs.
# check-interval = "10m"

# The default period ahead of the endtime of a shard group that its successor
# group is created.
advance-period = "10m"

###
### Controls the system self-monitoring, statistics and diagnostics.
###
### The internal database for monitoring data is created automatically if
### if it does not already exist. The target retention within this database
### is called 'monitor' and is also created with a retention period of 7 days
### and a replication factor of 1, if it does not exist. In all cases the
### this retention policy is configured as the default for the database.

[monitor]
# Whether to record statistics internally.
# store-enabled = true

# The destination database for recorded statistics
# store-database = "_internal"

# The interval at which to record statistics
# store-interval = "10s"

###
### [http]
###
### Controls how the HTTP endpoints are configured. These are the primary
### mechanism for getting data into and out of InfluxDB.
###

[http]
# Determines whether HTTP endpoint is enabled.
# enabled = true

# The bind address used by the HTTP service.
bind-address = ":8086"

# Determines whether user authentication is enabled over HTTP/HTTPS.
#auth-enabled = true

# The default realm sent back when issuing a basic auth challenge.
# realm = "InfluxDB"

# Determines whether HTTP request logging is enabled.
# 默認為true,會生成很多http請求的數據,建議關閉,不然日志文件跟插入數據量成正比,大致1:1的關系
#
log-enabled = false

# Determines whether the HTTP write request logs should be suppressed when the log is enabled.
# suppress-write-log = false

# When HTTP request logging is enabled, this option specifies the path where
# log entries should be written. If unspecified, the default is to write to stderr, which
# intermingles HTTP logs with internal InfluxDB logging.
#
# If influxd is unable to access the specified path, it will log an error and fall back to writing
# the request log to stderr.
# access-log-path = ""

# Determines whether detailed write logging is enabled.
# write-tracing = false

# Determines whether the pprof endpoint is enabled. This endpoint is used for
# troubleshooting and monitoring.
# pprof-enabled = true

# Enables a pprof endpoint that binds to localhost:6060 immediately on startup.
# This is only needed to debug startup issues.
# debug-pprof-enabled = false

# Determines whether HTTPS is enabled.
# https-enabled = false

# The SSL certificate to use when HTTPS is enabled.
# https-certificate = "/etc/ssl/influxdb.pem"

# Use a separate private key location.
# https-private-key = ""

# The JWT auth shared secret to validate requests using JSON web tokens.
# shared-secret = ""

# The default chunk size for result sets that should be chunked.
# 查詢頁面顯示最大記錄數
#
max-row-limit = 10000

# The maximum number of HTTP connections that may be open at once. New connections that
# would exceed this limit are dropped. Setting this value to 0 disables the limit.
# max-connection-limit = 0

# Enable http service over unix domain socket
# unix-socket-enabled = false

# The path of the unix domain socket.
# bind-socket = "/var/run/influxdb.sock"

# The maximum size of a client request body, in bytes. Setting this value to 0 disables the limit.
# max-body-size = 25000000

# The maximum number of writes processed concurrently.
# Setting this to 0 disables the limit.
# max-concurrent-write-limit = 0

# The maximum number of writes queued for processing.
# Setting this to 0 disables the limit.
# max-enqueued-write-limit = 0

# The maximum duration for a write to wait in the queue to be processed.
# Setting this to 0 or setting max-concurrent-write-limit to 0 disables the limit.
# enqueued-write-timeout = 0


###
### [ifql]
###
### Configures the ifql RPC API.
###

[ifql]
# Determines whether the RPC service is enabled.
# enabled = true

# Determines whether additional logging is enabled.
# log-enabled = true

# The bind address used by the ifql RPC service.
# bind-address = ":8082"


###
### [logging]
###
### Controls how the logger emits logs to the output.
###

[logging]
# Determines which log encoder to use for logs. Available options
# are auto, logfmt, and json. auto will use a more a more user-friendly
# output format if the output terminal is a TTY, but the format is not as
# easily machine-readable. When the output is a non-TTY, auto will use
# logfmt.
# format = "auto"

# Determines which level of logs will be emitted. The available levels
# are error, warn, info, and debug. Logs that are equal to or above the
# specified level will be emitted.
# level = "info"

# Suppresses the logo output that is printed when the program is started.
# The logo is always suppressed if STDOUT is not a TTY.
# suppress-logo = false

###
### [subscriber]
###
### Controls the subscriptions, which can be used to fork a copy of all data
### received by the InfluxDB host.
###

[subscriber]
# Determines whether the subscriber service is enabled.
# enabled = true

# The default timeout for HTTP writes to subscribers.
# http-timeout = "30s"

# Allows insecure HTTPS connections to subscribers. This is useful when testing with self-
# signed certificates.
# insecure-skip-verify = false

# The path to the PEM encoded CA certs file. If the empty string, the default system certs will be used
# ca-certs = ""

# The number of writer goroutines processing the write channel.
# write-concurrency = 40

# The number of in-flight writes buffered in the write channel.
# write-buffer-size = 1000


###
### [[graphite]]
###
### Controls one or many listeners for Graphite data.
###

[[graphite]]
# Determines whether the graphite endpoint is enabled.
# enabled = false
# database = "graphite"
# retention-policy = ""
# bind-address = ":2003"
# protocol = "tcp"
# consistency-level = "one"

# These next lines control how batching works. You should have this enabled
# otherwise you could get dropped metrics or poor performance. Batching
# will buffer points in memory if you have many coming in.

# Flush if this many points get buffered
# batch-size = 5000

# number of batches that may be pending in memory
# batch-pending = 10

# Flush at least this often even if we haven't hit buffer limit
# batch-timeout = "1s"

# UDP Read buffer size, 0 means OS default. UDP listener will fail if set above OS max.
# udp-read-buffer = 0

### This string joins multiple matching 'measurement' values providing more control over the final measurement name.
# separator = "."

### Default tags that will be added to all metrics. These can be overridden at the template level
### or by tags extracted from metric
# tags = ["region=us-east", "zone=1c"]

### Each template line requires a template pattern. It can have an optional
### filter before the template and separated by spaces. It can also have optional extra
### tags following the template. Multiple tags should be separated by commas and no spaces
### similar to the line protocol format. There can be only one default template.
# templates = [
# "*.app env.service.resource.measurement",
# # Default template
# "server.*",
# ]

###
### [collectd]
###
### Controls one or many listeners for collectd data.
###

[[collectd]]
# enabled = false
# bind-address = ":25826"
# database = "collectd"
# retention-policy = ""
#
# The collectd service supports either scanning a directory for multiple types
# db files, or specifying a single db file.
# typesdb = "/usr/local/share/collectd"
#
# security-level = "none"
# auth-file = "/etc/collectd/auth_file"

# These next lines control how batching works. You should have this enabled
# otherwise you could get dropped metrics or poor performance. Batching
# will buffer points in memory if you have many coming in.

# Flush if this many points get buffered
# batch-size = 5000

# Number of batches that may be pending in memory
# batch-pending = 10

# Flush at least this often even if we haven't hit buffer limit
# batch-timeout = "10s"

# UDP Read buffer size, 0 means OS default. UDP listener will fail if set above OS max.
# read-buffer = 0

# Multi-value plugins can be handled two ways.
# "split" will parse and store the multi-value plugin data into separate measurements
# "join" will parse and store the multi-value plugin as a single multi-value measurement.
# "split" is the default behavior for backward compatability with previous versions of influxdb.
# parse-multivalue-plugin = "split"
###
### [opentsdb]
###
### Controls one or many listeners for OpenTSDB data.
###

[[opentsdb]]
# enabled = false
# bind-address = ":4242"
# database = "opentsdb"
# retention-policy = ""
# consistency-level = "one"
# tls-enabled = false
# certificate= "/etc/ssl/influxdb.pem"

# Log an error for every malformed point.
# log-point-errors = true

# These next lines control how batching works. You should have this enabled
# otherwise you could get dropped metrics or poor performance. Only points
# metrics received over the telnet protocol undergo batching.

# Flush if this many points get buffered
# batch-size = 1000

# Number of batches that may be pending in memory
# batch-pending = 5

# Flush at least this often even if we haven't hit buffer limit
# batch-timeout = "1s"

###
### [[udp]]
###
### Controls the listeners for InfluxDB line protocol data via UDP.
###

[[udp]]
# enabled = false
# bind-address = ":8089"
# database = "udp"
# retention-policy = ""

# InfluxDB precision for timestamps on received points ("" or "n", "u", "ms", "s", "m", "h")
# precision = ""

# These next lines control how batching works. You should have this enabled
# otherwise you could get dropped metrics or poor performance. Batching
# will buffer points in memory if you have many coming in.

# Flush if this many points get buffered
# batch-size = 5000

# Number of batches that may be pending in memory
# batch-pending = 10

# Will flush at least this often even if we haven't hit buffer limit
# batch-timeout = "1s"

# UDP Read buffer size, 0 means OS default. UDP listener will fail if set above OS max.
# read-buffer = 0

###
### [continuous_queries]
###
### Controls how continuous queries are run within InfluxDB.
### 連續查詢

[continuous_queries]
# Determines whether the continuous query service is enabled.
# //開啟連續查詢
#
enabled = true

# Controls whether queries are logged when executed by the CQ service.
# //開啟連續查詢的日志,有助於異常發現
#
log-enabled = true

# Controls whether queries are logged to the self-monitoring data store.
# query-stats-enabled = false

# interval for how often continuous queries will be checked if they need to run
# run-interval = "1s"

###
### [tls]
###
### Global configuration settings for TLS in InfluxDB.
###

[tls]
# Determines the available set of cipher suites. See https://golang.org/pkg/crypto/tls/#pkg-constants
# for a list of available ciphers, which depends on the version of Go (use the query
# SHOW DIAGNOSTICS to see the version of Go used to build InfluxDB). If not specified, uses
# the default settings from Go's crypto/tls package.
# ciphers = [
# "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305",
# "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
# ]

# Minimum version of the tls protocol that will be negotiated. If not specified, uses the
# default settings from Go's crypto/tls package.
# min-version = "tls1.2"

# Maximum version of the tls protocol that will be negotiated. If not specified, uses the
# default settings from Go's crypto/tls package.
# max-version = "tls1.2"

 

啟動后創建一個名稱為 prometheus 的庫

docker exec -it influxdb bash  
influx
create database prometheus

Prometheus集群

在A和B 上分別使用docker安裝Prometheus

參照https://www.cnblogs.com/xiaoyou2018/p/14037006.html

A :http://192.168.1.190:9090

B :http://192.168.1.206.9090

在B 上安裝nginx,使用nginx代理A和B 

[root@kibana vhost]# cat prometheus.conf 
upstream prom.midust.com{
        server 192.168.1.190:9090  max_fails=0 fail_timeout=0s weight=3;
        server 192.168.1.106:9090  max_fails=0 fail_timeout=0s weight=3;
        keepalive 300;
        }

    server {
        listen       80;
        server_name  prom.test.com;
        access_log /var/log/nginx/prom.midust.com.access.log;
        error_log  /var/log/nginx/prom.midust.com.error.log;
        # Load configuration files for the default server block.
        #include /etc/nginx/default.d/*.conf;

        location / {
             proxy_pass http://prom.test.com;
                proxy_set_header Host $host;
                proxy_set_header X-Real-IP $remote_addr;
                proxy_set_header x-forwarded-for $proxy_add_x_forwarded_for;
                proxy_redirect default;
                proxy_http_version 1.1;
                proxy_set_header Connection "";
        }

        error_page 404 /404.html;
            location = /40x.html {
        }

        error_page 500 502 503 504 /50x.html;
            location = /50x.html {
        }
    }

解析之后,

訪問 http://prom.test.com

 

A和B 上的Prometheus 接入 influxdb 

A 讀和寫

B只讀

安裝remote_storage_adapter 組件

鏈接:https://pan.baidu.com/s/1c0rWQhRg9QZpDb4eadkeOg
提取碼:cu6n
放在 /data/prometheus目錄

A和B 分別運行

nohup /data/prometheus/remote_storage_adapter --influxdb-url=http://192.168.1.206:8086 --influxdb.username=prom --influxdb.database=prometheus --influxdb.retention-policy=autogen &

A和B 上的Prometheus.yml 文件修改

A 的Prometheus.yml最后添加

remote_write:
  - url: "http://192.168.1.206:8086/api/v1/prom/write?db=prometheus&u=prom&p=xxx"

remote_read:
  - url: "http://192.168.1.206:8086/api/v1/prom/read?db=prometheus&u=prom&p=xxx"

B 的Prometheus.yml最后添加

remote_read:
  - url: "http://192.168.1.206:8086/api/v1/prom/read?db=prometheus&u=prom&p=TTdjy911.500"

稍等一會查看influxdb是否有數據

[root@localhost alertmanager-0.21.0.linux-amd64]# docker exec -it influxdb bash 
root@influxdb:/# influx
Connected to http://localhost:8086 version 1.8.3
InfluxDB shell version: 1.8.3
> use prometheus
Using database prometheus
> SHOW MEASUREMENTS

顯示如下說明成功

 

管理influxDB 工具“InfluxDBStudio”

鏈接:https://pan.baidu.com/s/1c0rWQhRg9QZpDb4eadkeOg 
提取碼:cu6n 

influxDB 設置保留數據期限:

Using database prometheus
> show retention policies
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
autogen 2160h0m0s 168h0m0s 1 true

 

 

 

 

retention policy描述了influxdb中的數據會保留多長時間、數據保留幾個副本(開源版的只能保留一個副本),以及每個shard保存多長時間的數據。每個influxdb數據庫都有一個獨立的retention policy。這里面涉及到幾個基本概念,下面描述一下。

DURATION:這個描述了保留策略要保留多久的數據。這個機制對於時序型的數據來講,是非常有用的。
SHARD:這個是實際存儲influxdb數據的單元。每個shard保留一個時間片的數據,默認是7天。如果你保存1年的數據,那么influxdb會把連續7天的數據放到一個shard中,使用好多個shard來保存數據。
shard duration這個描述了每個shard存放多數據的時間片是多大。默認7天。需要注意的是,當數據超出了保留策略后,influxdb並不是按照數據點的時間一點一點刪除的,而是會刪除整個shard group。
SHARD GROUP:顧名思義,這個一個shard group包含多個shard。對於開源版的influxdb,這個其實沒有什么區別,可以簡單理解為一個shard group只包含一個shard,但對於企業版的多節點集群模式來講,一個shard group可以包含不同節點上的不同shard,這使得influxdb可以保存更多的數據。
SHARD REPLICATION:這個描述了每個shard有幾個副本。對於開源版來講,只支持單副本,對於企業版來講,每個shard可以冗余存儲,這樣可以避免單點故障。

默認數據一直保留

如果想修改retention policy的數據保留時間,可以使用alter retention policy語句

alter retention policy autogen on prometheus duration 30d  REPLICATION 1  SHARD DURATION 7d default

 

alertmanager集群

Alertmanager引入了Gossip機制。Gossip機制為多個Alertmanager之間提供了信息傳遞的機制。確保及時在多個Alertmanager分別接收到相同告警信息的情況下,也只有一個告警通知被發送給Receiver。

 

 

Gossip協議

Gossip是分布式系統中被廣泛使用的協議,用於實現分布式節點之間的信息交換和狀態同步。Gossip協議同步狀態類似於流言或者病毒的傳播,如下所示:

 

 

 

 

Gossip分布式協議

一般來說Gossip有兩種實現方式分別為Push-based和Pull-based。在Push-based當集群中某一節點A完成一個工作后,隨機的從其它節點B並向其發送相應的消息,節點B接收到消息后在重復完成相同的工作,直到傳播到集群中的所有節點。而Pull-based的實現中節點A會隨機的向節點B發起詢問是否有新的狀態需要同步,如果有則返回。在簡單了解了Gossip協議之后,我們來看Alertmanager是如何基於Gossip協議實現集群高可用的。如下所示,當Alertmanager接收到來自Prometheus的告警消息后,會按照以下流程對告警進行處理:

 

 

通知流水線

  • 在第一個階段Silence中,Alertmanager會判斷當前通知是否匹配到任何的靜默規則,如果沒有則進入下一個階段,否則則中斷流水線不發送通知
  • 在第二個階段Wait中,Alertmanager會根據當前Alertmanager在集群中所在的順序(index)等待index * 5s的時間。
  • 當前Alertmanager等待階段結束后,Dedup階段則會判斷當前Alertmanager數據庫中該通知是否已經發送,如果已經發送則中斷流水線,不發送告警,否則則進入下一階段Send對外發送告警通知。
  • 告警發送完成后該Alertmanager進入最后一個階段Gossip,Gossip會通知其他Alertmanager實例當前告警已經發送。其他實例接收到Gossip消息后,則會在自己的數據庫中保存該通知已發送的記錄。

因此如下所示,Gossip機制的關鍵在於兩點:

 

 

Gossip機制

 

  • Silence設置同步:Alertmanager啟動階段基於Pull-based從集群其它節點同步Silence狀態,當有新的Silence產生時使用Push-based方式在集群中傳播Gossip信息。
  • 通知發送狀態同步:告警通知發送完成后,基於Push-based同步告警發送狀態。Wait階段可以確保集群狀態一致。

Alertmanager基於Gossip實現的集群機制雖然不能保證所有實例上的數據時刻保持一致,但是實現了CAP理論中的AP系統,即可用性和分區容錯性。同時對於Prometheus Server而言保持了配置了簡單性,Promthues Server之間不需要任何的狀態同步。

下載alertmanager 組件,放在

https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz

A和B 都安裝

tar zxvf alertmanager-0.21.0.linux-amd64.tar.gz -C /data/alertmanager/

服務器A 

vi /etc/systemd/system/alertmanager.service

[Unit]
Description=Alertmanager
After=network-online.target

[Service]
Restart=on-failure
ExecStart=/data/alertmanager/alertmanager-0.21.0.linux-amd64/alertmanager --web.listen-address=":9093" --cluster.listen-address="192.168.1.190:9094" --cluster.peer=192.168.1.206:9094 --config.file=/data/alertmanager/alertmanager-0.21.0.linux-amd64/alertmanager.yml

[Install]
WantedBy=multi-user.target

啟動alertmanager

systemctl daemon-reload
systemctl start alertmanager
systemctl status alertmanager

如果啟動報錯就直接運行

nohup /data/alertmanager/alertmanager-0.21.0.linux-amd64/alertmanager --web.listen-address=":9093" --cluster.listen-address="192.168.1.190:9094"  --cluster.peer=192.168.1.206:9094  --config.file=/data/alertmanager/alertmanager-0.21.0.linux-amd64/alertmanager.yml &

 

服務器B

vi /etc/systemd/system/alertmanager.service

[Unit]
Description=Alertmanager
After=network-online.target

[Service]
Restart=on-failure
ExecStart=/data/alertmanager/alertmanager-0.21.0.linux-amd64/alertmanager --web.listen-address=":9093" --cluster.listen-address="192.168.1.206:9094"  --cluster.peer=192.168.1.190:9094  --config.file=/data/alertmanager/alertmanager-0.21.0.linux-amd64/alertmanager.yml

[Install]
WantedBy=multi-user.target

 啟動報錯運行

nohup /data/alertmanager/alertmanager-0.21.0.linux-amd64/alertmanager --web.listen-address=":9093" --cluster.listen-address="192.168.1.206:9094"  --cluster.peer=192.168.1.190:9094  --config.file=/data/alertmanager/alertmanager-0.21.0.linux-amd64/alertmanager.yml &

 

修改Prometheus.yml文件的Alertmanager configuration

 

 

 

 

 

 

 

驗證:

關閉一台服務器的node_exporter

登錄 http://192.168.1.190:9093 和  http://192.168.1.206:9093

都能看到接收到的告警信息,但是釘釘只接收到一條告警

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM