repmgr安裝使用
node1: 192.168.5.132
node2: 192.168.5.133
一、通用安裝,兩個節點都執行:
1.repmgr安裝:
Install the repository definition for your distribution and PostgreSQL version:
curl https://dl.2ndquadrant.com/default/release/get/11/rpm | sudo bash
sudo yum repolist
Install:
sudo yum -y install repmgr12
2.安裝PG12
yum localinstall postgresql12-*
二、主庫設置:
1.初始化數據庫:
cd /usr/pg-12/
initdb -D data
2.修改數據庫配置文件:
[postgres@node1 data]$ vim postgresql.conf
listen_addresses = '*'
shared_preload_libraries = 'repmgr'
wal_log_hints = on
可以添加:
synchronous_standby_names = 'ANY 1(*)'
[postgres@node1 data]$ vim pg_hba.conf
host all all 192.168.5.132/32 trust
host all all 192.168.5.133/32 trust
host replication all 192.168.5.132/32 trust
host replication all 192.168.5.133/32 trust
啟動數據庫,使上面參數生效。注意:在clone備庫的時候,沒法輸入密碼,因此用trust。
pg_ctl -D ./ start
3.修改配置文件:
vim /etc/repmgr/12/repmgr.conf
node_id=1
node_name=node1
conninfo='host=192.168.5.132 port=5432 user=postgres dbname=postgres'
data_directory='/usr/pgsql-12/data'
4.注冊主節點
[postgres@node1 data]$ repmgr primary register
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered
[postgres@node1 data]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | host=192.168.5.132 port=5432 user=postgres dbname=postgres
三、備庫設置
1.修改配置:
vim /etc/repmgr/12/repmgr.conf
node_id=2
node_name=node2
conninfo='host=192.168.5.133 port=5432 user=postgres dbname=postgres'
data_directory='/usr/pgsql-12/data'
2.clone備庫
[postgres@localhost pgsql-12]$ repmgr standby clone -h 192.168.5.132 -U postgres NOTICE: destination directory "/usr/pgsql-12/data" provided INFO: connecting to source node DETAIL: connection string is: host=192.168.5.132 user=postgres DETAIL: current installation size is 23 MB NOTICE: checking for available walsenders on the source node (2 required) NOTICE: checking replication connections can be made to the source server (2 required) WARNING: data checksums are not enabled and "wal_log_hints" is "off" DETAIL: pg_rewind requires "wal_log_hints" to be enabled INFO: creating directory "/usr/pgsql-12/data"... NOTICE: starting backup (using pg_basebackup)... HINT: this may take some time; consider using the -c/--fast-checkpoint option INFO: executing: pg_basebackup -l "repmgr base backup" -D /usr/pgsql-12/data -h 192.168.5.132 -p 5432 -U postgres -X stream NOTICE: standby clone (using pg_basebackup) complete NOTICE: you can now start your PostgreSQL server HINT: for example: pg_ctl -D /usr/pgsql-12/data start HINT: after starting the server, you need to register this standby with "repmgr standby register"
3.啟動並注冊
[postgres@localhost pgsql-12]$ pg_ctl -D /usr/pgsql-12/data start waiting for server to start....2020-03-08 10:43:32.861 CST [74044] LOG: starting PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit 2020-03-08 10:43:32.935 CST [74044] LOG: listening on IPv4 address "0.0.0.0", port 5432 2020-03-08 10:43:32.968 CST [74044] LOG: listening on IPv6 address "::", port 5432 2020-03-08 10:43:32.973 CST [74044] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2020-03-08 10:43:33.004 CST [74044] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2020-03-08 10:43:33.072 CST [74044] LOG: redirecting log output to logging collector process 2020-03-08 10:43:33.072 CST [74044] HINT: Future log output will appear in directory "log". done server started [postgres@localhost pgsql-12]$ repmgr standby register INFO: connecting to local node "node2" (ID: 2) INFO: connecting to primary database WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1) INFO: standby registration complete NOTICE: standby node "node2" (ID: 2) successfully registered [postgres@localhost pgsql-12]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 1 | host=192.168.5.132 port=5432 user=postgres dbname=postgres 2 | node2 | standby | running | node1 | default | 100 | 1 | host=192.168.5.133 port=5432 user=postgres dbname=postgres
四、啟動repmgrd服務,主備都執行
1.查看服務
[postgres@localhost pgsql-12]$ repmgr service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+-------------+-----+---------+--------------------
1 | node1 | primary | * running | | not running | n/a | n/a | n/a
2 | node2 | standby | running | node1 | not running | n/a | n/a | n/a
2.添加配置,設置自動切換
vim /etc/repmgr/12/repmgr.conf
failover='automatic'
promote_command='/usr/pgsql-12/bin/repmgr standby promote'
follow_command='/usr/pgsql-12/bin/repmgr standby follow'
failover參數有兩個
automatic:表示開啟故障自動切換
manual:不開啟故障自動切換
3.啟動repmgrd服務
[postgres@localhost pgsql-12]$ repmgrd -d
[2020-03-07 18:48:58] [NOTICE] repmgrd (repmgrd 5.1.0) starting up
[2020-03-07 18:48:58] [INFO] connecting to database "host=192.168.5.133 port=5432 user=postgres dbname=postgres"
[postgres@localhost pgsql-12]$ INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2020-03-07 18:48:58] [NOTICE] starting monitoring of node "node2" (ID: 2)
[2020-03-07 18:48:58] [INFO] "connection_check_type" set to "ping"
[2020-03-07 18:48:58] [INFO] monitoring connection to upstream node "node1" (ID: 1)
4查看服務
[postgres@node1 data]$ repmgr service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node1 | primary | * running | | running | 16778 | no | n/a
2 | node2 | standby | running | node1 | running | 74238 | no | 1 second(s) ago
五、日志:
默認情況下,repmgr和repmgrd將日志輸出寫入 STDERR。
可以指定備用日志目標(文件或syslog)。
注意:
即使配置了另一個日志目標 ,repmgr應用程序本身也會繼續將日志輸出寫入到STDERR日志中,否則,由命令行操作產生的任何輸出都會“消失”在日志中。
在配置文件添加日志文件路徑:
[postgres@node1 repmgr]$ vim /etc/repmgr/12/repmgr.conf
log_file='/var/log/repmgr/repmgrd.log'
添加文件:
[root@node1 ~]# vim /etc/logrotate.d/repmgr
/var/log/repmgr/repmgrd.log { missingok compress rotate 52 maxsize 100M weekly create 0600 postgres postgres postrotate /usr/bin/killall -HUP repmgrd endscript }
六、添加witness節點
[postgres@localhost pgsql-12]$ repmgr witness register -h 192.168.5.133
INFO: connecting to witness node "node2" (ID: 2)
ERROR: provided node is a standby
HINT: a witness node must run on an independent primary server
七、命令使用
[postgres@localhost pgsql-12]$ repmgr --help
repmgr: replication management tool for PostgreSQL
Usage:
repmgr [OPTIONS] primary {register|unregister}
repmgr [OPTIONS] standby {register|unregister|clone|promote|follow|switchover}
repmgr [OPTIONS] node {status|check|rejoin|service}
repmgr [OPTIONS] cluster {show|event|matrix|crosscheck|cleanup}
repmgr [OPTIONS] witness {register|unregister}
repmgr [OPTIONS] service {status|pause|unpause}
repmgr [OPTIONS] daemon {start|stop}
1)查看節點狀態及信息
[postgres@localhost pgsql-12]$ repmgr node status
Node "node2":
PostgreSQL version: 12.4
Total data size: 23 MB
Conninfo: host=192.168.5.133 port=5432 user=postgres dbname=postgres
Role: standby
WAL archiving: off
Archive command: (none)
Replication connections: 0 (of maximal 10)
Replication slots: 0 physical (of maximal 10; 0 missing)
Upstream node: node1 (ID: 1)
Replication lag: 0 seconds
Last received LSN: 0/5000BF0
Last replayed LSN: 0/5000BF0
2)查看集群狀態
[postgres@node1 repmgr]$ repmgr daemon status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node1 | primary | * running | | running | 16778 | no | n/a
2 | node2 | standby | running | node1 | running | 74238 | no | 1 second(s) ago
3)停止repmgrd
配置repmgr.conf文件,添加命令:
repmgrd_service_start_command ='service repmgr-12 start'
repmgrd_service_stop_command ='service repmgr-12 stop'
執行停止:
[postgres@localhost pgsql-12]$ repmgr daemon stop
NOTICE: executing: "service repmgr-12 stop"
ERROR: repmgrd does not appear to have stopped after 15 seconds
HINT: use "repmgr service status" to confirm that repmgrd was successfully started
--但是:service repmgr-12 stop 沒有停止
換一個配置命令:
repmgrd_service_stop_command ='repmgr node service --list-actions --action=stop'
重新執行:
[postgres@localhost pgsql-12]$ repmgr daemon stop
NOTICE: executing: "repmgr node service --list-actions --action=stop"
ERROR: repmgrd does not appear to have stopped after 15 seconds
HINT: use "repmgr service status" to confirm that repmgrd was successfully started
仍然沒有停止掉
[postgres@node1 repmgr]$ repmgr service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node1 | primary | * running | | running | 16778 | no | n/a
2 | node2 | standby | running | node1 | running | 74238 | no | 1 second(s) ago
目前就通過pg_ctl停止數據庫和kill來殺repmgrd進程
4)暫停集群監控
[postgres@localhost pgsql-12]$ repmgr service pause
NOTICE: node 1 (node1) paused
NOTICE: node 2 (node2) paused
[postgres@localhost pgsql-12]$ repmgr daemon status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node1 | standby | running | node2 | running | 27253 | yes | 1 second(s) ago
2 | node2 | primary | * running | | running | 86308 | yes | n/a
[postgres@localhost pgsql-12]$
七、驗證
1)關閉repmgr進程,有什么影響?
repmgr不是進程,是一個插件,沒有可以停止的地方
2)關閉repmgrd進程,有什么影響?
只是repmgrd進程掉了,數據庫不會有影響。流復制集群正常~
3)如何重新加載repmgr.conf文件
殺掉進程,重新啟動:
[postgres@localhost pgsql-12]$ kill 74238
[postgres@localhost pgsql-12]$ [2020-03-07 19:44:15] [NOTICE] TERM signal received
[2020-03-07 19:44:15] [INFO] repmgrd terminating...
[postgres@localhost pgsql-12]$ repmgr daemon status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+-------------+-------+---------+--------------------
1 | node1 | primary | * running | | running | 16778 | no | n/a
2 | node2 | standby | running | node1 | not running | n/a | n/a | n/a
[postgres@localhost pgsql-12]$ repmgrd -d
[2020-03-07 19:44:48] [NOTICE] redirecting logging output to "/var/log/repmgr/repmgrd.log"
4)如果模擬切換?
手動停止主數據庫
主庫執行:
[postgres@node1 data]$ pg_ctl -D ./ stop waiting for server to shut down.... done server stopped [postgres@node1 data]$ [2020-09-29 17:22:34] [WARNING] unable to ping "host=192.168.5.132 port=5432 user=postgres dbname=postgres" [2020-09-29 17:22:34] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-09-29 17:22:34] [WARNING] connection to node "node1" (ID: 1) lost [2020-09-29 17:22:34] [DETAIL] FATAL: terminating connection due to administrator command server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. [2020-09-29 17:22:34] [INFO] attempting to reconnect to node "node1" (ID: 1) [2020-09-29 17:22:34] [ERROR] connection to database failed [2020-09-29 17:22:34] [DETAIL] could not connect to server: Connection refused Is the server running on host "192.168.5.132" and accepting TCP/IP connections on port 5432? [2020-09-29 17:22:34] [DETAIL] attempted to connect using: user=postgres dbname=postgres host=192.168.5.132 port=5432 connect_timeout=2 fallback_application_name=repmgr [2020-09-29 17:22:34] [WARNING] reconnection to node "node1" (ID: 1) failed [2020-09-29 17:22:34] [WARNING] unable to connect to local node [2020-09-29 17:22:34] [INFO] checking state of node 1, 1 of 6 attempts [2020-09-29 17:22:34] [WARNING] unable to ping "user=postgres dbname=postgres host=192.168.5.132 port=5432 connect_timeout=2 fallback_application_name=repmgr" 錯誤會一直報。 備庫查詢狀態,已經切換過來,時間可以設置重試、確認的時間和次數,默認6次,每次8s間隔: [postgres@localhost pgsql-12]$ repmgr daemon status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+-------+---------+-----------+----------+---------+-------+---------+-------------------- 1 | node1 | primary | - failed | ? | n/a | n/a | n/a | n/a 2 | node2 | primary | * running | | running | 79388 | no | n/a 啟動主庫,變為雙主: [postgres@localhost data]$ repmgr daemon status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+-------+---------+-----------+----------+---------+-------+---------+-------------------- 1 | node1 | primary | ! running | | running | 16778 | no | n/a 2 | node2 | primary | * running | | running | 79388 | no | n/a 此時停止主節點,使用node rejoin的方式加入: 需要在repmgr.conf中添加配置,才能啟動數據庫,不需要重啟repmgrd: service_start_command = '/usr/pgsql-12/bin/pg_ctl -D /usr/pgsql-12/data start' service_stop_command = '/usr/pgsql-12/bin/pg_ctl -D /usr/pgsql-12/data stop' service_restart_command = '/usr/pgsql-12/bin/pg_ctl -D /usr/pgsql-12/data restart' service_reload_command = '/usr/pgsql-12/bin/pg_ctl -D /usr/pgsql-12/data reload' --先進行rewind repmgr node rejoin -h 192.168.5.132 -U postgres -d postgres --force-rewind --dry-run --verbose --然后執行加入 repmgr node rejoin -h 192.168.5.132 -U postgres -d postgres --force-rewind --verbose [postgres@node1 pgsql-12]$ repmgr node rejoin -h192.168.5.133 -Upostgres -dpostgres --force-rewind --verbose INFO: checking for package configuration file "/etc/repmgr/12/repmgr.conf" INFO: configuration file found at: "/etc/repmgr/12/repmgr.conf" INFO: prerequisites for using pg_rewind are met INFO: 0 files copied to "/tmp/repmgr-config-archive-node1" NOTICE: executing pg_rewind DETAIL: pg_rewind command is "pg_rewind -D '/usr/pgsql-12/data' --source-server='host=192.168.5.133 port=5432 user=postgres dbname=postgres'" pg_rewind: servers diverged at WAL location 0/3009918 on timeline 1 pg_rewind: no rewind required NOTICE: 0 files copied to /usr/pgsql-12/data INFO: directory "/tmp/repmgr-config-archive-node1" deleted NOTICE: setting node 1's upstream to node 2 WARNING: unable to ping "host=192.168.5.132 port=5432 user=postgres dbname=postgres" DETAIL: PQping() returned "PQPING_NO_RESPONSE" NOTICE: starting server using "/usr/pgsql-12/bin/pg_ctl -D /usr/pgsql-12/data start" INFO: node "node1" (ID: 1) is pingable INFO: node "node1" (ID: 1) has attached to its upstream node NOTICE: NODE REJOIN successful DETAIL: node 1 is now attached to node 2 [postgres@localhost pgsql-12]$ repmgr daemon status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+-------+---------+-----------+----------+---------+-------+---------+-------------------- 1 | node1 | standby | running | node2 | running | 27253 | no | 0 second(s) ago 2 | node2 | primary | * running | | running | 86308 | no | n/a
5)當備機離線之后,再次加入,可能存在inactive狀態,怎么處理?
查看集群狀態:
[kingbase@node1 bin]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------- 1 | node132 | primary | * running | | default | 100 | 1 | host=192.168.5.132 user=esrep dbname=esrep port=54329 connect_timeout=3 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node133 | standby | ! running | node132 | default | 100 | 1 | host=192.168.5.133 user=esrep dbname=esrep port=54329 connect_timeout=3 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 WARNING: following issues were detected - node "node133" (ID: 2) is running but the repmgr node record is inactive
查看備機的rpmgr日志:
[2020-11-21 04:17:06] [NOTICE] repmgrd (repmgrd 5.0.0) starting up [2020-11-21 04:17:06] [INFO] connecting to database "host=192.168.5.133 user=esrep dbname=esrep port=54329 connect_timeout=3 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3" [2020-11-21 04:17:06] [ERROR] this node is marked as inactive and cannot be used as a failover target [2020-11-21 04:17:06] [HINT] Check that "repmgr (primary|standby) register" was executed for this node [2020-11-21 04:17:06] [INFO] repmgrd terminating...
看日志可以發現需要重新注冊備機,在備機重新注冊:
[kingbase@localhost kingbase]$ repmgr standby register -F INFO: connecting to local node "node133" (ID: 2) INFO: connecting to primary database INFO: standby registration complete NOTICE: standby node "node133" (ID: 2) successfully registered
再從主節點看集群狀態,已經正常:
[kingbase@node1 bin]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------- 1 | node132 | primary | * running | | default | 100 | 1 | host=192.168.5.132 user=esrep dbname=esrep port=54329 connect_timeout=3 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node133 | standby | running | node132 | default | 100 | 1 | host=192.168.5.133 user=esrep dbname=esrep port=54329 connect_timeout=3 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
6)如果沒有配置觀察節點,讓集群發生了腦裂,怎么處理?
比如網絡原因,主節點網絡隔離,備節點認為主節點離線了,會被提升為主庫(即使配置了turst_server,如果整個主節點網絡隔離了也會ping不通,生產中,如果網絡很差,可以把超時時間設置長一點);網絡恢復后,集群會出現兩個primary。這種情況,首先要備份兩個主節點的數據,然后就想辦法先把集群先恢復,恢復時可以考慮以哪個主為新的主節點。
這里建議的方法:
A)以timeline較高的節點為新主節點,這樣另外一個主節點重新regioin進集群即可。操作方法:將時間線低的節點停止,在主節點執行剔除這個節點id,然后啟動這個節點,最后執行重新加入集群的命令即可:
新主節點執行:
repmgr primary unregister --node-id n
節點重啟后,重新加入集群:
repmgr node rejoin -h 192.168.5.132 -U postgres -d postgres
如果有日志分支,則需要加上 --force-rewind來消除。
B) 以timeline低的節點為新主節點,則需要重做timeline高的節點,參照新建備機的操作。
最后,A、B兩種方式恢復集群后,再根據備份過的data目錄,通過查詢數據、應用摘取等方式,將差異數據合並到新的集群中。