案例說明: 對於主庫數據量比較大的環境,在添加新節點是可以采用在線clone方式創建新的備庫節點,也可以在離線的狀態下,直接拷貝其中一個備庫的所有集群相關目錄來創建新的備庫節點。本案例介紹了通過離線物理copy目錄的方式創建新的備庫節點,包括詳細的操作步驟。
一、案例環境
操作系統:
[root@node1 ~]# cat /etc/centos-release CentOS Linux release 7.2.1511 (Core) 數據庫: test=# select version(); version ------------------------------------------------------------------------------ KingbaseES V008R006C003B0010 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit (1 row)
集群架構:

一、配置新節點系統環境(和集群其他節點相同)
1、配置資源訪問限制策略
cat /etc/security/limits.conf
# End of file
* soft nofile 655360
root soft nofile 655360
* hard nofile 655360
root hard nofile 655360
* soft nproc 655360
root soft nproc 655360
* hard nproc 655360
root hard nproc 655360
* soft core unlimited
root soft core unlimited
* hard core unlimited
root hard core unlimited
* soft memlock 50000000
root soft memlock 50000000
* hard memlock 50000000
root hard memlock 50000000
2、配置sysctl.conf文件
[root@node2 ~]# cat /etc/sysctl.conf
# System default settings live in /usr/lib/sysctl.d/00-system.conf.
# To override those settings, enter new settings here, or in an /etc/sysctl.d/<name>.conf file
#
# For more information, see sysctl.conf(5) and sysctl.d(5).
kernel.sem= 5010 641280 5010 256
fs.file-max=7672460
fs.aio-max-nr=1048576
net.core.rmem_default=262144
net.core.rmem_max=4194304
net.core.wmem_default=262144
net.core.wmem_max=4194304
net.ipv4.ip_local_port_range=9000 65500
net.ipv4.tcp_wmem=8192 65536 16777216
net.ipv4.tcp_rmem=8192 87380 16777216
vm.min_free_kbytes=512000
vm.vfs_cache_pressure=200
vm.swappiness=20
net.ipv4.tcp_max_syn_backlog=4096
net.core.somaxconn=4096
3、關閉新節點防火牆
[root@node3 .ssh]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2021-06-24 13:59:09 CST; 23min ago
Main PID: 798 (firewalld)
CGroup: /system.slice/firewalld.service
└─798 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid
Jun 24 13:59:06 localhost.localdomain systemd[1]: Starting firewalld - dynamic firewall daemon...
Jun 24 13:59:09 localhost.localdomain systemd[1]: Started firewalld - dynamic firewall daemon.
[root@node3 .ssh]# systemctl stop firewalld
[root@node3 .ssh]# systemctl disable firewalld
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
Removed symlink /etc/systemd/system/basic.target.wants/firewalld.service.
[root@node3 .ssh]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Jun 24 13:59:06 localhost.localdomain systemd[1]: Starting firewalld - dynamic firewall daemon...
Jun 24 13:59:09 localhost.localdomain systemd[1]: Started firewalld - dynamic firewall daemon.
Jun 24 14:22:46 node3 systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jun 24 14:22:48 node3 systemd[1]: Stopped firewalld - dynamic firewall daemon.
4、驗證新節點和其他節點的ssh信任關系
kingbase用戶:kingbase 用戶能以 root 用戶免密ssh 到 所有節點。

root用戶:

5、關閉selinux

三、查看當前集群狀態現象
1、集群節點狀態信息
[kingbase@node1 bin]$ ./repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+ 1 | node248 | primary | * running | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node249 | standby | running | node248 | default | 100 | 5 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2、主備流復制狀態信息
test=# select * from sys_stat_replication; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_s tart | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_la g | replay_lag | sync_priority | sync_state | reply_time -------+----------+---------+------------------+---------------+-----------------+ 13347 | 16384 | esrep | node249 | 192.168.7.249 | | 23228 | 2021-03-01 14:45: 03.723296+08 | | streaming | 1/F205BC90 | 1/F205BC90 | 1/F205BC90 | 1/F205BC90 | | | | 1 | quorum | 2021-03-01 14:54:58.127023+08 (1 row)
四、物理copy創建新備庫節點
1、新備庫創建目錄(和集群其他節點一致)
[kingbase@node3 .ssh]$ mkdir -p /home/kingbase/cluster/R6HA/KHA/kingbase/data
2、從已有的主庫節點拷貝集群相關目錄和文件
[kingbase@node2 KHA]$ scp -r * node3:/home/kingbase/cluster/R6HA/KHA/
源庫必須是正常關閉狀態。
3、配置ip和arping可執行文件權限

===如果集群使用vip,需要對ip和arping可執行文件配置setuid權限。===
[root@node3 soft]# chmod 4755 /sbin/ip [root@node3 soft]# chmod 4755 /sbin/arping [root@node3 soft]# ls -lh /sbin/ip -rwsr-xr-x. 1 root root 319K Nov 20 2015 /sbin/ip [root@node3 soft]# ls -lh /sbin/arping -rwsr-xr-x. 1 root root 24K Nov 21 2015 /sbin/arping
四、將新備庫節點加入集群
1、編輯repmgr.conf文件
[kingbase@node3 etc]$ cat repmgr.conf on_bmj=off node_id=3 node_name=node243 promote_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf' follow_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr standby follow -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf -W --upstream-node-id=%n' conninfo='host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3' log_file='/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log' data_directory='/home/kingbase/cluster/R6HA/KHA/kingbase/data' sys_bindir='/home/kingbase/cluster/R6HA/KHA/kingbase/bin' ssh_options='-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22' reconnect_attempts=3 reconnect_interval=5 failover='automatic' recovery='manual' monitoring_history='no' trusted_servers='192.168.7.1' virtual_ip='192.168.7.240/24' net_device='enp0s3' ipaddr_path='/sbin' arping_path='/sbin' synchronous='quorum' repmgrd_pid_file='/home/kingbase/cluster/R6HA/KHA/kingbase/hamgrd.pid' ping_path='/usr/bin'

[kingbase@node3 bin]$ ./repmgr node rejoin -h 192.168.7.248 -U esrep -d esrep --force ERROR: database is still running in state "shut down in recovery" HINT: "repmgr node rejoin" cannot be executed on a running node
2、啟動備庫數據庫服務
啟動數據庫:
[kingbase@node3 bin]$ chmod 700 ../data [kingbase@node3 bin]$ ./sys_ctl start -D ../data waiting for server to start....2021-03-01 13:59:02.770 CST [20835] LOG: sepapower extension initialized 2021-03-01 13:59:02.813 CST [20835] LOG: starting KingbaseES V008R006C003B0010 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit 2021-03-01 13:59:02.814 CST [20835] LOG: listening on IPv4 address "0.0.0.0", port 54321 2021-03-01 13:59:02.814 CST [20835] LOG: listening on IPv6 address "::", port 54321 2021-03-01 13:59:02.859 CST [20835] LOG: listening on Unix socket "/tmp/.s.KINGBASE.54321" 2021-03-01 13:59:02.902 CST [20835] LOG: redirecting log output to logging collector process 2021-03-01 13:59:02.902 CST [20835] HINT: Future log output will appear in directory "sys_log". done server started
3、將節點作為standby注冊到集群
[kingbase@node3 bin]$ ./repmgr standby register --force INFO: connecting to local node "node243" (ID: 3) INFO: connecting to primary database WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1) WARNING: local node not attached to primary node 1 NOTICE: -F/--force supplied, continuing anyway INFO: standby registration complete NOTICE: standby node "node243" (ID: 3) successfully registered
4、將節點加入到集群
[kingbase@node3 bin]$ ./repmgr node rejoin -h 192.168.7.248 -U esrep -d esrep ERROR: connection to database failed DETAIL: fe_sendauth: no password supplied
=== 由以上獲知,加入節點無法訪問主庫數據庫,認證失敗。===
5、將集群其他節點的認證文件拷貝到新節點的宿主目錄下
[kingbase@node3 ~]$ ls -lha .encpwd -rw-------. 1 kingbase kingbase 55 Mar 1 14:33 .encpwd [kingbase@node3 ~]$ cat .encpwd *:*:*:system:MTIzNDU2 *:*:*:esrep:S2luZ2Jhc2VoYTExMA==
=== 對於R6集群使用了.encpwd的隱藏文件,用於系統用戶免密登錄數據庫===
6、新節點加入到集群
[kingbase@node3 bin]$ ./repmgr node rejoin -h 192.168.7.248 -U esrep -d esrep INFO: timelines are same, this server is not ahead DETAIL: local node lsn is 1/F2055920, rejoin target lsn is 1/F2062AB0 NOTICE: setting node 3's upstream to node 1 WARNING: unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3" DETAIL: PQping() returned "PQPING_NO_RESPONSE" NOTICE: begin to start server at 2021-03-01 14:34:19.973116 NOTICE: starting server using "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6HA/KHA/kingbase/data' -l /home/kingbase/cluster/R6HA/KHA/kingbase/bin/logfile start" NOTICE: start server finish at 2021-03-01 14:34:20.187969 NOTICE: NODE REJOIN successful DETAIL: node 3 is now attached to node 1
=== 從以上獲知,新節點node243作為備庫加入到集群中===
7、查看集群節點狀態
[kingbase@node3 bin]$ ./repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+-------- 1 | node248 | primary | * running | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node249 | standby | running | node248 | default | 100 | 5 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 3 | node243 | standby | running | node248 | default | 100 | 5 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
test=# select * from sys_stat_replication; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time -------+----------+---------+------------------+---------------+-----------------+-------------+ 13347 | 16384 | esrep | node249 | 192.168.7.249 | | 23228 | 2021-03-01 14:45:03.723296+08 | | streaming | 1/F20659C8 | 1/F20659C8 | 1/F20659C8 | 1/F20659C8 | | | | 1 | quorum | 2021-03-01 15:35:31.027066+08 25123 | 16384 | esrep | node243 | 192.168.7.243 | | 49130 | 2021-03-01 15:33:59.607489+08 | | streaming | 1/F20659C8 | 1/F20659C8 | 1/F20659C8 | 1/F20659C8 | | | | 1 | quorum | 2021-03-01 14:36:01.384836+08
9、數據同步測試
1) Primary DML操作
test=# \c prod
You are now connected to database "prod" as user "system".
prod=# create table t8 (like t7);
CREATE TABLE
prod=# \d
List of relations
Schema | Name | Type | Owner
--------+---------------------+-------------------+--------
......
public | t8 | table | system
(16 rows)
2) standby 查看同步數據
You are now connected to database "prod" as user "system".
prod=# \d
List of relations
Schema | Name | Type | Owner
--------+---------------------+-------------------+--------
.......
public | t8 | table | system
(16 rows)
五、重新啟動集群測試
1、重新啟動集群
[kingbase@node1 bin]$ ./sys_monitor.sh restart 2021-03-01 15:37:28 Ready to stop all DB ... Service process "node_export" was killed at process 14434 Service process "postgres_ex" was killed at process 14435 Service process "node_export" was killed at process 14008 Service process "postgres_ex" was killed at process 14009 There is no service "node_export" running currently. There is no service "postgres_ex" running currently. 2021-03-01 15:37:37 begin to stop repmgrd on "[192.168.7.248]". 2021-03-01 15:37:38 repmgrd on "[192.168.7.248]" stop success. 2021-03-01 15:37:38 begin to stop repmgrd on "[192.168.7.249]". 2021-03-01 15:37:39 repmgrd on "[192.168.7.249]" stop success. 2021-03-01 15:37:39 begin to stop repmgrd on "[192.168.7.243]". 2021-03-01 15:37:40 repmgrd on "[192.168.7.243]" already stopped. 2021-03-01 15:37:40 begin to stop DB on "[192.168.7.249]". waiting for server to shut down.... done server stopped 2021-03-01 15:37:41 DB on "[192.168.7.249]" stop success. 2021-03-01 15:37:41 begin to stop DB on "[192.168.7.243]". waiting for server to shut down.... done server stopped 2021-03-01 15:37:43 DB on "[192.168.7.243]" stop success. 2021-03-01 15:37:43 begin to stop DB on "[192.168.7.248]". waiting for server to shut down...... done server stopped 2021-03-01 15:37:46 DB on "[192.168.7.248]" stop success. 2021-03-01 15:37:46 Done. 2021-03-01 15:37:46 Ready to start all DB ... 2021-03-01 15:37:46 begin to start DB on "[192.168.7.248]". waiting for server to start.... done server started 2021-03-01 15:37:48 execute to start DB on "[192.168.7.248]" success, connect to check it. 2021-03-01 15:37:49 DB on "[192.168.7.248]" start success. 2021-03-01 15:37:49 Try to ping trusted_servers on host 192.168.7.248 ... 2021-03-01 15:37:52 Try to ping trusted_servers on host 192.168.7.249 ... 2021-03-01 15:37:54 Try to ping trusted_servers on host 192.168.7.243 ... 2021-03-01 15:37:57 begin to start DB on "[192.168.7.249]". waiting for server to start.... done server started 2021-03-01 15:37:59 execute to start DB on "[192.168.7.249]" success, connect to check it. 2021-03-01 15:38:00 DB on "[192.168.7.249]" start success. 2021-03-01 15:38:00 begin to start DB on "[192.168.7.243]". waiting for server to start.... done server started 2021-03-01 15:38:02 execute to start DB on "[192.168.7.243]" success, connect to check it. 2021-03-01 15:38:03 DB on "[192.168.7.243]" start success. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------- 1 | node248 | primary | * running | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node249 | standby | running | node248 | default | 100 | 5 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 3 | node243 | standby | running | node248 | default | 100 | 5 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2021-03-01 15:38:03 The primary DB is started. 2021-03-01 15:38:07 Success to load virtual ip [192.168.7.240/24] on primary host [192.168.7.248]. 2021-03-01 15:38:07 Try to ping vip on host 192.168.7.248 ... 2021-03-01 15:38:10 Try to ping vip on host 192.168.7.249 ... 2021-03-01 15:38:13 Try to ping vip on host 192.168.7.243 ... 2021-03-01 15:38:16 begin to start repmgrd on "[192.168.7.248]". [2021-03-01 15:38:17] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf" [2021-03-01 15:38:17] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log" 2021-03-01 15:38:17 repmgrd on "[192.168.7.248]" start success. 2021-03-01 15:38:17 begin to start repmgrd on "[192.168.7.249]". [2021-03-01 15:38:08] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf" [2021-03-01 15:38:08] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log" 2021-03-01 15:38:18 repmgrd on "[192.168.7.249]" start success. 2021-03-01 15:38:18 begin to start repmgrd on "[192.168.7.243]". [2021-03-01 14:38:40] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf" [2021-03-01 14:38:40] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log" 2021-03-01 15:38:19 repmgrd on "[192.168.7.243]" start success. ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+---------+---------+-----------+----------+---------+-------+---------+-------------------- 1 | node248 | primary | * running | | running | 27678 | no | n/a 2 | node249 | standby | running | node248 | running | 25551 | no | 1 second(s) ago 3 | node243 | standby | running | node248 | running | 20067 | no | n/a 2021-03-01 15:38:31 Done.
=== 從以上獲知,新的節點已經可以通過sys_monitor.sh進行管理。===
2、查看集群節點狀態
[kingbase@node1 bin]$ ./repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+---------+---------+-----------+----------+----------+----------+----------+-------- 1 | node248 | primary | * running | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2 | node249 | standby | running | node248 | default | 100 | 5 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 3 | node243 | standby | running | node248 | default | 100 | 5 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
3、在通過sys_monitor.sh啟動集群時自動在新節點創建crond服務
[root@node3 cron.d]# cat KINGBASECRON */1 * * * * kingbase . /etc/profile;/home/kingbase/cluster/R6HA/KHA/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf >> /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../kbha.log 2>&
六、操作步驟總結:
1)配置新節點系統環境和集群其他節點保持一致。
2)手工配置新節點和集群其他節點:root----root,kingbase----kingbase,kingbase----root用戶之間的信任關系。
3)關閉新節點防火牆和selinux。
4)關閉集群,從主庫拷貝集群目錄和相關文件到新節點(包括數據庫)。
5)配置ip和arping可執行文件的setuid權限。
6)配置新備庫repmgr.conf文件。
7)啟動集群,啟動新備庫數據庫服務,將新備庫注冊到集群。
8)拷貝.encpw文件到新備庫,關閉新備庫數據庫服務,將新備庫節點加入到集群。
9)驗證集群所有節點狀態信息和流復制信息。
10)重新啟動集群驗證。
