在postgresql的高可用架構中,通常會采用流復制機制實現主備,其歷史可參考如下。
在pg 12中,對promote進行了進一步的增強,https://paquier.xyz/postgresql-2/postgres-12-promote-function/。
promote_trigger_file文件控制hot standby是否停止恢復,進入promote。https://www.postgresql.org/docs/current/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY
所有涉及的promote相關文件:
/* File path names (all relative to $PGDATA) */ #define RECOVERY_SIGNAL_FILE "recovery.signal" #define STANDBY_SIGNAL_FILE "standby.signal" #define BACKUP_LABEL_FILE "backup_label" #define BACKUP_LABEL_OLD "backup_label.old" #define TABLESPACE_MAP "tablespace_map" #define TABLESPACE_MAP_OLD "tablespace_map.old" /* files to signal promotion to primary */ #define PROMOTE_SIGNAL_FILE "promote" #define FALLBACK_PROMOTE_SIGNAL_FILE "fallback_promote"
edb提供了一個性能影響的參考:
通過查詢select now(),pg_last_xact_replay_timestamp();可以知道實際的同步時延。
從上可知,HA模式大約會下降10%-30%左右的性能,remote_write因為僅僅寫入了standby的內存,所以性能影響很小。on完全取決於對端磁盤的性能,本質上就是WAL串行寫兩次+一次網絡。當前如果主節點沒有跑到瓶頸例如cpu沒跑滿、IO滿了,網絡沒有到瓶頸,那么很可能HA的TPS和單實例的TPS是一樣的,只有網絡成為瓶頸時,HA才會是問題,例如跑批需要同步時。
筆者環境使用lightdb for pg版本時,ltbench可以非常穩定的跑到70000 tps,響應時間在1.5毫秒,采用PMEM存儲,在邁絡思roce下,local、remote_apply、on相差在10%以內,主節點cpu利用率在80%左右。
而為了實現自動故障轉移,一般會選擇采用repmgr(當然也有Patroni+etcd(用python寫的)和pg_auto_failover),對於非k8s環境,我們推薦repmgr做failover管理,keepalived VIP作為客戶端透明故障轉移。lightdb for pg使用類似機制,但做了腦裂穩定性和可靠性加強、以及non-block standby failure,以及對外部SLB支持(不管是哪種高可用方案,底層都是基於流復制,這才是核心,上面只是盡可能自動化,異常時兜底能力仍然是最重要的,無之一)。
對於patroni高可用模式,下一篇將進行專門介紹。
repmgr配置項
在repmgr中除了基本配置外,還有幾個非常重要但是經常會被忽略的配置項目,主要和事件通知相關。
https://repmgr.org/docs/repmgr.html#EVENT-NOTIFICATIONS
repmgr命令主要如下:
repmgr primary register — initialise a repmgr installation and register the primary node
repmgr primary unregister — unregister an inactive primary node
repmgr standby clone — clone a PostgreSQL standby node from another PostgreSQL node
repmgr standby register — add a standby's information to the repmgr metadata
repmgr standby unregister — remove a standby's information from the repmgr metadata
repmgr standby promote — promote a standby to a primary
repmgr standby follow — attach a running standby to a new upstream node
repmgr standby switchover — promote a standby to primary and demote the existing primary to a standby
repmgr witness register — add a witness node's information to the repmgr metadata
repmgr witness unregister — remove a witness node's information to the repmgr metadata
===========日常管理主要使用node,cluster以及service三類為主。上面三類主要用於初始安裝和維護。
repmgr node status — show overview of a node's basic information and replication status
repmgr node check — performs some health checks on a node from a replication perspective
repmgr node rejoin — rejoin a dormant (stopped) node to the replication cluster
repmgr node service — show or execute the system service command to stop/start/restart/reload/promote a node
repmgr cluster show — display information about each registered node in the replication cluster
repmgr cluster matrix — runs repmgr cluster show on each node and summarizes output
repmgr cluster crosscheck — cross-checks connections between each combination of nodes
repmgr cluster event — output a formatted list of cluster events
repmgr cluster cleanup — purge monitoring history
repmgr service status — display information about the status of repmgrd on each node in the cluster
repmgr service pause — Instruct all repmgrd instances in the replication cluster to pause failover operations
repmgr service unpause — Instruct all repmgrd instances in the replication cluster to resume failover operations
repmgr daemon start — Start the repmgrd daemon on the local node
repmgr daemon stop — Stop the repmgrd daemon on the local node
比較重要的是:
- 查看集群節點角色和狀態。(任意節點執行,包含所有節點的信息)
repmgr cluster show -f /home/lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+----------------------------+---------+-----------+---------------------------+----------+----------+----------+------------------------------------------------------------------------ 1 | 10.19.36.10-defaultcluster | standby | running | 10.19.36.9-defaultcluster | default | 100 | 5 | host=10.19.36.10 port=5432 user=repmgr dbname=repmgr connect_timeout=2 2 | 10.19.36.9-defaultcluster | primary | * running | | default | 100 | 5 | host=10.19.36.9 port=5432 user=repmgr dbname=repmgr connect_timeout=2
- 查看repmgrd守護進程狀態。(任意節點執行,包含所有節點的信息)
[lightdb@localhost ~]$ repmgr service status -f /home/lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+----------------------------+---------+-----------+---------------------------+---------+---------+---------+-------------------- 1 | 10.19.36.10-defaultcluster | standby | running | 10.19.36.9-defaultcluster | running | 3324044 | no | 0 second(s) ago 2 | 10.19.36.9-defaultcluster | primary | * running | | running | 4170296 | no | n/a
status有多種狀態,特別是前面有*、?、!等。各自的含義(其實現在ltcluster\ltcluster-client.c的format_node_status函數中)如下:
-
- ? running:能ping通所在服務器,但是libpq連接不通;
- ? unreachable:ping不通所在服務器。
- 帶感嘆號前綴,比如! running as primary:表示節點在ltcluster元數據庫中的狀態和當前實際狀態不符合。比如沒有通過ltcluster而是其他方式執行了switchover/failover。
正常情況下,所有節點的timelineid是一樣的,witness的timelineid=n/a。
upstream列如果前面有?,含義類似,代表name到upstraem狀態不正常,如? node1,表示無法連通。如果有另外一個節點! running as primary,則表示集群拓撲變了(要看下repmgrd執行的failover是否會導致其有!,也就是元數據庫是否同時變更,理論上應該是),其timelineid通常是當前最大。
/* * Generate formatted node status output for display by "cluster show" and * "service status". */ bool format_node_status(t_node_info *node_info, PQExpBufferData *node_status, PQExpBufferData *upstream, ItemList *warnings) { bool error_found = false; t_node_info remote_node_rec = T_NODE_INFO_INITIALIZER; RecordStatus remote_node_rec_found = RECORD_NOT_FOUND; if (PQstatus(node_info->conn) == CONNECTION_OK) { node_info->node_status = NODE_STATUS_UP; node_info->recovery_type = get_recovery_type(node_info->conn); /* get node's copy of its record so we can see what it thinks its status is */ remote_node_rec_found = get_node_record_with_upstream(node_info->conn, node_info->node_id, &remote_node_rec); } else { /* check if node is reachable, but just not letting us in */ if (is_server_available_quiet(node_info->conninfo)) node_info->node_status = NODE_STATUS_REJECTED; else node_info->node_status = NODE_STATUS_DOWN; node_info->recovery_type = RECTYPE_UNKNOWN; } /* format node status info */ switch (node_info->type) { case PRIMARY: { /* node is reachable */ if (node_info->node_status == NODE_STATUS_UP) { if (node_info->active == true) { switch (node_info->recovery_type) { case RECTYPE_PRIMARY: appendPQExpBufferStr(node_status, "* running"); break; case RECTYPE_STANDBY: appendPQExpBufferStr(node_status, "! running as standby"); item_list_append_format(warnings, "node \"%s\" (ID: %i) is registered as primary but running as standby", node_info->node_name, node_info->node_id); break; case RECTYPE_UNKNOWN: appendPQExpBufferStr(node_status, "! unknown"); item_list_append_format(warnings, "node \"%s\" (ID: %i) has unknown replication status", node_info->node_name, node_info->node_id); break; } } else { if (node_info->recovery_type == RECTYPE_PRIMARY) { appendPQExpBufferStr(node_status, "! running"); item_list_append_format(warnings, "node \"%s\" (ID: %i) is running but the ltcluster node record is inactive", node_info->node_name, node_info->node_id); } else { appendPQExpBufferStr(node_status, "! running as standby"); item_list_append_format(warnings, "node \"%s\" (ID: %i) is registered as an inactive primary but running as standby", node_info->node_name, node_info->node_id); } } } /* node is up but cannot connect */ else if (node_info->node_status == NODE_STATUS_REJECTED) { if (node_info->active == true) { appendPQExpBufferStr(node_status, "? running"); } else { appendPQExpBufferStr(node_status, "! running"); error_found = true; } } /* node is unreachable */ else { /* node is unreachable but marked active */ if (node_info->active == true) { appendPQExpBufferStr(node_status, "? unreachable"); item_list_append_format(warnings, "node \"%s\" (ID: %i) is registered as an active primary but is unreachable", node_info->node_name, node_info->node_id); } /* node is unreachable and marked as inactive */ else { appendPQExpBufferStr(node_status, "- failed"); error_found = true; } } } break; case STANDBY: { /* node is reachable */ if (node_info->node_status == NODE_STATUS_UP) { if (node_info->active == true) { switch (node_info->recovery_type) { case RECTYPE_STANDBY: appendPQExpBufferStr(node_status, " running"); break; case RECTYPE_PRIMARY: appendPQExpBufferStr(node_status, "! running as primary"); item_list_append_format(warnings, "node \"%s\" (ID: %i) is registered as standby but running as primary", node_info->node_name, node_info->node_id); break; case RECTYPE_UNKNOWN: appendPQExpBufferStr(node_status, "! unknown"); item_list_append_format( warnings, "node \"%s\" (ID: %i) has unknown replication status", node_info->node_name, node_info->node_id); break; } } else { if (node_info->recovery_type == RECTYPE_STANDBY) { appendPQExpBufferStr(node_status, "! running"); item_list_append_format(warnings, "node \"%s\" (ID: %i) is running but the ltcluster node record is inactive", node_info->node_name, node_info->node_id); } else { appendPQExpBufferStr(node_status, "! running as primary"); item_list_append_format(warnings, "node \"%s\" (ID: %i) is running as primary but the ltcluster node record is inactive", node_info->node_name, node_info->node_id); } } /* warn about issue with paused WAL replay */ if (is_wal_replay_paused(node_info->conn, true)) { item_list_append_format(warnings, _("WAL replay is paused on node \"%s\" (ID: %i) with WAL replay pending; this node cannot be manually promoted until WAL replay is resumed"), node_info->node_name, node_info->node_id); } } /* node is up but cannot connect */ else if (node_info->node_status == NODE_STATUS_REJECTED) { if (node_info->active == true) { appendPQExpBufferStr(node_status, "? running"); } else { appendPQExpBufferStr(node_status, "! running"); error_found = true; } } /* node is unreachable */ else { /* node is unreachable but marked active */ if (node_info->active == true) { appendPQExpBufferStr(node_status, "? unreachable"); item_list_append_format(warnings, "node \"%s\" (ID: %i) is registered as an active standby but is unreachable", node_info->node_name, node_info->node_id); } else { appendPQExpBufferStr(node_status, "- failed"); error_found = true; } } } break; case WITNESS: { /* node is reachable */ if (node_info->node_status == NODE_STATUS_UP) { if (node_info->active == true) { appendPQExpBufferStr(node_status, "* running"); } else { appendPQExpBufferStr(node_status, "! running"); error_found = true; } } /* node is up but cannot connect */ else if (node_info->node_status == NODE_STATUS_REJECTED) { if (node_info->active == true) { appendPQExpBufferStr(node_status, "? rejected"); } else { appendPQExpBufferStr(node_status, "! failed"); error_found = true; } } /* node is unreachable */ else { if (node_info->active == true) { appendPQExpBufferStr(node_status, "? unreachable"); } else { appendPQExpBufferStr(node_status, "- failed"); error_found = true; } } } break; case UNKNOWN: { /* this should never happen */ appendPQExpBufferStr(node_status, "? unknown node type"); error_found = true; } break; } /* format node upstream info */ if (remote_node_rec_found == RECORD_NOT_FOUND) { /* * Unable to retrieve the node's copy of its own record - copy the * name from our own copy of the record */ appendPQExpBuffer(upstream, "? %s", node_info->upstream_node_name); } else if (remote_node_rec.type == WITNESS) { /* no upstream - unlikely to happen */ if (remote_node_rec.upstream_node_id == NO_UPSTREAM_NODE) { appendPQExpBufferStr(upstream, "! "); item_list_append_format(warnings, "node \"%s\" (ID: %i) is a witness but reports it has no upstream node", node_info->node_name, node_info->node_id); } /* mismatch between reported upstream and upstream in local node's metadata */ else if (node_info->upstream_node_id != remote_node_rec.upstream_node_id) { appendPQExpBufferStr(upstream, "! "); if (node_info->upstream_node_id != remote_node_rec.upstream_node_id) { item_list_append_format(warnings, "node \"%s\" (ID: %i) reports a different upstream (reported: \"%s\", expected \"%s\")", node_info->node_name, node_info->node_id, remote_node_rec.upstream_node_name, node_info->upstream_node_name); } } else { t_node_info upstream_node_rec = T_NODE_INFO_INITIALIZER; RecordStatus upstream_node_rec_found = get_node_record(node_info->conn, node_info->upstream_node_id, &upstream_node_rec); if (upstream_node_rec_found != RECORD_FOUND) { appendPQExpBufferStr(upstream, "? "); item_list_append_format(warnings, "unable to find record for upstream node ID %i", node_info->upstream_node_id); } else { PGconn *upstream_conn = establish_db_connection_quiet(upstream_node_rec.conninfo); if (PQstatus(upstream_conn) != CONNECTION_OK) { appendPQExpBufferStr(upstream, "? "); item_list_append_format(warnings, "unable to connect to node \"%s\" (ID: %i)'s upstream node \"%s\" (ID: %i)", node_info->node_name, node_info->node_id, upstream_node_rec.node_name, upstream_node_rec.node_id); } PQfinish(upstream_conn); } } appendPQExpBufferStr(upstream, remote_node_rec.upstream_node_name); } else if (remote_node_rec.type == STANDBY) { if (node_info->upstream_node_id != NO_UPSTREAM_NODE && node_info->upstream_node_id == remote_node_rec.upstream_node_id) { /* * expected and reported upstreams match - check if node is actually * connected to the upstream */ NodeAttached attached_to_upstream = NODE_ATTACHED_UNKNOWN; char *replication_state = NULL; t_node_info upstream_node_rec = T_NODE_INFO_INITIALIZER; RecordStatus upstream_node_rec_found = get_node_record(node_info->conn, node_info->upstream_node_id, &upstream_node_rec); if (upstream_node_rec_found != RECORD_FOUND) { item_list_append_format(warnings, "unable to find record for upstream node ID %i", node_info->upstream_node_id); } else { PGconn *upstream_conn = establish_db_connection_quiet(upstream_node_rec.conninfo); if (PQstatus(upstream_conn) != CONNECTION_OK) { item_list_append_format(warnings, "unable to connect to node \"%s\" (ID: %i)'s upstream node \"%s\" (ID: %i)", node_info->node_name, node_info->node_id, upstream_node_rec.node_name, upstream_node_rec.node_id); } else { attached_to_upstream = is_downstream_node_attached(upstream_conn, node_info->node_name, &replication_state); } PQfinish(upstream_conn); } if (attached_to_upstream == NODE_ATTACHED_UNKNOWN) { appendPQExpBufferStr(upstream, "? "); item_list_append_format(warnings, "unable to determine if node \"%s\" (ID: %i) is attached to its upstream node \"%s\" (ID: %i)", node_info->node_name, node_info->node_id, upstream_node_rec.node_name, upstream_node_rec.node_id); } if (attached_to_upstream == NODE_NOT_ATTACHED) { appendPQExpBufferStr(upstream, "? "); item_list_append_format(warnings, "node \"%s\" (ID: %i) attached to its upstream node \"%s\" (ID: %i) in state \"%s\"", node_info->node_name, node_info->node_id, upstream_node_rec.node_name, upstream_node_rec.node_id, replication_state); } else if (attached_to_upstream == NODE_DETACHED) { appendPQExpBufferStr(upstream, "! "); item_list_append_format(warnings, "node \"%s\" (ID: %i) is not attached to its upstream node \"%s\" (ID: %i)", node_info->node_name, node_info->node_id, upstream_node_rec.node_name, upstream_node_rec.node_id); } appendPQExpBufferStr(upstream, node_info->upstream_node_name); } else { if (node_info->upstream_node_id != NO_UPSTREAM_NODE && remote_node_rec.upstream_node_id == NO_UPSTREAM_NODE) { appendPQExpBufferChar(upstream, '!'); item_list_append_format(warnings, "node \"%s\" (ID: %i) reports it has no upstream (expected: \"%s\")", node_info->node_name, node_info->node_id, node_info->upstream_node_name); } else if (node_info->upstream_node_id != NO_UPSTREAM_NODE && remote_node_rec.upstream_node_id != NO_UPSTREAM_NODE) { appendPQExpBuffer(upstream, "! %s", remote_node_rec.upstream_node_name); item_list_append_format(warnings, "node \"%s\" (ID: %i) reports a different upstream (reported: \"%s\", expected \"%s\")", node_info->node_name, node_info->node_id, remote_node_rec.upstream_node_name, node_info->upstream_node_name); } } } return error_found; }
- -- csv可以生成csv格式,便於自動化分析,如下:
[lightdb@localhost ~]$ repmgr service status -f /home/lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf --csv 1,10.19.36.10-defaultcluster,standby,1,1,3324044,0,100,1,default 2,10.19.36.9-defaultcluster,primary,1,1,4170296,0,100,-1,default
- 查看本節點角色以及狀態、復制延時(基本上status就足夠)(僅包含本節點的信息)
[lightdb@localhost ~]$ repmgr node status -f /home/lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf Node "10.19.36.10-defaultcluster": LightDB version: 13.3 Total data size: 9086 MB Conninfo: host=10.19.36.10 port=5432 user=repmgr dbname=repmgr connect_timeout=2 Role: standby WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective) # 除非設置為always,否則在WAL replay時總是僅用。 Archive command: test ! -f /home/lightdb/lightdb-x/13.3-21.2/archive/%f && cp %p /home/lightdb/lightdb-x/13.3-21.2/archive/%f && find /home/lightdb/lightdb-x/13.3-21.2/archive -type f -mmin +10080 | xargs -i rm {} WALs pending archiving: 2 pending files Replication connections: 0 (of maximal 10) Replication slots: 0 physical (of maximal 10; 0 missing) Upstream node: 10.19.36.9-defaultcluster (ID: 2) Replication lag: 55 seconds # 第一次不太准,可能會比較高 Last received LSN: 6/F641CCE8 Last replayed LSN: 6/F64159D0
如果不放心可以查下check,check做了實時檢查。就監控而言,保守起見,兩個都執行(僅包含本節點的信息)。
[lightdb@localhost ~]$ repmgr node check -f /home/lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf Node "10.19.36.10-defaultcluster": Server role: OK (node is standby) Replication lag: OK (0 seconds) WAL archiving: OK (2 pending archive ready files) Upstream connection: OK (node "10.19.36.10-defaultcluster" (ID: 1) is attached to expected upstream node "10.19.36.9-defaultcluster" (ID: 2)) Downstream servers: OK (this node has no downstream nodes) Replication slots: OK (node has no physical replication slots) Missing physical replication slots: OK (node has no missing physical replication slots) Configured data directory: OK (configured "data_directory" is "/home/lightdb/data/defaultCluster")
查看集群節點間連通性(任意節點執行,包含所有節點的信息)。
[lightdb@localhost ~]$ repmgr cluster crosscheck -f /home/lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf INFO: connecting to database Name | ID | 1 | 2 ----------------------------+----+---+--- 10.19.36.10-defaultcluster | 1 | * | * 10.19.36.9-defaultcluster | 2 | * | *
- 監控從節點異常(除非主節點選擇的同步模式是local,如果為on、remote_apply或remote_write,就必須監控從節點,以避免從節點發生異常時,主節點停止工作,從而影響生產)。
============================上述主要是監控,下面來看一下日常管理類。
- 節點維護重啟。
[lightdb@lightdb1 ~]$ repmgr service pause -f /mnt/pmem1/lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf NOTICE: node 1 (10.20.137.41-defaultcluster) paused NOTICE: node 2 (10.20.137.42-defaultcluster) paused [lightdb@lightdb1 ~]$ repmgr service status -f /mnt/pmem1/lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+-----------------------------+---------+-----------+-----------------------------+---------+--------+---------+-------------------- 1 | 10.20.137.41-defaultcluster | primary | * running | | running | 38834 | yes | n/a 2 | 10.20.137.42-defaultcluster | standby | running | 10.20.137.41-defaultcluster | running | 185064 | yes | 1 second(s) ago 。。。。修改參數,重啟實例。。。。
期間,standby的repmgr日志會一直嘗試重試,如下:
[2021-10-30 18:22:14] [INFO] node "10.20.137.42-defaultcluster" (ID: 2) monitoring upstream node "10.20.137.41-defaultcluster" (ID: 1) in normal state [2021-10-30 18:22:14] [DETAIL] last monitoring statistics update was 2 seconds ago [2021-10-30 18:23:12] [WARNING] unable to ping "host=10.20.137.41 port=5432 user=repmgr dbname=repmgr connect_timeout=2" [2021-10-30 18:23:12] [DETAIL] PQping() returned "PQPING_REJECT" [2021-10-30 18:23:12] [WARNING] unable to connect to upstream node "10.20.137.41-defaultcluster" (ID: 1) [2021-10-30 18:23:12] [INFO] checking state of node "10.20.137.41-defaultcluster" (ID: 1), 1 of 3 attempts [2021-10-30 18:23:12] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=10.20.137.41 port=5432 fallback_application_name=repmgr" [2021-10-30 18:23:12] [DETAIL] PQping() returned "PQPING_REJECT" [2021-10-30 18:23:12] [INFO] sleeping up to 5 seconds until next reconnection attempt [2021-10-30 18:23:17] [INFO] checking state of node "10.20.137.41-defaultcluster" (ID: 1), 2 of 3 attempts [2021-10-30 18:23:17] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=10.20.137.41 port=5432 fallback_application_name=repmgr" [2021-10-30 18:23:17] [DETAIL] PQping() returned "PQPING_REJECT" [2021-10-30 18:23:17] [INFO] sleeping up to 5 seconds until next reconnection attempt [2021-10-30 18:23:22] [INFO] checking state of node "10.20.137.41-defaultcluster" (ID: 1), 3 of 3 attempts [2021-10-30 18:23:22] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=10.20.137.41 port=5432 fallback_application_name=repmgr" [2021-10-30 18:23:22] [DETAIL] PQping() returned "PQPING_REJECT" [2021-10-30 18:23:22] [WARNING] unable to reconnect to node "10.20.137.41-defaultcluster" (ID: 1) after 3 attempts [2021-10-30 18:23:22] [NOTICE] repmgrd on this node is paused [2021-10-30 18:23:22] [DETAIL] no failover will be carried out [2021-10-30 18:23:22] [HINT] execute "repmgr service unpause" to resume normal failover mode [2021-10-30 18:23:22] [WARNING] unable to ping "host=10.20.137.41 port=5432 user=repmgr dbname=repmgr connect_timeout=2" [2021-10-30 18:23:22] [DETAIL] PQping() returned "PQPING_REJECT" [2021-10-30 18:23:22] [ERROR] unable to execute get_primary_current_lsn() [2021-10-30 18:23:22] [DETAIL] FATAL: terminating connection due to administrator command server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. [2021-10-30 18:23:22] [WARNING] unable to retrieve primary's current LSN [2021-10-30 18:23:24] [WARNING] unable to ping "host=10.20.137.41 port=5432 user=repmgr dbname=repmgr connect_timeout=2" [2021-10-30 18:23:24] [DETAIL] PQping() returned "PQPING_REJECT" [2021-10-30 18:23:24] [WARNING] unable to ping "host=10.20.137.41 port=5432 user=repmgr dbname=repmgr connect_timeout=2" [2021-10-30 18:23:24] [DETAIL] PQping() returned "PQPING_REJECT" [2021-10-30 18:23:26] [WARNING] unable to ping "host=10.20.137.41 port=5432 user=repmgr dbname=repmgr connect_timeout=2" [2021-10-30 18:23:26] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2021-10-30 18:23:26] [WARNING] unable to ping "host=10.20.137.41 port=5432 user=repmgr dbname=repmgr connect_timeout=2" [2021-10-30 18:23:26] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2021-10-30 18:23:28] [NOTICE] upstream is available but upstream connection has gone away, resetting [2021-10-30 18:23:28] [NOTICE] reconnected to upstream node "10.20.137.41-defaultcluster" (ID: 1) after 6 seconds, resuming monitoring [2021-10-30 18:27:16] [INFO] node "10.20.137.42-defaultcluster" (ID: 2) monitoring upstream node "10.20.137.41-defaultcluster" (ID: 1) in normal state [2021-10-30 18:27:16] [DETAIL] last monitoring statistics update was 2 seconds ago
pg日志如下:
2021-10-30 18:12:30.114559T @ checkpointer 00000[2021-10-29 20:45:28 CST] 0 [115395] DETAIL: Last completed transaction was at log time 2021-10-30 18:12:30.084333+08. 2021-10-30 18:14:41.898079T @ walreceiver 00000[2021-10-30 17:58:15 CST] 0 [144662] LOG: replication terminated by primary server 2021-10-30 18:14:41.898079T @ walreceiver 00000[2021-10-30 17:58:15 CST] 0 [144662] DETAIL: End of WAL reached on timeline 3 at 10/800000A0. 2021-10-30 18:14:41.898109T @ walreceiver XX000[2021-10-30 17:58:15 CST] 0 [144662] FATAL: could not send end-of-streaming message to primary: no COPY in progress 2021-10-30 18:14:41.898250T @ startup 00000[2021-10-29 20:45:28 CST] 0 [115394] LOG: invalid record length at 10/800000A0: wanted 24, got 0 2021-10-30 18:14:41.909281T @ walreceiver XX000[2021-10-30 18:14:41 CST] 0 [158899] FATAL: could not connect to the primary server: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2021-10-30 18:14:46.909030T @ walreceiver 00000[2021-10-30 18:14:46 CST] 0 [158962] LOG: started streaming WAL from primary at 10/80000000 on timeline 3 2021-10-30 18:15:30.175149T @ checkpointer 00000[2021-10-29 20:45:28 CST] 0 [115395] LOG: restartpoint starting: time
- repmgrd守護進程啟動與停止。
- 主備切換。本質上是從節點promote的過程。只不過這個過程是去正常停止主節點。同樣timineline會發生增加。如下:
此時新從節點的wal也是一樣的。partial文件是因為promote發生在此時,所以其內容也被拷貝到了新timeline。
- 故障恢復。主節點修復后,如果能夠正常rejoin回來固然好,更多的時候是rejoin失敗(這通常發生在failover后,從promote為新主,然后不停地寫寫寫,之后老主rejoin(立刻rejoin幾乎都沒有問題))。如下:
[lightdb@hs-10-19-36-9 log]$ repmgr node rejoin -f /home/lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf -d 'host=10.19.36.10 dbname=repmgr user=repmgr' ERROR: this node cannot attach to rejoin target node 1 DETAIL: rejoin target server's timeline 6 forked off current database system timeline 5 before current recovery point 8/C00000A0 HINT: use --force-rewind to execute lt_rewind [lightdb@hs-10-19-36-9 log]$ repmgr node rejoin -f /home/lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf -d 'host=10.19.36.10 dbname=repmgr user=repmgr' --force-rewind NOTICE: lt_rewind execution required for this node to attach to rejoin target node 1 DETAIL: rejoin target server's timeline 6 forked off current database system timeline 5 before current recovery point 8/C00000A0 NOTICE: executing lt_rewind DETAIL: lt_rewind command is "/home/lightdb/lightdb-x/13.3-21.2/bin/lt_rewind -D '/home/lightdb/data/defaultCluster' --source-server='host=10.19.36.10 port=5432 user=repmgr dbname=repmgr connect_timeout=2'" ERROR: lt_rewind execution failed DETAIL: lt_rewind: servers diverged at WAL location 8/A0000000 on timeline 5 lt_rewind: error: could not open file "/home/lightdb/data/defaultCluster/pg_wal/000000050000000800000004": 沒有那個文件或目錄 lt_rewind: fatal: could not find previous WAL record at 8/8DF84C18
可以看下新主節點或failed的老節點wal目錄下是否包含該文件,有的話都拷貝過來(一般都有)。然后,重新執行rejoin。
[lightdb@lightdb1 data]$ repmgr node rejoin -d 'host=10.20.137.42 dbname=repmgr user=repmgr' -f /mnt/pmem1/lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf --force-rewind NOTICE: lt_rewind execution required for this node to attach to rejoin target node 2 DETAIL: rejoin target server's timeline 6 forked off current database system timeline 5 before current recovery point 23/C0000028
### 正確的廢話解釋是:a standby server was promoted before it could catch up with the primary, and now the primary cannot take the role of a standby server
https://www.postgresql.org/message-id/5211C82B.2080605%40vmware.com 提了個bug,但是似乎沒有給出充分答案
NOTICE: executing lt_rewind DETAIL: lt_rewind command is "/mnt/pmem1/lightdb/lightdb-x/13.3-21.2/bin/lt_rewind -D '/mnt/pmem1/lightdb/data' --source-server='host=10.20.137.42 port=5432 user=repmgr dbname=repmgr connect_timeout=2'" NOTICE: 0 files copied to /mnt/pmem1/lightdb/data NOTICE: setting node 1's upstream to node 2 WARNING: unable to ping "host=10.20.137.41 port=5432 user=repmgr dbname=repmgr connect_timeout=2" DETAIL: PQping() returned "PQPING_NO_RESPONSE" NOTICE: starting server using "/mnt/pmem1/lightdb/lightdb-x/13.3-21.2/bin/lt_ctl -w -D '/mnt/pmem1/lightdb/data' start" WARNING: unable to ping "host=10.20.137.41 port=5432 user=repmgr dbname=repmgr connect_timeout=2" DETAIL: PQping() returned "PQPING_NO_RESPONSE" WARNING: unable to ping "host=10.20.137.41 port=5432 user=repmgr dbname=repmgr connect_timeout=2" DETAIL: PQping() returned "PQPING_NO_RESPONSE"
INFO: waiting for node "10.19.69.192-defaultcluster" (ID: 1) to respond to pings; 56 of max 60 attempts (parameter "node_rejoin_timeout")
WARNING: unable to ping "host=10.19.69.192 port=5433 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
WARNING: unable to ping "host=10.19.69.192 port=5433 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
WARNING: unable to ping "host=10.19.69.192 port=5433 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
WARNING: unable to ping "host=10.19.69.192 port=5433 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
ERROR: NODE REJOIN failed
DETAIL: local node "10.19.69.192-defaultcluster" did not become available start after 60 seconds
HINT: check the LightDB log on the local node
問題在於rejoin看起來好像執行成功了,但是實例還是啟動失敗(已經--force-rewind)。pg.log對應日志如下:
2021-10-30 23:46:06.436515T @ startup 00000[2021-10-30 23:46:06 CST] 0 [139358] LOG: database system was interrupted while in recovery at log time 2021-10-30 23:10:42 CST 2021-10-30 23:46:06.436515T @ startup 00000[2021-10-30 23:46:06 CST] 0 [139358] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2021-10-30 23:46:06.459085T @ startup 00000[2021-10-30 23:46:06 CST] 0 [139358] LOG: entering standby mode 2021-10-30 23:46:06.459218T @ startup XX000[2021-10-30 23:46:06 CST] 0 [139358] FATAL: requested timeline 6 does not contain minimum recovery point 2A/687352C8 on timeline 4 2021-10-30 23:46:06.460032T @ postmaster 00000[2021-10-30 23:46:05 CST] 0 [139356] LOG: startup process (PID 139358) exited with exit code 1 2021-10-30 23:46:06.460061T @ postmaster 00000[2021-10-30 23:46:05 CST] 0 [139356] LOG: aborting startup due to startup process failure 2021-10-30 23:46:06.461174T @ postmaster 00000[2021-10-30 23:46:05 CST] 0 [139356] LOG: database system is shut down
確定的是:新主的timeline id是5,為什么最小恢復點需要4呢?因為pg_rewind是從失敗節點的wal文件中去找新主timeline fork off之前的那個full checkpoint,然后根據全量檢查點之后的WAL找到新主之后修改過的block,這樣就可以做增量恢復了。
也可能是下列錯誤:
2021-11-04 22:40:44.695099T @ walreceiver 00000[2021-11-04 22:40:44 CST] 0 [5559] LOG: started streaming WAL from primary at 3/60000000 on timeline 2 2021-11-04 22:40:44.695197T @ walreceiver XX000[2021-11-04 22:40:44 CST] 0 [5559] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000020000000300000003 has already been removed 2021-11-04 22:40:45.249731T host_10_19_69_193 repmgr@repmgr 10.19.69.192(40182) client backend 57P03[2021-11-04 22:40:45 CST] 0 [5560] FATAL: the database system is starting up 2021-11-04 22:40:45.250723T host_10_19_69_193 repmgr@repmgr 10.19.69.192(40184) client backend 57P03[2021-11-04 22:40:45 CST] 0 [5561] FATAL: the database system is starting up 2021-11-04 22:40:47.258559T host_10_19_69_193 repmgr@repmgr 10.19.69.192(40188) client backend 57P03[2021-11-04 22:40:47 CST] 0 [5564] FATAL: the database system is starting up 2021-11-04 22:40:47.259607T host_10_19_69_193 repmgr@repmgr 10.19.69.192(40190) client backend 57P03[2021-11-04 22:40:47 CST] 0 [5565] FATAL: the database system is starting up 2021-11-04 22:40:49.267843T host_10_19_69_193 repmgr@repmgr 10.19.69.192(40194) client backend 57P03[2021-11-04 22:40:49 CST] 0 [5567] FATAL: the database system is starting up 2021-11-04 22:40:49.268930T host_10_19_69_193 repmgr@repmgr 10.19.69.192(40196) client backend 57P03[2021-11-04 22:40:49 CST] 0 [5568] FATAL: the database system is starting up 2021-11-04 22:40:49.700288T @ walreceiver 00000[2021-11-04 22:40:49 CST] 0 [5570] LOG: started streaming WAL from primary at 3/60000000 on timeline 2 2021-11-04 22:40:49.700369T @ walreceiver XX000[2021-11-04 22:40:49 CST] 0 [5570] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000020000000300000003 has already been removed
這通常是由於原從節點(新主節點)配置不正確,在WAL沒有歸檔之前被刪除了。
如果確定無論如何都找不到WAL了,那只能通過clone重新初始化。
repmgr -h 10.19.69.193 -U repmgr -d repmgr -f lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf standby clone --dry-run
repmgr -h 10.19.69.193 -U repmgr -d repmgr -f lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf standby clone -F
repmgr -f lightdb/lightdb-x/13.3-21.2/etc/repmgr/repmgr.conf standby register -F
lt_ctl start
注:補充一下,Pg的timeline,准確的說和oracle的resetlogs是一樣的。相當於重新初始化的實例。wal文件的前8位就是timeline id。每次發生基於時間點的恢復或pg_rewind、promote,即進入ArchiveRecovery模式(只要出現recovery.signal或standby.signal都會設置改參數),timeline id都會加1,並且會生成一個NewTimelineID.history文件。所以第一個history文件是00000002.history。在時間點恢復(參數recovery_target_timeline和recovery_target_lsn控制)的情況下,hisotry文件會記錄設置的恢復點和實際的恢復點。如下:
cat 00000002.history 1 0/70000D8 after LSN 0/7000060 含義如下: 1<parentTLI> 0/70000D8 <switchpoint> after LSN 0/7000060<reason> parentTLI: ID of the parent timeline switchpoint: XLogRecPtr of the WAL location where the switch happened reason : human-readable explanation of why the timeline was changed
所以多次時間點恢復后可能是上面這樣的。
或者如下:
查看pg_wal,可以發現最舊的文件已經是從00004開始了。
[lightdb@hs-10-19-36-9 defaultCluster]$ cd /home/lightdb/data/defaultCluster/pg_wal [lightdb@hs-10-19-36-9 pg_wal]$ ll 總用量 13107224 -rw------- 1 lightdb lightdb 333 10月 29 11:19 000000010000000000000002.00000028.backup -rw------- 1 lightdb lightdb 42 10月 29 12:13 00000002.history -rw------- 1 lightdb lightdb 85 10月 29 14:02 00000003.history -rw------- 1 lightdb lightdb 128 10月 29 15:49 00000004.history -rw------- 1 lightdb lightdb 536870912 10月 29 23:12 000000050000000800000005 -rw------- 1 lightdb lightdb 536870912 10月 29 23:12 000000050000000800000006 -rw------- 1 lightdb lightdb 536870912 10月 29 22:39 000000050000000800000007 -rw------- 1 lightdb lightdb 536870912 10月 29 22:42 000000050000000900000000 -rw------- 1 lightdb lightdb 536870912 10月 29 22:44 000000050000000900000001 -rw------- 1 lightdb lightdb 536870912 10月 29 22:46 000000050000000900000002 -rw------- 1 lightdb lightdb 536870912 10月 29 22:47 000000050000000900000003 -rw------- 1 lightdb lightdb 536870912 10月 29 22:54 000000050000000900000004 -rw------- 1 lightdb lightdb 536870912 10月 29 22:49 000000050000000900000005 -rw------- 1 lightdb lightdb 536870912 10月 29 22:52 000000050000000900000006 -rw------- 1 lightdb lightdb 536870912 10月 29 23:03 000000050000000900000007 -rw------- 1 lightdb lightdb 536870912 10月 29 22:39 000000050000000A00000000 -rw------- 1 lightdb lightdb 536870912 10月 29 22:39 000000050000000A00000001 -rw------- 1 lightdb lightdb 536870912 10月 29 22:40 000000050000000A00000002 -rw------- 1 lightdb lightdb 536870912 10月 29 22:41 000000050000000A00000003 -rw------- 1 lightdb lightdb 536870912 10月 29 22:43 000000050000000A00000004 -rw------- 1 lightdb lightdb 536870912 10月 29 22:46 000000050000000A00000005 -rw------- 1 lightdb lightdb 536870912 10月 29 22:48 000000050000000A00000006 -rw------- 1 lightdb lightdb 536870912 10月 29 22:51 000000050000000A00000007 -rw------- 1 lightdb lightdb 536870912 10月 29 22:40 000000050000000B00000000 -rw------- 1 lightdb lightdb 536870912 10月 29 22:41 000000050000000B00000001 -rw------- 1 lightdb lightdb 536870912 10月 29 22:43 000000050000000B00000002 -rw------- 1 lightdb lightdb 536870912 10月 29 22:45 000000050000000B00000003 -rw------- 1 lightdb lightdb 536870912 10月 29 22:50 000000050000000B00000004 -rw------- 1 lightdb lightdb 536870912 10月 29 22:53 000000050000000B00000005 -rw------- 1 lightdb lightdb 171 10月 29 16:04 00000005.history -rw------- 1 lightdb lightdb 214 10月 29 23:12 00000006.history drwx------ 2 lightdb lightdb 242 10月 29 23:12 archive_status
PS:wal文件的命名格式是:3部分8位16進制,從00000001 00000000 00000001開始,如:00000001(timelineid) 0000000C(logNo, lsn/(0x100000000/wal_segsz_bytes)) 000000CE(segNo, lsn % (0x100000000/wal_segsz_bytes))。所以wal文件名的生成是有體系的,根據wal大小不同,第三部分可能不同,並非簡單的x/yyZZZZZZ中的yy。
0x100000000=4 294,967,296,無論wal_segsz_bytes多大,logNo每4GB加1,segNo數量不確定。
看下各個history文件的內容。
[lightdb@hs-10-19-36-9 pg_wal]$ tail -fn 100 00000005.history 1 0/800000A0 no recovery target specified # 表示沒有聲明任何recovery_target相關參數,在xlog.c的7611行else邏輯中。 2 0/A00000A0 no recovery target specified 3 4/C00000A0 no recovery target specified 4 4/E00000A0 no recovery target specified ^C [lightdb@hs-10-19-36-9 pg_wal]$ cat 00000005.history 1 0/800000A0 no recovery target specified 2 0/A00000A0 no recovery target specified 3 4/C00000A0 no recovery target specified 4 4/E00000A0 no recovery target specified [lightdb@hs-10-19-36-9 pg_wal]$ cat 00000004.history 1 0/800000A0 no recovery target specified 2 0/A00000A0 no recovery target specified 3 4/C00000A0 no recovery target specified [lightdb@hs-10-19-36-9 pg_wal]$ cat 00000003.history 1 0/800000A0 no recovery target specified 2 0/A00000A0 no recovery target specified [lightdb@hs-10-19-36-9 pg_wal]$ cat 00000006.history 1 0/800000A0 no recovery target specified 2 0/A00000A0 no recovery target specified 3 4/C00000A0 no recovery target specified 4 4/E00000A0 no recovery target specified 5 8/A0000000 no recovery target specified
故障恢復問題2,除了上述場景外,故障恢復還可能發生下列情況。主備的repmgr一直在嘗試連接掛掉的新節點,就是沒有執行failover。
[2021-10-30 23:02:01] [WARNING] unable to ping "host=10.20.137.42 port=5432 user=repmgr dbname=repmgr connect_timeout=2" [2021-10-30 23:02:01] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2021-10-30 23:02:01] [WARNING] connection to node "10.20.137.42-defaultcluster" (ID: 2) lost [2021-10-30 23:02:01] [DETAIL] connection pointer is NULL [2021-10-30 23:02:01] [INFO] attempting to reconnect to node "10.20.137.42-defaultcluster" (ID: 2) [2021-10-30 23:02:01] [ERROR] connection to database failed [2021-10-30 23:02:01] [DETAIL] could not connect to server: Connection refused Is the server running on host "10.20.137.42" and accepting TCP/IP connections on port 5432? [2021-10-30 23:02:01] [DETAIL] attempted to connect using: user=repmgr connect_timeout=2 dbname=repmgr host=10.20.137.42 port=5432 fallback_application_name=repmgr options=-csearch_path= [2021-10-30 23:02:01] [WARNING] reconnection to node "10.20.137.42-defaultcluster" (ID: 2) failed [2021-10-30 23:02:01] [WARNING] unable to ping "host=10.20.137.42 port=5432 user=repmgr dbname=repmgr connect_timeout=2" [2021-10-30 23:02:01] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2021-10-30 23:02:01] [ERROR] unable to determine if server is in recovery [2021-10-30 23:02:01] [DETAIL] query text is: SELECT pg_catalog.pg_is_in_recovery() [2021-10-30 23:02:01] [WARNING] unable to determine node recovery status
如果沒有配置witness的話,這個時候要檢查一下repmgr服務是否在pause狀態。如果pause狀態,是不會failover的(95%的概率)。
最后,如果repmgrd應該在幾個節點運行呢?primary/standby/standby2/witness,如果都運行的話,它們之間如何協商呢?實際如下:
每個standby上的repmgrd除了監控自己外,還監控primary,以便primary異常后進行failover,如果有多個standby,會進行協商,依據為LSN , Priority, Node_ID。系統會先選舉一個LSN比較大者作為候選備節點;如LSN一樣,會根據Priority優先級(等於0代表不參與選主)進行比較,該優先級是在配置文件中進行參數配置;如優先級也一樣,會比較節點的Node ID,小者會優先選舉。如果存在2個standby的情況,需要確定VIP如何漂移到正確的standby節點,因為選主是repmgrd控制,所以它也需要控制VIP的漂移,技術上是沒問題的,lightdb也是如此實現VIP搶占的,在check_lightdb.sh中判斷。但是witness不需要運行repmgrd。
https://www.pgcon.org/2018/schedule/attachments/485_pg_prefaulter.pdf
https://paquier.xyz/content/materials/20150917_pgopen2015_standbys.pdf