背景:
生产环境中在重做备机的时候
pg_basebackup -D -P -v --wal -method=steam
发现数据目录大小一直未增长,但是basebackup的进程还一直在,就夯住了,想到去看下执行pg_basebackup的前提都有哪些,以及官方文档是怎么描述的:
在备份的开始时,需要向从中拿去备份的服务器写一个检查点。尤其在没有使用选项
--checkpoint=fast
时,这可能需要一点时间,在其间pg_basebackup看起来处于闲置状态。
那就有可能是checkpoint的阶段卡住了
/*
* Start the actual backup
*/
PQescapeStringConn(conn, escaped_label, label, sizeof(escaped_label), &i);
if (maxrate > 0)
maxrate_clause = psprintf("MAX_RATE %u", maxrate);
if (verbose)
pg_log_info("initiating base backup, waiting for checkpoint to complete");
if (showprogress && !verbose)
{
fprintf(stderr, "waiting for checkpoint");
if (isatty(fileno(stderr)))
fprintf(stderr, "\r");
else
fprintf(stderr, "\n");
}
开始前会waiting for checkpoint
那什么时候会触发ckp呢?
/*
* RequestCheckpoint
* Called in backend processes to request a checkpoint
* flags is a bitwise OR of the following:
* CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
* CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
* CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
* ignoring checkpoint_completion_target parameter.
* CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred
* since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
* CHECKPOINT_END_OF_RECOVERY).
* CHECKPOINT_WAIT: wait for completion before returning (otherwise,
* just signal checkpointer to do it, and return).
* CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
* (This affects logging, and in particular enables CheckPointWarning.)
*/
-
库关闭的时候
-
pg_basebackup
-
达到checkpoint_timeout
-
达到checkpoint_completion_target 和max_wal_size的时候
-
手动checkpoint
调度模式的ckp就需要参数的限制来做,如果此时没有自动完成检查点,pg_basebackup就卡住了,
为了立即开始备份,这里手动在主节点上执行checkpoint,发现数据目录大小就开始增长了。
ckp之后会发生什么:
- 脏数据落盘
- 发生之后此次checkpoint之前的wal都可以清理
ckp的相关参数:
postgres=# select name,short_desc from pg_settings where name like '%checkpoint%'
;
name | short_desc
------------------------------+------------------------------------------------------------------------------------------
checkpoint_completion_target | Time spent flushing dirty buffers during checkpoint, as fraction of checkpoint interval.
checkpoint_flush_after | Number of pages after which previously performed writes are flushed to disk.
checkpoint_timeout | Sets the maximum time between automatic WAL checkpoints.
checkpoint_warning | Enables warnings if checkpoint segments are filled more frequently than this.
log_checkpoints | Logs each checkpoint.
(5 rows)
checkpoint_completion_target:
由于每5分钟或达到每个max_wal_size阈值都会发生一次检查点,因此在检查点时间内,共享缓冲区中存在的所有脏页将被刷新到磁盘,从而导致巨大的IO。
checkpoint_completion_target来这里进行救援。
这会使刷新速度变慢,这意味着PostgreSQL应该花费checkpoint_completion_target * checkpoint_timeout的时间来写入数据。
例如,如果我的checkpoint_completion_target为0.5,并且数据库将限制写入,以便最后写入在2.5分钟后完成。
checkpoint_timeout:
自动 WAL 检查点之间的最长时间
checkpoint_flush_after:
在执行检查点时,只要写入的字节数超过checkpoint_flush_after,则尝试强制OS将这些写入操作刷到存储中。这样做将限制内核页面缓存中的脏数据量,从而减少在检查点末尾发出fsync时停顿的可能性。
此设置在某些平台上可能无效。
ckp的作用:
加快数据恢复过程,减缓服务器性能压力
pg_basebackup的参数:
如果想要不等待checkpoint直接开始备份,可以加上参数 -c, --checkpoint=fast|spread