PostgreSQL synchronous_commit參數確認，以及流復制的思考

本文轉載自查看原文 2020-03-09 16:18 831 PostgreSQL

在很多時候我們查看官方手冊，發現synchronous_commit參數的介紹中，on比remote_apply先介紹，就認為on的級別比remote_apply高，其實不然：

在官網上的說明：

synchronous_commit (enum)
Specifies whether transaction commit will wait for WAL records to be written to disk before the command returns a “success” indication to the client. Valid values are on, remote_apply, remote_write, local, and off. The default, and safe, setting is on. When off, there can be a delay between when success is reported to the client and when the transaction is really guaranteed to be safe against a server crash. (The maximum delay is three times wal_writer_delay.) Unlike fsync, setting this parameter to off does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly. So, turning synchronous_commit off can be a useful alternative when performance is more important than exact certainty about the durability of a transaction. For more discussion see Section 29.3.

If synchronous_standby_names is non-empty, this parameter also controls whether or not transaction commits will wait for their WAL records to be replicated to the standby server(s).
When set to on, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and flushed it to disk. This ensures the transaction will not be lost unless both the primary and all synchronous standbys suffer corruption of their database storage.
When set to remote_apply, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and applied it, so that it has become visible to queries on the standby(s).
When set to remote_write, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and written it out to their operating system. This setting is sufficient to ensure data preservation even if a standby instance of PostgreSQL were to crash, but not if the standby suffers an operating-system-level crash, since the data has not necessarily reached stable storage on the standby.
Finally, the setting local causes commits to wait for local flush to disk, but not for replication. This is not usually desirable when synchronous replication is in use, but is provided for completeness.

If synchronous_standby_names is empty, the settings on, remote_apply, remote_write and local all provide the same synchronization level: transaction commits only wait for local flush to disk.

This parameter can be changed at any time; the behavior for any one transaction is determined by the setting in effect when it commits. It is therefore possible, and useful, to have some transactions commit synchronously and others asynchronously. For example, to make a single multistatement transaction commit asynchronously when the default is the opposite, issue SET LOCAL synchronous_commit TO OFF within the transaction.

但是代碼里面的順序是這樣的：

src\backend\replication\syncrep.c SyncRepReleaseWaiters

/*
	 * Set the lsn first so that when we wake backends they will release up to
	 * this location.
	 */
	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
	{
		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
	}
	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
	{
		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
	}
	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
	{
		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
	}

	LWLockRelease(SyncRepLock);

src\include\access\xact.h

typedef enum
{
	SYNCHRONOUS_COMMIT_OFF,		/* asynchronous commit */
	SYNCHRONOUS_COMMIT_LOCAL_FLUSH, /* wait for local flush only */
	SYNCHRONOUS_COMMIT_REMOTE_WRITE,	/* wait for local flush and remote
										 * write */
	SYNCHRONOUS_COMMIT_REMOTE_FLUSH,	/* wait for local and remote flush */
	SYNCHRONOUS_COMMIT_REMOTE_APPLY /* wait for local flush and remote apply */
}			SyncCommitLevel;

/* Define the default setting for synchronous_commit */
#define SYNCHRONOUS_COMMIT_ON	SYNCHRONOUS_COMMIT_REMOTE_FLUSH

/* Synchronous commit level */
extern int	synchronous_commit;

on對應的級別應該是SYNCHRONOUS_COMMIT_REMOTE_FLUSH，看if的順序，remote_apply將是最后的一個級別。

那么還有幾個問題需要確認：

1.wal_receiver獲取到wal日志變化，是以wal record為單位的，還是以page為單位。

---應該是record為單位，某個操作產生了wal日志record（每次插入都會產生一條record，而不是一個事務一個，一個事務可能產生非常多的record，及時最后沒有提交，也會同步到備庫，事務的終止也會產生一條record），會有一個LSN，通過pg_current_wal_location查看。而這個record也會同步更新到備庫，而不是等待該事務commit時才去將事務產生的所有record同步。

2.wal日志怎么觸發將一個新的record發送到備庫？

--插入一條數據，會產生一條record，該record在wal_buffer中，等落盤到wal日志之后在（xlog.c中的XLogBackgroundFlush中會激活日志發送），再讀取wal日志變化，往tcp隊列中加。

3.備庫回放為什么不從wal_buffer中回放，而是從wal日志中回放。wal日志中的數據肯定還在內存中，應該不會產生磁盤讀寫，但是還是有開銷。

--回放wal_buffer中的記錄，很難，怎么去控制，而回放wal文件則簡單很多？

那么一條日志需要先落盤才能發送給從庫，那么什么時候才會落盤？

wal日志是定時去刷新的，而不是等到commit時才去刷，觸發wal_buffer刷盤的操作有：

1）commit操作

2）checkpoint，會確commit的日志已經落盤。

3）wal_writer_delay時間到達，且產生的日志已經超過了 wal_writer_flush_after的設置量才刷盤。

自動刷日志進程，相關參數說明：

wal_buffers (integer) 默認-1，共享內存的1/32，但是不能超過XLOG_BLCKSZ，一般是16MB。

The amount of shared memory used for WAL data that has not yet been written to disk. The default setting of -1 selects a size equal to 1/32nd (about 3%) of shared_buffers, but not less than 64kB nor more than the size of one WAL segment, typically 16MB. This value can be set manually if the automatic choice is too large or too small, but any positive value less than 32kB will be treated as 32kB. If this value is specified without units, it is taken as WAL blocks, that is XLOG_BLCKSZ bytes, typically 8kB. This parameter can only be set at server start.

The contents of the WAL buffers are written out to disk at every transaction commit, so extremely large values are unlikely to provide a significant benefit. However, setting this value to at least a few megabytes can improve write performance on a busy server where many clients are committing at once. The auto-tuning selected by the default setting of -1 should give reasonable results in most cases.


wal_writer_delay (integer) 
后台walwriter多少時間將wal_buffer刷盤，但還要看wal_writer_flush_after參數，例如等了200ms，准備刷盤時，發現新產生的wal數據不到flush_after參數設置的1MB，則不刷盤，進行下一輪等待。
高並發系統這個參數設置小一點，推薦10ms。

Specifies how often the WAL writer flushes WAL, in time terms. After flushing WAL the writer sleeps for the length of time given by wal_writer_delay, unless woken up sooner by an asynchronously committing transaction. If the last flush happened less than wal_writer_delay ago and less than wal_writer_flush_after worth of WAL has been produced since, then WAL is only written to the operating system, not flushed to disk. If this value is specified without units, it is taken as milliseconds. The default value is 200 milliseconds (200ms). Note that on many systems, the effective resolution of sleep delays is 10 milliseconds; setting wal_writer_delay to a value that is not a multiple of 10 might have the same results as setting it to the next higher multiple of 10. This parameter can only be set in the postgresql.conf file or on the server command line.

wal_writer_flush_after (integer) 當上一次wal刷盤后，產生了多少新wal才進行刷盤。

Specifies how often the WAL writer flushes WAL, in volume terms. If the last flush happened less than wal_writer_delay ago and less than wal_writer_flush_after worth of WAL has been produced since, then WAL is only written to the operating system, not flushed to disk. If wal_writer_flush_after is set to 0 then WAL data is always flushed immediately. If this value is specified without units, it is taken as WAL blocks, that is XLOG_BLCKSZ bytes, typically 8kB. The default is 1MB. This parameter can only be set in the postgresql.conf file or on the server command line.

commit_delay (integer) 默認是0，在高並發系統時候，由於每個commit都會去將wal_buffer刷盤，如果太頻繁性能並一定太好，可以等待一小段時間，多個commit一次性刷盤。設置了時間不一定生效，還要看當前系統中有多少數目的事務還在進行中，只有打開的事務超過設置數才會去等待。

Setting commit_delay adds a time delay before a WAL flush is initiated. This can improve group commit throughput by allowing a larger number of transactions to commit via a single WAL flush, if system load is high enough that additional transactions become ready to commit within the given interval. However, it also increases latency by up to the commit_delay for each WAL flush. Because the delay is just wasted if no other transactions become ready to commit, a delay is only performed if at least commit_siblings other transactions are active when a flush is about to be initiated. Also, no delays are performed if fsync is disabled. If this value is specified without units, it is taken as microseconds. The default commit_delay is zero (no delay). Only superusers can change this setting.

In PostgreSQL releases prior to 9.3, commit_delay behaved differently and was much less effective: it affected only commits, rather than all WAL flushes, and waited for the entire configured delay even if the WAL flush was completed sooner. Beginning in PostgreSQL 9.3, the first process that becomes ready to flush waits for the configured interval, while subsequent processes wait only until the leader completes the flush operation.

commit_siblings (integer) 一般設置大一點，想想高並發系統的並發事務量，事務延時提交的時間是否生效要看這系統還未commit的事務是否達到這個數量。

Minimum number of concurrent open transactions to require before performing the commit_delay delay. A larger value makes it more probable that at least one other transaction will become ready to commit during the delay interval. The default is five transactions.

不要和max_wal_size搞混了：

max_wal_size (integer) 相當於是產生了多大的wal日志后，如果期間沒有做過checkpoint，就會觸發一次checkpoint
Maximum size to let the WAL grow to between automatic WAL checkpoints. This is a soft limit; WAL size can exceed max_wal_size under special circumstances, such as heavy load, a failing archive_command, or a high wal_keep_segments setting. If this value is specified without units, it is taken as megabytes. The default is 1 GB. Increasing this parameter can increase the amount of time needed for crash recovery. This parameter can only be set in the postgresql.conf file or on the server command line.

那么什么時候將產生的record通過tpc連接發送給備庫呢，怎么觸發的？我們下次再研究。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 PostgreSQL 流復制解惑 PostgreSQL 流復制+高可用 Docker之PostgreSQL 12主從流復制搭建 [轉]PostgreSQL主從流復制部署 PostgreSQL流復制-主從切換 Postgresql 9.6 搭建異步流復制和同步流復制詳細教程 PostgreSQL 使用Docker搭建流復制測試環境 PostgreSQL13基於流復制搭建后備服務器一個commit引發的思考 PostgreSQL索引思考