PostgreSQL checkpoint_completion_target及臟數據刷盤過程說明


 1)checkpoint相關參數

checkpoint_timeout (integer)
Maximum time between automatic WAL checkpoints. If this value is specified without units, it is taken as seconds. 
The valid range is between 30 seconds and one day. The default is five minutes (5min). Increasing this parameter can increase the amount of time needed for crash recovery. 
This parameter can only be set in the postgresql.conf file or on the server command line.

checkpoint_completion_target (floating point)
Specifies the target of checkpoint completion, as a fraction of total time between checkpoints. 
The default is 0.5. This parameter can only be set in the postgresql.conf file or on the server command line.

checkpoint_flush_after (integer)
Whenever more than this amount of data has been written while performing a checkpoint,
attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the amount of dirty data in the kernel's page cache, 
reducing the likelihood of stalls when an fsync is issued at the end of the checkpoint, or when the OS writes data back in larger batches in the background. 
Often that will result in greatly reduced transaction latency, but there also are some cases, especially with workloads that are bigger than shared_buffers, 
but smaller than the OS's page cache, where performance might degrade. This setting may have no effect on some platforms. If this value is specified without units,
it is taken as blocks, that is BLCKSZ bytes, typically 8kB. The valid range is between 0, which disables forced writeback, and 2MB. The default is 256kB on Linux, 
0 elsewhere. (If BLCKSZ is not 8kB, the default and maximum values scale proportionally to it.) This parameter can only be set in the postgresql.conf file or on the server command line.

checkpoint_warning (integer) Write a message to the server log if checkpoints caused by the filling of WAL segment files happen closer together than this amount of time 
(which suggests that max_wal_size ought to be raised). If this value is specified without units, it is taken as seconds. The default is 30 seconds (30s). 
Zero disables the warning. No warnings will be generated if checkpoint_timeout is less than checkpoint_warning. This parameter can only be set in the postgresql.conf file or on the server command line.

max_wal_size (integer)
Maximum size to let the WAL grow to between automatic WAL checkpoints. This is a soft limit; WAL size can exceed max_wal_size under special circumstances,

such as heavy load, a failing archive_command, or a high wal_keep_segments setting. If this value is specified without units, it is taken as megabytes.

The default is 1 GB. Increasing this parameter can increase the amount of time needed for crash recovery. This parameter can only be set in the postgresql.conf file or on the server command line.

  

2)checkpoint_completion_target參數總結:

大致可以這么說:checkpoint_completion_target越大,意味着checkpointer進程休眠的機會越多,以控制臟塊刷盤的進度。

在checkpoint過程中當刷盤的臟數據超過一定值(checkpoint_flush_after )后,會調用fsync將數據從page cache中刷盤。

因此,休眠越多,fsync也就不那么頻繁,刷盤的IO壓力就會降一點。

在checkpoint完成后,會調用一次fsync,將page cache都刷到磁盤。

所以,休眠越多,就會讓fsync操作時的IO平滑一點。

 

3)類似的機制在pg_start_backup中使用:

checkpoint的時候分為了schedual checkpoint和全力checkpoint(無休眠),在pg_start_backup函數中第二個參數可以選擇是否使用fast的checkpoint模式,默認為false。

相關邏輯可以參考:http://blog.itpub.net/6906/viewspace-2652315/

 

4)臟頁處理的過程分為幾個步驟:

首先是由background writer將shared buffers里面的被更改過的頁面(即臟頁),通過調用write寫入操作系統page cache。在函數BgBufferSync可以看到,PG的background writer進程,會根據LRU鏈表,掃描shared buffers(實際上是每次掃描一部分),如果發現臟頁,就調用系統調用write。可以通過設置bgwriter_delay參數,來控制background writer每次掃描之間的時間間隔。background writer在對一個頁面調用write后,會將該頁面對應的文件(實際上是表的segement,每個表可能有多個segment,對應多個物理文件)記錄到共享內存的數組CheckpointerShmem->requests中,調用順序如下:
BackgroundWriterMain -> BgBufferSync -> SyncOneBuffer -> FlushBuffer -> smgrwrite | | V ForwardFsyncRequest <- register_dirty_segment <- mdwrite


最終checkpointer進程通過讀取CheckpointerShmem->requests數組,獲得這些request,並放入pendingOpsTable。而真正將臟頁回寫到磁盤的操作,是由checkpointer進程完成的。checkpointer每次也會調用smgrwrite,把所有的shared buffers臟頁(即還沒有被background writer清理過得臟頁)寫入操作系統的page cache,並存入pendingOpsTable,這樣pendingOpsTable存放了所有write過的臟頁,包括之前background writer>已經處理的臟頁。隨后PG的checkpointer進程會根據pedingOpsTable的記錄,進行臟頁回寫操作(注意每次調用fysnc,都會sync數據表的一個文件,文件中所有臟頁都會寫入磁盤),調用順序如下:
CheckPointGuts->CheckPointBuffers->->mdsync->pg_fsync->fsync


如果checkpointer做磁盤寫入的頻率過高,則每次可能只寫入很少的數據。我們知道,磁盤對於順序寫入批量數據比隨機寫的效率要高的多,每次寫入很少數據,就造成大量隨機寫;而如果我們放慢checkpoint的頻率,多個隨機頁面就有可能組成一次順序批量寫入,效率大大提高。另外,checkpoint會進行fsync操作,大量的fsync可能造成系統IO阻塞,降低系統穩定性,因此checkpoint不能過於頻繁。但checkpoint的間隔也不能無限制放大。因為如果出現系統宕機,在進行恢復時,需要從上一次checkpoint的時間點開始恢復,如果checkpoint間隔過長,會造成恢復時間緩慢,降低可用性。

 

5)總結:

1)bgwriter會定時去刷一些臟數據,直接調用write函數寫入,然后在共享內存中標記。這個時候不會調用fsync。
2)checkpoint時,會讀取到哪些臟塊被bgwriter已經寫過了,就只處理那些沒有刷盤的臟塊。
3)在checkpoint過程中,只要超過一定量臟數據刷盤了(checkpoint_flush_after ),就會觸發一次fsync。最終所有數據都刷盤。

4)checkpoint將shared_buffer刷盤只有buffer io這一種方式,既fsync的方式。wal writer將wal_buffer刷盤則有兩種方式,open_開頭的方式是O_direct方式,f*開頭的是commit調用fsync等函數刷page cache的方式。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM