lightdb中clog(commit log)內部實現、及與wal、commit的關系及細節


  WAL相當於oracle中的redo log,mysql中的redolog,9.6及之前名為xlog,10+當前在pg_wal文件夾中,wal段默認大小為16M,在initdb時可以指定大小,后續原則上不可以修改。可以通過pg_waldump查看二進制日志的內容。wal的結構解析https://www.cnblogs.com/abclife/p/13708947.html,雖然不完全正確,比如LSN的物理文件ID解析就不正確)。wal的物理結構(定義在src/include/access/xlogrecord.h)如下:

typedef struct XLogRecord
{
    uint32        xl_tot_len;        /* total len of entire record */
    TransactionId xl_xid;        /* xact id */
    XLogRecPtr    xl_prev;        /* ptr to previous record in log */
    uint8        xl_info;        /* flag bits, see below */
    RmgrId        xl_rmid;        /* resource manager for this record */
    /* 2 bytes of padding here, initialize to zero */
    pg_crc32c    xl_crc;            /* CRC for this record */

    /* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */

} XLogRecord;

typedef struct XLogRecordBlockHeader
{
    uint8        id;                /* block reference ID */
    uint8        fork_flags;        /* fork within the relation, and flags */
    uint16        data_length;    /* number of payload bytes (not including page
                                 * image) */

    /* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
    /* If BKPBLOCK_SAME_REL is not set, a RelFileNode follows */
    /* BlockNumber follows */
} XLogRecordBlockHeader;

  其中BlockHeader和block data一一對應,其中場景之一是第一次寫入塊或checkpoint之后第一次修改塊時會生成block data。所以每個日志條目的最小長度為24字節+n字節(記錄null標記,1個bit 1位,所以等於n字段數/8)+main data。

  具體為:解析wal記錄的時候,首先讀取XLogRecord,並根據xl_tot_len讀取整條記錄,然后根據其中的xl_rmid判斷當前是哪個RMGR,調用具體rmgr read回調,RMGR首先根據xl_info(info是第二個關鍵口子,也是RMGR后的第一個關鍵依據)判斷接下去是XLogRecordBlockHeader(0個或多個)、XLogRecordDataHeader(有且只有一個)還是其它(就是一事一議)。

   WAL歸檔的執行過程可見https://wiki.moritetu.xyz/?PostgreSQL/%E8%A7%A3%E6%9E%90/WAL%E3%82%A2%E3%83%BC%E3%82%AB%E3%82%A4%E3%83%96

   clog(全稱Commit Log,PostgreSQL transaction-commit-log manager,主要在clog.c中實現)里面記錄了事務的執行狀態,每次事務提交和回滾的時候,都需要更新該狀態(調用CommitTransactionCommand(void)),PostgreSQL服務器訪問該文件確定事務的狀態,保存在pg_xact目錄中,每個文件大小為256KB,每個事務2位(bit),故1個文件可以包含131072個事務。對於第一次修改的數據行來說,因為事務狀態存儲在clog中,所以修改后第一次判斷行的可見性需要通過訪問clog來確定,而訪問clog是一個非常耗費性能的過程,故關於clog訪問優化,有一個很長的discussion。

  注:執行層的操作lt_xact文件的接口函數主要在transam.c中,clog.c是底層實現。

  事務在clog中的狀態包括:

/*
 * Possible transaction statuses --- note that all-zeroes is the initial
 * state.
 *
 * A "subcommitted" transaction is a committed subtransaction whose parent
 * hasn't committed or aborted yet.
 */
typedef int XidStatus;

#define TRANSACTION_STATUS_IN_PROGRESS        0x00
#define TRANSACTION_STATUS_COMMITTED        0x01
#define TRANSACTION_STATUS_ABORTED            0x02
#define TRANSACTION_STATUS_SUB_COMMITTED    0x03

  在clog.c中。

  因為pg的MVCC在文件中實現undo,即使事務回滾了,新創建的行也不會被刪除,但是因為clog中記錄了事務的執行狀態,所以其他事務在xmin和xmax判斷時候可以過濾掉或不過濾掉這些記錄(主要是xmax=0的情況,因為此時可能提交了、也可能稍微提交)。

pg_xact(9.6及之前名為pg_clog,雖然代碼中還是clog.c)

[postgres@hs-10-20-30-194 pg_xact]$ ll
total 13208
-rw------- 1 postgres postgres 262144 May 24 17:26 0000
-rw------- 1 postgres postgres 262144 May 24 17:26 0001
-rw------- 1 postgres postgres 262144 May 24 17:27 0002
-rw------- 1 postgres postgres 262144 May 24 17:27 0003
-rw------- 1 postgres postgres 262144 May 24 17:27 0004
-rw------- 1 postgres postgres 262144 May 24 17:28 0005
-rw------- 1 postgres postgres 262144 May 24 17:28 0006
-rw------- 1 postgres postgres 262144 May 24 17:28 0007

  clog和wal的交互:這得先理解事務的完整過程。

  

  在AM層,調用xlog相關接口將WAL條目寫入WAL文件,PortalDrop清理執行完成后,主入口exec_simple_query()->finish_xact_command()會依次調用CommitTransactionCommand()->CommitTransaction()->RecordTransactionCommit()->XactLogCommitRecord()調用XLogInsert()將commit wal條目寫入WAL文件,然后RecordTransactionCommit()調用XLogFlush刷新commit WAL日志,然后調用TransactionIdCommitTree()更新clog。TransactionIdCommitTree->TransactionIdSetTreeStatus->TransactionIdSetPageStatus->TransactionIdSetPageStatusInternal,然后根據pageno找到slotno(使用slru簡單最近最少訪問算法管理),調用TransactionIdSetStatusBit(其根據xid找到偏移量,然后進行位運算更新事務狀態)

 * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
 * CLOG page numbering also wraps around at 0xFFFFFFFF/CLOG_XACTS_PER_PAGE,
 * and CLOG segment numbering at
 * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
 * explicit notice of that fact in this module, except when comparing segment
 * and page numbers in TruncateCLOG (see CLOGPagePrecedes).
 */

/* We need two bits per xact, so four xacts fit in a byte */
#define CLOG_BITS_PER_XACT    2
#define CLOG_XACTS_PER_BYTE 4   每字節包含的事務數
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)    每BLOCK包含的事務數,32768
#define CLOG_XACT_BITMASK    ((1 << CLOG_BITS_PER_XACT) - 1)   0x11

#define TransactionIdToPage(xid)    ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE)    根據事務ID找到頁,事務ID 整除 32768
#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)    頁內事務相對順序號偏移量 事務ID 取余 32768
#define TransactionIdToByte(xid)    (TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE)  頁內字節偏移量
#define TransactionIdToBIndex(xid)    ((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)     字節內事務相對順序號偏移量  事務ID 取余 4

 

 /* We store the latest async LSN for each group of transactions *

#define CLOG_XACTS_PER_LSN_GROUP 32 /* keep this a power of 2 */
#define CLOG_LSNS_PER_PAGE (CLOG_XACTS_PER_PAGE / CLOG_XACTS_PER_LSN_GROUP)

#define GetLSNIndex(slotno, xid) ((slotno) * CLOG_LSNS_PER_PAGE + \
((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) / CLOG_XACTS_PER_LSN_GROUP)

  因為更新clog是內存中進行的,不會刷盤,那問題來了。1、重啟恢復的時候哪里會用到?2、判斷元祖可見性的時候哪里會調用到?

  所有的元祖在被fetch時,都會檢查xmin、xmax是否已經提交,如果infomask_2上沒有標記的話,就回去clog緩存區查詢,如下:

TransactionIdGetStatus clog.c:654
TransactionLogFetch transam.c:79
TransactionIdDidCommit transam.c:129
HeapTupleSatisfiesMVCC heapam_visibility.c:1058
HeapTupleSatisfiesVisibility heapam_visibility.c:1695
heapgetpage heapam.c:476
heapgettup_pagemode heapam.c:917
heap_getnextslot heapam.c:1390
table_scan_getnextslot tableam.h:906
SeqNext nodeSeqscan.c:80
ExecScanFetch execScan.c:133
ExecScan execScan.c:182
ExecSeqScan nodeSeqscan.c:112
ExecProcNodeFirst execProcnode.c:454
ExecProcNode executor.h:248
ExecutePlan execMain.c:1632
standard_ExecutorRun execMain.c:350
CitusExecutorRun multi_executor.c:214
pgss_ExecutorRun pg_stat_statements.c:1043
pgsk_ExecutorRun pg_stat_kcache.c:1034
pgqs_ExecutorRun pg_qualstats.c:661
explain_ExecutorRun auto_explain.c:334
ExecutorRun execMain.c:292
PortalRunSelect pquery.c:912
PortalRun pquery.c:756
exec_simple_query postgres.c:1325
PostgresMain postgres.c:4415
BackendRun postmaster.c:4527
BackendStartup postmaster.c:4211
ServerLoop postmaster.c:1740
PostmasterMain postmaster.c:1413
main main.c:231
__libc_start_main 0x00007f3353efd555
_start 0x0000000000483aa9

 

/*
 * information stored in t_infomask:
 */
#define HEAP_HASNULL            0x0001    /* has null attribute(s) */
#define HEAP_HASVARWIDTH        0x0002    /* has variable-width attribute(s) */
#define HEAP_HASEXTERNAL        0x0004    /* has external stored attribute(s) */
#define HEAP_HASOID_OLD            0x0008    /* has an object-id field */
#define HEAP_XMAX_KEYSHR_LOCK    0x0010    /* xmax is a key-shared locker */
#define HEAP_COMBOCID            0x0020    /* t_cid is a combo cid */
#define HEAP_XMAX_EXCL_LOCK        0x0040    /* xmax is exclusive locker */
#define HEAP_XMAX_LOCK_ONLY        0x0080    /* xmax, if valid, is only a locker */

 /* xmax is a shared locker */
#define HEAP_XMAX_SHR_LOCK    (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)

#define HEAP_LOCK_MASK    (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \
                         HEAP_XMAX_KEYSHR_LOCK)
#define HEAP_XMIN_COMMITTED        0x0100    /* t_xmin committed */
#define HEAP_XMIN_INVALID        0x0200    /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN        (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
#define HEAP_XMAX_COMMITTED        0x0400    /* t_xmax committed */
#define HEAP_XMAX_INVALID        0x0800    /* t_xmax invalid/aborted */
#define HEAP_XMAX_IS_MULTI        0x1000    /* t_xmax is a MultiXactId */
#define HEAP_UPDATED            0x2000    /* this is UPDATEd version of row */
#define HEAP_MOVED_OFF            0x4000    /* moved to another place by pre-9.0
                                         * VACUUM FULL(cluster同義詞); kept for binary
                                         * upgrade support */
#define HEAP_MOVED_IN            0x8000    /* moved from another place by pre-9.0
                                         * VACUUM FULL; kept for binary
                                         * upgrade support */
#define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)

#define HEAP_XACT_MASK            0xFFF0    /* visibility-related bits */

  每當一個新的clog頁面(和pg中其他頁面一樣,也是BLCKSZ宏定義,默認8KB)被初始化為0的時候,clog.c就會生成一條wal記錄。xact.c中針對提交和回滾操作的記錄(recording)也會寫clog。對於同步提交:在clog記錄commit前,XLOG會確保被刷新,所以WAL可以自動被保證。對於異步提交:必須跟蹤最新的LSN影響的每個CLOG頁,這樣才能刷新響應的xlog。clog的細節描述具體可以參見:https://www.interdb.jp/pg/pgsql05.html。clog的清理參見:https://www.interdb.jp/pg/pgsql06.html#_6.4.,由vacuum freeze負責清理。

部分結構化描述可以參見https://blog.csdn.net/weixin_39540651/article/details/115677138。 

其他目錄說明:

  • pg_logical
  • pg_commit_ts
  • pg_multixact
  • pg_subtrans
  • pg_snapshots
  • pg_replslot
  • pg_dynshmem

9.6目錄說明

10.0+目錄說明(到14為止未在發生調整) 

https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND

vacuum的內部調用鏈路

[lightdb@hs-10-20-30-217 ~]$ pstack 2393906
#0  0x00007f7eb7bb0a86 in ?? () from /lib64/libpthread.so.0
#1  0x00007f7eb7bb0b78 in ?? () from /lib64/libpthread.so.0
#2  0x0000000000710202 in PGSemaphoreLock (sema=0x7f4e36e3e8b8) at pg_sema.c:327
#3  0x0000000000785cf5 in LWLockAcquire (lock=0x7f7deb25bb00, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1331
#4  0x0000000000764d2a in StartBufferIO (buf=buf@entry=0x7f4e5a3a7b00, forInput=forInput@entry=true) at bufmgr.c:4188
#5  0x0000000000766cdd in BufferAlloc (foundPtr=0x7ffe45a794fb, strategy=0x2501320, blockNum=<optimized out>, forkNum=MAIN_FORKNUM, relpersistence=112 'p', smgr=0x258bc30) at bufmgr.c:1286
#6  ReadBuffer_common (smgr=0x258bc30, relpersistence=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=<optimized out>, blockNum@entry=4198708, mode=RBM_NORMAL, strategy=0x2501320, hit=0x7ffe45a795af) at bufmgr.c:761
#7  0x0000000000767360 in ReadBufferExtended (reln=reln@entry=0x7f4e367f7548, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=4198708, mode=mode@entry=RBM_NORMAL, strategy=<optimized out>) at bufmgr.c:677
#8  0x00000000004dbcc2 in lazy_scan_heap (aggressive=true, nindexes=2, Irel=0x25682d8, vacrelstats=<optimized out>, params=0x2501434, onerel=<optimized out>) at vacuumlazy.c:1092
#9  heap_vacuum_rel (onerel=<optimized out>, params=0x2501434, bstrategy=<optimized out>) at vacuumlazy.c:518
#10 0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x2501434, rel=0x7f4e367f7548) at ../../../src/include/access/tableam.h:1463
#11 vacuum_rel (relid=29477, relation=<optimized out>, params=params@entry=0x2501434) at vacuum.c:1893
#12 0x000000000062e1cb in vacuum (relations=0x2539038, params=params@entry=0x2501434, bstrategy=<optimized out>, bstrategy@entry=0x2501320, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#13 0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x2501320, tab=0x2501430) at autovacuum.c:3137
#14 do_autovacuum () at autovacuum.c:2467
#15 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#16 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#17 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#18 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#19 <signal handler called>
#20 0x00007f7eb745a987 in select () from /lib64/libc.so.6
#21 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#22 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x24490d0) at postmaster.c:1488
#23 0x000000000048e7de in main (argc=1, argv=0x24490d0) at main.c:231
[lightdb@hs-10-20-30-217 ~]$ pstack 2393906
#0  0x00007f7eb745a987 in select () from /lib64/libc.so.6
#1  0x00000000008f0d8a in pg_usleep (microsec=<optimized out>) at pgsleep.c:56
#2  0x000000000062ece3 in vacuum_delay_point () at vacuum.c:2034
#3  0x00000000004dbc1d in lazy_scan_heap (aggressive=true, nindexes=2, Irel=0x25682d8, vacrelstats=<optimized out>, params=0x2501434, onerel=<optimized out>) at vacuumlazy.c:1034
#4  heap_vacuum_rel (onerel=<optimized out>, params=0x2501434, bstrategy=<optimized out>) at vacuumlazy.c:518
#5  0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x2501434, rel=0x7f4e367f7548) at ../../../src/include/access/tableam.h:1463
#6  vacuum_rel (relid=29477, relation=<optimized out>, params=params@entry=0x2501434) at vacuum.c:1893
#7  0x000000000062e1cb in vacuum (relations=0x2539038, params=params@entry=0x2501434, bstrategy=<optimized out>, bstrategy@entry=0x2501320, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#8  0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x2501320, tab=0x2501430) at autovacuum.c:3137
#9  do_autovacuum () at autovacuum.c:2467
#10 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#11 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#12 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#13 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#14 <signal handler called>
#15 0x00007f7eb745a987 in select () from /lib64/libc.so.6
#16 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#17 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x24490d0) at postmaster.c:1488
#18 0x000000000048e7de in main (argc=1, argv=0x24490d0) at main.c:231
[lightdb@hs-10-20-30-217 ~]$ pstack 2393906
#0  0x00007f7eb745a987 in select () from /lib64/libc.so.6
#1  0x00000000008f0d8a in pg_usleep (microsec=<optimized out>) at pgsleep.c:56
#2  0x000000000062ece3 in vacuum_delay_point () at vacuum.c:2034
#3  0x00000000004dbc1d in lazy_scan_heap (aggressive=true, nindexes=2, Irel=0x25682d8, vacrelstats=<optimized out>, params=0x2501434, onerel=<optimized out>) at vacuumlazy.c:1034
#4  heap_vacuum_rel (onerel=<optimized out>, params=0x2501434, bstrategy=<optimized out>) at vacuumlazy.c:518
#5  0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x2501434, rel=0x7f4e367f7548) at ../../../src/include/access/tableam.h:1463
#6  vacuum_rel (relid=29477, relation=<optimized out>, params=params@entry=0x2501434) at vacuum.c:1893
#7  0x000000000062e1cb in vacuum (relations=0x2539038, params=params@entry=0x2501434, bstrategy=<optimized out>, bstrategy@entry=0x2501320, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#8  0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x2501320, tab=0x2501430) at autovacuum.c:3137
#9  do_autovacuum () at autovacuum.c:2467
#10 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#11 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#12 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#13 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#14 <signal handler called>
#15 0x00007f7eb745a987 in select () from /lib64/libc.so.6
#16 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#17 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x24490d0) at postmaster.c:1488
#18 0x000000000048e7de in main (argc=1, argv=0x24490d0) at main.c:231
[lightdb@hs-10-20-30-217 ~]$ 


[lightdb@hs-10-20-30-217 base]$ pstack 3695188
#0  0x00007f1394acb987 in select () from /lib64/libc.so.6
#1  0x00000000008f0d8a in pg_usleep (microsec=<optimized out>) at pgsleep.c:56
#2  0x000000000062ece3 in vacuum_delay_point () at vacuum.c:2034
#3  0x00000000004e8635 in btvacuumpage (vstate=vstate@entry=0x7fff02c75f80, scanblkno=scanblkno@entry=81599) at nbtree.c:1109
#4  0x00000000004e8e21 in btvacuumscan (info=info@entry=0x7fff02c76130, stats=stats@entry=0x1106688, callback=callback@entry=0x4daac0 <lazy_tid_reaped>, callback_state=callback_state@entry=0x7efc67f13048, cycleid=<optimized out>) at nbtree.c:1029
#5  0x00000000004e8f7e in btbulkdelete (info=0x7fff02c76130, stats=0x1106688, callback=0x4daac0 <lazy_tid_reaped>, callback_state=0x7efc67f13048) at nbtree.c:871
#6  0x00000000004da85e in lazy_vacuum_index (indrel=<optimized out>, stats=stats@entry=0x10f3df8, dead_tuples=0x7efc67f13048, reltuples=263139328, vacrelstats=vacrelstats@entry=0x10f3a58) at vacuumlazy.c:2444
#7  0x00000000004db677 in lazy_vacuum_all_indexes (onerel=onerel@entry=0x7efca8105c08, Irel=Irel@entry=0x10f3648, stats=stats@entry=0x10f3dd8, vacrelstats=vacrelstats@entry=0x10f3a58, lps=lps@entry=0x0, nindexes=nindexes@entry=12) at vacuumlazy.c:1810
#8  0x00000000004dbecf in lazy_scan_heap (aggressive=true, nindexes=12, Irel=0x10f3648, vacrelstats=<optimized out>, params=0x10b27c4, onerel=<optimized out>) at vacuumlazy.c:1696
#9  heap_vacuum_rel (onerel=<optimized out>, params=0x10b27c4, bstrategy=<optimized out>) at vacuumlazy.c:518
#10 0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x10b27c4, rel=0x7efca8105c08) at ../../../src/include/access/tableam.h:1463
#11 vacuum_rel (relid=9217732, relation=<optimized out>, params=params@entry=0x10b27c4) at vacuum.c:1893
#12 0x000000000062e1cb in vacuum (relations=0x111f518, params=params@entry=0x10b27c4, bstrategy=<optimized out>, bstrategy@entry=0x10b2550, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#13 0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x10b2550, tab=0x10b27c0) at autovacuum.c:3137
#14 do_autovacuum () at autovacuum.c:2467
#15 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#16 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#17 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#18 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#19 <signal handler called>
#20 0x00007f1394acb987 in select () from /lib64/libc.so.6
#21 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#22 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0xfd0110) at postmaster.c:1488
#23 0x000000000048e7de in main (argc=1, argv=0xfd0110) at main.c:231
[lightdb@hs-10-20-30-217 base]$ pstack 3695188
#0  0x00007f1395222ff4 in pread64 () from /lib64/libpthread.so.0
#1  0x000000000076cb41 in FileRead (file=<optimized out>, buffer=buffer@entry=0x7efd0674fb00 "Q\002", amount=amount@entry=8192, offset=offset@entry=814096384, wait_event_info=wait_event_info@entry=167772173) at fd.c:2102
#2  0x00000000007902a0 in mdread (reln=<optimized out>, forknum=<optimized out>, blocknum=99377, buffer=0x7efd0674fb00 "Q\002") at md.c:663
#3  0x0000000000766a05 in ReadBuffer_common (smgr=0x10cc198, relpersistence=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=<optimized out>, blockNum@entry=99377, mode=RBM_NORMAL, strategy=0x10b2550, hit=0x7fff02c74e9f) at bufmgr.c:916
#4  0x0000000000767360 in ReadBufferExtended (reln=reln@entry=0x7efca810b6d8, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=99377, mode=mode@entry=RBM_NORMAL, strategy=<optimized out>) at bufmgr.c:677
#5  0x00000000004e864c in btvacuumpage (vstate=vstate@entry=0x7fff02c75f80, scanblkno=scanblkno@entry=99377) at nbtree.c:1117
#6  0x00000000004e8e21 in btvacuumscan (info=info@entry=0x7fff02c76130, stats=stats@entry=0x1106688, callback=callback@entry=0x4daac0 <lazy_tid_reaped>, callback_state=callback_state@entry=0x7efc67f13048, cycleid=<optimized out>) at nbtree.c:1029
#7  0x00000000004e8f7e in btbulkdelete (info=0x7fff02c76130, stats=0x1106688, callback=0x4daac0 <lazy_tid_reaped>, callback_state=0x7efc67f13048) at nbtree.c:871
#8  0x00000000004da85e in lazy_vacuum_index (indrel=<optimized out>, stats=stats@entry=0x10f3df8, dead_tuples=0x7efc67f13048, reltuples=263139328, vacrelstats=vacrelstats@entry=0x10f3a58) at vacuumlazy.c:2444
#9  0x00000000004db677 in lazy_vacuum_all_indexes (onerel=onerel@entry=0x7efca8105c08, Irel=Irel@entry=0x10f3648, stats=stats@entry=0x10f3dd8, vacrelstats=vacrelstats@entry=0x10f3a58, lps=lps@entry=0x0, nindexes=nindexes@entry=12) at vacuumlazy.c:1810
#10 0x00000000004dbecf in lazy_scan_heap (aggressive=true, nindexes=12, Irel=0x10f3648, vacrelstats=<optimized out>, params=0x10b27c4, onerel=<optimized out>) at vacuumlazy.c:1696
#11 heap_vacuum_rel (onerel=<optimized out>, params=0x10b27c4, bstrategy=<optimized out>) at vacuumlazy.c:518
#12 0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x10b27c4, rel=0x7efca8105c08) at ../../../src/include/access/tableam.h:1463
#13 vacuum_rel (relid=9217732, relation=<optimized out>, params=params@entry=0x10b27c4) at vacuum.c:1893
#14 0x000000000062e1cb in vacuum (relations=0x111f518, params=params@entry=0x10b27c4, bstrategy=<optimized out>, bstrategy@entry=0x10b2550, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#15 0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x10b2550, tab=0x10b27c0) at autovacuum.c:3137
#16 do_autovacuum () at autovacuum.c:2467
#17 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#18 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#19 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#20 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#21 <signal handler called>
#22 0x00007f1394acb987 in select () from /lib64/libc.so.6
#23 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#24 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0xfd0110) at postmaster.c:1488
#25 0x000000000048e7de in main (argc=1, argv=0xfd0110) at main.c:231


[lightdb@hs-10-20-30-217 data]$ pstack 4021998
#0  0x00007f78c40ac987 in select () from /lib64/libc.so.6
#1  0x00000000008f0d8a in pg_usleep (microsec=<optimized out>) at pgsleep.c:56
#2  0x000000000062ece3 in vacuum_delay_point () at vacuum.c:2034
#3  0x00000000004dce20 in lazy_scan_heap (aggressive=true, nindexes=12, Irel=0x2a18588, vacrelstats=<optimized out>, params=0x29d03cc, onerel=<optimized out>) at vacuumlazy.c:984
#4  heap_vacuum_rel (onerel=<optimized out>, params=0x29d03cc, bstrategy=<optimized out>) at vacuumlazy.c:518
#5  0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x29d03cc, rel=0x7f61d764f158) at ../../../src/include/access/tableam.h:1463
#6  vacuum_rel (relid=9217732, relation=<optimized out>, params=params@entry=0x29d03cc) at vacuum.c:1893
#7  0x000000000062e1cb in vacuum (relations=0x2a20378, params=params@entry=0x29d03cc, bstrategy=<optimized out>, bstrategy@entry=0x29d0158, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#8  0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x29d0158, tab=0x29d03c8) at autovacuum.c:3137
#9  do_autovacuum () at autovacuum.c:2467
#10 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#11 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#12 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#13 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#14 <signal handler called>
#15 0x00007f78c40ac987 in select () from /lib64/libc.so.6
#16 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#17 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x28f4110) at postmaster.c:1488
#18 0x000000000048e7de in main (argc=1, argv=0x28f4110) at main.c:231

 https://cloud.tencent.com/developer/article/2027383

wal的內部實現解析

  WALbuffer分配在共享內存中,最小64KB,通過XLogCtlInsert控制具體backend的申請,作為XLogCtlData的成員對外提供,無單獨訪問接口。
45000 53000 會話1
61000 69000 會話2
69000 77000 會話1
通過調試的指針地址從上可見,確實是共享。
  XLogInsert->XLogInsertRecord->CopyXLogRecordToWAL,只負責拷貝到wal buffer中的buffer(根據lsn位置/wal塊大小計算而來,類似整除,會提前計算wal lsn_start和lsn_end)(包括過大到下一頁),並不寫入wal文件。
LOG: INSERT @ 0/2091D708: - Heap/INSERT: off 3 flags 0x08
LOG: INSERT @ 0/2091D778: - Transaction/COMMIT: 2022-10-16 23:44:59.926298+08
  上述僅表示生成到XLogInsertRecord,無寫入的意思。通過系統調用跟蹤也可以看出,如下:
[{EPOLLIN, {u32=23473288, u64=23473288}}], 1, -1) = 1
recvfrom(11, "Q\0\0\0.insert into t values('wocao"..., 8192, 0, NULL, NULL) = 47
lseek(6, 0, SEEK_END) = 8192
open("base/14199/16388_fsm", O_RDWR) = 7
lseek(7, 0, SEEK_END) = 24576
open("base/14199/16388", O_RDWR) = 8
lseek(8, 0, SEEK_END) = 245760
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=180, tv_usec=0}}, NULL) = 0
sendto(11, "C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T", 22, 0, NULL, 0) = 22
recvfrom(11, 0xf94780, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5,
------------insert沒有產生pwrite64系統調用

[{EPOLLIN, {u32=23473288, u64=23473288}}], 1, -1) = 1
recvfrom(11, "Q\0\0\0\fcommit;\0", 8192, 0, NULL, NULL) = 13
open("lt_wal/000000010000000000000001", O_RDWR) = 9
pwrite64(9, "\6\321\5\0\1\0\0\0\0\0\222 \0\0\0\0%\32\0\0\0\0\0\0yyyyyyyy"..., 8192, 9568256) = 8192
fdatasync(9) = 0
sendto(10, "\2\0\0\0\250\3\0\0w7\0\0\10\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 936, 0, NULL, 0) = 936
sendto(10, "\2\0\0\0\350\1\0\0w7\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 488, 0, NULL, 0) = 488
sendto(10, "\16\0\0\0H\0\0\0\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 72, 0, NULL, 0) = 72
sendto(11, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I", 18, 0, NULL, 0) = 18
recvfrom(11, 0xf94780, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5,

  如果wal buffer 64KB,滿了的話,那會怎么樣呢?理論上需要8個單行8000字節的請求就可以占滿,按照實現,是進行evict淘汰不再需要的wal buffer的,但是實際未寫到wal文件的話,怎么evict出去呢?后面如果又寫會怎么樣?這個還需要研究!!!

  當前wal文件句柄保存在xlog.c文件的openLogFile全局變量中
  因為wal也是分8k為單位讀寫,通過XLogWritePages寫入wal文件,每次8K*n字節,那如何保證每次寫小量而不重復拷貝呢?比如幾十字節?實際是在commit flush(RecordTransactionCommit函數中)的時候,每次寫仍然是8KB為單位,具體由startoffset += XLogWritePages(from, npages, startoffset)每次寫的from和startoffset參數可以看出。因為文件系統寫以塊為單位,所以寫8KB(如果文件系統8KB塊的話,如果512/4KB,則多了8倍或2倍)和80字節本質上的io是一樣的。

btree節點分裂的wal日志記錄

  在 btree 索引中,頁面拆分需要分配一個新頁面,並在父 btree 級別插入一個新鍵,但出於鎖定原因,這必須由兩個單獨的 WAL 記錄反映出來。 重放第一條記錄,以分配新頁面並將元組移動到其中,在頁面上設置一個標志,以指示鍵尚未插入到父級。 重放第二條記錄將清除標志。 在正常操作期間,其他后端永遠不會看到此中間狀態,因為子頁面上的鎖在兩個操作期間會一直持有,但如果操作在寫入第二個 WAL 記錄之前中斷,則會看到這種狀態。 搜索算法實現時考慮了這種不一致的異常,如果插入遇到設置了未完全拆分標志的頁面,它將通過在繼續之前將鍵插入父級來完成中斷的拆分。

In btree indexes, for example,a page split requires a new page to be allocated, and an insertion of a new key in the parent btree level, but for locking reasons this has to be reflected by two separate WAL records. Replaying the first record, to allocate the new page and move tuples to it, sets a flag on the page to indicate that the key has not been inserted to the parent yet. Replaying the second record clears the flag. This intermediate state is never seen by other backends during normal operation, because the lock on the child page is held across the two actions, but will be seen if the operation is interrupted before writing the second WAL record. The search algorithm works with the intermediate state as normal, but if an insertion encounters a page with the incomplete-split flag set, it will finish the interrupted split by inserting the key to the parent, before proceeding.


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM