通過本文你會清楚知道 fsync()、fdatasync()、sync()、O_DIRECT、O_SYNC、REQ_PREFLUSH、REQ_FUA的區別和作用。
fsync() fdatasync() sync() 是什么?
首先它們是系統調用。
fsync
fsync(int fd) 系統調用把打開的文件描述符fd相關的所有緩沖元數據和數據與都刷新到磁盤上(non-volatile storage)
fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)).
fdatasync
fdatasync(int fd) 類似fsync,但不flush元數據,除非元數據影響后面讀數據。比如文件修改時間元數據變了就不會刷,而文件大小變了影響了后面對該文件的讀取,這個會一同刷下去。所以fdatasync的性能要比fsync好。
fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modification; see stat(2)) do not require flushing because they are not necessary for a subsequent data read to be handled correctly. On the other hand, a change to the file size (st_size, as made by say ftruncate(2)), would require a metadata flush.The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.
sync
sync(void) 系統調用會使包含更新文件的所有內核緩沖區(包含數據塊、指針塊、元數據等)都flush到磁盤上。
Flush file system buffers, force changed blocks to disk, update the super block
O_DIRECT O_SYNC REQ_PREFLUSH REQ_FUA 是什么?
它們都是flag,可能最終的效果相同,但它們在不同的層面上。
O_DIRECT O_SYNC是系統調用open的flag參數,REQ_PREFLUSH REQ_FUA 是kernel bio的flag參數。
要理解這幾個參數要需要知道兩個頁緩存:
-
一個是你的內存,free -h可以看到的buff/cache;
-
另外一個是硬盤自帶的page cache。
一個io寫盤的簡單流程如下:
可以對比linux storage stack diagram:
O_DIRECT
O_DIRECT 表示io不經過系統緩存,這可能會降低你的io性能。它同步傳輸數據,但不保證數據安全。
備注:后面說的數據安全皆表示數據被寫到磁盤的non-volatile storage
Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from user-
space buffers. The O_DIRECT flag on its own makes an effort
to transfer data synchronously, but does not give the
guarantees of the O_SYNC flag that data and necessary metadata
are transferred. To guarantee synchronous I/O, O_SYNC must be
used in addition to O_DIRECT.
通過dd命令可以清楚看到 O_DIRECT和非O_DIRECT區別,注意buff/cache的變化:
#清理緩存:
#echo 3 > /proc/sys/vm/drop_caches
#free -h
total used free shared buff/cache available
Mem: 62G 1.1G 61G 9.2M 440M 60G
Swap: 31G 0B 31G
#dd without direct
#dd if=/dev/zero of=/dev/bcache0 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 27.8166 s, 38.6 MB/s
#free -h
total used free shared buff/cache available
Mem: 62G 1.0G 60G 105M 1.5G 60G
Swap: 31G 0B 31G
#echo 3 > /proc/sys/vm/drop_caches
#free -h
total used free shared buff/cache available
Mem: 62G 626M 61G 137M 337M 61G
Swap: 31G 0B 31G
#dd with direct
#dd if=/dev/zero of=/dev/bcache0 bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.72088 s, 395 MB/s
#free -h
total used free shared buff/cache available
Mem: 62G 628M 61G 137M 341M 61G
Swap: 31G 0B 31G
O_SYNC
O_SYNC 同步io標記,保證數據安全寫到non-volatile storage
Write operations on the file will complete according to the
requirements of synchronized I/O file integrity completion
REQ_PREFLUSH
REQ_PREFLUSH 是bio的request flag,表示在本次io開始時先確保在它之前完成的io都已經寫到非易失性存儲里。
我理解REQ_PREFLUSH之確保在它之前完成的io都寫到非易失物理設備,但它自己可能是只寫到了disk page cache里,並不確保安全。
可以在一個空的bio里設置REQ_PREFLUSH,表示回刷disk page cache里數據。
Explicit cache flushes
The REQ_PREFLUSH flag can be OR ed into the r/w flags of a bio submitted from
the filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started. This explicitly
guarantees that previously completed write requests are on non-volatile
storage before the flagged bio starts. In addition the REQ_PREFLUSH flag can be
set on an otherwise empty bio structure, which causes only an explicit cache
flush without any dependent I/O. It is recommend to use
the blkdev_issue_flush() helper for a pure cache flush.
REQ_FUA
REQ_FUA 是bio的request flag,表示數據安全寫到非易失性存儲再返回
Forced Unit Access
The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this request is only
signaled after the data has been committed to non-volatile storage.
實驗驗證
重新編譯了bcache內核模塊,打印出bio->bi_opf,opf是bio的 operation flag,它有req flag or組成,決定io的行為。
request flag
#from linux source code include/linux/blk_types.h
enum req_opf {
/* read sectors from the device */
REQ_OP_READ = 0,
/* write sectors to the device */
REQ_OP_WRITE = 1,
/* flush the volatile write cache */
REQ_OP_FLUSH = 2,
/* discard sectors */
REQ_OP_DISCARD = 3,
/* securely erase sectors */
REQ_OP_SECURE_ERASE = 5,
/* reset a zone write pointer */
REQ_OP_ZONE_RESET = 6,
/* write the same sector many times */
REQ_OP_WRITE_SAME = 7,
/* reset all the zone present on the device */
REQ_OP_ZONE_RESET_ALL = 8,
/* write the zero filled sector many times */
REQ_OP_WRITE_ZEROES = 9,
/* SCSI passthrough using struct scsi_request */
REQ_OP_SCSI_IN = 32,
REQ_OP_SCSI_OUT = 33,
/* Driver private requests */
REQ_OP_DRV_IN = 34,
REQ_OP_DRV_OUT = 35,
REQ_OP_LAST,
};
enum req_flag_bits {
__REQ_FAILFAST_DEV = /* 8 no driver retries of device errors */
REQ_OP_BITS,
__REQ_FAILFAST_TRANSPORT, /* 9 no driver retries of transport errors */
__REQ_FAILFAST_DRIVER, /* 10 no driver retries of driver errors */
__REQ_SYNC, /* 11 request is sync (sync write or read) */
__REQ_META, /* 12 metadata io request */
__REQ_PRIO, /* 13 boost priority in cfq */
__REQ_NOMERGE, /* 14 don't touch this for merging */
__REQ_IDLE, /* 15 anticipate more IO after this one */
__REQ_INTEGRITY, /* 16 I/O includes block integrity payload */
__REQ_FUA, /* 17 forced unit access */
__REQ_PREFLUSH, /* 18 request for cache flush */
__REQ_RAHEAD, /* 19 read ahead, can fail anytime */
__REQ_BACKGROUND, /* 20 background IO */
__REQ_NOWAIT, /* 21 Don't wait if request will block */
__REQ_NOWAIT_INLINE, /* 22 Return would-block error inline */
.....
}
bio->bi_opf 對照表
下面測試時候需要用到,可以直接跳過,有疑惑的時候回來查看。
十進制flag | 十六進制flag | REQ_FLG |
---|---|---|
2409 | 1000 0000 0001 | REQ_OP_WRITE | REQ_SYNC |
4096 | 0001 0000 0000 0000 | REQ_META | REQ_READ |
34817 | 1000 1000 0000 0001 | REQ_OP_WRITE | REQ_SYNC | REQ_IDLE |
399361 | 0110 0001 1000 0000 0001 | REQ_OP_WRITE | REQ_SYNC | REQ_META | REQ_FUA | REQ_PREFLUSH |
264193 | 0100 0000 1000 0000 0001 | REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH |
165889 | 0010 1000 1000 0000 0001 | REQ_OP_WRITE | REQ_SYNC | REQ_IDLE | REQ_FUA |
1048577 | 0001 0000 0000 0000 0000 0001 | REQ_OP_WRITE | REQ_BACKGROUND |
0010 0000 1000 0000 0001 | REQ_OP_WRITE | REQ_SYNC | REQ_FUA |
測試
用測試工具dd對塊設備/dev/bcache0直接測試。
筆者通過dd源碼已確認:oflag=direct表示文件已O_DIRECT打開,oflag=sync 表示已O_SYNC打開,conv=fdatasync表示dd結束后會發送一個fdatasync(fd), conv=fsync表示dd結束后會發送一個fsync(fd)
- direct
#dd if=/dev/zero of=/dev/bcache0 oflag=direct bs=8k count=1
//messages
kernel: bcache: cached_dev_make_request() bi_opf 34817, size 8192
bi_opf 34817, size 8192:bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE,是一個同步寫,可以看到不保證數據安全
- direct & sync
#dd if=/dev/zero of=/dev/bcache0 oflag=direct,sync bs=8k count=1
kernel: bcache: cached_dev_make_request() bi_opf 165889, size 8192
kernel: bcache: cached_dev_make_request() bi_opf 264193, size 0
bi_opf 165889, size 8192: bi_opf=REQ_OP_WRITE | REQ_SYNC | REQ_IDLE | REQ_FUA ,是一個同步寫請求,並且該io直接寫到硬盤的non-volatile storage
bi_opf 264193, size 0: bi_opf= REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH ,size = 0,表示回刷disk的page cache保證以前寫入的io都刷到non-volatile storage
通過這個可以理解O_SYNC為什么可以保證數據安全。
#dd if=/dev/zero of=/dev/bcache0 oflag=direct,sync bs=8k count=2
kernel: bcache: cached_dev_make_request() iop_opf 165889, size 8192
kernel: bcache: cached_dev_make_request() iop_opf 264193, size 0
kernel: bcache: cached_dev_make_request() iop_opf 165889, size 8192
kernel: bcache: cached_dev_make_request() iop_opf 264193, size 0
寫兩個io到設備,從上面可以看到以O_SYNC打開文件,每個write都會發送一個flush請求,這對性能的影響比較大,所以在一般實現中不已O_SYNC打開文件,而是在幾個io結束后調用一次fdatasync。
- without direct
#dd if=/dev/zero of=/dev/bcache0 bs=8k count=1
kernel: bcache: cached_dev_make_request() bi_opf 2049, size 4096
kernel: bcache: cached_dev_make_request() bi_opf 2049, size 4096
bi_opf 2049, size 4096:bi_opf = REQ_OP_WRITE | REQ_SYNC,同步寫請求,可以看到原先8k的io在page cache里被拆成了2個4k的io寫了下來,不保證數據安全
- direct & fdatasync
#dd if=/dev/zero of=/dev/bcache0 oflag=direct conv=fdatasync bs=8k count=1
kernel: bcache: cached_dev_make_request() bi_opf 34817, size 8192
kernel: bcache: cached_dev_make_request() bi_opf 264193, size 0
bi_opf 34817, size 8192:bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE, 同步寫請求,這個不保證數據安全
bi_opf 264193, size 0:bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH,一個disk page cache 回刷請求,這個就是fdatasync()下發的。
- direct & sync & fdatasync
#dd if=/dev/zero of=/dev/bcache0 oflag=direct,sync conv=fdatasync bs=8k count=1
kernel: bcache: cached_dev_make_request() bi_opf 165889, size 8192
kernel: bcache: cached_dev_make_request() bi_opf 264193, size 0
kernel: bcache: cached_dev_make_request() bi_opf 264193, size 0
bi_opf 165889, size 8192: bi_opf=REQ_OP_WRITE | REQ_SYNC | REQ_IDLE | REQ_FUA ,是一個同步寫請求,並且該io直接寫到硬盤的non-volatile storage
bi_opf 264193, size 0: bi_opf= REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH ,size = 0,表示回刷disk的page cache保證以前寫入的io都刷到non-volatile storage
結合上面的分析,這三個bio其實就是一個寫io,兩個flush io,分別由O_SYNC和fdatasync觸發。
- direct & fsync
#dd if=/dev/zero of=/dev/bcache0 oflag=direct conv=fsync bs=8k count=1
kernel: bcache: cached_dev_make_request() bi_opf 34817, size 8192
kernel: bcache: cached_dev_make_request() bi_opf 264193, size 0
同direct + fdatasync,應該是寫的是一個塊設備,沒有元數據或元數據沒有變化,所以fdatasync和fsync收到的bio是一樣的
彩蛋
如何打開關閉硬盤緩存
查看當前硬盤寫Cache狀態
#hdparm -W /dev/sda
/dev/sda:
write-caching = 1 (on)
關閉硬盤的寫Cache
#hdparm -W 0 /dev/sda
打開硬盤的寫Cache
#hdparm -W 1 /dev/sda