目錄
文章目錄
前文列表
《Ceph 分布式存儲架構解析與工作原理》
《手動部署 Ceph Mimic 三節點》
RBD
RBD: Ceph’s RADOS Block Devices , Ceph block devices are thin-provisioned, resizable and store data striped over multiple OSDs in a Ceph cluster.
Ceph RBD 是企業級的塊設備存儲解決方案,支持擴縮容、支持精簡置備,具有 COW 特性,一個塊設備(Volume)在 RADOS 中會被分割為若干個 Objects 儲存。
CEPH BLOCK DEVICE
- Thin-provisioned
- Images up to 16 exabytes
- Configurable striping
- In-memory caching
- Snapshots
- Copy-on-write cloning
- Kernel driver support
- KVM/libvirt support
- Back-end for cloud solutions
- Incremental backup
- Disaster recovery (multisite asynchronous replication)
RBD Pool 的創建與刪除
HELP:
osd pool create <poolname> <int[0-]> {<int[0-]>} {replicated|erasure} {<erasure_code_create pool profile>} {<rule>} {<int>}
創建一個 Pool rbd_pool 並初始化為 RBD 類型:
[root@ceph-node1 ~]# ceph osd pool create rbd_pool 8 8
pool 'rbd_pool' created
[root@ceph-node1 ~]# ceph osd pool create rbd_pool02 8 8
pool 'rbd_pool02' created
[root@ceph-node1 ~]# rbd pool init rbd_pool
[root@ceph-node1 ~]# rbd pool init rbd_pool02
NOTE:創建 Poo 時會指定 pg_num,官方推薦:
- 少於 5 個 OSD, 設置 pg_num 為 128
- 5~10 個 OSD,設置 pg_num 為 512
- 10~50 個 OSD,設置 pg_num 為 4096
- 超過 50 個OSD,可以應用 Ceph PGs per Pool Calculator 來進行計算
查看 Pools 的 df 信息:
[root@ceph-node1 ~]# rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR
.rgw.root 2.2 KiB 6 0 18 0 0 0 63 42 KiB 6 6 KiB
default.rgw.control 0 B 8 0 24 0 0 0 0 0 B 0 0 B
default.rgw.log 0 B 207 0 621 0 0 0 36870 36 MiB 24516 0 B
default.rgw.meta 0 B 0 0 0 0 0 0 0 0 B 0 0 B
rbd_pool 114 MiB 44 0 132 0 0 0 2434 49 MiB 843 223 MiB
total_objects 265
total_used 9.4 GiB
total_avail 81 GiB
total_space 90 GiB
查看每個 Pool 的 df 信息:
[root@ceph-node1 ~]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
90 GiB 79 GiB 11 GiB 12.39
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
.rgw.root 3 2.2 KiB 0 24 GiB 6
default.rgw.control 4 0 B 0 24 GiB 8
default.rgw.meta 5 0 B 0 24 GiB 0
default.rgw.log 6 0 B 0 24 GiB 207
images 9 39 MiB 0.16 24 GiB 9
volumes 10 36 B 0 24 GiB 5
vms 11 19 B 0 24 GiB 3
backups 12 19 B 0 24 GiB 2
查看 Pool 的相關信息:
# 查看 Pools 清單
[root@ceph-node1 ~]# rados lspools
rbd_pool
rbd_pool02
# 查看 Pool 的 PG 和 PGP 數量
[root@ceph-node1 ~]# ceph osd pool get rbd_pool pg_num
pg_num: 8
[root@ceph-node1 ~]# ceph osd pool get rbd_pool pgp_num
pgp_num: 8
# 查看當前 OSD 狀態
root@ceph-node1 ~]# ceph osd dump | grep pool
pool 1 'rbd_pool' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 71 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 3 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 56 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 59 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 61 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 63 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
# 查看 Pool 下屬 Obejcts 的 OSD Map 信息,可以看出 Pool rbd_pool 的 Objects 具有 3 副本(PG 被映射到 3 個 OSD)
[root@ceph-node1 ~]# ceph osd map rbd_pool rbd_info
osdmap e53 pool 'rbd_pool' (1) object 'rbd_info' -> pg 1.ac0e573a (1.2) -> up ([4,0,8], p4) acting ([4,0,8], p4)
[root@ceph-node1 ~]# ceph osd map rbd_pool rbd_directory
osdmap e53 pool 'rbd_pool' (1) object 'rbd_directory' -> pg 1.30a98c1c (1.4) -> up ([7,3,1], p7) acting ([7,3,1], p7)
[root@ceph-node1 ~]# ceph osd map rbd_pool rbd_id.volume01
osdmap e53 pool 'rbd_pool' (1) object 'rbd_id.volume01' -> pg 1.8f1d799c (1.4) -> up ([7,3,1], p7) acting ([7,3,1], p7)
刪除一個 RBD 類型的 Pool:
[root@ceph-node1 ~]# ceph osd pool delete rbd_pool02
Error EPERM: WARNING: this will *PERMANENTLY DESTROY* all data stored in pool rbd_pool02. If you are *ABSOLUTELY CERTAIN* that is what you want, pass the pool name *twice*, followed by --yes-i-really-really-mean-it.
[root@ceph-node1 ~]# ceph osd pool delete rbd_pool02 rbd_pool02 --yes-i-really-really-mean-it
Error EPERM: pool deletion is disabled; you must first set the mon_allow_pool_delete config option to true before you can destroy a pool
上述表示刪除失敗,Pool 的刪除是一個風險級別非常高的操作,所以 Ceph 為該操作設定了一個配置項,我們修改配置運行刪除 Pool:
# 只允許當前客戶端執行刪除操作,所以直接修改 /etc/ceph/ceph.conf
[mon]
mon allow pool delete = true
$ systemctl restart ceph-mon.target
再次刪除 Pool:
[root@ceph-node1 ~]# ceph osd pool delete rbd_pool02 rbd_pool02 --yes-i-really-really-mean-it
pool 'rbd_pool02' removed
設置 Pool 的配額:
# 設置允許最大 Objects 數量為 100
ceph osd pool set-quota test-pool max_objects 100
# 設置允許最大容量限制為 10GB
ceph osd pool set-quota test-pool max_bytes $((10 * 1024 * 1024 * 1024))
# 取消配額限制只需要把對應值設為 0 即可
Pool 的重命名:
ceph osd pool rename test-pool test-pool-new
創建/刪除 Pool Snapshot:
# 創建
ceph osd pool mksnap test-pool test-pool-snapshot
# 刪除
ceph osd pool rmsnap test-pool test-pool-snapshot
NOTE:Ceph 有兩種 Snapshot 類型,這兩種類型是互斥的,在存在 Self Managed Snapshot 的 Pool 中不能再執行 Pool Snapshot,反之亦然。
- Pool Snapshot:對 Pool 執行快照
- Self Managed Snapshot:對 RBD 塊設備執行快照
設置 Pool 的元數據:
eph osd pool set {pool-name} {key} {value}
# 設置 Pool 的 PG 副本數量為 3
ceph osd pool set test-pool size 3
塊設備的創建與刪除
要創建 RBD 塊設備,首先需要登錄到 Ceph Cluster 的任意 MON 節點,或登錄到具有 Ceph Cluster 管理員權限的 Host,或 Ceph 客戶端上。下面我們直接在 MON 節點上執行操作。
指定 RBD Pool 創建塊設備:
[root@ceph-node1 ~]# rbd create rbd_pool/volume01 --size 1024
[root@ceph-node1 ~]# rbd create rbd_pool/volume02 --size 1024
[root@ceph-node1 ~]# rbd create rbd_pool/volume03 --size 1024
NOTE:如果不指定 RBD 類型的 Pool Name,則默認為 rbd,所以首選需要有一個 Pool rbd。
查看塊設備信息:
[root@ceph-node1 ~]# rbd --image rbd_pool/volume01 info
rbd image 'volume01':
size 1 GiB in 256 objects
order 22 (4 MiB objects)
id: 11126b8b4567
block_name_prefix: rbd_data.11126b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
op_features:
flags:
create_timestamp: Tue Apr 23 07:30:57 2019
# size:塊設備大小
# order:Object 大小,這里 22 表示 2**22 即 4MiB
# block_name_prefix:塊設備的 Ceph Cluster 全局唯一標示
# format:塊設備格式,有 1、2 兩種類型
# features:卷特性,細分為
# - layering:支持分層
# - striping:支持條帶化 v2
# - exclusive-lock:支持獨占鎖
# - object-map:支持對象映射索引,依賴 exclusive-lock
# - fast-diff:支持快速計算差異,依賴 object-map
# - deep-flatten:支持快照扁平化
# - journaling:支持記錄 IO 操作,依賴獨占鎖
查看 Pool 下屬的塊設備的相關 Obejcts:
[root@ceph-node1 ~]# rados -p rbd_pool ls
...
rbd_directory
rbd_id.volume01
# rbd_id.volume01:保存了 volume01 自己的 block_name_prefix
# rbd_directory:保存了這個 Pool 下屬所有塊設備的索引
獲取 Pool 下的一個 Objects 並查看其內容:
root@ceph-node1 ~]# rados -p rbd_pool get rbd_info rbd_info
[root@ceph-node1 ~]# ls
rbd_info
[root@ceph-node1 ~]# hexdump -vC rbd_info
00000000 6f 76 65 72 77 72 69 74 65 20 76 61 6c 69 64 61 |overwrite valida|
00000010 74 65 64 |ted|
00000013
刪除一個塊設備:
[root@ceph-node1 ~]# rbd ls rbd_pool
volume01
volume02
volume03
[root@ceph-node1 ~]# rbd rm volume03 -p rbd_pool
Removing image: 100% complete...done.
[root@ceph-node1 ~]# rbd ls rbd_pool
volume01
volume02
塊設備的掛載與卸載
RBD 的驅動程序已經被集成到 Linux 內核(2.6.39 或更高版本),將 Linux 操作系統作為客戶端掛載卷的前提是加載 RBD 內核模塊,下面我們首先將卷掛載到 ceph-node1,因為 ceph-node1 就是一個原生的客戶端,不需要額外操作。
映射塊設備到客戶端本地:
[root@ceph-node1 ~]# lsmod | grep rbd
rbd 83640 2
libceph 306625 1 rbd
[root@ceph-node1 ~]# rbd map rbd_pool/volume01
rbd: sysfs write failed
RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable rbd_pool/volume01 object-map fast-diff deep-flatten".
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (6) No such device or address
這里執行映射失敗了,原因是因為客戶端內核版本較低,無法支持卷的全部特性,可以通過以下修改來解決。
設定 Ceph 的全局默認 RBD 塊設備特性清單:
$ vi /etc/ceph/ceph.conf
[global]
...
rbd_default_features = 1
$ ceph-deploy --overwrite-conf admin ceph-node1 ceph-node2 ceph-node3
或者在創建塊設備的時候單獨指定塊設備特性:
rbd create rbd_pool/volume03 --size 1024 --image-format 1 --image-feature layering
或者關閉內核不支持的塊設備特性之后再映射:
[root@ceph-node1 ~]# rbd info rbd_pool/volume01
rbd image 'volume01':
size 1 GiB in 256 objects
order 22 (4 MiB objects)
id: 11126b8b4567
block_name_prefix: rbd_data.11126b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
op_features:
flags:
create_timestamp: Tue Apr 23 07:30:57 2019
[root@ceph-node1 ~]# rbd feature disable rbd_pool/volume01 object-map fast-diff deep-flatten
[root@ceph-node1 ~]# rbd feature disable rbd_pool/volume02 object-map fast-diff deep-flatten
[root@ceph-node1 ~]# rbd info rbd_pool/volume01
rbd image 'volume01':
size 1 GiB in 256 objects
order 22 (4 MiB objects)
id: 11126b8b4567
block_name_prefix: rbd_data.11126b8b4567
format: 2
features: layering, exclusive-lock
op_features:
flags:
create_timestamp: Tue Apr 23 07:30:57 2019
[root@ceph-node1 ~]# rbd map rbd_pool/volume01
/dev/rbd0
查看客戶端本地已映射的塊設備:
[root@ceph-node1 ~]# rbd showmapped
id pool image snap device
0 rbd_pool volume01 - /dev/rbd0
[root@ceph-node1 ~]# lsblk | grep rbd0
rbd0 252:0 0 1G 0 disk
塊設備映射到本地之后就等同於一個裸設備,需要分區格式化以及創建文件系統:
[root@ceph-node1 ~]# mkfs.xfs /dev/rbd0
meta-data=/dev/rbd0 isize=512 agcount=8, agsize=32768 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=262144, imaxpct=25
= sunit=1024 swidth=1024 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
數據最終在 OSDs 上以 Object 的形式儲存,Obejcts 的前綴為塊設備的 block_name_prefix。隨着寫入的數據越多,Objects 的數量也會越多:
[root@ceph-node1 deploy]# rados ls -p rbd_pool | grep rbd_data.121896b8b4567
rbd_data.121896b8b4567.0000000000000080
rbd_data.121896b8b4567.00000000000000a0
rbd_data.121896b8b4567.0000000000000082
rbd_data.121896b8b4567.00000000000000e0
rbd_data.121896b8b4567.00000000000000ff
rbd_data.121896b8b4567.0000000000000081
rbd_data.121896b8b4567.0000000000000040
rbd_data.121896b8b4567.0000000000000020
rbd_data.121896b8b4567.0000000000000000
rbd_data.121896b8b4567.00000000000000c0
rbd_data.121896b8b4567.0000000000000060
rbd_data.121896b8b4567.0000000000000001
NOTE:而且這些塊設備對應的 Objects 的后綴是是以 16 進制編碼的,Object 的命名規則為 block_name_prefix+index。而且 index range [0x00, 0xff] 就是十進制的 256,表示該塊設備的 Size 具有 256 個 Objects。
[root@ceph-node1 ~]# rbd --image rbd_pool/volume01 info
rbd image 'volume01':
size 1 GiB in 256 objects
但很顯然,現在實際上並不存在 256 個 Objects,這是因為 Ceph RBD 是精簡置備的,並非是完成創建卷時就已經把塊設備 Size 對應的 256 個 Objects 都准備好了,Objects 的數量是隨着實際數據的寫入而逐漸增長的。並且上述的 12 個 Objects 僅僅是格式化 XFS 文件系統時生成的。
mount 塊設備並寫入數據:
[root@ceph-node1 ~]# mkdir -pv /mnt/volume01
[root@ceph-node1 ~]# mount /dev/rbd0 /mnt/volume01
[root@ceph-node1 ~]# dd if=/dev/zero of=/mnt/volume01/fi1e1 count=10 bs=1M
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.0129169 s, 812 MB/s
[root@ceph-node1 deploy]# rados ls -p rbd_pool | grep rbd_data.121896b8b4567 | wc -l
37
可以看見,塊設備的 Objects 動態增長了。
我們不妨看看塊設備 volume01 的第一、第二個 Objects 的內容:
[root@ceph-node1 ~]# rados -p rbd_pool get rbd_data.121896b8b4567.0000000000000000 rbd_data.121896b8b4567.0000000000000000
[root@ceph-node1 ~]# rados -p rbd_pool get rbd_data.121896b8b4567.0000000000000001 rbd_data.121896b8b4567.0000000000000001
[root@ceph-node1 ~]# ll -lht
total 172K
-rw-r--r-- 1 root root 32K Apr 24 06:30 rbd_data.121896b8b4567.0000000000000001
-rw-r--r-- 1 root root 128K Apr 24 06:30 rbd_data.121896b8b4567.0000000000000000
[root@ceph-node1 ~]# hexdump -vC rbd_data.121896b8b4567.0000000000000000 | more
00000000 58 46 53 42 00 00 10 00 00 00 00 00 00 04 00 00 |XFSB............|
...
有兩點信息值得我們關注,首先,從第一個 Objects 的內容可以看出它的確是 mkfs.xfs 文件系統時創建的,而且這樣的數據就是 XFS 文件系統的格式;再一個,可見 Object0、Object1 的 Size 都沒有達到 4MiB,但卻創建了更多的 Objects。這是因為客戶端將原始數據分片之后是輪詢儲存到 Object Set 的,而不是存滿了一個 Object 再繼續存到下一個 Object 的(!!! 這里有待驗證,因為 volume01 沒有開啟條帶化特性)。
卸載塊設備:
[root@controller ~]# umount /mnt/volume01/
root@ceph-node1 deploy]# rbd showmapped
id pool image snap device
1 rbd_pool volume01 - /dev/rbd1
[root@ceph-node1 deploy]# rbd unmap rbd_pool/volume01
[root@ceph-node1 deploy]# rbd showmapped
關於 RBD 塊設備與 XFS 的一個有趣描述:
RBD 其實是一個完完整整的塊設備,如果把 1G 的塊想成一個 1024 層樓的高樓的話,xfs 可以想象成住在這個大樓里的樓管,它只能在大樓里面,也就只能看到這 1024 層的房子,樓管自然可以安排所有的住戶(文件或文件名),住在哪一層哪一間,睡在地板還是天花板(文件偏移量),隔壁的樓管叫做 ext4,雖然住在一模一樣的大樓里,但是它們有着自己的安排策略,這就是文件系統如果組織文件的一個比喻。某一天拆遷大隊長跑來說,我不管你們(xfs or ext4)是怎么安排的,蓋這么高的樓是想做什么,然后大隊長把這 1024 層房子,每 4 層(4MiB)砍了一刀,一共砍成了 256 個四層,然后一起打包帶走了,運到了一個叫做 Ceph 的小區里面,放眼望去,這個小區里面的房子最高也就四層(填滿的),而有些才打了地基(還沒寫內容)。
新建客戶端
這里我們將 OpenStack 的 Controller Node 作為客戶端,首先加載 rbd 內核模塊:
[root@controller ~]# uname -r
3.10.0-957.10.1.el7.x86_64
[root@controller ~]# modprobe rbd
[root@controller ~]# lsmod | grep rbd
rbd 83640 0
libceph 306625 1 rbd
為了授予客戶端訪問 Ceph Cluster 的權限,需要將管理員密鑰環和配置文件拷貝到客戶端上。客戶端與 Ceph Cluster 之間的身份驗證基於這個密鑰環,這里我們使用 admin keyring,使客戶端擁有完全訪問 Ceph Cluster 的權限。合理的做法應該是創建權限集有限的密鑰環分發給非管理員客戶端。下面我們依舊通過 ceph-deploy 工具來完成客戶端的安裝與授權。
-
在 Controller 上配置 Ceph Mimic YUM 源
-
Controller 與 Ceph Deploy 節點免密登錄:
[root@ceph-node1 deploy]# ssh-copy-id -i ~/.ssh/id_rsa.pub root@controller
- 在 Controller 上安裝客戶端軟件:
ceph-deploy install controller
- 將密鑰環和配置文件拷貝到 Controller
ceph-deploy --overwrite-conf admin controller
LOG:
[ceph_deploy.admin][DEBUG ] Pushing admin keys and conf to controller
- 在客戶端查詢 Ceph Cluster 狀態
[root@controller ~]# ceph -s
cluster:
id: d82f0b96-6a69-4f7f-9d79-73d5bac7dd6c
health: HEALTH_WARN
too few PGs per OSD (2 < min 30)
services:
mon: 3 daemons, quorum ceph-node1,ceph-node2,ceph-node3
mgr: ceph-node1(active), standbys: ceph-node2, ceph-node3
osd: 9 osds: 9 up, 9 in
data:
pools: 1 pools, 8 pgs
objects: 40 objects, 84 MiB
usage: 9.3 GiB used, 81 GiB / 90 GiB avail
pgs: 8 active+clean
- 在 Controller 客戶端掛載卷
[root@controller ~]# rbd map rbd_pool/volume02
/dev/rbd0
[root@controller ~]# rbd showmapped
id pool image snap device
0 rbd_pool volume02 - /dev/rbd0
[root@controller ~]# mkfs.xfs /dev/rbd0
meta-data=/dev/rbd0 isize=512 agcount=8, agsize=32768 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=262144, imaxpct=25
= sunit=1024 swidth=1024 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
[root@controller ~]# mkdir -pv /mnt/volume02
mkdir: created directory ‘/mnt/volume02’
[root@controller ~]# mount /dev/rbd0 /mnt/volume02
[root@controller ~]# df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/centos-root xfs 196G 3.2G 193G 2% /
devtmpfs devtmpfs 9.8G 0 9.8G 0% /dev
tmpfs tmpfs 9.8G 0 9.8G 0% /dev/shm
tmpfs tmpfs 9.8G 938M 8.9G 10% /run
tmpfs tmpfs 9.8G 0 9.8G 0% /sys/fs/cgroup
/dev/sda1 xfs 197M 165M 32M 84% /boot
tmpfs tmpfs 2.0G 0 2.0G 0% /run/user/0
/dev/rbd0 xfs 1014M 43M 972M 5% /mnt/volume02
[root@controller ~]# dd if=/dev/zero of=/mnt/volume02/fi1e1 count=10 bs=1M
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.0100852 s, 1.0 GB/s
塊設備的擴縮容
塊設備擴容:
[root@ceph-node1 deploy]# rbd resize rbd_pool/volume01 --size 2048
Resizing image: 100% complete...done.
[root@ceph-node1 ~]# rbd info rbd_pool/volume01
rbd image 'volume01':
size 2 GiB in 512 objects
order 22 (4 MiB objects)
id: 11126b8b4567
block_name_prefix: rbd_data.11126b8b4567
format: 2
features: layering, exclusive-lock
op_features:
flags:
create_timestamp: Tue Apr 23 07:30:57 2019
如果已經被掛載了,還要手動檢查新的容量是否被內核接收,刷新一下文件系統的大小:
[root@ceph-node1 ~]# df -Th
...
/dev/rbd0 xfs 1014M 103M 912M 11% /mnt/volume01
[root@ceph-node1 ~]# xfs_growfs -d /mnt/volume01/
meta-data=/dev/rbd0 isize=512 agcount=8, agsize=32768 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0 spinodes=0
data = bsize=4096 blocks=262144, imaxpct=25
= sunit=1024 swidth=1024 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
data blocks changed from 262144 to 524288
[root@ceph-node1 ~]# df -Th
...
/dev/rbd0 xfs 2.0G 103M 1.9G 6% /mnt/volume01
RBD 塊設備的 Format 1 VS Format 2
眾所周知,RBD 塊設備有兩種格式,不同的格式包含了不同塊設備特性集合:
- Format 1:Hammer,rbd_default_features = 3
- Format 2:Jewel,rbd_default_features = 61
Features 編號:配置項 rbd_default_features 的值就是由下列編號相加得到的:
NOTE:Only applies to format 2 images
- +1 for layering,
- +2 for striping v2,
- +4 for exclusive lock,
- +8 for object map
- +16 for fast-diff,
- +32 for deep-flatten,
- +64 for journaling
塊設備的快照、克隆、恢復
Ceph 塊設備的快照是一個基於時間點的、只讀的塊設備鏡像副本,可以通過創建快照來保存塊設備在某一時刻的狀態,並且支持創建多次快照,所以快照功能也被稱之為 Time Machine(時間機器)。
Ceph RBD 應用的快照技術為 COW(寫時復制),通過寫時復制可以在保持原始鏡像數據不被修改的情況下,得到一個新的鏡像或稱為快照文件,而新的鏡像中儲存化了新數據的變更。可見,快照的執行是非常快的,因為它僅更新了塊設備的元數據,例如:添加了快照 ID。Ceph RBD 的分層特性與 QCOW2 鏡像文件具有一樣的特征,更多的 COW 工作原理,請瀏覽《再談 COW、ROW 快照技術》。
注:基於具體的實現差異,快照未必會以實際的文件形式而存在。所以,這里 “快照” 的語義表示變更的新數據。
基於 COW 的鏈式快照:
創建快照:
$ rbd snap create rbd_pool/volume01@snap01
查看快照:
[root@ceph-node1 ~]# rbd snap ls rbd_pool/volume01
SNAPID NAME SIZE TIMESTAMP
4 snap01 2 GiB Tue Apr 23 23:50:23 2019
刪除指定快照:
$ rbd snap rm rbd_pool/volume01@snap01
刪除全部快照:
[root@ceph-node1 ~]# rbd snap purge rbd_pool/volume01
Removing all snapshots: 100% complete...done.
克隆就是基於快照恢復出一個新的塊設備,它的執行速度同樣很快,只會修改了塊設備的元數據,例如:添加了 parent 信息。為了便於理解,我們首先統一一下術語定義:
- 塊設備,鏡像(RBD Image):Ceph RBD 塊設備的統稱。
- 原始鏡像,模板鏡像:被執行快照的塊設備。
- COW 副本鏡像:通過某一個快照克隆出來的塊設備。
NOTE:快照和克隆的區別在於,克隆利用了快照的 COW 分層特性,拷貝了指定快照的數據並轉換為一個被 Ceph 認可的新的塊設備。
得益於 Ceph RBD 的分層特性,使得對接 Ceph 存儲的 Openstack 能夠在短時間之內快速地啟動數百台虛擬機,而虛擬機磁盤的本質就是 COW 副本鏡像,它保存了 GuestOS 更改的數據。
創建 RBD 塊設備:
[root@ceph-node1 ~]# rbd create rbd_pool/volume03 --size 1024 --image-format 2
[root@ceph-node1 ~]# rbd ls rbd_pool
volume01
volume02
volume03
[root@ceph-node1 ~]# rbd info rbd_pool/volume03
rbd image 'volume03':
size 1 GiB in 256 objects
order 22 (4 MiB objects)
id: 12bb6b8b4567
block_name_prefix: rbd_data.12bb6b8b4567
format: 2
features: layering
op_features:
flags:
create_timestamp: Wed Apr 24 03:53:28 2019
執行快照:
[root@ceph-node1 ~]# rbd snap create rbd_pool/volume03@snap01
[root@ceph-node1 ~]# rbd snap ls rbd_pool/volume03
SNAPID NAME SIZE TIMESTAMP
8 snap01 1 GiB Wed Apr 24 03:54:53 2019
將快照設為保護(不可被刪除)狀態,此時的快照就是原始鏡像:
rbd snap protect rbd_pool/volume03@snap01
執行克隆操作,得到 COW 副本鏡像,包含了父鏡像信息:
[root@ceph-node1 ~]# rbd clone rbd_pool/volume03@snap01 rbd_pool/vol01_from_volume03
[root@ceph-node1 ~]# rbd ls rbd_pool
vol01_from_volume03
volume01
volume02
volume03
[root@ceph-node1 ~]# rbd info rbd_pool/vol01_from_volume03
rbd image 'vol01_from_volume03':
size 1 GiB in 256 objects
order 22 (4 MiB objects)
id: faf86b8b4567
block_name_prefix: rbd_data.faf86b8b4567
format: 2
features: layering
op_features:
flags:
create_timestamp: Wed Apr 24 03:56:36 2019
parent: rbd_pool/volume03@snap01
overlap: 1 GiB
執行 COW 副本鏡像的扁平化,使其不再依賴父鏡像,自身就具有塊設備的完整數據(舊+新):
[root@ceph-node1 ~]# rbd flatten rbd_pool/vol01_from_volume03
Image flatten: 100% complete...done.
[root@ceph-node1 ~]# rbd info rbd_pool/vol01_from_volume03
rbd image 'vol01_from_volume03':
size 1 GiB in 256 objects
order 22 (4 MiB objects)
id: faf86b8b4567
block_name_prefix: rbd_data.faf86b8b4567
format: 2
features: layering
op_features:
flags:
create_timestamp: Wed Apr 24 03:56:36 2019
解除保護並刪除快照:
[root@ceph-node1 ~]# rbd snap unprotect rbd_pool/volume03@snap01
[root@ceph-node1 ~]# rbd snap rm rbd_pool/volume03@snap01
Removing snap: 100% complete...done.
[root@ceph-node1 ~]# rbd snap ls rbd_pool/volume03
恢復一個快照就是進行一個 “時刻穿越”,將塊設備還原到某個特定時間點的狀態,所以 Ceph 的恢復指令選擇了使用 rollback 一詞:
[root@ceph-node1 ~]# rbd snap rollback rbd_pool/volume01@snap01
Rolling back to snapshot: 100% complete...done.
NOTE:如果塊設備已經被掛載,那么需要 re-mount 以刷新文件系統狀態
塊設備的 I/O 模型
- 客戶端應用 librbd 庫創建一個塊設備,並向塊設備寫入數據。
- 客戶端應用 librbd 庫調用 librados 庫,經過 Pool、RBD(RBD 的專屬 Pool)、Object、PG、OSDs 進行層層映射之后,得到 Primary OSD 的 IP:Port。
- 客戶端與 Primary OSD 建立 Socket 通信,將要寫入的數據直接傳輸給 Primary OSD,再由 Primary OSD 同步到 Replica OSDs 中。
RBD QoS
QoS(Quality of Service,服務質量)起源於網絡技術,主要用於解決網絡阻塞和延遲問題,能夠為指定的網絡通信提供更好的服務質量。在存儲領域,集群的 IO 能力同樣是有限的,比如:帶寬,IOPS 等參數。如何避免用戶之間爭搶資源,保障高優用戶的存儲服務質量?就需要對有限的 IO 能力進行合理分配,這就是存儲 QoS。而 RBD QoS 就是在 Ceph librbd 模塊上實現的 QoS。
librbd 架構:
- librbd 接口層:讀取配置,設置配置對象的 Watch 功能
- Internal:所有 librbd 接口讀寫的匯聚
- osdc:對用戶數據流進行切片
- objectcache:用戶數據緩沖層
- send op:數據操作發送層
RBD QoS 參數選項:
-
conf_rbd_qos_iops_limit 每秒 IO 限制
-
conf_rbd_qos_read_iops_limit 每秒讀 IO 限制
-
conf_rbd_qos_write_iops_limit 每秒寫 IO 限制
-
conf_rbd_qos_bps_limit 每秒帶寬限制
-
conf_rbd_qos_read_bps_limit 每秒讀帶寬限制
-
conf_rbd_qos_write_bps_limit 每秒寫帶寬限制
Token bucket algorithm(令牌桶算法)
在當前版本(Mimic)中已經實現了基於令牌桶算法的 RBD QoS 功能,但暫時只支持 conf_rbd_qos_iops_limit 選項。
[root@ceph-node2 ~]# rbd create rbd_pool/volume01 --size 10240
[root@ceph-node2 ~]# rbd image-meta set rbd_pool/volume01 conf_rbd_qos_iops_limit 1000
[root@ceph-node2 ~]# rbd image-meta list rbd_pool/volume01
There is 1 metadatum on this image:
Key Value
conf_rbd_qos_iops_limit 1000
[root@ceph-node2 ~]# rbd image-meta set rbd_pool/volume01 conf_rbd_qos_bps_limit 2048000
failed to set metadata conf_rbd_qos_bps_limit of image : (2) No such file or directory
rbd: setting metadata failed: (2) No such file or directory
令牌桶算法的基本思想:
- 假如用戶配置的平均發送速率為 R,則按照 1/R(s) 的速率令牌桶投放令牌;
- 假設令牌桶最多可以儲存 N 個令牌,如果新添的令牌到達時令牌桶已經滿了,那么這個令牌被丟棄;
- 根據預設的匹配規則先對報文進行分類,不符合匹配規則的報文不需要經過令牌桶的處理,直接發送;
- 反正,符合匹配規則的報文,則需要進行令牌桶處理。當一個 M 字節的數據包到達時,就從令牌桶中取走 M 個令牌,數據包持有令牌運行被發送到網絡;
- 如果令牌桶中少於 M 個令牌,則該數據包不被發送。只有等到桶中投入了新的令牌,報文才可能被發送。這就可以限制報文的流量只能小於或等於令牌投入的速度,以此達到限制流量的目的。
令牌桶算法的 UML 流程:
- 用戶發起的請求異步 IO 到達 Image 中。
- 請求到達 ImageRequestWQ 隊列中。
- 在 ImageRequestWQ 出隊列的時候加入令牌桶算法 TokenBucket。
- 通過令牌桶算法進行限速,然后發送給 ImageRequest 進行處理。
dmClock algorithm
當前基於 dmClock 實現的 QoS 還在不斷的提交 PR,很多尚未 Merge 到 Master,這里不多做討論,有待繼續跟進社區。
dmClock 是一種基於時間標簽的 I/O 調度算法,最先被 VMware 提出用於集中式管理的存儲系統。dmClock 可以將以下實體作為 QoS 對象:
- RBD Image
- Pool
- CephFS Directory
- Client 或者一組 Client
- 數據集
dmClock 主要通過 Reservation、Weight 和 Limit 來實現 QoS 控制:
- reservation:預留,表示客戶端獲得的最低 I/O 資源。
- weight:權重,表示客戶端所占共享 I/O 資源的比重。
- limit:上限,表示客戶端可獲得的最高 I/O 資源。
塊設備性能測試
使用 RADOS bench 進行基准測試
Ceph 提供了一個內置的基准測試程序 RADOS bench,它用於測試 Ceph 對象存儲器的性能。
語法:
rados bench -p <pool_name> <seconds> <write|seq|rand>
- -p:指定 Pool
- Seconds:指定測試運行時長(s)
- write|seq|rand:指定測試類型,它應該是寫、順序讀或者隨機讀
- -t:指定井發數,默認為 160
- –no-cleanup:指定 RODOS bench 寫入到 Pool 的臨時數據將不被刪除,這些數據可用於讀操作,默認為刪除。
創建 Test Pool:
[root@ceph-node1 fio_tst]# ceph osd pool create test_pool 8 8
pool 'test_pool' created
[root@ceph-node1 fio_tst]# rados lspools
rbd_pool
.rgw.root
default.rgw.control
default.rgw.meta
default.rgw.log
test_pool
寫入性能測試:
[root@ceph-node1 ~]# rados bench -p test_pool 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_ceph-node1_34663
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 16 0 0 0 - 0
2 16 19 3 5.99838 6 1.66988 1.45496
3 16 25 9 11.997 24 2.99072 2.19588
4 16 32 16 15.996 28 3.33757 2.62036
5 16 39 23 18.3956 28 2.37105 2.54244
6 16 45 29 19.3289 24 2.65705 2.53162
7 16 52 36 20.5669 28 1.10485 2.51232
8 16 54 38 18.9957 8 3.27297 2.55235
9 16 67 51 22.6617 52 2.22446 2.57639
10 15 69 54 21.5954 12 2.41363 2.55092
11 2 69 67 24.3585 52 2.54614 2.54768
Total time run: 11.0136
Total writes made: 69
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 25.0599
Stddev Bandwidth: 17.0752
Max bandwidth (MB/sec): 52
Min bandwidth (MB/sec): 0
Average IOPS: 6
Stddev IOPS: 4
Max IOPS: 13
Min IOPS: 0
Average Latency(s): 2.52608
Stddev Latency(s): 0.615785
Max latency(s): 4.03768
Min latency(s): 1.00465
- Average IOPS: 6
- Bandwidth (MB/sec): 25.0599
查看寫入的臨時數據:
[root@ceph-node1 ~]# rados ls -p test_pool
benchmark_data_ceph-node1_34663_object59
benchmark_data_ceph-node1_34663_object42
...
測試順序讀性能:
[root@ceph-node1 ~]# rados bench -p test_pool 10 seq
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 35 19 75.9741 76 0.323362 0.505157
2 16 59 43 85.9734 96 0.198388 0.543002
3 13 69 56 74.645 52 2.17551 0.543934
Total time run: 3.22507
Total reads made: 69
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 85.5795
Average IOPS: 21
Stddev IOPS: 5
Max IOPS: 24
Min IOPS: 13
Average Latency(s): 0.716294
Max latency(s): 2.17551
Min latency(s): 0.136899
- Average IOPS: 21
- Bandwidth (MB/sec): 85.5795
測試隨機讀性能:
[root@ceph-node1 ~]# rados bench -p test_pool 10 rand
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 108 92 367.61 368 0.0316746 0.119223
2 16 206 190 379.757 392 0.608621 0.145352
3 16 299 283 377.153 372 0.0378421 0.149943
4 15 368 353 352.859 280 0.6233 0.167492
5 16 473 457 365.474 416 0.66508 0.1645
6 16 563 547 364.547 360 0.00293503 0.166475
7 16 627 611 349.038 256 0.00289834 0.173168
8 16 748 732 365.897 484 0.0360462 0.168754
9 16 825 809 359.457 308 0.00331438 0.171356
10 16 896 880 351.908 284 0.75437 0.173779
Total time run: 10.3753
Total reads made: 897
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 345.821
Average IOPS: 86
Stddev IOPS: 17
Max IOPS: 121
Min IOPS: 64
Average Latency(s): 0.182936
Max latency(s): 1.06424
Min latency(s): 0.00266894
- Average IOPS: 86
- Bandwidth (MB/sec): 345.821
使用 fio 進行 IO 測試
fio 是一個第三方 IO 測試工具,在 Linux 系統上使用非常方便。fio 參數解析:
filename=/dev/emcpowerb 支持文件系統或者裸設備,-filename=/dev/sda2 或 -filename=/dev/sdb
direct=1 測試過程繞過機器自帶的 Buffer,使測試結果更真實
rw=randwread 測試隨機讀的 I/O
rw=randwrite 測試隨機寫的 I/O
rw=randrw 測試隨機混合寫和讀的 I/O
rw=read 測試順序讀的 I/O
rw=write 測試順序寫的 I/O
rw=rw 測試順序混合寫和讀的 I/O
bs=4k 單次 IO 的塊文件大小為 4k
bsrange=512-2048 同上,提定數據塊的大小范圍
size=5g 本次的測試文件大小為 5G,以每次 4k 的 IO 進行測試
numjobs=30 本次的測試線程為 30
runtime=1000 測試時間為 1000 秒,如果不寫則一直將 5G 文件分 4k 每次寫完為止
ioengine=psync IO 引擎使用 pync 方式,如果要使用 libaio 引擎,需要 yum install libaio-devel 包
rwmixwrite=30 在混合讀寫的模式下,寫占 30%
group_reporting 關於顯示結果的,匯總每個進程的信息
lockmem=1g 只使用 1G 內存進行測試
zero_buffers 用 0 初始化系統 Buffer
nrfiles=8 每個進程生成文件的數量
安裝 fio 工具和 librbd 相關開發庫(e.g. librbd-devel):
$ yum install -y fio "*librbd*"
創建測試塊設備:
root@ceph-node1 ~]# rbd create rbd_pool/volume01 --size 10240
[root@ceph-node1 ~]# rbd ls rbd_pool
volume01
[root@ceph-node1 ~]# rbd info rbd_pool/volume01
rbd image 'volume01':
size 10 GiB in 2560 objects
order 22 (4 MiB objects)
id: 1229a6b8b4567
block_name_prefix: rbd_data.1229a6b8b4567
format: 2
features: layering
op_features:
flags:
create_timestamp: Thu Apr 25 00:11:20 2019
編輯測試參數配置:
$ mkdir /root/fio_tst && cd /root/fio_tst
$ cat write.fio
[global]
description="write test with block size of 4M"
direct=1
ioengine=rbd
clustername=ceph
clientname=admin
pool=rbd_pool
rbdname=volume01
iodepth=32
runtime=300
rw=randrw
numjobs=1
bs=8k
[logging]
write_iops_log=write_iops_log
write_bw_log=write_bw_log
write_lat_log=write_lat_log
執行 IO 測試:
[root@ceph-node1 fio_tst]# fio write.fio
logging: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=rbd, iodepth=32
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=392KiB/s,w=280KiB/s][r=49,w=35 IOPS][eta 00m:00s]
logging: (groupid=0, jobs=1): err= 0: pid=34842: Thu Apr 25 02:41:25 2019
Description : ["write test with block size of 4M"]
read: IOPS=120, BW=965KiB/s (988kB/s)(283MiB/300063msec)
slat (nsec): min=734, max=688371, avg=4864.30, stdev=11806.74
clat (usec): min=771, max=1638.5k, avg=112142.48, stdev=118607.44
lat (usec): min=773, max=1638.5k, avg=112147.35, stdev=118607.37
clat percentiles (msec):
| 1.00th=[ 3], 5.00th=[ 5], 10.00th=[ 7], 20.00th=[ 20],
| 30.00th=[ 31], 40.00th=[ 48], 50.00th=[ 71], 60.00th=[ 107],
| 70.00th=[ 144], 80.00th=[ 192], 90.00th=[ 268], 95.00th=[ 342],
| 99.00th=[ 523], 99.50th=[ 617], 99.90th=[ 827], 99.95th=[ 894],
| 99.99th=[ 1133]
bw ( KiB/s): min= 4, max=10622, per=41.55%, avg=400.92, stdev=773.36, samples=36196
iops : min= 1, max= 1, avg= 1.00, stdev= 0.00, samples=36196
write: IOPS=120, BW=966KiB/s (990kB/s)(283MiB/300063msec)
slat (usec): min=2, max=1411, avg=18.63, stdev=31.60
clat (msec): min=4, max=2301, avg=152.15, stdev=137.61
lat (msec): min=4, max=2301, avg=152.17, stdev=137.61
clat percentiles (msec):
| 1.00th=[ 11], 5.00th=[ 18], 10.00th=[ 24], 20.00th=[ 40],
| 30.00th=[ 61], 40.00th=[ 86], 50.00th=[ 116], 60.00th=[ 150],
| 70.00th=[ 192], 80.00th=[ 245], 90.00th=[ 326], 95.00th=[ 405],
| 99.00th=[ 609], 99.50th=[ 735], 99.90th=[ 961], 99.95th=[ 1133],
| 99.99th=[ 2198]
bw ( KiB/s): min= 3, max= 1974, per=14.09%, avg=136.11, stdev=164.08, samples=36248
iops : min= 1, max= 1, avg= 1.00, stdev= 0.00, samples=36248
lat (usec) : 1000=0.01%
lat (msec) : 2=0.39%, 4=1.87%, 10=4.40%, 20=7.39%, 50=19.19%
lat (msec) : 100=18.12%, 250=33.10%, 500=13.85%, 750=1.35%, 1000=0.28%
lat (msec) : 2000=0.05%, >=2000=0.01%
cpu : usr=0.33%, sys=0.12%, ctx=5983, majf=0, minf=30967
IO depths : 1=0.6%, 2=1.7%, 4=5.4%, 8=17.9%, 16=68.3%, 32=6.1%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=95.2%, 8=0.7%, 16=1.3%, 32=2.8%, 64=0.0%, >=64=0.0%
issued rwt: total=36196,36248,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=965KiB/s (988kB/s), 965KiB/s-965KiB/s (988kB/s-988kB/s), io=283MiB (297MB), run=300063-300063msec
WRITE: bw=966KiB/s (990kB/s), 966KiB/s-966KiB/s (990kB/s-990kB/s), io=283MiB (297MB), run=300063-300063msec
Disk stats (read/write):
dm-0: ios=0/1336, merge=0/0, ticks=0/42239, in_queue=45656, util=1.33%, aggrios=0/1356, aggrmerge=0/87, aggrticks=0/46003, aggrin_queue=46002, aggrutil=1.39%
sda: ios=0/1356, merge=0/87, ticks=0/46003, in_queue=46002, util=1.39%
- read: IOPS=120, BW=965KiB/s (988kB/s)(283MiB/300063msec)
- write: IOPS=120, BW=966KiB/s (990kB/s)(283MiB/300063msec)
輸出結果參數解析:
io 執行了多少(M)IO
bw 平均 IO 帶寬
iops IOPS
runt 線程運行時間
slat 提交延遲
clat 完成延遲
lat 響應時間
cpu 利用率
IO depths IO 隊列
IO submit 單個 IO 提交要提交的 IO 數
IO complete Like the above submit number, but for completions instead.
IO issued The number of read/write requests issued, and how many of them were short.
IO latencies IO 完延遲的分布
aggrb Group 總帶寬
minb 最小平均帶寬.
maxb 最大平均帶寬.
mint Group 中線程的最短運行時間.
maxt Group中線程的最長運行時間.
ios 所有Group總共執行的IO數.
merge 總共發生的 IO 合並數.
ticks Number of ticks we kept the disk busy.
io_queue 花費在隊列上的總共時間.
util 磁盤利用率
對塊設備的 IOPS 執行 QoS:
[root@ceph-node1 fio_tst]# rbd image-meta set rbd_pool/volume01 conf_rbd_qos_iops_limit 50
[root@ceph-node1 fio_tst]# rbd image-meta list rbd_pool/volume01
There is 1 metadatum on this image:
Key Value
conf_rbd_qos_iops_limit 50
再次執行 IO 測試:
[root@ceph-node1 fio_tst]# fio write.fio
logging: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=rbd, iodepth=32
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [f(1)][100.0%][r=56KiB/s,w=200KiB/s][r=7,w=25 IOPS][eta 00m:00s]
logging: (groupid=0, jobs=1): err= 0: pid=35040: Thu Apr 25 02:50:54 2019
Description : ["write test with block size of 4M"]
read: IOPS=24, BW=199KiB/s (204kB/s)(58.5MiB/300832msec)
slat (nsec): min=807, max=753733, avg=4863.82, stdev=16831.94
clat (usec): min=1311, max=1984.2k, avg=482448.41, stdev=481601.17
lat (usec): min=1313, max=1984.2k, avg=482453.27, stdev=481601.01
clat percentiles (msec):
| 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 6], 20.00th=[ 8],
| 30.00th=[ 14], 40.00th=[ 23], 50.00th=[ 71], 60.00th=[ 969],
| 70.00th=[ 978], 80.00th=[ 986], 90.00th=[ 995], 95.00th=[ 995],
| 99.00th=[ 1003], 99.50th=[ 1011], 99.90th=[ 1036], 99.95th=[ 1938],
| 99.99th=[ 1989]
bw ( KiB/s): min= 4, max= 6247, per=100.00%, avg=486.86, stdev=666.26, samples=7483
iops : min= 1, max= 1, avg= 1.00, stdev= 0.00, samples=7483
write: IOPS=25, BW=200KiB/s (205kB/s)(58.8MiB/300832msec)
slat (usec): min=3, max=1427, avg=19.40, stdev=33.29
clat (msec): min=6, max=1998, avg=799.32, stdev=501.54
lat (msec): min=6, max=1998, avg=799.34, stdev=501.54
clat percentiles (msec):
| 1.00th=[ 13], 5.00th=[ 16], 10.00th=[ 20], 20.00th=[ 34],
| 30.00th=[ 961], 40.00th=[ 986], 50.00th=[ 995], 60.00th=[ 995],
| 70.00th=[ 995], 80.00th=[ 1003], 90.00th=[ 1011], 95.00th=[ 1938],
| 99.00th=[ 1989], 99.50th=[ 1989], 99.90th=[ 1989], 99.95th=[ 1989],
| 99.99th=[ 2005]
bw ( KiB/s): min= 4, max= 1335, per=50.98%, avg=101.96, stdev=184.46, samples=7522
iops : min= 1, max= 1, avg= 1.00, stdev= 0.00, samples=7522
lat (msec) : 2=0.04%, 4=1.27%, 10=11.24%, 20=11.63%, 50=11.50%
lat (msec) : 100=2.03%, 250=0.63%, 500=0.27%, 750=0.34%, 1000=48.66%
lat (msec) : 2000=12.41%
cpu : usr=0.06%, sys=0.03%, ctx=1118, majf=0, minf=10355
IO depths : 1=0.9%, 2=2.7%, 4=7.9%, 8=22.2%, 16=61.5%, 32=4.8%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=96.0%, 8=0.2%, 16=0.6%, 32=3.1%, 64=0.0%, >=64=0.0%
issued rwt: total=7483,7522,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=199KiB/s (204kB/s), 199KiB/s-199KiB/s (204kB/s-204kB/s), io=58.5MiB (61.3MB), run=300832-300832msec
WRITE: bw=200KiB/s (205kB/s), 200KiB/s-200KiB/s (205kB/s-205kB/s), io=58.8MiB (61.6MB), run=300832-300832msec
Disk stats (read/write):
dm-0: ios=11/1599, merge=0/0, ticks=68/19331, in_queue=19399, util=0.92%, aggrios=11/1555, aggrmerge=0/46, aggrticks=68/19228, aggrin_queue=19295, aggrutil=0.92%
sda: ios=11/1555, merge=0/46, ticks=68/19228, in_queue=19295, util=0.92%
- read: IOPS=24, BW=199KiB/s (204kB/s)(58.5MiB/300832msec)
- write: IOPS=25, BW=200KiB/s (205kB/s)(58.8MiB/300832msec)
從結果來看 RBD QoS 的效果是還可以的。