標簽(空格分隔): ceph,ceph運維,pg
集群環境:
[root@node3 ~]# cat /etc/redhat-release
CentOS Linux release 7.3.1611 (Core)
[root@node3 ~]# ceph -v
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
集群當前布局:
[root@node3 ceph-6]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.08844 root default
-3 0.02948 host node1
0 hdd 0.00980 osd.0 up 1.00000 1.00000
3 hdd 0.00980 osd.3 up 1.00000 1.00000
-5 0.02948 host node2
1 hdd 0.00980 osd.1 up 1.00000 1.00000
4 hdd 0.00980 osd.4 up 1.00000 1.00000
-7 0.02948 host node3
2 hdd 0.00980 osd.2 up 1.00000 1.00000
5 hdd 0.00980 osd.5 up 1.00000 1.00000
為每個主機再添加一個osd:
為了重現too few pgs的錯誤,同時為了創建指定數據位置osd,下面創建bluestore的osd,數據存儲在/dev/sdd1上。在每個主機上執行下面的步驟:
第一步:創建bluestore類型的osd:
[root@node2 ~]# ceph-disk prepare --bluestore /dev/sdd2 --block.db /dev/sdd1
set_data_partition: incorrect partition UUID: cafecafe-9b03-4f30-b4c6-b4b80ceff106, expected ['4fbd7e29-9d25-41b8-afd0-5ec00ceff05d', '4fbd7e29-9d25-41b8-afd0-062c0ceff05d', '4fbd7e29-8ae0-4982-bf9d-5a8d867af560', '4fbd7e29-9d25-41b8-afd0-35865ceff05d']
prepare_device: OSD will not be hot-swappable if block.db is not the same device as the osd data
prepare_device: Block.db /dev/sdd1 was not prepared with ceph-disk. Symlinking directly.
meta-data=/dev/sdd2 isize=2048 agcount=4, agsize=648895 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=2595579, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
第二步:激活該osd:
[root@node2 ~]# ceph-disk activate /dev/sdd2
creating /var/lib/ceph/tmp/mnt.mR3qCJ/keyring
added entity osd.8 auth auth(auid = 18446744073709551615 key=AQBNqOVZt/iUBBAArkrWrZi9N0zxhHhYfhanyw== with 0 caps)
got monmap epoch 1
Removed symlink /etc/systemd/system/ceph-osd.target.wants/ceph-osd@8.service.
Created symlink from /etc/systemd/system/ceph-osd.target.wants/ceph-osd@8.service to /usr/lib/systemd/system/ceph-osd@.service.
最后查看集群布局,發現共有9個osd:
[root@node3 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.08844 root default
-3 0.02948 host node1
0 hdd 0.00980 osd.0 up 1.00000 1.00000
3 hdd 0.00980 osd.3 up 1.00000 1.00000
7 hdd 0.00989 osd.7 up 1.00000 1.00000
-5 0.02948 host node2
1 hdd 0.00980 osd.1 up 1.00000 1.00000
4 hdd 0.00980 osd.4 up 1.00000 1.00000
8 hdd 0.00989 osd.8 up 1.00000 1.00000
-7 0.02948 host node3
2 hdd 0.00980 osd.2 up 1.00000 1.00000
5 hdd 0.00980 osd.5 up 1.00000 1.00000
6 hdd 0.00989 osd.6 up 1.00000 1.00000
重現too few pgs錯誤:
創建一個pg數較小的存儲池:
[root@node3 ~]# ceph osd pool create rbd 64 64
pool 'rbd' created
[root@node3 ~]# rados lspools
rbd
[root@node3 ~]# ceph -s
cluster:
id: b8b4aa68-d825-43e9-a60a-781c92fec20e
health: HEALTH_WARN
too few PGs per OSD (21 < min 30)
services:
mon: 1 daemons, quorum node1
mgr: node1(active)
osd: 9 osds: 9 up, 9 in
data:
pools: 1 pools, 64 pgs
objects: 0 objects, 0 bytes
usage: 9742 MB used, 82717 MB / 92459 MB avail
pgs: 64 active+clean
從上面可以看到,提示說每個osd上的pg數量小於最小的數目30個。pgs為64,因為是3副本的配置,所以當有9個osd的時候,每個osd上均分了64/9 *3=21個pgs,也就是出現了如上的錯誤 小於最小配置30個。
集群這種狀態如果進行數據的存儲和操作,會發現集群卡死,無法響應io,同時會導致大面積的osd down。
解決辦法:修改默認pool rbd的pgs
[root@node3 ~]# ceph osd pool set rbd pg_num 128
set pool 1 pg_num to 128
之后查看集群狀態
[root@node3 ~]# ceph -s
cluster:
id: b8b4aa68-d825-43e9-a60a-781c92fec20e
health: HEALTH_WARN
Reduced data availability: 5 pgs inactive, 44 pgs peering
Degraded data redundancy: 49 pgs unclean
1 pools have pg_num > pgp_num
services:
mon: 1 daemons, quorum node1
mgr: node1(active)
osd: 9 osds: 9 up, 9 in
data:
pools: 1 pools, 128 pgs
objects: 0 objects, 0 bytes
usage: 9743 MB used, 82716 MB / 92459 MB avail
pgs: 7.031% pgs unknown
38.281% pgs not active
70 active+clean
44 peering
9 unknown
5 activating
可以看到還沒ok,提示pg_num 大於 pgp_num,所以還需要修改pgp_num
[root@node3 ~]# ceph osd pool set rbd pgp_num 128
set pool 1 pgp_num to 128
再次查看集群狀態:
[root@node3 ~]# ceph -s
cluster:
id: b8b4aa68-d825-43e9-a60a-781c92fec20e
health: HEALTH_OK
services:
mon: 1 daemons, quorum node1
mgr: node1(active)
osd: 9 osds: 9 up, 9 in
data:
pools: 1 pools, 128 pgs
objects: 0 objects, 0 bytes
usage: 9750 MB used, 82709 MB / 92459 MB avail
pgs: 128 active+clean
這里是簡單的實驗,pool上也沒有數據,所以修改pg影響並不大,但是如果是生產環境,這時候再重新修改pg數,會對生產環境產生較大影響。因為pg數變了,就會導致整個集群的數據重新均衡和遷移,數據越大響應io的時間會越長。所以,最好在一開始就設置好pg數。
