021 Ceph關於too few PGs per OSD的問題

本文轉載自查看原文 2019-03-31 17:06 1472 Ceph/ linux

在一個ceph集群中，操作創建一個池后，發現ceph的集群狀態處於warn狀態，信息如下

檢查集群的信息

查看看池

[root@serverc ~]# ceph osd pool ls

images   #只有一個池

[root@serverc ~]# ceph osd tree

ID CLASS WEIGHT  TYPE NAME        STATUS REWEIGHT PRI-AFF 
-1       0.13129 root default -5 0.04376 host serverc 2 hdd 0.01459 osd.2 up 1.00000 1.00000 #9塊osd狀態up in狀態 3 hdd 0.01459 osd.3 up 1.00000 1.00000 7 hdd 0.01459 osd.7 up 1.00000 1.00000 -3 0.04376 host serverd 0 hdd 0.01459 osd.0 up 1.00000 1.00000 5 hdd 0.01459 osd.5 up 1.00000 1.00000 6 hdd 0.01459 osd.6 up 1.00000 1.00000 -7 0.04376 host servere 1 hdd 0.01459 osd.1 up 1.00000 1.00000 4 hdd 0.01459 osd.4 up 1.00000 1.00000 8 hdd 0.01459 osd.8 up 1.00000 1.00000

重現錯誤

[root@serverc ~]# ceph osd pool create images 64 64

[root@serverc ~]# ceph osd pool application enable images rbd

[root@serverc ~]# ceph -s

 cluster:
    id:     04b66834-1126-4870-9f32-d9121f1baccd health: HEALTH_WARN too few PGs per OSD (21 < min 30) services: mon: 3 daemons, quorum serverc,serverd,servere mgr: servere(active), standbys: serverd, serverc osd: 9 osds: 9 up, 9 in data: pools: 1 pools, 64 pgs objects: 8 objects, 12418 kB usage: 1005 MB used, 133 GB / 134 GB avail pgs: 64 active+clean

[root@serverc ~]# ceph pg dump

dumped all
version 1334
stamp 2019-03-29 22:21:41.795511
last_osdmap_epoch 0
last_pg_scan 0
full_ratio 0
nearfull_ratio 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES   LOG DISK_LOG STATE        STATE_STAMP                VERSION REPORTED UP      UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           
1.3f          0                  0        0         0       0       0   0        0 active+clean 2019-03-29 22:17:34.871318     0'0    33:41 [7,1,0]          7 [7,1,0]              7        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
1.3e          0                  0        0         0       0       0   0        0 active+clean 2019-03-29 22:17:34.867341     0'0    33:41 [4,5,7]          4 [4,5,7]              4        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
1.3d          0                  0        0         0       0       0   0        0 active+clean 2019-03-29 22:17:34.871213     0'0    33:41 [0,3,1]          0 [0,3,1]              0        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
1.3c          0                  0        0         0       0       0   0        0 active+clean 2019-03-29 22:17:34.859216     0'0    33:41 [5,7,1]          5 [5,7,1]              5        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
1.3b          0                  0        0         0       0       0   0        0 active+clean 2019-03-29 22:17:34.870865     0'0    33:41 [0,8,7]          0 [0,8,7]              0        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
1.3a          2                  0        0         0       0      19  17       17 active+clean 2019-03-29 22:17:34.858977   33'17   33:117 [4,6,7]          4 [4,6,7]              4        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
1.39          0                  0        0         0       0       0   0        0 active+clean 2019-03-29 22:17:34.871027     0'0    33:41 [0,3,4]          0 [0,3,4]              0        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
1.38          1                  0        0         0       0      16   1        1 active+clean 2019-03-29 22:17:34.861985    30'1    33:48 [4,2,5]          4 [4,2,5]              4        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
1.37          0                  0        0         0       0       0   0        0 active+clean 2019-03-29 22:17:34.861667     0'0    33:41 [6,7,1]          6 [6,7,1]              6        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
1.36          0                  0        0         0       0       0   0        0 active+clean 2019-03-29 22:17:34.860382     0'0    33:41 [6,3,1]          6 [6,3,1]              6        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
1.35          0                  0        0         0       0       0   0        0 active+clean 2019-03-29 22:17:34.860407     0'0    33:41 [8,6,2]          8 [8,6,2]              8        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
1.34          0                  0        0         0       0       0   2        2 active+clean 2019-03-29 22:17:34.861874    32'2    33:44 [4,3,0]          4 [4,3,0]              4        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
1.33          0                  0        0         0       0       0   0        0 active+clean 2019-03-29 22:17:34.860929     0'0    33:41 [4,6,2]          4 [4,6,2]              4        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
1.32          0                  0        0         0       0       0   0        0 active+clean 2019-03-29 22:17:34.860589     0'0    33:41 [4,2,6]          4 [4,2,6]              4        0'0 2019-03-29 21:55:07.534833             0'0 2019-03-29 21:55:07.534833 
…………                          
1 8 0 0 0 0 12716137 78 78                             
sum 8 0 0 0 0 12716137 78 78 
OSD_STAT USED  AVAIL  TOTAL  HB_PEERS          PG_SUM PRIMARY_PG_SUM 
8         119M 15229M 15348M [0,1,2,3,4,5,6,7]     22              6 
7         119M 15229M 15348M [0,1,2,3,4,5,6,8]     22              9 
6         119M 15229M 15348M [0,1,2,3,4,5,7,8]     23              5 
5         107M 15241M 15348M [0,1,2,3,4,6,7,8]     18              7 
4         107M 15241M 15348M [0,1,2,3,5,6,7,8]     18              9 
3         107M 15241M 15348M [0,1,2,4,5,6,7,8]     23              6 
2         107M 15241M 15348M [0,1,3,4,5,6,7,8]     19              6 
1         107M 15241M 15348M [0,2,3,4,5,6,7,8]     24              8 
0         107M 15241M 15348M [1,2,3,4,5,6,7,8]     23              8 
sum      1005M   133G   134G

由提示看出，每個osd上的pg數量小於最小的數目30個。是因為在創建池的時候，指定pg和pgs為64，由於是3副本的配置，所以當有9個osd的時候，每個osd上均分了64/9 *3=21個pgs,也就是出現了如上的錯誤小於最小配置30個。從pg dump看出每塊osd上的PG數，都小於30

集群這種狀態如果進行數據的存儲和操作，會發現集群卡死，無法響應io，同時會導致大面積的osd down。

解決辦法

修改pool的pg數

[root@serverc ~]# ceph osd pool set images pg_num 128

set pool 1 pg_num to 128

[root@serverc ~]# ceph -s

 cluster:
    id:     04b66834-1126-4870-9f32-d9121f1baccd health: HEALTH_WARN Reduced data availability: 21 pgs peering Degraded data redundancy: 21 pgs unclean 1 pools have pg_num > pgp_num too few PGs per OSD (21 < min 30) services: mon: 3 daemons, quorum serverc,serverd,servere mgr: servere(active), standbys: serverd, serverc osd: 9 osds: 9 up, 9 in data: pools: 1 pools, 128 pgs objects: 8 objects, 12418 kB usage: 1005 MB used, 133 GB / 134 GB avail pgs: 50.000% pgs unknown 16.406% pgs not active 64 unknown 43 active+clean 21 peering

出現 too few PGs per OSD

繼續修改PGS

[root@serverc ~]# ceph osd pool set images pgp_num 128

set pool 1 pgp_num to 128

查看

[root@serverc ~]# ceph -s
  cluster:
    id:     04b66834-1126-4870-9f32-d9121f1baccd health: HEALTH_WARN Reduced data availability: 7 pgs peering Degraded data redundancy: 24 pgs unclean, 2 pgs degraded services: mon: 3 daemons, quorum serverc,serverd,servere mgr: servere(active), standbys: serverd, serverc osd: 9 osds: 9 up, 9 in data: pools: 1 pools, 128 pgs objects: 8 objects, 12418 kB usage: 1005 MB used, 133 GB / 134 GB avail pgs: 24.219% pgs not active #pg狀態，數據在重平衡（狀態信息代表的意義，請參考https://www.cnblogs.com/zyxnhr/p/10616497.html第三部分內容） 97 active+clean 20 activating 9 peering 2 activating+degraded [root@serverc ~]# ceph -s cluster: id: 04b66834-1126-4870-9f32-d9121f1baccd health: HEALTH_WARN Reduced data availability: 7 pgs peering Degraded data redundancy: 3/24 objects degraded (12.500%), 33 pgs unclean, 4 pgs degraded services: mon: 3 daemons, quorum serverc,serverd,servere mgr: servere(active), standbys: serverd, serverc osd: 9 osds: 9 up, 9 in data: pools: 1 pools, 128 pgs objects: 8 objects, 12418 kB usage: 1005 MB used, 133 GB / 134 GB avail pgs: 35.938% pgs not active 3/24 objects degraded (12.500%) 79 active+clean 34 activating 9 peering 3 activating+degraded 2 active+clean+snaptrim 1 active+recovery_wait+degraded io: recovery: 1 B/s, 0 objects/s [root@serverc ~]# ceph -s cluster: id: 04b66834-1126-4870-9f32-d9121f1baccd health: HEALTH_OK services: mon: 3 daemons, quorum serverc,serverd,servere mgr: servere(active), standbys: serverd, serverc osd: 9 osds: 9 up, 9 in data: pools: 1 pools, 128 pgs objects: 8 objects, 12418 kB usage: 1050 MB used, 133 GB / 134 GB avail pgs: 128 active+clean io: recovery: 1023 kB/s, 0 keys/s, 0 objects/s [root@serverc ~]# ceph -s cluster: id: 04b66834-1126-4870-9f32-d9121f1baccd health: HEALTH_OK #數據平衡完畢，集群狀態恢復正常 services: mon: 3 daemons, quorum serverc,serverd,servere mgr: servere(active), standbys: serverd, serverc osd: 9 osds: 9 up, 9 in data: pools: 1 pools, 128 pgs objects: 8 objects, 12418 kB usage: 1016 MB used, 133 GB / 134 GB avail pgs: 128 active+clean io: recovery: 778 kB/s, 0 keys/s, 0 objects/s

注：這里是實驗環境，pool上也沒有數據，所以修改pg影響並不大，但是如果是生產環境，這時候再重新修改pg數，會對生產環境產生較大影響。因為pg數變了，就會導致整個集群的數據重新均衡和遷移，數據越大響應io的時間會越長。具體請參考https://www.cnblogs.com/zyxnhr/p/10543814.html，對PG的狀態參數有詳細的解釋，同時，在生產環境，修改PG，如果不影響業務，要考慮到各個方面，比如在什么時候恢復，什么時間修改pgs，請參考

參考資料：

https://my.oschina.net/xiaozhublog/blog/664560

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 問題處理--ceph集群告警 too many PGs per OSD HEALTH_WARN too few PGs per OSD (21 < min 30）解決方法 [Ceph]osd 無法啟動 start request repeated too quickly for ceph-osd@1.service Ceph Health_err osd_full等問題的處理 ceph添加/刪除OSD ceph osd 知識刪除Ceph OSD節點 Ceph更換OSD磁盤 ceph集群添加osd Ceph 監控 OSD 和 PG