ceph高可用分布式存儲集群04-cephfs文件系統關於filesystem is degraded故障的處理


[root@ceph-osd-122 ~]# ceph status
  cluster:
    id:     f7e2d72a-5ef0-4f45-a866-119de1046594
    health: HEALTH_WARN
             1 filesystem is degraded
            2 MDSs report oversized cache
            1 MDSs report slow requests
            1 filesystem is online with fewer MDS than max_mds
            1 nearfull osd(s)
            17 pool(s) nearfull
            Long heartbeat ping times on back interface seen, longest is 8915.301 msec
            Long heartbeat ping times on front interface seen, longest is 8646.933 msec
            Degraded data redundancy: 97087/326349843 objects degraded (0.030%), 1092 pgs degraded, 1048 pgs undersized
            4 daemons have recently crashed
            1 slow ops, oldest one blocked for 1916198 sec, mon.ceph-osd-140 has slow ops
  services:
    mon: 3 daemons, quorum ceph-osd-128,ceph-osd-140,ceph-mon-121 (age 3h)
    mgr: ceph-mon-121(active, since 2d), standbys: ceph-osd-124, ceph-osd-128
    mds: 5 fs (degraded: test_data_pool:1/1) 4 up:standby-replay 8 up:standby 1 up:rejoin 4 up:active
    osd: 269 osds: 269 up (since 111m), 269 in (since 111m); 8383 remapped pgs
    rgw: 3 daemons active (ceph-mon-121, ceph-osd-128, ceph-osd-136)
  data:
    pools:   17 pools, 22272 pgs
    objects: 161.13M objects, 485 TiB
    usage:   984 TiB used, 974 TiB / 1.9 PiB avail
    pgs:     97087/326349843 objects degraded (0.030%)
             83991926/326349843 objects misplaced (25.737%)
             13828 active+clean
             7209  active+remapped+backfill_wait
             1022  active+recovery_wait+undersized+degraded+remapped
             74    active+recovery_wait+remapped
             59    active+recovery_wait+degraded
             42    active+remapped+backfilling
             25    active+recovering+undersized+remapped
             10    active+recovery_wait+degraded+remapped
             1     active+recovery_wait
             1     active+undersized+degraded+remapped+backfill_wait
             1     active+recovering
  io:
    client:   397 MiB/s rd, 83 MiB/s wr, 207 op/s rd, 33 op/s wr
    recovery: 5.2 MiB/s, 26 keys/s, 5 objects/s
 
解決辦法
通過ceph health detail命令查看到是哪個mds節點導致filesystem故障
保障有Standby節點的情況下,重啟或者關閉現有的mds節點服務
命令示范
systemctl stop ceph-mds@ceph-osd-209.service
 
查看結果
Filesystem 'test_data_pool' (3)
fs_name test_data_pool
epoch   464216
flags   32
created 2020-10-20 12:44:14.557513
modified        2020-12-11 15:52:15.077549
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
min_compat_client       -1 (unspecified)
last_failure    0
last_failure_osd_epoch  146881
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=53700}
failed
damaged
stopped 1
data_pools      [12]
metadata_pool   13
inline_data     disabled
balancer
standby_count_wanted    1
53700:  [v2:10.2.36.136:6800/376801554,v1:10.2.36.136:6801/376801554] 'ceph-osd-136' mds.0.463578 up:active seq 2105
53712:  [v2:10.2.36.134:6812/2816905550,v1:10.2.36.134:6813/2816905550] 'ceph-osd-134' mds.0.0 up:standby-replay seq 2
# tail -f /var/log/ceph/ceph-mds.ceph-osd-136.log
  log_file /var/log/ceph/ceph-mds.ceph-osd-136.log
2020-12-11 13:32:05.852 7f82d818c1c0  0 ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable), process ceph-mds, pid 76582
2020-12-11 13:32:05.852 7f82d818c1c0  0 pidfile_write: ignore empty --pid-file
2020-12-11 13:32:05.915 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 462957 from mon.0
2020-12-11 13:32:06.510 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 462958 from mon.0
2020-12-11 13:32:10.735 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 462959 from mon.0
2020-12-11 13:32:10.735 7f82c5f7f700  1 mds.ceph-osd-136 Map has assigned me to become a standby
2020-12-11 14:31:58.925 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 463578 from mon.0
2020-12-11 14:31:58.929 7f82c5f7f700  1 mds.0.463578 handle_mds_map i am now mds.0.463578
2020-12-11 14:31:58.929 7f82c5f7f700  1 mds.0.463578 handle_mds_map state change up:boot --> up:replay
2020-12-11 14:31:58.929 7f82c5f7f700  1 mds.0.463578 replay_start
2020-12-11 14:31:58.929 7f82c5f7f700  1 mds.0.463578  recovery set is
2020-12-11 14:31:58.929 7f82c5f7f700  1 mds.0.463578  waiting for osdmap 146881 (which blacklists prior instance)
2020-12-11 14:31:58.965 7f82bef71700  0 mds.0.cache creating system inode with ino:0x100
2020-12-11 14:31:58.965 7f82bef71700  0 mds.0.cache creating system inode with ino:0x1
2020-12-11 14:32:08.230 7f82bdf6f700  1 mds.0.463578 Finished replaying journal
2020-12-11 14:32:08.230 7f82bdf6f700  1 mds.0.463578 making mds journal writeable
2020-12-11 14:32:08.981 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 463580 from mon.0
2020-12-11 14:32:08.981 7f82c5f7f700  1 mds.0.463578 handle_mds_map i am now mds.0.463578
2020-12-11 14:32:08.981 7f82c5f7f700  1 mds.0.463578 handle_mds_map state change up:replay --> up:reconnect
2020-12-11 14:32:08.981 7f82c5f7f700  1 mds.0.463578 reconnect_start
2020-12-11 14:32:08.981 7f82c5f7f700  1 mds.0.463578 reopen_log
2020-12-11 14:32:08.981 7f82c5f7f700  1 mds.0.server reconnect_clients -- 4 sessions
2020-12-11 14:32:08.987 7f82c5f7f700  0 log_channel(cluster) log [DBG] : reconnect by client.53865 v1:10.2.111.122:0/1576025043 after 0.00500009
2020-12-11 14:32:08.988 7f82c5f7f700  0 log_channel(cluster) log [DBG] : reconnect by client.63815 v1:10.2.111.41:0/3853651453 after 0.00700012
2020-12-11 14:32:08.988 7f82c5f7f700  0 log_channel(cluster) log [DBG] : reconnect by client.63812 v1:10.2.110.32:0/251902561 after 0.00700012
2020-12-11 14:32:08.988 7f82c5f7f700  0 log_channel(cluster) log [DBG] : reconnect by client.82333 v1:10.2.111.137:0/3749474228 after 0.00700012
2020-12-11 14:32:08.989 7f82c5f7f700  1 mds.0.463578 reconnect_done
2020-12-11 14:32:09.984 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 463581 from mon.0
2020-12-11 14:32:09.984 7f82c5f7f700  1 mds.0.463578 handle_mds_map i am now mds.0.463578
2020-12-11 14:32:09.984 7f82c5f7f700  1 mds.0.463578 handle_mds_map state change up:reconnect --> up:rejoin
2020-12-11 14:32:09.984 7f82c5f7f700  1 mds.0.463578 rejoin_start
2020-12-11 14:32:09.993 7f82c5f7f700  1 mds.0.463578 rejoin_joint_start
2020-12-11 14:32:13.828 7f82bf772700  1 mds.0.463578 rejoin_done
2020-12-11 14:32:13.942 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 463583 from mon.0
2020-12-11 14:32:18.268 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 463585 from mon.0
2020-12-11 14:32:18.268 7f82c5f7f700  1 mds.0.463578 handle_mds_map i am now mds.0.463578
2020-12-11 14:32:18.268 7f82c5f7f700  1 mds.0.463578 handle_mds_map state change up:rejoin --> up:active
2020-12-11 14:32:18.268 7f82c5f7f700  1 mds.0.463578 recovery_done -- successful recovery!
2020-12-11 14:32:18.270 7f82c5f7f700  1 mds.0.463578 active_start
2020-12-11 14:32:18.767 7f82c5f7f700  1 mds.0.463578 cluster recovered.
2020-12-11 14:32:18.767 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 463586 from mon.0
2020-12-11 14:32:22.884 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 463588 from mon.0
2020-12-11 14:32:30.003 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 463590 from mon.0
2020-12-11 14:32:34.037 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 463592 from mon.0
2020-12-11 14:32:38.274 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 463594 from mon.0
2020-12-11 14:32:46.071 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 463596 from mon.0
2020-12-11 14:32:50.058 7f82c5f7f700  1 mds.ceph-osd-136 Updating MDS map to version 463598 from mon.0
 
總結:各種原因導致使用cephfs不是很穩定,后期轉向使用ceph的對象存儲而不再使用cephfs,畢竟對於運維能晚上睡好覺還是最重要的。
 
作者:Dexter_Wang   工作崗位:某互聯網公司資深雲計算與存儲工程師  聯系郵箱:993852246@qq.com


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM