解決ceph節點因斷開SSH遠程后的造成集群網絡不穩定(節點的Mon和OSD進程自動down)的問題


故障描述:ceph節點因為斷開SSH網絡鏈接會立刻導致mon和osd守護進程自動down的問題

觀察/var/log/ceph/ceph.log的部分關鍵信息顯示如下:

2020-07-27 17:49:01.395696 mon.ceph-node1 (mon.0) 381808 : cluster [WRN] Health check
 update: Reduced data availability: 1 pg inactive, 5 pgs peering (PG_AVAILABILITY)
2020-07-27 17:49:03.369683 mon.ceph-node1 (mon.0) 381809 : cluster [INF] Health check
 cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 5 pgs peeri
ng)
2020-07-27 17:48:55.313287 mgr.ceph-node1 (mgr.6352) 266574 : cluster [DBG] pgmap v29
8025: 320 pgs: 22 active+undersized, 47 active+undersized+degraded, 9 peering, 242 ac
tive+clean; 53 GiB data, 759 GiB used, 11 TiB / 12 TiB avail; 0 B/s wr, 0 op/s; 2669/
40779 objects degraded (6.545%); 0 B/s, 0 objects/s recovering
2020-07-27 17:48:57.314405 mgr.ceph-node1 (mgr.6352) 266575 : cluster [DBG] pgmap v29
8027: 320 pgs: 44 stale+active+clean, 27 active+undersized, 51 active+undersized+degr
aded, 20 peering, 178 active+clean; 53 GiB data, 759 GiB used, 11 TiB / 12 TiB avail;
 0 B/s wr, 0 op/s; 3051/40779 objects degraded (7.482%); 0 B/s, 0 objects/s recoverin
g


2020-07-27 17:51:02.089931 mon.ceph-node1 (mon.0) 382017 : cluster [INF] Health check
 cleared: MON_DOWN (was: 1/3 mons down, quorum ceph-node1,ceph-node2)


2020-07-27 17:51:02.579862 mon.ceph-node1 (mon.0) 382026 : cluster [WRN] overall HEAL
TH_WARN 4 osds down; 1 host (4 osds) down; Long heartbeat ping times on back interfac
e seen, longest is 2171.403 msec; Long heartbeat ping times on front interface seen, 
longest is 2171.434 msec; Degraded data redundancy: 11649/40770 objects degraded (28.
572%), 190 pgs degraded, 181 pgs undersized


2020-07-27 17:52:32.565545 osd.9 (osd.9) 59 : cluster [WRN] slow request osd_op(clien
t.6400.0:370569 3.20 3:06380552:::rbd_header.172d226df4f8:head [watch unwatch cookie 
140360537903920] snapc 0=[] ondisk+write+known_if_redirected e31947) initiated 2020-0
7-27 17:52:01.830706 currently started


2020-07-27 17:55:06.335968 mon.ceph-node1 (mon.0) 382428 : cluster [WRN] Health check
 failed: 2 slow ops, oldest one blocked for 31 sec, mon.ceph-node1 has slow ops (SLOW
_OPS)

2020-07-27 17:56:03.133399 osd.8 (osd.8) 25 : cluster [WRN] Monitor daemon marked osd
.8 down, but it is still running

[WRN]
Health check update: Long heartbeat ping times on front interface seen, longest is 21297.249 msec (OSD_SLOW_PING_TIME_FRONT)

2020-07-28 10:02:39.045969
[WRN]
Health check update: Long heartbeat ping times on back interface seen, longest is 21297.238 msec (OSD_SLOW_PING_TIME_BACK)

在存在故障的節點上通過dmesg命令查看到部分的kernel的硬件信息,一般用於設備故障的診斷時使用

[root@ceph-node3 ~]# dmesg -T | tail
[Tue Jul 28 09:59:55 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[Tue Jul 28 09:59:55 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[Tue Jul 28 09:59:55 2020] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
[Tue Jul 28 10:06:34 2020] IPv6: ADDRCONF(NETDEV_UP): em2: link is not ready
[Tue Jul 28 10:06:34 2020] IPv6: ADDRCONF(NETDEV_UP): em3: link is not ready
[Tue Jul 28 10:06:34 2020] IPv6: ADDRCONF(NETDEV_UP): em4: link is not ready
[Tue Jul 28 10:06:34 2020] IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready
[Tue Jul 28 10:10:29 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[Tue Jul 28 10:10:29 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[Tue Jul 28 10:10:29 2020] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready

對比查看其他ceph節點上的配置文件信息,發現配置參數有點不一致的問題

vim /etc/sysconfig/network-scripts/ifcfg-ib0

CONNECTED_MODE=no
TYPE=InfiniBand
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ib0
UUID=2ab4abde-b8a5-6cbc-19b1-2bfb193e4e89
DEVICE=ib0
ONBOOT=yes
IPADDR=10.0.0.20
NETMASK=255.255.255.0
#USERS=ROOT	//多個此參數,與其他節點上有不同,於是刪除了此參數

修改后重啟network服務和NetworkManager服務,發現描述的故障已經解除。再次使用dmesg也查看不到最新的錯誤信息。USERS=ROOT這個參數的作用暫時還不明確?


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM