mds0: Many clients (191) failing to respond to cache pressure


cephfs時我們產品依賴的主要分布式操作系統,但似乎很不給面子,壓力測試的時候經常出問題。

背景

集群環境出現的問題: mds0: Many clients (191) failing to respond to cache pressure
背景:三個節點,100多個客戶端mount,服務器可用內存僅剩100MB,ceph報錯如下:

[root@node1 ceph]# ceph -s
    cluster 1338affa-2d3d-416e-9251-4aa6e9c20eef
     health HEALTH_WARN
            mds0: Many clients (191) failing to respond to cache pressure
     monmap e1: 3 mons at {node1=192.168.0.1:6789/0,node2=192.168.0.2:6789/0,node3=192.168.0.3:6789/0}
            election epoch 22, quorum 0,1,2 node1,node2,node3
      fsmap e924: 1/1/1 up {0=node1=up:active}, 2 up:standby
     osdmap e71: 3 osds: 3 up, 3 in
            flags sortbitwise,require_jewel_osds
      pgmap v48336: 576 pgs, 3 pools, 82382 MB data, 176 kobjects
            162 GB used, 5963 GB / 6126 GB avail
                 576 active+clean
  client io 0 B/s rd, 977 kB/s wr, 19 op/s rd, 116 op/s wr

至今問題也沒有解決。(我的意思是說沒有弄清楚Capacity的機制,如果抱着解決不了問題,就解決提出問題的人的思路,可以參考第三部分。)
mds日志如下:

2019-11-12 16:00:17.679876 7fa6a5040700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 34.236623 secs
2019-11-12 16:00:17.679914 7fa6a5040700  0 log_channel(cluster) log [WRN] : slow request 34.236623 seconds old, received at 2019-11-12 15:59:43.326917: client_request(client.154893:13683 open #1000005cb77 2019-11-12 15:59:43.293037) currently failed to xlock, waiting
2019-11-12 16:03:27.614474 7fa6a5040700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 34.350555 secs
2019-11-12 16:03:27.614523 7fa6a5040700  0 log_channel(cluster) log [WRN] : slow request 34.350555 seconds old, received at 2019-11-12 16:02:53.263857: client_request(client.155079:5446 open #1000003e360 2019-11-12 16:02:54.011037) currently failed to xlock, waiting
2019-11-12 16:03:57.615297 7fa6a5040700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 64.351379 secs
2019-11-12 16:03:57.615322 7fa6a5040700  0 log_channel(cluster) log [WRN] : slow request 64.351379 seconds old, received at 2019-11-12 16:02:53.263857: client_request(client.155079:5446 open #1000003e360 2019-11-12 16:02:54.011037) currently failed to xlock, waiting
2019-11-12 16:03:58.181330 7fa6a5040700  0 log_channel(cluster) log [WRN] : client.155079 isn't responding to mclientcaps(revoke), ino 1000003e360 pending pAsxLsXsxFcb issued pAsxLsXsxFsxcrwb, sent 64.458260 seconds ago

后續的努力

自己找環境重現,用的一個測試服務器,安裝了一個Ubuntu系統,然后進行測試。驚喜的發現,同一個客戶端不管我mount多少個目錄,與后端的連接始終都只有那兩個。
但重現過程中還是出現類似的問題了。

mds0: Client ubuntu:guest failing to respond to capability release

靜置一段時間之后出現了如下錯誤:

[root@ceph741 ~]# ceph -s
    cluster 1338affa-2d3d-416e-9251-4aa6e9c20eef
     health HEALTH_WARN
            mds0: Client ubuntu:guest failing to respond to capability release
            mds0: Client ubuntu:guest failing to advance its oldest client/flush tid
     monmap e2: 3 mons at {ceph741=192.168.15.112:6789/0,ceph742=192.168.15.113:6789/0,ceph743=192.168.15.114:6789/0}
            election epoch 38, quorum 0,1,2 ceph741,ceph742,ceph743
      fsmap e8989: 1/1/1 up {0=ceph743=up:active}, 2 up:standby
     osdmap e67: 3 osds: 3 up, 3 in
            flags sortbitwise,require_jewel_osds
      pgmap v847657: 576 pgs, 3 pools, 20803 MB data, 100907 objects
            44454 MB used, 241 GB / 284 GB avail
                 576 active+clean
  client io 59739 B/s rd, 3926 kB/s wr, 58 op/s rd, 770 op/s wr

臨時的解決辦法

臨時的解決辦法就是把出問題的客戶端干掉。
步驟主要命令:

ceph tell  mds.0 session ls
ceph tell mds.0 session evict id=249632

其中id是問題client的id。那么問題客戶端比其他客戶端哪里不同呢,實話說,我也不知道,大家可以看下:

參考:
https://www.jianshu.com/p/d1e0e32346ac
http://www.talkwithtrend.com/Article/242905
https://www.jianshu.com/p/fa49e40f6133


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM