Ceph MDS問題分析
1. 問題背景
1.1 客戶端緩存問題
$ ceph -s
health HEALTH_WARN
mds0: Client xxx-online00.gz01 failing to respond to cache pressure
官方解釋
消息: “Client name failing to respond to cache pressure”
代碼: MDS_HEALTH_CLIENT_RECALL, MDS_HEALTH_CLIENT_RECALL_MANY
描述:
客戶端有各自的元數據緩存,客戶端緩存中的條目(比如索引節點)也會存在於 MDS 緩存中,所以當 MDS 需要削減其緩存時(保持在 mds_cache_size 以下),它也會發消息給客戶端讓它們削減自己的緩存。如果有客戶端沒響應或者有缺陷,就會妨礙 MDS 將緩存保持在 mds_cache_size 以下, MDS 就有可能耗盡內存而后崩潰。如果某個客戶端的響應時間超過了 mds_recall_state_timeout (默認為 60s ),這條消息就會出現。
1.2 服務端內存不釋放
同上參考1.1 客戶端緩存問題
1.3 mds session的inode過多
客戶端session的inode太多,導致內存很高,從而也導致主從mds切換加載inode慢,嚴重影響服務的可用性。
1.4 mds夯住問題或慢查詢
- 客戶端搜索遍歷查找文件(不可控)
- session的 inode太大導致mds負載過高
- 日志級別開的太大,從而導致mds負載高
2. 分析思路
上面的幾個問題都是有一定的聯系,互相影響的。所以,我們先從已知的方向逐步深入分析問題,從而優化解決問題。
2.1 組件通信流程圖

- Client <–> MDS
元數據操作和capalities - Client <–> OSD
數據IO - Client <–> Monitor
認證,集群map信息等 - MDS <–> Monitor
心跳,集群map信息等 - MDS <–> OSD
元數據IO - Monitor <–> OSD
心跳,集群map信息等
2.2 查看客戶端session
$ ceph --admin-daemon /var/run/ceph/ceph-mds.ceph-xxxx-osd02.py.asok session ls
[
{
"id": 5122511,
"num_leases": 0,
"num_caps": 655,
"state": "open",
"replay_requests": 0,
"completed_requests": 1,
"reconnecting": false,
"inst": "client.5122511 192.168.1.2:0\/2026289820",
"client_metadata": {
"ceph_sha1": "b1e0532418e4631af01acbc0cedd426f1905f4af",
"ceph_version": "ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)",
"entity_id": "log_xxx_cephfs",
"hostname": "ceph-test-osd02",
"mount_point": "\/mnt\/log"
}
}
]
說明:
- id:client唯一id
- num_caps:client獲取的caps
- inst:client端的ip和端口鏈接信息
- ceph_version:client端的ceph-fuse版本,若使用kernel client,則為kernel_version
- hostname:client端的主機名
- mount_point:client在主機上對應的mount point
- pid:client端ceph-fuse進程的pid
2.3 查看客戶端的inode數量
跟蹤代碼發現session里面的num_caps就是統計的客戶端的inode數量, 大概統計了下已經打開的inode數量在400w左右。

總結:
可以查看客戶端的session信息,包含host、mount、inode等信息
可以統計所有客戶端session的inode數量。
2.4 嘗試mds主從切換
2.4.1 執行過程如下
2018-04-27 19:24:03.923349 7f53015d7700 1 mds.0.2738 handle_mds_map state change up:boot --> up:replay
2018-04-27 19:24:03.923356 7f53015d7700 1 mds.0.2738 replay_start
2018-04-27 19:24:03.923360 7f53015d7700 1 mds.0.2738 recovery set is
2018-04-27 19:24:03.923365 7f53015d7700 1 mds.0.2738 waiting for osdmap 6339 (which blacklists prior instance)
2018-04-27 19:24:03.948526 7f52fc2ca700 0 mds.0.cache creating system inode with ino:100
2018-04-27 19:24:03.948675 7f52fc2ca700 0 mds.0.cache creating system inode with ino:1
2018-04-27 19:24:04.238128 7f52fa2b8700 1 mds.0.2738 replay_done
2018-04-27 19:24:04.238143 7f52fa2b8700 1 mds.0.2738 making mds journal writeable
2018-04-27 19:24:04.924352 7f53015d7700 1 mds.0.2738 handle_mds_map i am now mds.0.2738
2018-04-27 19:24:04.924357 7f53015d7700 1 mds.0.2738 handle_mds_map state change up:replay --> up:reconnect
2018-04-27 19:24:04.924370 7f53015d7700 1 mds.0.2738 reconnect_start
2018-04-27 19:24:04.924371 7f53015d7700 1 mds.0.2738 reopen_log
2018-04-27 19:24:04.924380 7f53015d7700 1 mds.0.server reconnect_clients -- 19 sessions
2018-04-27 19:24:04.926357 7f53015d7700 0 log_channel(cluster) log [DBG] : reconnect by client.4375 192.168.1.3:0/1796553051 after 0.001950
2018-04-27 19:24:04.926429 7f53015d7700 0 log_channel(cluster) log [DBG] : reconnect by client.4403 192.168.1.3:0/1032897847 after 0.002036
2018-04-27 19:24:15.228507 7f53015d7700 1 mds.0.2738 reconnect_done
2018-04-27 19:24:15.984143 7f53015d7700 1 mds.0.2738 handle_mds_map i am now mds.0.2738
2018-04-27 19:24:15.984148 7f53015d7700 1 mds.0.2738 handle_mds_map state change up:reconnect --> up:rejoin
2018-04-27 19:24:15.984156 7f53015d7700 1 mds.0.2738 rejoin_start
2018-04-27 19:25:15.987531 7f53015d7700 1 mds.0.2738 rejoin_joint_start
2018-04-27 19:27:40.105134 7f52fd4ce700 1 mds.0.2738 rejoin_done
2018-04-27 19:27:42.206654 7f53015d7700 1 mds.0.2738 handle_mds_map i am now mds.0.2738
2018-04-27 19:27:42.206658 7f53015d7700 1 mds.0.2738 handle_mds_map state change up:rejoin --> up:active
2018-04-27 19:27:42.206666 7f53015d7700 1 mds.0.2738 recovery_done -- successful recovery!
主從切換流程:
- handle_mds_map state change up:boot --> up:replay
- handle_mds_map state change up:replay --> up:reconnect
- handle_mds_map state change up:reconnect --> up:rejoin
- handle_mds_map state change up:rejoin --> up:active
up:boot
- 此狀態在啟動期間被廣播到CEPH監視器。這種狀態是不可見的,因為監視器立即將MDS分配給可用的秩或命令MDS作為備用操作。這里記錄了完整性的狀態。
up:replay
- 日志恢復階段,他將日志內容讀入內存后,在內存中進行回放操作。
up:reconnect
- 恢復的mds需要與之前的客戶端重新建立連接,並且需要查詢之前客戶端發布的文件句柄,重新在mds的緩存中創建一致性功能和鎖的狀態。
mds不會同步記錄文件打開的信息,原因是需要避免在訪問mds時產生多余的延遲,並且大多數文件是以只讀方式打開。
up:rejoin
- 把客戶端的inode加載到mds cache。(耗時最多的地方)
為什么mds切換耗時比較高?
- 分析日志(發現執行rejoin_start,rejoin_joint_start動作耗時比較高)
2018-04-27 19:24:15.984156 7f53015d7700 1 mds.0.2738 rejoin_start 2018-04-27 19:25:15.987531 7f53015d7700 1 mds.0.2738 rejoin_joint_start 2018-04-27 19:27:40.105134 7f52fd4ce700 1 mds.0.2738 rejoin_done 2018-04-27 19:27:42.206654 7f53015d7700 1 mds.0.2738 handle_mds_map i am now mds.0.2738 2018-04-27 19:27:42.206658 7f53015d7700 1 mds.0.2738 handle_mds_map state change up:rejoin --> up:active
-
跟蹤代碼分析(在執行process_imported_caps超時了, 這個函數主要是打開inodes 加載到cache中)
image.png
總結:
- 主從切換時mds詳細狀態
- 主從切換時主要耗時的階段rejoin_start,加載客戶端session的inode信息
2.5 釋放客戶端inode
2.5.1 模擬客戶端session inode過多
- 查看客戶端session信息
#查看客戶端session的inode數量, num_caps:7
$ ceph daemon mds.ceph-xxx-osd01.ys session ls
[
{
"id": 554418,
"num_leases": 0,
"num_caps": 7,
"state": "open",
"replay_requests": 0,
"completed_requests": 0,
"reconnecting": false,
"inst": "client.554418 192.168.1.2:0/1285681097",
"client_metadata": {
"ceph_sha1": "fe3a2269d799a8b950404cb2de11af84c7af0ea4",
"ceph_version": "didi_dss version 12.2.2.4 (fe3a2269d799a8b950404cb2de11af84c7af0ea4) luminous (stable)",
"entity_id": "admin",
"hostname": "ceph-xxx-osd01.ys",
"mount_point": "/mnt",
"pid": "2084",
"root": "/"
}
}
]
- 客戶端遍歷所有文件
#遍歷掛載目錄下所有文件
$ tree /mnt/
#查看這個目錄下面所有文件夾及文件數量
$ tree /mnt/ | wc -l
347
- 再次查看客戶端session信息
#查看客戶端session的inode數量, num_caps:346
$ ceph daemon mds.ceph-xxx-osd01.ys session ls
[
{
"id": 554418,
"num_leases": 1,
"num_caps": 346,
"state": "open",
"replay_requests": 0,
"completed_requests": 2,
"reconnecting": false,
"inst": "client.554418 192.168.1.3:0/1285681097",
"client_metadata": {
"ceph_sha1": "fe3a2269d799a8b950404cb2de11af84c7af0ea4",
"ceph_version": "didi_dss version 12.2.2.4 (fe3a2269d799a8b950404cb2de11af84c7af0ea4) luminous (stable)",
"entity_id": "admin",
"hostname": "ceph-xxx-osd01.ys",
"mount_point": "/mnt",
"pid": "2084",
"root": "/"
}
}
]
結論:
- 客戶端通過遍歷掛載目錄下所有文件,發現服務端的session num_caps跟客戶端文件夾及文件梳理匹配
- 也就是說客戶端讀取過的文件句柄,都會在服務端記錄下來。 (mds緩存了dentry,並且以lru算法的緩存淘汰方式把dentry緩存在了內存中)
2.5.2 釋放客戶端session inode
解決方案:
- 方案1:采用多活mds(目前12版 multi active不穩定)
- 方案2:evict client(主動踢出有問題的客戶端)
- 方案3:client remount(有問題的客戶端重新mount掛載)
- 方案4:drop_cache, limit_cache
- mds limiting cache by memory https://github.com/ceph/ceph/pull/17711
- (官方提供的mds 主動刪除cache,補丁在review過程中個,目標版本是ceph-14.0.0) https://github.com/ceph/ceph/pull/21566
image.png
3. 深入分析
根據上面的分析,我們基本有一定的思路。 這里我們繼續深入到方案2中。
3.1 剔除客戶端session
3.1.1 查看客戶端session信息
$ ceph daemon mds.ceph-xxx-osd01.ys session ls
[
{
"id": 554418,
"num_leases": 0,
"num_caps": 1589,
"state": "open",
"replay_requests": 0,
"completed_requests": 2,
"reconnecting": false,
"inst": "client.554418 192.168.1.2:0/1285681097",
"client_metadata": {
"ceph_sha1": "fe3a2269d799a8b950404cb2de11af84c7af0ea4",
"ceph_version": "didi_dss version 12.2.2.4 (fe3a2269d799a8b950404cb2de11af84c7af0ea4) luminous (stable)",
"entity_id": "admin",
"hostname": "ceph-xxx-osd01.ys",
"mount_point": "/mnt",
"pid": "2084",
"root": "/"
}
}
]
3.1.2 剔除客戶端session信息
$ ceph tell mds.ceph-xxx-osd01.ys client evict id=554418
3.1.3 查看osd的blacklist
#超時恢復的時間是1小時,剔除的時間是16:16:30, 恢復的時間是17:16:30
$ ceph osd blacklist ls
listed 1 entries
192.168.1.2:0/1285681097 2018-10-10 17:16:30.819201
3.1.4 查看客戶端掛載目錄(不能讀寫)
$ ll /mnt
ls: cannot access /mnt: No such file or directory
3.1.5 恢復剔除的客戶端
$ ceph osd blacklist rm 192.168.1.2:0/1285681097
un-blacklisting 192.168.1.2:0/1285681097
3.1.6 查看客戶端掛載目錄(正常讀寫)
$ ll /mnt
total 147698
-rw-r--r-- 1 root root 4 Oct 10 15:25 aa.txt
...
3.1.7 osd黑名單的客戶端超時時間
- 舊版本超時時間為1小時
- 新版本12.2.2 超時時間為300s
總結:
- 可以通過指令client evict 剔除指定的客戶端
- 剔除的客戶端會加入到osd黑名單中
- 加入到osd黑名單中的客戶端都不能讀寫
- 恢復剔除的客戶端需要刪除osd黑名單中的客戶端信息
- 刪除osd黑名單中客戶端信息,客戶端立馬能正常讀寫
- fuse客戶端可以恢復,kernel客戶端無法恢復
- 經過試驗證明:
- 剔除的用戶雖然釋放了inode
- 主mds的內存並未釋放
- 主從切換后,備mds內存會釋放
- 主從切換后,切換速度少了加載inode耗時的階段,從而加快切換速度,秒級別
3.2 內存未釋放分析
3.2.1 依賴軟件
yum install google-perftools
3.2.2 查看mds內存
top - 13:14:06 up 63 days, 21:36, 1 user, load average: 0.06, 0.08, 0.12
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 0.1 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13149521+total, 96957576 free, 10023744 used, 24513896 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 11539159+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4997 ceph 20 0 4081012 1.447g 11452 S 0.0 1.2 0:54.29 ceph-mds
3.2.3 啟動剖析器
$ ceph tell mds.0 heap start_profiler
2018-10-12 13:15:35.979279 7f3430bfa700 0 client.5796596 ms_handle_reset on 192.168.1.2:6804/2252738073
2018-10-12 13:15:36.008686 7f34293fc700 0 client.5796599 ms_handle_reset on 192.168.1.2:6804/2252738073
mds.ceph-xxx-osd01.ys started profiler
3.2.4 轉儲堆棧信息
$ ceph tell mds.0 heap dump
2018-10-12 13:16:34.891671 7efd04bfa700 0 client.5796659 ms_handle_reset on 192.168.1.2:6804/2252738073
2018-10-12 13:16:34.922696 7efcfd3fc700 0 client.5796662 ms_handle_reset on 192.168.1.2:6804/2252738073
mds.ceph-xxx-osd01.ys dumping heap profile now.
------------------------------------------------
MALLOC: 1225155304 ( 1168.4 MiB) Bytes in use by application
MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
MALLOC: + 289987072 ( 276.6 MiB) Bytes in central cache freelist
MALLOC: + 11013456 ( 10.5 MiB) Bytes in transfer cache freelist
MALLOC: + 7165384 ( 6.8 MiB) Bytes in thread cache freelists
MALLOC: + 7598240 ( 7.2 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 1540919456 ( 1469.5 MiB) Actual memory used (physical + swap)
MALLOC: + 112582656 ( 107.4 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 1653502112 ( 1576.9 MiB) Virtual address space used
MALLOC:
MALLOC: 94545 Spans in use
MALLOC: 16 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
3.2.5 google-pprof分析內存堆棧
pprof --text bin/ceph-mds out/mds.a.profile.0001.heap
$ pprof --text bin/ceph-mds out/mds.a.profile.0008.heap
Using local file bin/ceph-mds.
Using local file out/mds.a.profile.0008.heap.
Total: 46.6 MB
18.1 38.7% 38.7% 19.5 41.9% Server::prepare_new_inode
6.2 13.3% 52.0% 6.2 13.3% std::_Rb_tree::_M_emplace_hint_unique (inline)
5.0 10.7% 62.7% 5.8 12.3% CDir::add_null_dentry
3.8 8.1% 70.8% 3.8 8.1% std::_Rb_tree::_Rb_tree_impl::_M_initialize (inline)
3.6 7.7% 78.6% 3.6 7.7% ceph::logging::Log::create_entry
3.1 6.7% 85.2% 3.1 6.7% Counter::_count (inline)
2.6 5.5% 90.7% 2.6 5.5% ceph::buffer::raw_combined::create (inline)
0.9 2.0% 92.8% 0.9 2.0% std::_Vector_base::_M_create_storage (inline)
0.8 1.6% 94.4% 0.8 1.6% CDir::add_null_dentry (inline)
0.6 1.2% 95.6% 0.6 1.2% CInode::add_client_cap (inline)
0.5 1.1% 96.6% 0.5 1.1% std::string::_Rep::_S_create
0.5 1.0% 97.6% 0.5 1.0% MDCache::add_inode (inline)
0.2 0.5% 98.2% 0.3 0.6% decode_message
0.2 0.4% 98.5% 0.2 0.5% MDCache::request_start (inline)
0.1 0.2% 98.7% 0.1 0.3% CInode::project_inode
0.1 0.2% 99.0% 0.1 0.2% std::_Rb_tree::_M_insert_unique (inline)
0.1 0.1% 99.1% 0.1 0.1% std::string::_M_data (inline)
4. 總結
cephfs mds目前版本都是單活,對於復雜多變的客戶端可能會帶來一定的性能影響。
例如:在本地磁盤下去搜索一個文件,如果文件數過多的,對本機cpu以及負載也會帶來一定的沖擊,更何況是復雜多變的網絡磁盤。
目前推薦的優化方案:
- ceph-fuse客戶端Qos限速,避免IO一瞬間涌進來導致mds抖動(從客戶端限制IOPS,避免資源爭搶,對系統資源帶來沖擊)
- 多活mds, 目錄分片(最終解決方案,詳見:多活MDS的性能測試 )
- mds在主處理流程中使用了單線程,這導致了其單個MDS的性能受到了限制,最大單個MDS可達8k ops/s,CPU利用率達到的 140%左右。
- 如果mds負載過高或者內存過大,限制內存或者定期的回收cache(減輕mds的壓力,提升吞吐 https://github.com/ceph/ceph/pull/17711/files)
- 剔除用戶可以釋放inode數量,但是不能減少內存,如果此時切換主從可以加快切換速度。