ceph的crush算法是個好東西,能夠實現對象讀寫位置的計算,誒,最大的問題是,pg分布怎么如此不均衡
問題的出現
在實際使用ceph的過程中,我們經常會遇到這樣的問題,創建了pool之后,ceph osd df會看到這些pool的pg在osd上分布不均勻,甚至相差很大,尤其是像rbd-pool或者rgw-data這樣的數據pool,相差十幾幾十個pg,在集群用到80%以上時會出現讓我們十分頭疼的問題,就是部分osd已經到了nearfull,但是部分osd只用了60%
解決這個問題的有效辦法就是在集群剛建好的時候,對pool進行調整,調整的方法就是對osd進行reweight,通過多次的reweight,指定的pool在osd上能大致得到比較好的均衡效果,但是,這個前提是在集群剛建好的時候,而且,遇到擴容場景,這種多次調整的辦法就不行了
解決思路
我們在生產上使用ceph,最希望的情況就是每一個磁盤使用量都幾乎一樣,這就意味着至少主要承載數據的pool的pg在所有指定的osd上是近乎完美分布的,而且,在擴容之后,所有osd的用量仍能保持非常均衡的水平,而使用最小的代價達到,有辦法做到嗎?
當然是有的,從12.2.x版本開始,社區開發出了一個工具osdmaptool,這個工具允許我們對指定的osdmap進行運算,結合ceph osd pg-upmap-items命令實現單個pg級別的人為遷移,這就意味着,我們可以人為地指定某個pg遷移到指定的osd上,真的太神奇了!
要知道,pg的分布是通過crushmap、reweight等參數輸入到算法而計算出來的,目的是讓client能夠通過計算的方式得出應該在哪個位置進行讀寫,而人為改變pg在一定程度上可以說是違背了算法的本意
upmap
摘錄一段官方的介紹
Starting in Luminous v12.2.z there is a new pg-upmap exception table in
the OSDMap that allows the cluster to explicitly map specific PGs to
specific OSDs. This allows the cluster to fine-tune the data distribution
to, in most cases, perfectly distributed PGs across OSDs.
The key caveat to this new mechanism is that it requires that all clients
understand the new pg-upmap structure in the OSDMap.
也就是,upmap能夠實現人為的指定pg分布,但是,需要客戶端能夠識別新的pg-upmap的結構,因為跟使用crush算法直接計算得出pg分布不同,人為修改了pg的位置后,就不能單單通過算法的到移動后的pg的位置了,必須提出新的結構
如何使用
這里我們實踐一下,看看這個工具是不是真的那么好用
根據要求,使用upmap的前提條件有兩個,第一是ceph版本必須是12.2.x及后續版本,第二是ceph的client特性至少要支持到luminous,才能保證client能夠解讀pg-upmap的新結構
如何實現呢?往下看
ceph features #查看ceph特征
ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it #設置集群僅支持 Luminous(或者L之后的)客戶端
1.利用ceph osd getmap 獲取最新得osdmap
ceph osd getmap -o {osdmap_filename}
例子:ceph osd getmap -o osdmap.bin
然后我們查看一下此時集群中rgw的data pool的pg分布情況
osdmaptool --test-map-pgs --pool 6 ./osdmap.bin
2.osdmaptool --upmap-pool [poolname] [osdmapfile] --upmap [outfilename]
osdmaptool {osdmap_filename} --upmap out.txt [--upmap-pool <pool>] [--upmap-max <max-count>] [--upmap-deviation <max-deviation>] #獲取當前集群數據均衡后的優化信息
例子: osdmaptool --upmap-pool default.rgw.buckets.data osdmap.bin --upmap upmap.txt --upmap-max 99 --upmap-deviation 1
說明
upmap-pool :指定需要優化均衡的存儲池名
upmap-max:指定一次優化的數據條目,默認100,可根據環境業務情況調整該值,一次調整的條目越多,數據遷移會越多,可能對環境業務造成影響。
max-deviation:最大偏差值,默認為0.01(即1%)。如果OSD利用率與平均值之間的差異小於此值,則將被視為完美。
查看可以優化的結果
cat upmap.txt
3.source 一下導出得upmap方案
source upmap.txt #執行優化
set 6.7 pg_upmap_items mapping to [118->152] set 6.8 pg_upmap_items mapping to [107->117] set 6.b pg_upmap_items mapping to [92->147,169->117] set 6.10 pg_upmap_items mapping to [180->80] set 6.17 pg_upmap_items mapping to [171->131,110->152,180->81] set 6.18 pg_upmap_items mapping to [99->96] set 6.1d pg_upmap_items mapping to [171->134,91->167] set 6.20 pg_upmap_items mapping to [107->109] set 6.21 pg_upmap_items mapping to [107->109] set 6.24 pg_upmap_items mapping to [120->108] set 6.25 pg_upmap_items mapping to [11->156] set 6.2c pg_upmap_items mapping to [104->108] set 6.2e pg_upmap_items mapping to [169->113,100->96] set 6.48 pg_upmap_items mapping to [107->117] set 6.4a pg_upmap_items mapping to [107->117] set 6.4b pg_upmap_items mapping to [177->124] set 6.4c pg_upmap_items mapping to [91->94] set 6.58 pg_upmap_items mapping to [126->123] set 6.60 pg_upmap_items mapping to [118->112] set 6.63 pg_upmap_items mapping to [92->90,177->124,104->112] set 6.66 pg_upmap_items mapping to [177->129] set 6.6c pg_upmap_items mapping to [101->175] set 6.6d pg_upmap_items mapping to [34->35] set 6.79 pg_upmap_items mapping to [110->106] set 6.7a pg_upmap_items mapping to [120->116] set 6.7d pg_upmap_items mapping to [91->94] set 6.7e pg_upmap_items mapping to [92->94,118->172] set 6.8b pg_upmap_items mapping to [169->113,92->90,177->127] set 6.9e pg_upmap_items mapping to [92->94] set 6.a3 pg_upmap_items mapping to [107->117] set 6.b5 pg_upmap_items mapping to [92->147,107->117] set 6.b7 pg_upmap_items mapping to [171->131] set 6.b9 pg_upmap_items mapping to [110->172] set 6.ba pg_upmap_items mapping to [180->81] set 6.bb pg_upmap_items mapping to [92->94] set 6.c8 pg_upmap_items mapping to [107->119] set 6.ca pg_upmap_items mapping to [107->117] set 6.d2 pg_upmap_items mapping to [92->90,169->113,179->112,177->127,171->134] set 6.d4 pg_upmap_items mapping to [11->156] set 6.d7 pg_upmap_items mapping to [110->108,171->134] set 6.dd pg_upmap_items mapping to [91->94] set 6.e0 pg_upmap_items mapping to [107->139] set 6.e3 pg_upmap_items mapping to [92->94] set 6.e6 pg_upmap_items mapping to [179->108,101->96] set 6.e8 pg_upmap_items mapping to [171->131,99->96] set 6.ec pg_upmap_items mapping to [104->108] set 6.ed pg_upmap_items mapping to [34->35] set 6.fe pg_upmap_items mapping to [150->124,92->94] set 6.ff pg_upmap_items mapping to [43->42] set 6.105 pg_upmap_items mapping to [120->152] set 6.10a pg_upmap_items mapping to [138->178] set 6.10b pg_upmap_items mapping to [169->113] set 6.110 pg_upmap_items mapping to [92->147,126->127] set 6.11c pg_upmap_items mapping to [179->108] set 6.137 pg_upmap_items mapping to [104->112] set 6.13a pg_upmap_items mapping to [120->152] set 6.13e pg_upmap_items mapping to [92->147,150->127] set 6.148 pg_upmap_items mapping to [107->117] set 6.150 pg_upmap_items mapping to [126->124] set 6.152 pg_upmap_items mapping to [92->89,179->152,169->139,100->175] set 6.15c pg_upmap_items mapping to [179->112] set 6.15e pg_upmap_items mapping to [92->89,150->128] set 6.160 pg_upmap_items mapping to [126->129] set 6.161 pg_upmap_items mapping to [107->115] set 6.162 pg_upmap_items mapping to [149->115] set 6.16e pg_upmap_items mapping to [169->113,179->116] set 6.174 pg_upmap_items mapping to [34->35] set 6.17d pg_upmap_items mapping to [150->127] set 6.185 pg_upmap_items mapping to [91->94] set 6.18c pg_upmap_items mapping to [179->112,103->175] set 6.190 pg_upmap_items mapping to [126->123] set 6.192 pg_upmap_items mapping to [179->152,177->124] set 6.19d pg_upmap_items mapping to [171->131] set 6.19e pg_upmap_items mapping to [92->89,150->124] set 6.1a0 pg_upmap_items mapping to [107->117,118->106] set 6.1a1 pg_upmap_items mapping to [103->96] set 6.1a8 pg_upmap_items mapping to [171->178] set 6.1ac pg_upmap_items mapping to [101->175] set 6.1b9 pg_upmap_items mapping to [110->108] set 6.1bd pg_upmap_items mapping to [171->131,150->123] set 6.1c9 pg_upmap_items mapping to [177->124,104->106] set 6.1cb pg_upmap_items mapping to [92->147,169->113] set 6.1cc pg_upmap_items mapping to [179->112] set 6.1d2 pg_upmap_items mapping to [179->152,169->115,177->128] set 6.1d7 pg_upmap_items mapping to [171->134,180->85] set 6.1dd pg_upmap_items mapping to [149->119] set 6.1e0 pg_upmap_items mapping to [118->152] set 6.1ed pg_upmap_items mapping to [34->35] set 6.1ee pg_upmap_items mapping to [169->139,138->134] set 6.1f5 pg_upmap_items mapping to [179->108,107->117] set 6.1fa pg_upmap_items mapping to [180->153] set 6.1ff pg_upmap_items mapping to [43->42]
查看變化我們看到pg多的osd在向pg少的osd轉移pg
ceph pg map 6.b
osdmap e10488 pg 6.b (6.b) -> up [98,
147,117,135,177,81,116] acting [98,
92,113,134,127,81,116]
ceph osd df | grep -w " 92 "
92 hdd 7.27698 0.84999 7.3 TiB 6.4 TiB 6.4 TiB 56 KiB 19 GiB 859 GiB 88.47 1.72 36 up
ceph osd df | grep -w "147 "