014 Ceph管理和自定義CRUSHMAP


 一、概念

1.1 Ceph集群寫操作流程

client首先訪問ceph monitor獲取cluster map的一個副本,知曉集群的狀態和配置

數據被轉化為一個或多個對象,每個對象都具有對象名稱和存儲池名稱

以PG數為基數做hash,將對象映射到一個PG

根據計算出的PG,再通過CRUSH算法得到存放數據的一組OSD位置(副本個數),第一個是主,后面是從

客戶端獲得OSD ID,直接和這些OSD通信並存放數據

注: 以上所有操作都是在客戶端完成 的,不會影響ceph集群服務端性能

1.2 Crush和對象放置策略

Ceph使用CRUSH算法(Controlled Replication Under Scalable Hashing 可擴展哈希下的受控復制)來計算哪些OSD存放哪些對象

對象分配到PG中,CRUSH決定這些PG使用哪些OSD來存儲對象。理想情況下,CRUSH會將數據均勻的分布到存儲中

當添加新OSD或者現有的OSD出現故障時,Ceph使用CRUSH在活躍的OSD上重平衡數據

CRUSH map是CRUSH算法的中央配置機制,可通過調整CRUSH map來優化數據存放位置

默認情況下,CRUSH將對象放置到不同主機上的OSD中。可以配置CRUSH map和CRUSH rules,使對象放置到不同房間或者不同機櫃的主機上的OSD中。也可以將SSD磁盤分配給需要高速存儲的池

1.3 Crush map的組成部分

CRUSH hierarchy(層次結構):一個樹型結構,通常用於代表OSD所處的位置。默認情況下,有一個根bucket,它包含所有的主機bucket,而OSD則是主機bucket的樹葉。這個層次結構允許我們自定義,對它重新排列或添加更多的層次,如將OSD主機分組到不同的機櫃或者不同的房間

CRUSH rule(規則):CRUSH rule決定如何從bucket中分配OSD pg。每個池必須要有一條CRUSH rule,不同的池可map不同的CRUSH rule

二、crushmap的解譯編譯和更新

[root@ceph2 ceph]# ceph osd getcrushmap -o ./crushmap.bin

[root@ceph2 ceph]# file ./crushmap.bin

./crushmap.bin: MS Windows icon resource - 8 icons, 2-colors

[root@ceph2 ceph]# crushtool -d crushmap.bin -o ./crushmap.txt

[root@ceph2 ceph]# vim ./crushmap.txt

[root@ceph2 ceph]# crushtool -c crushmap.txt -o crushmap.-new.bin

[root@ceph2 ceph]# ceph osd tree

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF 
-1 0.13129 root default -3 0.04376 host ceph2 0 hdd 0.01459 osd.0 up 1.00000 1.00000 3 hdd 0.01459 osd.3 up 1.00000 1.00000 6 hdd 0.01459 osd.6 up 1.00000 1.00000 -7 0.04376 host ceph3 2 hdd 0.01459 osd.2 up 1.00000 1.00000 5 hdd 0.01459 osd.5 up 1.00000 1.00000 8 hdd 0.01459 osd.8 up 1.00000 1.00000 -5 0.04376 host ceph4 1 hdd 0.01459 osd.1 up 1.00000 1.00000 4 hdd 0.01459 osd.4 up 1.00000 1.00000 7 hdd 0.01459 osd.7 up 1.00000 1.00000 

[root@ceph2 ceph]# vim ./crushmap.txt

crushmap.txt

配置項

host ceph3 {
        id -7                      # do not change unnecessarily ,一個負整數,以便與存儲設備id區分 id -8 class hdd # do not change unnecessarily # weight 0.044 alg straw2 #將pg map到osd時的算法,默認使用straw2 hash 0 # rjenkins1 #每個bucket都有一個hash算法,目前Ceph支持rjenkins1算法,設為0即使用該算法 item osd.2 weight 0.015 #一個bucket包含的其他bucket或者葉子 item osd.5 weight 0.015 item osd.8 weight 0.015 }

CRUSH map包含數據放置規則,默認有兩個規則: replicated_rule和erasure-code

通過ceph osd crush rule ls可列出現有規則,也可以使用ceph osd crush rule dump打印規則詳細詳細

[root@ceph2 ceph]# ceph osd crush rule ls

[root@ceph2 ceph]# ceph osd crush rule dump

[
    {
        "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 1, "rule_name": "EC-pool", "ruleset": 1, "type": 3, "min_size": 3, "max_size": 5, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } ]

 和配置文件crushmap一樣

rule replicated_rule {
        id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule <rulename> { id <id > [整數,規則id] type [replicated|erasure] [規則類型,用於復制池還是糾刪碼池] min_size <min-size> [如果池的最小副本數小於該值,則不會為當前池應用這條規則] max_size <max-size>[如果創建的池的最大副本大於該值,則不會為當前池應用這條規則] step take <bucket type> [這條規則作用的bucket,默認為default] step [chooseleaf|choose] [firstn] <num> type <bucket-type> # num == 0 選擇N(池的副本數)個bucket # num > 0且num < N 選擇num個bucket # num < 0 選擇N-num(絕對值)個bucket step emit }

創建一個復制池:

[root@ceph2 ceph]# ceph osd pool create testpool 32 32
pool 'testpool' already exists [root@ceph2 ceph]# ceph osd pool set tetspool size 11 Error ENOENT: unrecognized pool 'tetspool' #crush規則最大為10 修改crushrule

[root@ceph2 ceph]# vim crushmap.txt

rule replicated1_rule {
        id 2 type replicated min_size 1 max_size 11 step take default step chooseleaf firstn 0 type host step emit }

[root@ceph2 ceph]# vim crushmap.txt

[root@ceph2 ceph]# crushtool -c crushmap.txt -o crushmap-new.bin

[root@ceph2 ceph]# ceph osd setcrushmap -i crushmap-new.bin

[root@ceph2 ceph]# ceph osd crush rule ls

[root@ceph2 ceph]# ceph osd pool get testpool all

size: 3
min_size: 2 crash_replay_interval: 0 pg_num: 128 pgp_num: 128 crush_rule: replicated_rule hashpspool: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: false nodeep-scrub: false use_gmt_hitset: 1 auid: 0 fast_read: 0

三、從命令行更新Crush map參數

3.1 默認bucket類型

type 0 osd,type 1 host,type 2 chassis,type 3 rack,type 4 row,type 5 pdu

type 6 pod,type 7 room,type 8 datacenter,type 9 region,type 10 root

圖解

 

四、從命令行更新CRUSH map的 層次結構

4.1 創建bucket

[root@ceph2 ceph]# ceph osd crush add-bucket DC1 datacenter
added bucket DC1 type datacenter to crush map

4.2 檢查

[root@ceph2 ceph]# ceph osd getcrushmap -o crushmap.bin

[root@ceph2 ceph]# crushtool -d crushmap.bin -o crushmap.txt

[root@ceph2 ceph]# vim crushmap.txt

datacenter DC1 {
        id -9           # do not change unnecessarily # weight 0.000 alg straw2 hash 0 # rjenkins1 }

 [root@ceph2 ceph]#  ceph osd  tree

4.3 刪除

[root@ceph2 ceph]# vim crushmap.txt

這一段刪除
datacenter DC1 {
        id -9           # do not change unnecessarily # weight 0.000 alg straw2 hash 0 # rjenkins1 }

4.4 編譯導入

[root@ceph2 ceph]# crushtool -c crushmap.txt -o crushmap-new.bin

[root@ceph2 ceph]# ceph osd setcrushmap -i crushmap-new.bin

已經刪除

[root@ceph2 ceph]# ceph osd tree

4.5 重新創建bucket

[root@serverc ~]# ceph osd crush add-bucket dc1 root
added bucket dc1 type root to crush map

五、規划新的bucket

5.1 定義三個rack

[root@ceph2 ceph]# ceph osd crush add-bucket  rack1 rack
added bucket rack1 type rack to crush map
[root@ceph2 ceph]# ceph osd crush add-bucket rack2 rack added bucket rack2 type rack to crush map [root@ceph2 ceph]# ceph osd crush add-bucket rack3 rack added bucket rack3 type rack to crush map

5.2 把三個rack加入到dc1

[root@ceph2 ceph]# ceph osd crush  move rack1 root=dc1
moved item id -10 name 'rack1' to location {root=dc1} in crush map [root@ceph2 ceph]# ceph osd crush move rack2 root=dc1 moved item id -11 name 'rack2' to location {root=dc1} in crush map [root@ceph2 ceph]# ceph osd crush move rack3 root=dc1 moved item id -12 name 'rack3' to location {root=dc1} in crush map

5.3 把三個主機移動到rack中

由於move過去,默認就沒有主機了,可以只用link連接過去

[root@ceph2 ceph]# ceph osd crush link ceph2 rack=rack1
linked item id -3 name 'ceph2' to location {rack=rack1} in crush map [root@ceph2 ceph]# ceph osd crush link ceph2 rack=rack2 linked item id -3 name 'ceph2' to location {rack=rack2} in crush map [root@ceph2 ceph]# ceph osd crush link ceph3 rack=rack2 linked item id -7 name 'ceph3' to location {rack=rack2} in crush map [root@ceph2 ceph]# ceph osd crush link ceph4 rack=rack2 linked item id -5 name 'ceph4' to location {rack=rack2} in crush map

[root@ceph2 ceph]# ceph osd tree

5.4 修改

發現命令行配置錯誤,修改crushmap規則

原配置
rack rack1 {
        id -10          # do not change unnecessarily id -15 class hdd # do not change unnecessarily # weight 0.045 alg straw2 hash 0 # rjenkins1 item ceph2 weight 0.045 } rack rack2 { id -11 # do not change unnecessarily id -14 class hdd # do not change unnecessarily # weight 0.135 alg straw2 hash 0 # rjenkins1 item ceph2 weight 0.045 item ceph3 weight 0.045 item ceph4 weight 0.045 } rack rack3 { id -12 # do not change unnecessarily id -13 class hdd # do not change unnecessarily # weight 0.000 alg straw2 hash 0 # rjenkins1 }

修改

rack rack1 {
        id -10          # do not change unnecessarily id -15 class hdd # do not change unnecessarily # weight 0.045 alg straw2 hash 0 # rjenkins1 item ceph2 weight 0.045 } rack rack2 { id -11 # do not change unnecessarily id -14 class hdd # do not change unnecessarily # weight 0.135 alg straw2 hash 0 # rjenkins1 item ceph3 weight 0.045 } rack rack3 { id -12 # do not change unnecessarily id -13 class hdd # do not change unnecessarily # weight 0.000 alg straw2 hash 0 # rjenkins1 item ceph4 weight 0.045 }

5.5 編譯導入

[root@ceph2 ceph]# crushtool -c crushmap.txt -o crushmap-new.bin 
[root@ceph2 ceph]# ceph osd  setcrushmap -i crushmap-new.bin 29

 [root@ceph2 ceph]# ceph osd tree

5.6 創建一個故障域為rack級別的rule

[root@ceph2 ceph]# ceph osd crush rule create-replicated indc1 dc1 rack

[root@ceph2 ceph]# ceph osd getcrushmap -o ./crushmap.bin

[root@ceph2 ceph]# crushtool -d ./crushmap.bin -o ./crushmap.txt

[root@ceph2 ceph]# vim ./crushmap.txt

rule indc1 {
        id 3 type replicated min_size 1 max_size 10 step take dc1 step chooseleaf firstn 0 type rack step emit }

5.7 創建一個池檢測

[root@ceph2 ceph]# ceph osd pool create test 32 32
pool 'test' created [root@ceph2 ceph]# ceph osd pool application enable test rbd enabled application 'rbd' on pool 'test' [root@ceph2 ceph]# ceph osd pool get test all size: 3 min_size: 2 crash_replay_interval: 0 pg_num: 32 pgp_num: 32 crush_rule: replicated_rule hashpspool: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: false nodeep-scrub: false use_gmt_hitset: 1 auid: 0 fast_read: 0

5.8 修改crushmap規則

[root@ceph2 ceph]# ceph osd pool set test crush_rule indc1
set pool 16 crush_rule to indc1 [root@ceph2 ceph]# ceph osd pool get test all size: 3 min_size: 2 crash_replay_interval: 0 pg_num: 32 pgp_num: 32 crush_rule: indc1 hashpspool: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: false nodeep-scrub: false use_gmt_hitset: 1 auid: 0 fast_read: 0

5.9 創建一個對象測試

[root@ceph2 ceph]# rados -p test put test /etc/ceph/ceph.conf
[root@ceph2 ceph]# rados -p test ls test [root@ceph2 ceph]# ceph osd map test test osdmap e239 pool 'test' (16) object 'test' -> pg 16.40e8aab5 (16.15) -> up ([5,6], p5) acting ([5,6,0], p5) [root@ceph2 ceph]# cd /var/lib/ceph/osd/ceph-0/current/ [root@ceph2 current]# ls 10.0_head 1.1f_head 13.5_head 14.20_head 1.48_head 15.26_head 1.53_head 15.5c_head 15.79_head 16.11_head 1.66_TEMP 1.79_head 5.2_head 6.5_head

六、 實現指定ssd的存儲池

每一個主機上選擇一塊磁盤作為ssd
ceph2:osd.0 ceph3:osd.2 ceph4:osd.1

6.1 創建基於這些主機的bucket

[root@ceph2 current]# ceph osd crush add-bucket ceph2-ssd host
added bucket ceph2-ssd type host to crush map [root@ceph2 current]# ceph osd crush add-bucket ceph3-ssd host added bucket ceph3-ssd type host to crush map [root@ceph2 current]# ceph osd crush add-bucket ceph4-ssd host added bucket ceph4-ssd type host to crush map

 [root@ceph2 current]# ceph osd tree

6.2 創建root,並把三台主機添加進去

[root@ceph2 current]# ceph osd crush add-bucket ssd-root root
added bucket ssd-root type root to crush map [root@ceph2 current]# ceph osd crush move ceph2-ssd root=ssd-root moved item id -17 name 'ceph2-ssd' to location {root=ssd-root} in crush map [root@ceph2 current]# ceph osd crush move ceph3-ssd root=ssd-root moved item id -18 name 'ceph3-ssd' to location {root=ssd-root} in crush map [root@ceph2 current]# ceph osd crush move ceph4-ssd root=ssd-root moved item id -19 name 'ceph4-ssd' to location {root=ssd-root} in crush map

[root@ceph2 current]# ceph osd tree

6.3 在這些主機上只添加指定的ssd盤的osd

[root@ceph2 current]# ceph osd crush add osd.0 0.01500 root=ssd-root host=ceph2-ssd
add item id 0 name 'osd.0' weight 0.015 at location {host=ceph2-ssd,root=ssd-root} to crush map [root@ceph2 current]# ceph osd crush add osd.2 0.01500 root=ssd-root host=ceph3-ssd add item id 2 name 'osd.2' weight 0.015 at location {host=ceph3-ssd,root=ssd-root} to crush map [root@ceph2 current]# ceph osd crush add osd.1 0.01500 root=ssd-root host=ceph4-ssd add item id 1 name 'osd.1' weight 0.015 at location {host=ceph4-ssd,root=ssd-root} to crush map

[root@ceph2 current]# ceph osd tree

6.4 創建一個crush rule,故障域基於主機級別

[root@ceph2 current]# ceph osd crush rule create-replicated ssdrule ssd-root host

6.5 創建一個池檢測

[root@ceph2 current]# ceph osd pool create ssdpool 32 32 replicated ssdrule
pool 'ssdpool' created [root@ceph2 current]# ceph osd pool get ssdpool all size: 3 min_size: 2 crash_replay_interval: 0 pg_num: 32 pgp_num: 32 crush_rule: ssdrule hashpspool: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: false nodeep-scrub: false use_gmt_hitset: 1 auid: 0 fast_read: 0 [root@ceph2 current]# rados -p ssdpool put test /etc/ceph/ceph.conf [root@ceph2 current]# rados -p ssdpool ls test [root@ceph2 current]# ceph osd map ssdpool test osdmap e255 pool 'ssdpool' (17) object 'test' -> pg 17.40e8aab5 (17.15) -> up ([2,1,0], p2) acting ([2,1,0], p2)

6.6 保證兩個root不沖突

可以把指定的ssd從dc1的root中去掉,保證兩個root各用個的磁盤

[root@ceph2 current]# cd /etc/ceph/

[root@ceph2 ceph]# ceph osd getcrushmap -o crushmap.bin

41

[root@ceph2 ceph]# crushtool -d crushmap.bin -o crushmap.txt

[root@ceph2 ceph]# vim crushmap.txt

host ceph2 {
        id -3           # do not change unnecessarily id -4 class hdd # do not change unnecessarily # weight 0.045 alg straw2 hash 0 # rjenkins1 #item osd.0 weight 0.015 item osd.3 weight 0.015 item osd.6 weight 0.015 } host ceph4 { id -5 # do not change unnecessarily id -6 class hdd # do not change unnecessarily # weight 0.045 alg straw2 hash 0 # rjenkins1 #item osd.1 weight 0.015 item osd.4 weight 0.015 item osd.7 weight 0.015 } host ceph3 { id -7 # do not change unnecessarily id -8 class hdd # do not change unnecessarily # weight 0.045 alg straw2 hash 0 # rjenkins1 #item osd.2 weight 0.015 item osd.5 weight 0.015 item osd.8 weight 0.015 }

[root@ceph2 ceph]# crushtool -c crushmap.txt -o crushmap-new.bin

[root@ceph2 ceph]# ceph osd setcrushmap -i crushmap-new.bin

42

[root@ceph2 ceph]# ceph osd tree

ID  CLASS WEIGHT  TYPE NAME          STATUS REWEIGHT PRI-AFF 
-20       0.04500 root ssd-root -17 0.01500 host ceph2-ssd 0 hdd 0.01500 osd.0 up 1.00000 1.00000 -18 0.01500 host ceph3-ssd 2 hdd 0.01500 osd.2 up 1.00000 1.00000 -19 0.01500 host ceph4-ssd 1 hdd 0.01500 osd.1 up 1.00000 1.00000 -9 0.17999 root dc1 -10 0.04500 rack rack1 -3 0.04500 host ceph2 3 hdd 0.01500 osd.3 up 1.00000 1.00000 6 hdd 0.01500 osd.6 up 1.00000 1.00000 -11 0.13499 rack rack2 -7 0.04500 host ceph3 5 hdd 0.01500 osd.5 up 1.00000 1.00000 8 hdd 0.01500 osd.8 up 1.00000 1.00000 -12 0 rack rack3 -5 0.04500 host ceph4 4 hdd 0.01500 osd.4 up 1.00000 1.00000 7 hdd 0.01500 osd.7 up 1.00000 1.00000 -1 0.13197 root default -3 0.04399 host ceph2 3 hdd 0.01500 osd.3 up 1.00000 1.00000 6 hdd 0.01500 osd.6 up 1.00000 1.00000 -7 0.04399 host ceph3 5 hdd 0.01500 osd.5 up 1.00000 1.00000 8 hdd 0.01500 osd.8 up 1.00000 1.00000 -5 0.04399 host ceph4 4 hdd 0.01500 osd.4 up 1.00000 1.00000 7 hdd 0.01500 osd.7 up 1.00000 1.00000 

 各個的osd位於不同的root中,實驗完成!!!


博主聲明:本文的內容來源主要來自譽天教育晏威老師,由本人實驗完成操作驗證,需要的博友請聯系譽天教育(http://www.yutianedu.com/),獲得官方同意或者晏老師(https://www.cnblogs.com/breezey/)本人同意即可轉載,謝謝!

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM