轉載於:https://cloud.tencent.com/developer/article/1626935
rook版本: 1.3.11 ceph版本: 14.2.10
一、Rook是什么,要解決什么問題
First thing first,Rook is not a CSI driver. —— 首先,Rook不是一個容器存儲驅動。
官方對於Rook的定義是這樣的:
Rook is an open source cloud-native storage orchestrator, providing the platform, framework, and support for a diverse set of storage solutions to natively integrate with cloud-native environments. Rook turns storage software into self-managing, self-scaling, and self-healing storage services. It does this by automating deployment, bootstrapping, configuration, provisioning, scaling, upgrading, migration, disaster recovery, monitoring, and resource management. Rook uses the facilities provided by the underlying cloud-native container management, scheduling and orchestration platform to perform its duties. Rook integrates deeply into cloud native environments leveraging extension points and providing a seamless experience for scheduling, lifecycle management, resource management, security, monitoring, and user experience.
翻譯過來概況下
Rook是一個開源的雲原生存儲編排系統,提供平台、框架和支持,提供了一套多樣化的存儲解決方案,可以與雲原生環境進行天然集成。Rook利用雲原生容器管理、調度和調度平台提供的設施,將存儲軟件轉化為自我管理、自我擴展和自我修復的存儲服務,實現自動化部署、啟動、配置、擴容、升級、遷移、災難恢復、監控和資源管理。Rook的快速擴展的特點,深度集成到雲原生環境中,並在調度、生命周期管理、資源管理、安全、監控等方面提供優異的用戶體驗。
因此,Rook解決的問題是:
- 快速部署一套雲原生存儲集群;
- 平台化管理雲原生存儲集群,包括存儲的擴容、升級、監控、災難恢復等全生命周期管理;
- 本身基於雲原生容器管理(如Kubernetes),管理方便。
二、通過Rook部署Ceph集群
目前Rook支持多種存儲集群的部署,包括:
- Ceph,它是一個高度可擴展的分布式存儲解決方案,適用於塊存儲、對象存儲和共享文件系統,具有多年的生產部署經驗。
- EdgeFS,它是高性能和容錯的分散式數據結構,可以通過對象、文件、NoSQL和塊存儲形式進行訪問。
- Cassandra,它是一個高度可用的NoSQL數據庫,具有閃電般快速的性能、靈活可控的數據一致性和大規模的可擴展性。
- CockroachDB,它是一個雲原生的SQL數據庫,用於構建全局性的、可擴展的雲服務,可在災難中生存。
- NFS,它允許遠程主機通過網絡掛載文件系統,並與這些文件系統進行交互,就像在本地掛載一樣。
- YugabyteDB,是一個高性能的雲端分布式SQL數據庫,可以自動容忍磁盤、節點、區域和區域故障。
其中對於Ceph和EdgeFS已經是stable了,可以逐步生產使用。今天就來部署一把存儲界的Super Star——Ceph。
1、部署前准備
官方給出了部署條件,主要是針對Kubernetes集群和節點系統層如何支持Ceph的部署條件。我這邊使用的CentOS 7.6的官方系統,作了如下操作:
- 確保部署節點都安裝了lvm2,可以通過
yum install lvm2安裝 - 如果你跟我一樣,計划使用Ceph作為rbd存儲,確保部署節點都安裝了rbd內核模塊,可以通過
modprobe rbd檢查是否已安裝
2、部署Ceph集群
所有的部署所需的物料已經都在Rook官方的Git倉庫中,建議git clone最新穩定版,然后可以參照官方文檔一步步進行部署。以下是我這邊的部署效果
拉取項目
git clone --single-branch --branch release-1.2 https://github.com/rook/rook.git
修改 operator.yaml 的鏡像名,更改為私有倉庫
ROOK_CSI_CEPH_IMAGE: "10.2.55.8:5000/kubernetes/cephcsi:v2.1.2" ROOK_CSI_REGISTRAR_IMAGE: "10.2.55.8:5000/kubernetes/csi-node-driver-registrar:v1.2.0" ROOK_CSI_RESIZER_IMAGE: "10.2.55.8:5000/kubernetes/csi-resizer:v0.4.0" ROOK_CSI_PROVISIONER_IMAGE: "10.2.55.8:5000/kubernetes/csi-provisioner:v1.4.0" ROOK_CSI_SNAPSHOTTER_IMAGE: "10.2.55.8:5000/kubernetes/csi-snapshotter:v1.2.2" ROOK_CSI_ATTACHER_IMAGE: "10.2.55.8:5000/kubernetes/csi-attacher:v2.1.0"
ROOK_CSI_KUBELET_DIR_PATH: "/data/k8s/kubelet" ###如果之前有修改過kubelet 數據目錄,這里需要修改
執行 operator.yaml
cd rook/cluster/examples/kubernetes/ceph # create namespace、crds、service accounts, roles, role bindings kubectl create -f common.yaml # create rook-ceph operator kubectl create -f operator.yaml
配置cluster
cluster.yaml文件里的內容需要修改,一定要適配自己的硬件情況,請詳細閱讀配置文件里的注釋,避免我踩過的坑。
此文件的配置,除了增刪osd設備外,其他的修改都要重裝ceph集群才能生效,所以請提前規划好集群。如果修改后不卸載ceph直接apply,會觸發ceph集群重裝,導致集群異常掛掉
apiVersion: ceph.rook.io/v1 kind: CephCluster metadata: # 命名空間的名字,同一個命名空間只支持一個集群 name: rook-ceph namespace: rook-ceph spec: # ceph版本說明 # v13 is mimic, v14 is nautilus, and v15 is octopus. cephVersion: #修改ceph鏡像,加速部署時間 image: 10.2.55.8:5000/kubernetes/ceph:v14.2.10 # 是否允許不支持的ceph版本 allowUnsupported: false #指定rook數據在節點的保存路徑 dataDirHostPath: /data/k8s/rook # 升級時如果檢查失敗是否繼續 skipUpgradeChecks: false # 從1.5開始,mon的數量必須是奇數 mon: count: 3 # 是否允許在單個節點上部署多個mon pod allowMultiplePerNode: false mgr: modules: - name: pg_autoscaler enabled: true # 開啟dashboard,禁用ssl,指定端口是7000,你可以默認https配置。我是為了ingress配置省事。 dashboard: enabled: true port: 7000 ssl: false # 開啟prometheusRule monitoring: enabled: false # 部署PrometheusRule的命名空間,默認此CR所在命名空間 rulesNamespace: rook-ceph crashCollector: disable: false placement: osd: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: roleoperator: In values: - storage-node # 存儲的設置,默認都是true,意思是會把集群所有node的設備清空初始化。 storage: # cluster level storage configuration and selection useAllNodes: false #關閉使用所有Node useAllDevices: false #關閉使用所有設備 nodes: - name: "k8s-node1" #指定存儲節點主機 devices: - name: "sdb" #指定磁盤為/dev/sdb - name: "k8s-node2" devices: - name: "sdb"
更多 cluster 的 CRD 配置參考:
為osd節點增加label
[root@k8s-master ceph]# kubectl label nodes k8s-node1 role=storage-node node/k8s-node1 labeled [root@k8s-master ceph]# kubectl label nodes k8s-node2 role=storage-node node/k8s-node2 labeled [root@k8s-master ceph]# kubectl label nodes k8s-master role=storage-node node/k8s-masterlabeled
執行安裝
# create single-node ceph cluster for test kubectl create -f cluster.yaml # Once it is completed(it took 5 mins, which depends on ur network condition), it should look like as below: # all the pods are deployed in `rook-ceph` namespace [root@k8s-master cephfs]# kubectl get pod -n rook-ceph NAME READY STATUS RESTARTS AGE csi-cephfsplugin-gpxh5 3/3 Running 3 24h csi-cephfsplugin-j2ms4 3/3 Running 3 24h csi-cephfsplugin-mnrfj 3/3 Running 3 24h csi-cephfsplugin-provisioner-845c5c79b4-4xzhl 5/5 Running 5 24h csi-cephfsplugin-provisioner-845c5c79b4-8frl8 5/5 Running 4 24h csi-rbdplugin-lkl2f 3/3 Running 3 24h csi-rbdplugin-n6p2k 3/3 Running 3 24h csi-rbdplugin-n9hmx 3/3 Running 3 24h csi-rbdplugin-provisioner-5fd9759ff6-2pjwd 6/6 Running 6 24h csi-rbdplugin-provisioner-5fd9759ff6-87g9c 6/6 Running 5 24h rook-ceph-crashcollector-k8s-master-579874bc7d-lfqd7 1/1 Running 1 24h rook-ceph-crashcollector-k8s-node1-7845c5d877-nkgp4 1/1 Running 0 34m rook-ceph-crashcollector-k8s-node2-6f9d46bffb-mlzpk 1/1 Running 0 34m rook-ceph-mds-myfs-a-757d4b-vnbwk 1/1 Running 0 34m rook-ceph-mds-myfs-b-69b5cc7f8-vjjfm 1/1 Running 0 34m rook-ceph-mgr-a-7f54fc9664-8wgzn 1/1 Running 1 24h rook-ceph-mon-a-7fc6b89ffb-w62bt 1/1 Running 1 24h rook-ceph-mon-b-7b88756867-99wc7 1/1 Running 1 24h rook-ceph-mon-c-846595bfcf-hv9cd 1/1 Running 1 24h rook-ceph-operator-6b57cd66b7-xnxnt 1/1 Running 1 24h rook-ceph-osd-0-7f6f9dcdf6-jlqhb 1/1 Running 1 24h rook-ceph-osd-1-6bd5556d9f-nrl8t 1/1 Running 1 24h rook-ceph-osd-2-5b6fc44884-nwfwb 1/1 Running 1 24h rook-ceph-osd-prepare-k8s-master-rw4gf 0/1 Completed 0 40m rook-ceph-osd-prepare-k8s-node1-l2jsw 0/1 Completed 0 40m rook-ceph-osd-prepare-k8s-node2-qc9nm 0/1 Completed 0 40m rook-ceph-tools-7fc67d8895-wbtr6 1/1 Running 1 24h rook-discover-9ms6x 1/1 Running 1 24h rook-discover-jj5sl 1/1 Running 1 24h rook-discover-l2rnr 1/1 Running 1 24h
部署完成后,可以通過官方提供的toolbox(就在剛才的git目錄下)檢查Ceph集群的健康狀況:
# create ceph toolbox for check kubectl create -f toolbox.yaml # enter the pod to run ceph command kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- bash [root@rook-ceph-tools-7fc67d8895-wbtr6 /]# ceph status cluster: id: 37eb0bc7-0863-482b-bdf5-90f0a3c3fb66 health: HEALTH_WARN clock skew detected on mon.b, mon.c services: mon: 3 daemons, quorum a,b,c (age 52m) mgr: a(active, since 48m) mds: myfs:1 {0=myfs-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 52m), 3 in (since 24h) task status: scrub status: mds.myfs-a: idle mds.myfs-b: idle data: pools: 2 pools, 64 pgs objects: 26 objects, 92 KiB usage: 3.0 GiB used, 57 GiB / 60 GiB avail pgs: 64 active+clean io: client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr [root@rook-ceph-tools-7fc67d8895-wbtr6 /]# ceph osd status +----+------------+-------+-------+--------+---------+--------+---------+-----------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+------------+-------+-------+--------+---------+--------+---------+-----------+ | 0 | k8s-node1 | 1028M | 18.9G | 0 | 0 | 2 | 106 | exists,up | | 1 | k8s-node2 | 1028M | 18.9G | 0 | 0 | 0 | 0 | exists,up | | 2 | k8s-master | 1028M | 18.9G | 0 | 0 | 0 | 0 | exists,up | +----+------------+-------+-------+--------+---------+--------+---------+-----------+
[root@rook-ceph-tools-7fc67d8895-wbtr6 /]# ceph df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 60 GiB 57 GiB 13 MiB 3.0 GiB 5.02 TOTAL 60 GiB 57 GiB 13 MiB 3.0 GiB 5.02 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL myfs-metadata 1 91 KiB 25 1.9 MiB 0 18 GiB myfs-data0 2 158 B 1 192 KiB 0 18 GiB
[root@rook-ceph-tools-7fc67d8895-wbtr6 /]# rados df POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR myfs-data0 192 KiB 1 0 3 0 0 0 6 3 KiB 3 2 KiB 0 B 0 B myfs-metadata 1.9 MiB 25 0 75 0 0 0 5165 2.6 MiB 149 160 KiB 0 B 0 B total_objects 26 total_used 3.0 GiB total_avail 57 GiB total_space 60 GiB
三、刪除ceph集群
刪除ceph集群前,請先清理相關pod
刪除塊存儲和文件存儲
kubectl delete -n rook-ceph cephblockpool replicapool kubectl delete storageclass rook-ceph-block kubectl delete -f csi/cephfs/filesystem.yaml kubectl delete storageclass csi-cephfs rook-ceph-block
kubectl -n rook-ceph delete cephcluster rook-ceph
刪除operator和相關crd
kubectl delete -f cluster.yaml
kubectl delete -f operator.yaml kubectl delete -f common.yaml kubectl delete -f crds.yaml
清除主機上的數據
刪除Ceph集群后,在之前部署Ceph組件節點的/data/rook/目錄,會遺留下Ceph集群的配置信息。
rm -rf /data/k8s/rook/*
若之后再部署新的Ceph集群,先把之前Ceph集群的這些信息刪除,不然啟動monitor會失敗;
# cat clean-rook-dir.sh
hosts=(
192.168.130.130
192.168.130.131
192.168.130.132
)
for host in ${hosts[@]} ; do
ssh $host "rm -rf /data/k8s/rook/*"
done
清除device
yum install gdisk -y export DISK="/dev/sdb" sgdisk --zap-all $DISK dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync blkdiscard $DISK ls /dev/mapper/ceph-* | xargs -I% -- dmsetup remove % rm -rf /dev/ceph-*
如果因為某些原因導致刪除ceph集群卡主,可以先執行以下命令, 再刪除ceph集群就不會卡主了
kubectl -n rook-ceph patch cephclusters.ceph.rook.io rook-ceph -p '{"metadata":{"finalizers": []}}' --type=merge
四、使用Ceph集群
Ceph是能提供對象存儲、塊存儲、共享文件系統多種存儲形式,這里使用塊存儲,兼容性更好,靈活性更高。
# go to the ceph csi rbd folder cd rook/cluster/examples/kubernetes/ceph/ # create ceph rdb storageclass for test [root@k8s-master ceph]# kubectl apply -f csi/rbd/storageclass.yaml cephblockpool.ceph.rook.io/replicapool created storageclass.storage.k8s.io/rook-ceph-block created [root@k8s-master ceph]# kubectl get storageclass NAME PROVISIONER AGE rook-ceph-block rook-ceph.rbd.csi.ceph.com 17s
有了Ceph StorageClass,我們只需要申明PVC,就可以快速按需創建出一個塊設備以及對應的PV,相比傳統的需要手動首先創建PV,然后在聲明對應的PVC,操作更簡單,管理更方便。
下面是一個基於Ceph StorageClass的PVC yaml例子:
[root@k8s-master ceph]# kubectl apply -f csi/rbd/pvc.yaml
persistentvolumeclaim/rbd-pvc created
部署PVC,並觀察PV是否自動創建:
[root@k8s-master ceph]# kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE rbd-pvc Bound pvc-ef49d8f8-b9fd-4aad-b604-9d4ec667e346 1Gi RWO rook-ceph-block 23s
[root@k8s-master ceph]# kubectl get pvc,pv NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/rbd-pvc Bound pvc-ef49d8f8-b9fd-4aad-b604-9d4ec667e346 1Gi RWO rook-ceph-block 29s NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/pvc-ef49d8f8-b9fd-4aad-b604-9d4ec667e346 1Gi RWO Delete Bound default/rbd-pvc rook-ceph-block 26s
創建一個基於PVC的Pod:
[root@k8s-master ceph]# kubectl apply -f csi/rbd/pod.yaml pod/csirbd-demo-pod created
--- apiVersion: v1 kind: Pod metadata: name: csirbd-demo-pod spec: containers: - name: web-server image: nginx volumeMounts: - name: mypvc mountPath: /var/lib/www/html volumes: - name: mypvc persistentVolumeClaim: claimName: rbd-pvc readOnly: false
等待Pod部署完成,觀察pod的存儲掛載情況:
[root@k8s-master ceph]# kubectl exec -it csirbd-demo-pod -- bash -c df -h Filesystem 1K-blocks Used Available Use% Mounted on overlay 16558080 8315872 8242208 51% / tmpfs 65536 0 65536 0% /dev tmpfs 1447648 0 1447648 0% /sys/fs/cgroup /dev/mapper/centos-root 16558080 8315872 8242208 51% /etc/hosts shm 65536 0 65536 0% /dev/shm /dev/rbd0 999320 2564 980372 1% /var/lib/www/html tmpfs 1447648 12 1447636 1% /run/secrets/kubernetes.io/serviceaccount tmpfs 1447648 0 1447648 0% /proc/acpi tmpfs 1447648 0 1447648 0% /proc/scsi tmpfs 1447648 0 1447648 0% /sys/firmware
理解Access Mode屬性
存儲系統的訪問安全控制在Kubernetes的時代得到了長足的進步,遠遠勝於純Docker時代的簡單粗暴。來看下Kubernetes在管理存儲(PV、PVC)時提供了哪些訪問控制機制:
- RWO: ReadWriteOnce,只有單個節點可以掛載這個volume,進行讀寫操作;
- ROX: ReadOnlyMany,多個節點可以掛載這個volume,只能進行讀操作;
- RWX: ReadWriteMany,多個節點可以掛載這個volume,讀寫操作都是允許的。
所以RWO、ROX和RWX只跟同時使用volume的worker節點數量有關,而不是跟pod數量!
以前苦於沒有部署雲原生存儲系統,一直沒法實踐這些特性,這次得益於Rook的便捷性,趕緊來嘗鮮下。計划測試兩個場景:
- 測試ReadWriteOnce,測試步驟如下:
- 首先部署一個使用ReadWriteOnce訪問權限的PVC的名為ceph-pv-pod的單個pod實例
- 然后部署一個使用相同PVC的名為n2的deployment,1個pod實例
- 擴容n2至6個pod副本
- 觀察結果
> kubectl get pod -o wide --sort-by=.spec.nodeName | grep -E '^(n2|ceph)' NAME READY STATUS IP NODE n2-7db787d7f4-ww2fp 0/1 ContainerCreating <none> node01 n2-7db787d7f4-8r4n4 0/1 ContainerCreating <none> node02 n2-7db787d7f4-q5msc 0/1 ContainerCreating <none> node02 n2-7db787d7f4-2pfvd 1/1 Running 100.96.174.137 node03 n2-7db787d7f4-c8r8k 1/1 Running 100.96.174.139 node03 n2-7db787d7f4-hrwv4 1/1 Running 100.96.174.138 node03 ceph-pv-pod 1/1 Running 100.96.174.135 node03
從上面的結果可以看到,由於ceph-pv-pod這個Pod優先綁定了聲明為ReadWriteOnce的PVC,它所在的節點node03就能成功部署n2的pod實例,而調度到其他節點的n2就無法成功部署了,挑個看看錯誤信息:
> kubectl describe pod n2-7db787d7f4-ww2fp ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned default/n2-7db787d7f4-ww2fp to node01 Warning FailedAttachVolume 10m attachdetach-controller Multi-Attach error for volume "pvc-fb2d6d97-d7aa-43df-808c-81f15e7a2797" Volume is already used by pod(s) n2-7db787d7f4-c8r8k, ceph-pv-pod, n2-7db787d7f4-2pfvd, n2-7db787d7f4-hrwv4
從Pod Events中可以明顯看到錯誤了,由於ReadWriteOnce的存在,無法使用Multi-Attach了,符合期待。
- 測試ReadWriteMany,測試步驟如下:
- 首先部署一個使用 ReadWriteMany訪問權限的PVC的名為2ceph-pv-pod的單個pod實例
- 然后部署一個使用相同PVC的名為n3的deployment,1個pod實例
- 擴容n3至6個pod副本
- 觀察結果
原來是想直接改第一個測試場景的創建pvc的yaml,發現如下錯誤。意思是創建好的pvc除了申請的存儲空間以外,其他屬性是無法修改的。
kubectl apply -f pvc.yaml The PersistentVolumeClaim "ceph-pv-claim" is invalid: spec: Forbidden: is immutable after creation except resources.requests for bound claims
只能重新創建了。。。但當聲明創建新的PVC時,又發生了問題,pvc一直處於pending狀態。。。
> kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ceph-pv-claim Bound pvc-fb2d6d97-d7aa-43df-808c-81f15e7a2797 1Gi RWO rook-ceph-block 36h ceph-pvc-2 Pending rook-ceph-block 10m > kubectl describe pvc ceph-pvc-2 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Provisioning 4m41s (x11 over 13m) rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-66f64ff49c-wvpkg_b78217fb-8739-4ced-9e18-7430fdde964b External provisioner is provisioning volume for claim "default/ceph-pvc-2" Warning ProvisioningFailed 4m41s (x11 over 13m) rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-66f64ff49c-wvpkg_b78217fb-8739-4ced-9e18-7430fdde964b failed to provision volume with StorageClass "rook-ceph-block": rpc error: code = InvalidArgument desc = multi node access modes are only supported on rbd `block` type volumes Normal ExternalProvisioning 3m4s (x42 over 13m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "rook-ceph.rbd.csi.ceph.com" or manually created by system administrator
查看event詳細后,發現了這個錯誤信息:
failed to provision volume with StorageClass "rook-ceph-block": rpc error: code = InvalidArgument desc = multi node access modes are only supported on rbd `block` type volumes
翻譯過來的意思是:多節點訪問模式只支持在rbd block類型的volume上配置。。。難道說ceph的這個rbd storageclass是個假的“塊存儲”。。。
一般發生這種不所措的錯誤,首先可以去官方Github的issue或pr里找找有沒有類似的問題。經過一番搜索,找到一個maintainer的相關說法。如下圖所示。意思是不推薦在ceph rbd模式下使用RWX訪問控制,如果應用層沒有訪問鎖機制,可能會造成數據損壞。
進而找到了官方上的說法
There are two CSI drivers integrated with Rook that will enable different scenarios:
- RBD: This driver is optimized for RWO pod access where only one pod may access the storage
- CephFS: This driver allows for RWX with one or more pods accessing the same storage
好吧,原來官方網站已經說明了CephFS模式是使用RWX模式的正確選擇。
使用CephFS測試ReadWriteMany(RWX)模式
官方已經提供了支持CephFS的StorageClass,我們需要部署開啟:
cd rook/cluster/examples/kubernetes/ceph/ kubectl apply -f filesystem.yaml kubectl apply -f csi/cephfs/storageclass.yaml [root@k8s-master ceph]# kubectl get sc NAME PROVISIONER AGE rook-ceph-block rook-ceph.rbd.csi.ceph.com 10m rook-cephfs rook-ceph.cephfs.csi.ceph.com 57m
創建完CephFS的StorageClass和FileSystem,就可以測試了。測試場景為部署一個deployment,6個副本,使用RWX模式的Volume:
--- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: cephfs-pvc spec: accessModes: - ReadWriteMany resources: requests: storage: 1Gi storageClassName: rook-cephfs
新建一個nginx的deployment
apiVersion: apps/v1 kind: Deployment metadata: labels: app: n4-cephfs pv: cephfs name: n4-cephfs spec: replicas: 3 selector: matchLabels: app: n4-cephfs pv: cephfs template: metadata: labels: app: n4-cephfs pv: cephfs spec: volumes: - name: fsceph-pv-storage persistentVolumeClaim: claimName: cephfs-pvc containers: - image: 10.2.55.8:5000/library/nginx:1.18.0 name: nginx ports: - containerPort: 80 name: "http-server" volumeMounts: - mountPath: "/usr/share/nginx/html" name: fsceph-pv-storage
部署后觀察每個pod的運行情況以及PV和PVC創建情況:
[root@k8s-master ceph]# kubectl get pod -l pv=cephfs -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES n4-cephfs-66cbcfd84b-25k4n 1/1 Running 0 44m 10.244.2.196 k8s-node2 <none> <none> n4-cephfs-66cbcfd84b-7rgw6 1/1 Running 0 44m 10.244.1.242 k8s-node1 <none> <none> n4-cephfs-66cbcfd84b-gl9pj 1/1 Running 0 44m 10.244.0.178 k8s-master <none> <none> [root@k8s-master ceph]# kubectl get pvc,pv NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/cephfs-pvc Bound pvc-7265f11e-39ce-42df-9b7c-02e8916bc5c2 1Gi RWX rook-cephfs 44m persistentvolumeclaim/rbd-pvc Bound pvc-ef49d8f8-b9fd-4aad-b604-9d4ec667e346 1Gi RWO rook-ceph-block 13m NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/pvc-7265f11e-39ce-42df-9b7c-02e8916bc5c2 1Gi RWX Delete Bound default/cephfs-pvc rook-cephfs 44m persistentvolume/pvc-ef49d8f8-b9fd-4aad-b604-9d4ec667e346 1Gi RWO Delete Bound default/rbd-pvc rook-ceph-block 13m [root@k8s-master ~]# kubectl exec pod/n4-cephfs-66cbcfd84b-7rgw6 -- bash -c df -h Filesystem 1K-blocks Used Available Use% Mounted on overlay 16558080 8313176 8244904 51% / tmpfs 65536 0 65536 0% /dev tmpfs 1447648 0 1447648 0% /sys/fs/cgroup /dev/mapper/centos-root 16558080 8313176 8244904 51% /etc/hosts shm 65536 0 65536 0% /dev/shm 10.99.56.167:6789,10.110.248.120:6789,10.110.195.253:6789:/volumes/csi/csi-vol-e14ef036-5002-11eb-b1e4-e2f740f51378/319af089-cd37-4b55-a776-d81216bca859 1048576 0 1048576 0% /usr/share/nginx/html tmpfs 1447648 12 1447636 1% /run/secrets/kubernetes.io/serviceaccount tmpfs 1447648 0 1447648 0% /proc/acpi tmpfs 1447648 0 1447648 0% /proc/scsi tmpfs 1447648 0 1447648 0% /sys/firmware [root@k8s-master ~]# kubectl exec pod/n4-cephfs-66cbcfd84b-25k4n -- bash -c df -h Filesystem 1K-blocks Used Available Use% Mounted on overlay 16558080 9623308 6934772 59% / tmpfs 65536 0 65536 0% /dev tmpfs 1447648 0 1447648 0% /sys/fs/cgroup /dev/mapper/centos-root 16558080 9623308 6934772 59% /etc/hosts shm 65536 0 65536 0% /dev/shm 10.99.56.167:6789,10.110.248.120:6789,10.110.195.253:6789:/volumes/csi/csi-vol-e14ef036-5002-11eb-b1e4-e2f740f51378/319af089-cd37-4b55-a776-d81216bca859 1048576 0 1048576 0% /usr/share/nginx/html tmpfs 1447648 12 1447636 1% /run/secrets/kubernetes.io/serviceaccount tmpfs 1447648 0 1447648 0% /proc/acpi tmpfs 1447648 0 1447648 0% /proc/scsi tmpfs 1447648 0 1447648 0% /sys/firmware
分布在不同的節點上的pod都能部署成功,PV也能創建綁定成功。符合測試預期。
更深入地觀察存儲掛載機制
通過上面兩個測試場景,我們來看下背后的雲原生存儲的運行邏輯:
- 進入pod觀察存儲掛載情況
對比兩個測試場景pod實例里面存儲掛載情況:
# use ceph as rbd storage # it is mount as block device df -h /dev/rbd0 976M 3.3M 957M 1% /usr/share/nginx/html # use ceph as file system storage # it is mount as nfs storage df -h 10.109.80.220:6789:/volumes/csi/csi-vol-1dc92634-79cd-11ea-96a3-26ab72958ea2 1.0G 0 1.0G 0% /usr/share/nginx/html
可以看到Ceph rbd和CephFS掛載到Pod里的方式是有差別的。
- 觀察主機層存儲掛載情況
# use ceph as rbd storage > df -h |grep rbd # on work node /dev/rbd0 976M 3.3M 957M 1% /var/lib/kubelet/pods/e432e18d-b18f-4b26-8128-0b0219a60662/volumes/kubernetes.io~csi/pvc-fb2d6d97-d7aa-43df-808c-81f15e7a2797/mount # use ceph as file system storage > df -h |grep csi 10.109.80.220:6789:/volumes/csi/csi-vol-1dc92634-79cd-11ea-96a3-26ab72958ea2 1.0G 0 1.0G 0% /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-75b40dd7-b880-4d67-9da6-88aba8616466/globalmount
簡單解釋下主機層相關路徑的命名規則:
/var/lib/kubelet/pods/<Pod的ID>/volumes/kubernetes.io~<Volume類型>/<Volume名字>
最終都是通過docker run映射到容器里去:
docker run -v /var/lib/kubelet/pods/<Pod-ID>/volumes/kubernetes.io~<Volume類型>/<Volume名字>:/<容器內目標目錄> 鏡像 ...
- 從Kubernetes觀察存儲掛載情況
Kubernetes提供了獲取StorageClass、PV和Node之間的關系——volumeattachment資源類型。它的官方解釋是:
VolumeAttachment captures the intent to attach or detach the specified volume to/from the specified node. VolumeAttachment objects are non-namespaced.
來看下當前的情況:
[root@k8s-master ~]# kubectl get volumeattachment NAME ATTACHER PV NODE ATTACHED AGE csi-0824b06d082cc7fca254899682d6665890473ad6023e13361031c57f60094361 rook-ceph.rbd.csi.ceph.com pvc-ef49d8f8-b9fd-4aad-b604-9d4ec667e346 k8s-node1 true 15m csi-321887a821d7fd1ad443965cb5527feaaec7db107331312f2e33da02c8544938 rook-ceph.cephfs.csi.ceph.com pvc-7265f11e-39ce-42df-9b7c-02e8916bc5c2 k8s-master true 48m csi-3beca915ee1a56489667bb1f848e9dd23ce605a408e308f46ce28b1f301bf613 rook-ceph.cephfs.csi.ceph.com pvc-7265f11e-39ce-42df-9b7c-02e8916bc5c2 k8s-node1 true 48m csi-c684923291c052d41ae6cc5afd7f9852f6e6d3e2d5183ea4148c29ed1430e5b8 rook-ceph.cephfs.csi.ceph.com pvc-7265f11e-39ce-42df-9b7c-02e8916bc5c2 k8s-node2 true 48m
能看到每個主機層掛載點的詳細情況,方便大家troubleshooting。
Ceph界面化管理Ceph Dashboard
Rook官方很貼心地提供了Ceph界面化管理的解決方案——Ceph dashboard。標准版部署Rook已經自帶這個功能,默認是無法集群外訪問的,手動expose為nodeport模式即可:
[root@k8s-master ~]# kubectl -n rook-ceph get svc |grep dash rook-ceph-mgr-dashboard NodePort 10.111.149.53 <none> 7000:30111/TCP 25h
通過瀏覽器訪問https://node-ip:30111,默認登錄用戶名為admin,密碼可以通過這樣的方式獲取:
[root@k8s-master ~]# kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o jsonpath="{['data']['password']}" | base64 --decode && echo ]Qz!5OK^|%#a$lzgQ(<n
登錄后,界面如下。內容非常多,包括讀寫速率監控,健康監控等,絕對是Ceph管理的好幫手。

還提供交互式API文檔,非常貼心。
戳視頻可以看完整Demo:
總結
Rook能幫你快速搭建一套Production-Ready的雲原生存儲平台,同時提供全生命周期管理,適合初中高級全階段的存儲管理玩家。
本文涉及的部署物料可以去這里獲取:

