最近公司有項目想在 k8s 集群中運行 GPU 任務,於是研究了一下。下面是部署的步驟。
1. 首先得有一個可以運行的 k8s 集群. 集群部署參考 kubeadm安裝k8s
2. 准備 GPU 節點
2.1 安裝驅動
curl -fsSL https://mirrors.aliyun.com/nvidia-cuda/ubuntu1804/x86_64/7fa2af80.pub | sudo apt-key add - echo "deb https://mirrors.aliyun.com/nvidia-cuda/ubuntu1804/x86_64/ ./" > /etc/apt/sources.list.d/cuda.list apt-get update apt-get install -y cuda-drivers-455 # 按需要安裝對應的版本
2.2 安裝 nvidia-docker2
<!-- Note that you need to install the nvidia-docker2 package and not the nvidia-container-toolkit. This is because the new --gpus options hasn't reached kubernetes yet -->
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update && sudo apt-get install -y nvidia-docker2 ## /etc/docker/daemon.json 文件中加入以下內容, 使默認的運行時是 nvidia { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } } ## 重啟 docker sudo systemctl restart docker
2.3 在 k8s 集群中安裝 nvidia-device-plugin
使集群支持 GPU
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.7.3/nvidia-device-plugin.yml # 如果因為網絡問題訪問不到該文件, 可在瀏覽器打開 https://github.com/NVIDIA/k8s-device-plugin/blob/v0.7.3/nvidia-device-plugin.yml ## 把文件內容拷貝到本地執行
nvidia-device-plugin
做三件事情
-
Expose the number of GPUs on each nodes of your cluster
-
Keep track of the health of your GPUs
-
Run GPU enabled containers in your Kubernetes cluster.
之后把節點加入 k8s 集群
以上步驟成功完成之后, 運行以下命令能看到類似下面圖片中的內容說明插件安裝好了
kubectl get pod --all-namespaces | grep nvidia kubectl describe node 10.31.0.17
3. 運行 GPU Jobs
# cat nvidia-gpu-demo.yaml apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: nvidia/cuda:9.0-devel resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs - name: digits-container image: nvidia/digits:6.0 resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs
kubectl apply -f nvidia-gpu-demo.yaml kubectl exec -it xxx-76dd5bd849-hlmdr -- bash # nvidia-smi
以上就簡單實現了 k8s 調度 GPU 任務。
如有遇到問題可在留言區討論。