k8s 調度 GPU
最近公司有項目想在 k8s 集群中運行 GPU 任務,於是研究了一下。下面是部署的步驟。
1. 首先得有一個可以運行的 k8s 集群. 集群部署參考 kubeadm安裝k8s
2. 准備 GPU 節點
2.1 安裝驅動
1
2
3
4
5
|
curl -fsSL https:
//mirrors
.aliyun.com
/nvidia-cuda/ubuntu1804/x86_64/7fa2af80
.pub |
sudo
apt-key add -
echo
"deb https://mirrors.aliyun.com/nvidia-cuda/ubuntu1804/x86_64/ ./"
>
/etc/apt/sources
.list.d
/cuda
.list
apt-get update
apt-get
install
-y cuda-drivers-455
# 按需要安裝對應的版本
|
2.2 安裝 nvidia-docker2
<!-- Note that you need to install the nvidia-docker2 package and not the nvidia-container-toolkit. This is because the new --gpus options hasn't reached kubernetes yet -->
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
distribution=$(.
/etc/os-release
;
echo
$ID$VERSION_ID)
curl -s -L https:
//nvidia
.github.io
/nvidia-docker/gpgkey
|
sudo
apt-key add -
curl -s -L https:
//nvidia
.github.io
/nvidia-docker/
$distribution
/nvidia-docker
.list |
sudo
tee
/etc/apt/sources
.list.d
/nvidia-docker
.list
sudo
apt-get update &&
sudo
apt-get
install
-y nvidia-docker2
## /etc/docker/daemon.json 文件中加入以下內容, 使默認的運行時是 nvidia
{
"default-runtime"
:
"nvidia"
,
"runtimes"
: {
"nvidia"
: {
"path"
:
"/usr/bin/nvidia-container-runtime"
,
"runtimeArgs"
: []
}
}
}
## 重啟 docker
sudo
systemctl restart docker
|
2.3 在 k8s 集群中安裝 nvidia-device-plugin
使集群支持 GPU
1
2
3
4
|
kubectl create -f https:
//raw
.githubusercontent.com
/NVIDIA/k8s-device-plugin/v0
.7.3
/nvidia-device-plugin
.yml
# 如果因為網絡問題訪問不到該文件, 可在瀏覽器打開 https://github.com/NVIDIA/k8s-device-plugin/blob/v0.7.3/nvidia-device-plugin.yml
## 把文件內容拷貝到本地執行
|
nvidia-device-plugin
做三件事情
-
Expose the number of GPUs on each nodes of your cluster
-
Keep track of the health of your GPUs
-
Run GPU enabled containers in your Kubernetes cluster.
之后把節點加入 k8s 集群
以上步驟成功完成之后, 運行以下命令能看到類似下面圖片中的內容說明插件安裝好了
1
2
|
kubectl get pod --all-namespaces |
grep
nvidia
kubectl describe node 10.31.0.17
|
3. 運行 GPU Jobs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
# cat nvidia-gpu-demo.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia
/cuda
:9.0-devel
resources:
limits:
nvidia.com
/gpu
: 2
# requesting 2 GPUs
- name: digits-container
image: nvidia
/digits
:6.0
resources:
limits:
nvidia.com
/gpu
: 2
# requesting 2 GPUs
|
1
2
3
4
5
|
kubectl apply -f nvidia-gpu-demo.yaml
kubectl
exec
-it xxx-76dd5bd849-hlmdr --
bash
# nvidia-smi
|
以上就簡單實現了 k8s 調度 GPU 任務。
如有遇到問題可在留言區討論。