k8s調用GPU


1. 使用設備插件

調度 GPUs | Kubernetes 官方介紹

Kubernetes 實現了 Device Plugins 以允許 Pod 訪問類似 GPU 這類特殊的硬件功能特性。作為運維管理人員,你要在節點上安裝來自對應硬件廠商的 GPU 驅動程序,並運行來自 GPU 廠商的對應的設備插件。

當以上條件滿足時,Kubernetes 將暴露 amd.com/gpu 或 nvidia.com/gpu 為可調度的資源,可以通過請求 <vendor>.com/gpu 資源來使用 GPU 設備。不過,使用 GPU 時,在如何指定資源需求這個方面還是有一些限制的:

  • GPUs 只能設置在 limits 部分,這意味着:
    • 不可以僅指定 requests 而不指定 limits
    • 可以同時指定 limits 和 requests,不過這兩個值必須相等
    • 可以指定 GPU 的 limits 而不指定其 requestsK8S 將使用限制值作為默認的請求值
  • 容器(Pod)之間是不共享 GPU 的,GPU 也不可以過量分配
  • 每個容器可以請求一個或者多個 GPU,但是用小數值來請求部分 GPU 是不允許的
# need 2 GPUs
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:9.0-devel
      resources:
        limits:
          nvidia.com/gpu: 2
    - name: digits-container
      image: nvcr.io/nvidia/digits:20.12-tensorflow-py3
      resources:
        limits:
          nvidia.com/gpu: 2

2. 部署 AMD GPU 設備插件

節點需要使用 AMD 的 GPU 資源的話,需要先安裝 k8s-device-plugin 這個插件,並且需要 K8S 節點必須預先安裝 AMD GPU 的 Linux 驅動。

# 安裝顯卡插件
$ kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/r1.10/k8s-ds-amdgpu-dp.yaml

3. 部署 NVIDIA GPU 設備插件

節點需要使用 NVIDIA 的 GPU 資源的話,需要先安裝 k8s-device-plugin 這個插件,並且需要事先滿足下面的條件:

  • Kubernetes 的節點必須預先安裝了 NVIDIA 驅動
  • Kubernetes 的節點必須預先安裝 nvidia-docker2.0
  • Docker 的默認運行時必須設置為 nvidia-container-runtime,而不是 runc
  • NVIDIA 驅動版本大於或者等於 384.81 版本
# 安裝nvidia-docker2.0工具
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update && sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker

# 安裝nvidia-container-runtime運行時
$ cat /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

# 安裝顯卡插件
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

heml安裝

# 也可以使用helm安裝
$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
$ helm repo update
$ helm install --version=0.9.0 --generate-name nvdp/nvidia-device-plugin

# 也可以使用docker安裝
$ docker run -it \
    --security-opt=no-new-privileges \
    --cap-drop=ALL --network=none \
    -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \
    nvcr.io/nvidia/k8s-device-plugin:devel

顯卡插件,就是在我們通過在配置文件里面指定如下字段之后,啟動 pod 的時候,系統給為我們的服務分配對應需要數量的顯卡數量,讓我們的程序可以使用顯卡資源。

  • amd.com/gpu
  • nvidia.com/gpu

需要注意的是,第一次安裝顯卡驅動的話,是不用重啟服務器的,后續更新驅動版本的話,則是需要的。但是建議第一次安裝驅動之后,最好還是重啟下,防止意外情況的出現和發生。

4、驗證

1、添加標簽

# kubectl label nodes 192.168.1.56 nvidia.com/gpu.present=true

root@hello:~# kubectl  get  nodes -L nvidia.com/gpu.present
NAME           STATUS                     ROLES    AGE    VERSION   GPU.PRESENT
192.168.1.55   Ready,SchedulingDisabled   master   128m   v1.22.2  
192.168.1.56   Ready                      node     127m   v1.22.2   true

2、安裝helm倉庫

root@hello:~# curl https://baltocdn.com/helm/signing.asc | sudo apt-key add -
root@hello:~# sudo apt-get install apt-transport-https --yes
root@hello:~# echo "deb https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
root@hello:~# sudo apt-get update
root@hello:~# sudo apt-get install helm

helm install \
    --version=0.10.0 \
    --generate-name \
    nvdp/nvidia-device-plugin

3、查看是否有nvidia

kubectl describe node 192.168.1.56 | grep nv
                    nvidia.com/gpu.present=true
  nvidia.com/gpu:     1
  nvidia.com/gpu:     1
  kube-system                 nvidia-device-plugin-1637728448-fgg2d         0 (0%)        0 (0%)      0 (0%)           0 (0%)         50s
  nvidia.com/gpu     0           0
root@hello:~#

下載鏡像

root@hello:~# docker pull registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu
root@hello:~# docker save -o tensorflow-gpu.tar  registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu
root@hello:~# docker load -i tensorflow-gpu.tar

創建tensorflow測試pod

root@hello:~# vim gpu-test.yaml
root@hello:~# cat gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-gpu
  labels:
    test-gpu: "true"
spec:
  containers:
  - name: training
    image: registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu
    command:
    - python
    - tensorflow-sample-code/tfjob/docker/mnist/main.py
    - --max_steps=300
    - --data_dir=tensorflow-sample-code/data
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - effect: NoSchedule
    operator: Exists
root@hello:~#

root@hello:~# kubectl  apply -f gpu-test.yaml
pod/test-gpu created

查看log

 kubectl logs test-gpu
WARNING:tensorflow:From tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:


Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.


See tf.nn.softmax_cross_entropy_with_logits_v2.


2021-11-24 04:38:50.846973: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 04:38:50.847698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:10.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2021-11-24 04:38:50.847759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:00:10.0, compute capability: 7.5)
root@hello:~#

  

 

 

 

 

 

 

 

 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM