nvidia docker

本文轉載自查看原文 2018-04-27 14:18 849 雲技術、虛擬化

https://blog.gtwang.org/virtualization/centos-linux-install-nvidia-docker-gpu-container-tutorial/

https://www.cnblogs.com/yxfangcs/p/8438462.html

https://hub.docker.com/r/nvidia/cuda/

https://cloud.google.com/compute/docs/gpus/add-gpus

https://kairen.github.io/2018/02/17/container/docker-nvidia-install/

本篇主要介紹如何使用 NVIDIA Docker v2 來讓容器使用 GPU，

過去 NVIDIA Docker v1 需要使用 nvidia-docker 來取代 Docker 執行 GPU image，或是透過手動掛載 NVIDIA driver 與 CUDA 來使 Docker 能夠編譯與執行 GPU 應用程式 image，

而新版本的 Docker 則可以透過 –-runtime 來選擇使用 NVIDIA Docker v2 的 Runtime 來執行 GPU 應用。

如果你的docker是最新的版本，你可能無需安裝nvidia-docker來使用docker GPU。參考后文描述

nvidia-docker是一個插件

NVIDIA driver and CUDA library

1: 安裝驅動和cuda

[root@v5]# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 387.26 Thu Nov 2 21:20:16 PDT 2017
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
[root@v5]#
[root@v5]#
[root@v5]# cat /usr/local/cuda/version.txt
CUDA Version 9.1.85

[root@~]# nvidia-smi 
Sat Apr 28 14:21:36 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M10           Off  | 00000000:05:00.0 Off |                  N/A |
| N/A   35C    P0    16W /  53W |      0MiB /  8127MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M10           Off  | 00000000:06:00.0 Off |                  N/A |
| N/A   32C    P0    16W /  53W |      0MiB /  8127MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M10           Off  | 00000000:07:00.0 Off |                  N/A |
| N/A   30C    P0    15W /  53W |      0MiB /  8127MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M10           Off  | 00000000:08:00.0 Off |                  N/A |
| N/A   30C    P0    15W /  53W |      0MiB /  8127MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

[root@~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)

https://github.com/moby/moby/issues/35906

yum install http://mirror.centos.org/centos/7/os/x86_64/Packages/libseccomp-2.3.1-3.el7.x86_64.rpm

docker/nvidia-container-runtime 依賴比較新的libseccomp版本，系統自帶的版本太低不行。

https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions

想使用docker gpu，需要安裝nvidia-container-runtime，注冊nvidia runtime到docker daemon.

注冊方法：

Install the repository for your distribution by following the instructions here.
Install the nvidia-container-runtime package:

sudo yum install nvidia-container-runtime

3：

sudo tee /etc/docker/daemon.json <<EOF {  "runtimes": {  "nvidia": {  "path": "/usr/bin/nvidia-container-runtime",  "runtimeArgs": []  }  } } EOF sudo pkill -SIGHUP dockerd

https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup 文中提到了多種方法是各自獨立的，不要重復，否則會出現錯誤。我們這里使用daemon.json的方式來注冊nvidia runtime。

如果你的docker版本不是最新的，那么你需要使用nvidia-docker和與其匹配的nvidia-container-runtime版本。如下所示：

How do I install 2.0 if I'm not using the latest Docker version?

You must pin the versions of both nvidia-docker2 and nvidia-container-runtime when installing, for instance:

sudo apt-get install -y nvidia-docker2=2.0.2+docker1.12.6-1 nvidia-container-runtime=1.1.1+docker1.12.6-1

Use apt-cache madison nvidia-docker2 nvidia-container-runtime or yum search --showduplicates nvidia-docker2 nvidia-container-runtime to list the available versions.

nvidia-docker支持的最低docker版本是：Docker 1.12

[root@~]# docker version
Client:
 Version:         17.05.0-ce
 API version:     1.29
 Package version: docker-17.05.0-1001.el7.centos.x86_64
 Go version:      go1.8.3
 Git commit:      e1bfc47
 Built:           Fri Mar 23 13:44:53 2018
 OS/Arch:         linux/amd64

Server:
 Version:         17.05.0-ce
 API version:     1.29 (minimum version 1.12)
 Package version: docker-17.05.0-1001.el7.centos.x86_64
 Go version:      go1.8.3
 Git commit:      e1bfc47
 Built:           Fri Mar 23 13:44:53 2018
 OS/Arch:         linux/amd64
 Experimental:    false

版本匹配很重要，否則會出現flag provided but not defined: -console 的錯誤，

安裝匹配的版本，上文默認安裝的nvidia-container-runtime版本不行，nvidia-container-runtime-2.0.0-1.docker17.03.2.x86_64

最終終於出現了久違的

docker run --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility,video --rm nvidia/cuda:9.1-runtime-centos7 nvidia-smi

Sat Apr 28 07:18:35 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M10           Off  | 00000000:05:00.0 Off |                  N/A |
| N/A   35C    P0    16W /  53W |      0MiB /  8127MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M10           Off  | 00000000:06:00.0 Off |                  N/A |
| N/A   32C    P0    16W /  53W |      0MiB /  8127MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M10           Off  | 00000000:07:00.0 Off |                  N/A |
| N/A   30C    P0    15W /  53W |      0MiB /  8127MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M10           Off  | 00000000:08:00.0 Off |                  N/A |
| N/A   30C    P0    15W /  53W |      0MiB /  8127MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

成功。

https://github.com/NVIDIA/nvidia-docker/issues/531#issuecomment-343993909　　

No, please don't install the driver/cuda inside the container :). The image won't be portable to other machines. 不要在鏡像中安裝驅動，那樣的做法不可移植。

With 2.0, we now use environment variables to list the driver libraries that must be mounted inside the container at runtime: 我們現在使用環境變量的方法來在運行時動態的mount 驅動庫。
https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/9.0/base/Dockerfile#L30-33

In your case, you will need the following:

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES video,compute,utility

Also note that starting from CUDA 9.0, we have a new tag: nvidia/cuda:9.0-base. It will setup our repositories and set the environment variables (but you will be missing the video for ffmpeg). It avoid having a base image with ALL the CUDA libraries.

https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/9.0/base/Dockerfile#L30-33

https://hub.docker.com/r/nvidia/cuda/

這兩個鏈接下的dockerfile可以用來編譯生成docker cuda的鏡像，里面的環境變量就是https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup nvidia-container-runtime里面定義的環境變量。

想自己手動編譯成功不太容易，可部分參考：https://www.cnblogs.com/yxfangcs/p/8438462.html 這個文章。

最好的辦法還是直接在https://hub.docker.com/r/nvidia/cuda/里找到你心儀的版本，例如9.1-runtime-centos7 然后：

docker pull nvidia/cuda:9.1-runtime-centos7

Tags 不同tag的版本意思

CUDA images come in three flavors:

base: starting from CUDA 9.0, contains the bare minimum (libcudart) to deploy a pre-built CUDA application.
Use this image if you want to manually select which CUDA packages you want to install.
runtime: extends the base image by adding all the shared libraries from the CUDA toolkit.
Use this image if you have a pre-built application using multiple CUDA libraries.
devel: extends the runtime image by adding the compiler toolchain, the debugging tools, the headers and the static libraries.
Use this image to compile a CUDA application from sources.

例如： 9.1-base-centos7

FROM centos:7
LABEL maintainer "NVIDIA CORPORATION <cudatools@nvidia.com>"

RUN NVIDIA_GPGKEY_SUM=d1be581509378368edeec8c1eb2958702feedf3bc3d17011adbf24efacce4ab5 && \
    curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pub | sed '/^Version/d' > /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA && \
    echo "$NVIDIA_GPGKEY_SUM  /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA" | sha256sum -c --strict -

COPY cuda.repo /etc/yum.repos.d/cuda.repo              #自己下載好cuda.repo到本地

ENV CUDA_VERSION 9.1.85            #指定安裝的版本

ENV CUDA_PKG_VERSION 9-1-$CUDA_VERSION-1
RUN yum install -y \
        cuda-cudart-$CUDA_PKG_VERSION && \           #安裝cuda-9.1
    ln -s cuda-9.1 /usr/local/cuda && \           #連接/usr/local/cuda
    rm -rf /var/cache/yum/*

# nvidia-docker 1.0
LABEL com.nvidia.volumes.needed="nvidia_driver"
LABEL com.nvidia.cuda.version="${CUDA_VERSION}"

RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
    echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf

ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64           
#這邊也是個奇葩，自己手動注冊runtime，並且通過docker run --runtime=nvidia的方式啟動，容器中沒有/usr/local/nvidia的目錄；  通過nvidia-docker或mesos-executor的方式啟動的容器，內部是有這個目錄的
#在我司碰到了個坑，我司是通過mesos的container來啟動容器的，通過這種方式設置的環境變量沒能體現到容器內部，並且在dockfile中進行ldconfig也不行坑死！！！！需要在CMD的啟動腳本里ldconfig才有用。 與標准的docker做法有很大的出入！！！

# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility,video           #根據自身需要修改
ENV NVIDIA_REQUIRE_CUDA "cuda>=9.1"

9.1-runtime-centos7 runtime 是從base而來

ARG repository
FROM ${repository}:9.1-base-centos7
LABEL maintainer "NVIDIA CORPORATION <cudatools@nvidia.com>"

RUN yum install -y \
        cuda-libraries-$CUDA_PKG_VERSION && \
    rm -rf /var/cache/yum/*

9.1-devel-centos7 devel是從runtime而來

ARG repository
FROM ${repository}:9.1-runtime-centos7
LABEL maintainer "NVIDIA CORPORATION <cudatools@nvidia.com>"

RUN yum install -y \
        cuda-libraries-dev-$CUDA_PKG_VERSION \
        cuda-nvml-dev-$CUDA_PKG_VERSION \
        cuda-minimal-build-$CUDA_PKG_VERSION \
        cuda-command-line-tools-$CUDA_PKG_VERSION && \
    rm -rf /var/cache/yum/*

ENV LIBRARY_PATH /usr/local/cuda/lib64/stubs:${LIBRARY_PATH}

使用環境變量的兩種方式：

1：像上面一樣在dockfile里寫

2：run的時候寫：

docker run --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility,video --rm nvidia/cuda:9.1-runtime-centos7 nvidia-smi

https://github.com/nvidia/nvidia-container-runtime#nvidia_driver_capabilities

cuda.repo 網上下載下來

[cuda]
name=cuda
baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64
enabled=1
gpgcheck=1
gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pub

ldconfig -p | grep libnvidia
ldconfig -p | grep libcuda

find / -name "libnvidia-encode*"
/usr/lib64/nvidia/libnvidia-encode.so.390.30

ldd /usr/lib64/nvidia/libnvidia-encode.so.390.30 找出缺失的各種庫

find / -name "libcuda*"
/usr/lib64/libcuda.so

ldd /usr/lib64/libcuda.so

libnvidia-encode這個庫需要在NVIDIA_DRIVER_CAPABILITIES中帶上video能力

什么是TMD nvidia-docker2

https://github.com/NVIDIA/nvidia-docker/issues/633

Installing the nvidia-container-runtime for 17.03.2 should be just fine. Then you don't need the nvidia-docker2 package.

It's a very simple package that does two things: 1) provide a compatibility script called nvidia-docker 2) Register the new runtime to the docker daemon.

You can register the new runtime yourself: https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup

1: 提供一個簡便的腳本nvidia-docker2

2: 注冊runtime到docker deamon

https://devblogs.nvidia.com/nvidia-docker-gpu-server-application-deployment-made-easy/

docker使用GPU的早期做法：

One of the early work-arounds to this problem was to fully install the NVIDIA drivers inside the container and map in the character devices corresponding to the NVIDIA GPUs (e.g. /dev/nvidia0) on launch.

This solution is brittle because the version of the host driver must exactly match the version of the driver installed in the container. 需要完全匹配

This requirement drastically reduced the portability of these early containers, undermining one of Docker’s more important features. 不可移植

To enable portability in Docker images that leverage NVIDIA GPUs, we developed nvidia-docker, an open-source project hosted on Github that provides the two critical components needed for portable GPU-based containers:

driver-agnostic CUDA images; and
a Docker command line wrapper that mounts the user mode components of the driver and the GPUs (character devices) into the container at launch.

nvidia-docker is essentially a wrapper around the docker command that transparently provisions a container with the necessary components to execute code on the GPU.

If you need CUDA 6.5 or 7.0, you can specify a tag for the image. A list of available CUDA images for Ubuntu and CentOS can be found on the nvidia-docker wiki.

nvidia-docker run --rm -ti nvidia/cuda:7.0 nvcc --version

https://blog.csdn.net/a632189007/article/details/78801166

nvidia-docker-plugin是一個docker plugin，被用來幫助我們輕松部署container到GPU混合的環境下。類似一個守護進程，發現宿主機驅動文件以及GPU 設備，並且將這些掛載到來自docker守護進程的請求中。以此來支持docker GPU的使用。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 安裝nvidia-docker docker 使用 Nvidia 顯卡 nvidia-docker nvidia-docker的坑 nvidia-docker 安裝 nvidia-docker2 安裝 nvidia docker安裝 Docker 及 nvidia-docker 使用 nvidia-docker2配置與NVIDIA驅動安裝 Docker: Nvidia Driver, Nvidia Docker 推薦安裝步驟