nvidia docker

本文转载自查看原文 2018-04-27 14:18 849 云技术、虚拟化

https://blog.gtwang.org/virtualization/centos-linux-install-nvidia-docker-gpu-container-tutorial/

https://www.cnblogs.com/yxfangcs/p/8438462.html

https://hub.docker.com/r/nvidia/cuda/

https://cloud.google.com/compute/docs/gpus/add-gpus

https://kairen.github.io/2018/02/17/container/docker-nvidia-install/

本篇主要介紹如何使用 NVIDIA Docker v2 來讓容器使用 GPU，

過去 NVIDIA Docker v1 需要使用 nvidia-docker 來取代 Docker 執行 GPU image，或是透過手動掛載 NVIDIA driver 與 CUDA 來使 Docker 能夠編譯與執行 GPU 應用程式 image，

而新版本的 Docker 則可以透過 –-runtime 來選擇使用 NVIDIA Docker v2 的 Runtime 來執行 GPU 應用。

如果你的docker是最新的版本，你可能无需安装nvidia-docker来使用docker GPU。参考后文描述

nvidia-docker是一个插件

NVIDIA driver and CUDA library

1: 安装驱动和cuda

[root@v5]# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 387.26 Thu Nov 2 21:20:16 PDT 2017
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
[root@v5]#
[root@v5]#
[root@v5]# cat /usr/local/cuda/version.txt
CUDA Version 9.1.85

[root@~]# nvidia-smi 
Sat Apr 28 14:21:36 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M10           Off  | 00000000:05:00.0 Off |                  N/A |
| N/A   35C    P0    16W /  53W |      0MiB /  8127MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M10           Off  | 00000000:06:00.0 Off |                  N/A |
| N/A   32C    P0    16W /  53W |      0MiB /  8127MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M10           Off  | 00000000:07:00.0 Off |                  N/A |
| N/A   30C    P0    15W /  53W |      0MiB /  8127MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M10           Off  | 00000000:08:00.0 Off |                  N/A |
| N/A   30C    P0    15W /  53W |      0MiB /  8127MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

[root@~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)

https://github.com/moby/moby/issues/35906

yum install http://mirror.centos.org/centos/7/os/x86_64/Packages/libseccomp-2.3.1-3.el7.x86_64.rpm

docker/nvidia-container-runtime 依赖比较新的libseccomp版本，系统自带的版本太低不行。

https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions

想使用docker gpu，需要安装nvidia-container-runtime，注册nvidia runtime到docker daemon.

注册方法：

Install the repository for your distribution by following the instructions here.
Install the nvidia-container-runtime package:

sudo yum install nvidia-container-runtime

3：

sudo tee /etc/docker/daemon.json <<EOF {  "runtimes": {  "nvidia": {  "path": "/usr/bin/nvidia-container-runtime",  "runtimeArgs": []  }  } } EOF sudo pkill -SIGHUP dockerd

https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup 文中提到了多种方法是各自独立的，不要重复，否则会出现错误。我们这里使用daemon.json的方式来注册nvidia runtime。

如果你的docker版本不是最新的，那么你需要使用nvidia-docker和与其匹配的nvidia-container-runtime版本。如下所示：

How do I install 2.0 if I'm not using the latest Docker version?

You must pin the versions of both nvidia-docker2 and nvidia-container-runtime when installing, for instance:

sudo apt-get install -y nvidia-docker2=2.0.2+docker1.12.6-1 nvidia-container-runtime=1.1.1+docker1.12.6-1

Use apt-cache madison nvidia-docker2 nvidia-container-runtime or yum search --showduplicates nvidia-docker2 nvidia-container-runtime to list the available versions.

nvidia-docker支持的最低docker版本是：Docker 1.12

[root@~]# docker version
Client:
 Version:         17.05.0-ce
 API version:     1.29
 Package version: docker-17.05.0-1001.el7.centos.x86_64
 Go version:      go1.8.3
 Git commit:      e1bfc47
 Built:           Fri Mar 23 13:44:53 2018
 OS/Arch:         linux/amd64

Server:
 Version:         17.05.0-ce
 API version:     1.29 (minimum version 1.12)
 Package version: docker-17.05.0-1001.el7.centos.x86_64
 Go version:      go1.8.3
 Git commit:      e1bfc47
 Built:           Fri Mar 23 13:44:53 2018
 OS/Arch:         linux/amd64
 Experimental:    false

版本匹配很重要，否则会出现flag provided but not defined: -console 的错误，

安装匹配的版本，上文默认安装的nvidia-container-runtime版本不行，nvidia-container-runtime-2.0.0-1.docker17.03.2.x86_64

最终终于出现了久违的

docker run --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility,video --rm nvidia/cuda:9.1-runtime-centos7 nvidia-smi

Sat Apr 28 07:18:35 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M10           Off  | 00000000:05:00.0 Off |                  N/A |
| N/A   35C    P0    16W /  53W |      0MiB /  8127MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M10           Off  | 00000000:06:00.0 Off |                  N/A |
| N/A   32C    P0    16W /  53W |      0MiB /  8127MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M10           Off  | 00000000:07:00.0 Off |                  N/A |
| N/A   30C    P0    15W /  53W |      0MiB /  8127MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M10           Off  | 00000000:08:00.0 Off |                  N/A |
| N/A   30C    P0    15W /  53W |      0MiB /  8127MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

成功。

https://github.com/NVIDIA/nvidia-docker/issues/531#issuecomment-343993909　　

No, please don't install the driver/cuda inside the container :). The image won't be portable to other machines. 不要在镜像中安装驱动，那样的做法不可移植。

With 2.0, we now use environment variables to list the driver libraries that must be mounted inside the container at runtime: 我们现在使用环境变量的方法来在运行时动态的mount 驱动库。
https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/9.0/base/Dockerfile#L30-33

In your case, you will need the following:

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES video,compute,utility

Also note that starting from CUDA 9.0, we have a new tag: nvidia/cuda:9.0-base. It will setup our repositories and set the environment variables (but you will be missing the video for ffmpeg). It avoid having a base image with ALL the CUDA libraries.

https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/9.0/base/Dockerfile#L30-33

https://hub.docker.com/r/nvidia/cuda/

这两个链接下的dockerfile可以用来编译生成docker cuda的镜像，里面的环境变量就是https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup nvidia-container-runtime里面定义的环境变量。

想自己手动编译成功不太容易，可部分参考：https://www.cnblogs.com/yxfangcs/p/8438462.html 这个文章。

最好的办法还是直接在https://hub.docker.com/r/nvidia/cuda/里找到你心仪的版本，例如9.1-runtime-centos7 然后：

docker pull nvidia/cuda:9.1-runtime-centos7

Tags 不同tag的版本意思

CUDA images come in three flavors:

base: starting from CUDA 9.0, contains the bare minimum (libcudart) to deploy a pre-built CUDA application.
Use this image if you want to manually select which CUDA packages you want to install.
runtime: extends the base image by adding all the shared libraries from the CUDA toolkit.
Use this image if you have a pre-built application using multiple CUDA libraries.
devel: extends the runtime image by adding the compiler toolchain, the debugging tools, the headers and the static libraries.
Use this image to compile a CUDA application from sources.

例如： 9.1-base-centos7

FROM centos:7
LABEL maintainer "NVIDIA CORPORATION <cudatools@nvidia.com>"

RUN NVIDIA_GPGKEY_SUM=d1be581509378368edeec8c1eb2958702feedf3bc3d17011adbf24efacce4ab5 && \
    curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pub | sed '/^Version/d' > /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA && \
    echo "$NVIDIA_GPGKEY_SUM  /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA" | sha256sum -c --strict -

COPY cuda.repo /etc/yum.repos.d/cuda.repo              #自己下载好cuda.repo到本地

ENV CUDA_VERSION 9.1.85            #指定安装的版本

ENV CUDA_PKG_VERSION 9-1-$CUDA_VERSION-1
RUN yum install -y \
        cuda-cudart-$CUDA_PKG_VERSION && \           #安装cuda-9.1
    ln -s cuda-9.1 /usr/local/cuda && \           #连接/usr/local/cuda
    rm -rf /var/cache/yum/*

# nvidia-docker 1.0
LABEL com.nvidia.volumes.needed="nvidia_driver"
LABEL com.nvidia.cuda.version="${CUDA_VERSION}"

RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
    echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf

ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64           
#这边也是个奇葩，自己手动注册runtime，并且通过docker run --runtime=nvidia的方式启动，容器中没有/usr/local/nvidia的目录；  通过nvidia-docker或mesos-executor的方式启动的容器，内部是有这个目录的
#在我司碰到了个坑，我司是通过mesos的container来启动容器的，通过这种方式设置的环境变量没能体现到容器内部，并且在dockfile中进行ldconfig也不行坑死！！！！需要在CMD的启动脚本里ldconfig才有用。 与标准的docker做法有很大的出入！！！

# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility,video           #根据自身需要修改
ENV NVIDIA_REQUIRE_CUDA "cuda>=9.1"

9.1-runtime-centos7 runtime 是从base而来

ARG repository
FROM ${repository}:9.1-base-centos7
LABEL maintainer "NVIDIA CORPORATION <cudatools@nvidia.com>"

RUN yum install -y \
        cuda-libraries-$CUDA_PKG_VERSION && \
    rm -rf /var/cache/yum/*

9.1-devel-centos7 devel是从runtime而来

ARG repository
FROM ${repository}:9.1-runtime-centos7
LABEL maintainer "NVIDIA CORPORATION <cudatools@nvidia.com>"

RUN yum install -y \
        cuda-libraries-dev-$CUDA_PKG_VERSION \
        cuda-nvml-dev-$CUDA_PKG_VERSION \
        cuda-minimal-build-$CUDA_PKG_VERSION \
        cuda-command-line-tools-$CUDA_PKG_VERSION && \
    rm -rf /var/cache/yum/*

ENV LIBRARY_PATH /usr/local/cuda/lib64/stubs:${LIBRARY_PATH}

使用环境变量的两种方式：

1：像上面一样在dockfile里写

2：run的时候写：

docker run --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility,video --rm nvidia/cuda:9.1-runtime-centos7 nvidia-smi

https://github.com/nvidia/nvidia-container-runtime#nvidia_driver_capabilities

cuda.repo 网上下载下来

[cuda]
name=cuda
baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64
enabled=1
gpgcheck=1
gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pub

ldconfig -p | grep libnvidia
ldconfig -p | grep libcuda

find / -name "libnvidia-encode*"
/usr/lib64/nvidia/libnvidia-encode.so.390.30

ldd /usr/lib64/nvidia/libnvidia-encode.so.390.30 找出缺失的各种库

find / -name "libcuda*"
/usr/lib64/libcuda.so

ldd /usr/lib64/libcuda.so

libnvidia-encode这个库需要在NVIDIA_DRIVER_CAPABILITIES中带上video能力

什么是TMD nvidia-docker2

https://github.com/NVIDIA/nvidia-docker/issues/633

Installing the nvidia-container-runtime for 17.03.2 should be just fine. Then you don't need the nvidia-docker2 package.

It's a very simple package that does two things: 1) provide a compatibility script called nvidia-docker 2) Register the new runtime to the docker daemon.

You can register the new runtime yourself: https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup

1: 提供一个简便的脚本nvidia-docker2

2: 注册runtime到docker deamon

https://devblogs.nvidia.com/nvidia-docker-gpu-server-application-deployment-made-easy/

docker使用GPU的早期做法：

One of the early work-arounds to this problem was to fully install the NVIDIA drivers inside the container and map in the character devices corresponding to the NVIDIA GPUs (e.g. /dev/nvidia0) on launch.

This solution is brittle because the version of the host driver must exactly match the version of the driver installed in the container. 需要完全匹配

This requirement drastically reduced the portability of these early containers, undermining one of Docker’s more important features. 不可移植

To enable portability in Docker images that leverage NVIDIA GPUs, we developed nvidia-docker, an open-source project hosted on Github that provides the two critical components needed for portable GPU-based containers:

driver-agnostic CUDA images; and
a Docker command line wrapper that mounts the user mode components of the driver and the GPUs (character devices) into the container at launch.

nvidia-docker is essentially a wrapper around the docker command that transparently provisions a container with the necessary components to execute code on the GPU.

If you need CUDA 6.5 or 7.0, you can specify a tag for the image. A list of available CUDA images for Ubuntu and CentOS can be found on the nvidia-docker wiki.

nvidia-docker run --rm -ti nvidia/cuda:7.0 nvcc --version

https://blog.csdn.net/a632189007/article/details/78801166

nvidia-docker-plugin是一个docker plugin，被用来帮助我们轻松部署container到GPU混合的环境下。类似一个守护进程，发现宿主机驱动文件以及GPU 设备，并且将这些挂载到来自docker守护进程的请求中。以此来支持docker GPU的使用。

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 安装nvidia-docker docker 使用 Nvidia 显卡 nvidia-docker nvidia-docker的坑 nvidia-docker 安装 nvidia-docker2 安装 nvidia docker安装 Docker 及 nvidia-docker 使用 nvidia-docker2配置与NVIDIA驱动安装 Docker: Nvidia Driver, Nvidia Docker 推荐安装步骤