https://blog.gtwang.org/virtualization/centos-linux-install-nvidia-docker-gpu-container-tutorial/
https://www.cnblogs.com/yxfangcs/p/8438462.html
https://hub.docker.com/r/nvidia/cuda/
https://cloud.google.com/compute/docs/gpus/add-gpus
https://kairen.github.io/2018/02/17/container/docker-nvidia-install/
本篇主要介紹如何使用 NVIDIA Docker v2 來讓容器使用 GPU,
過去 NVIDIA Docker v1 需要使用 nvidia-docker 來取代 Docker 執行 GPU image,或是透過手動掛載 NVIDIA driver 與 CUDA 來使 Docker 能夠編譯與執行 GPU 應用程式 image,
而新版本的 Docker 則可以透過 –-runtime 來選擇使用 NVIDIA Docker v2 的 Runtime 來執行 GPU 應用。
如果你的docker是最新的版本,你可能无需安装nvidia-docker来使用docker GPU。参考后文描述
nvidia-docker是一个插件
NVIDIA driver and CUDA library
1: 安装驱动和cuda
[root@v5]# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 387.26 Thu Nov 2 21:20:16 PDT 2017
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
[root@v5]#
[root@v5]#
[root@v5]# cat /usr/local/cuda/version.txt
CUDA Version 9.1.85
[root@~]# nvidia-smi Sat Apr 28 14:21:36 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.26 Driver Version: 387.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla M10 Off | 00000000:05:00.0 Off | N/A | | N/A 35C P0 16W / 53W | 0MiB / 8127MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla M10 Off | 00000000:06:00.0 Off | N/A | | N/A 32C P0 16W / 53W | 0MiB / 8127MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla M10 Off | 00000000:07:00.0 Off | N/A | | N/A 30C P0 15W / 53W | 0MiB / 8127MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla M10 Off | 00000000:08:00.0 Off | N/A | | N/A 30C P0 15W / 53W | 0MiB / 8127MiB | 1% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
[root@~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
https://github.com/moby/moby/issues/35906
yum install http://mirror.centos.org/centos/7/os/x86_64/Packages/libseccomp-2.3.1-3.el7.x86_64.rpm
docker/nvidia-container-runtime 依赖比较新的libseccomp版本,系统自带的版本太低不行。
https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions
想使用docker gpu,需要安装nvidia-container-runtime,注册nvidia runtime到docker daemon.
注册方法:
- Install the repository for your distribution by following the instructions here.
- Install the
nvidia-container-runtime
package:
sudo yum install nvidia-container-runtime
3:
sudo tee /etc/docker/daemon.json <<EOF { "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } } EOF sudo pkill -SIGHUP dockerd
https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup 文中提到了多种方法是各自独立的,不要重复,否则会出现错误。 我们这里使用daemon.json的方式来注册nvidia runtime。
如果你的docker版本不是最新的,那么你需要使用nvidia-docker和与其匹配的nvidia-container-runtime版本。如下所示:
How do I install 2.0 if I'm not using the latest Docker version?
You must pin the versions of both nvidia-docker2
and nvidia-container-runtime
when installing, for instance:
sudo apt-get install -y nvidia-docker2=2.0.2+docker1.12.6-1 nvidia-container-runtime=1.1.1+docker1.12.6-1
Use apt-cache madison nvidia-docker2 nvidia-container-runtime
or yum search --showduplicates nvidia-docker2 nvidia-container-runtime
to list the available versions.
nvidia-docker支持的最低docker版本是:Docker 1.12
[root@~]# docker version Client: Version: 17.05.0-ce API version: 1.29 Package version: docker-17.05.0-1001.el7.centos.x86_64 Go version: go1.8.3 Git commit: e1bfc47 Built: Fri Mar 23 13:44:53 2018 OS/Arch: linux/amd64 Server: Version: 17.05.0-ce API version: 1.29 (minimum version 1.12) Package version: docker-17.05.0-1001.el7.centos.x86_64 Go version: go1.8.3 Git commit: e1bfc47 Built: Fri Mar 23 13:44:53 2018 OS/Arch: linux/amd64 Experimental: false
版本匹配很重要,否则会出现flag provided but not defined: -console 的错误,
安装匹配的版本,上文默认安装的nvidia-container-runtime版本不行,nvidia-container-runtime-2.0.0-1.docker17.03.2.x86_64
最终终于出现了久违的
docker run --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility,video --rm nvidia/cuda:9.1-runtime-centos7 nvidia-smi
Sat Apr 28 07:18:35 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.26 Driver Version: 387.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla M10 Off | 00000000:05:00.0 Off | N/A | | N/A 35C P0 16W / 53W | 0MiB / 8127MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla M10 Off | 00000000:06:00.0 Off | N/A | | N/A 32C P0 16W / 53W | 0MiB / 8127MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla M10 Off | 00000000:07:00.0 Off | N/A | | N/A 30C P0 15W / 53W | 0MiB / 8127MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla M10 Off | 00000000:08:00.0 Off | N/A | | N/A 30C P0 15W / 53W | 0MiB / 8127MiB | 1% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
成功。
https://github.com/NVIDIA/nvidia-docker/issues/531#issuecomment-343993909
No, please don't install the driver/cuda inside the container :). The image won't be portable to other machines. 不要在镜像中安装驱动,那样的做法不可移植。
With 2.0, we now use environment variables to list the driver libraries that must be mounted inside the container at runtime: 我们现在使用环境变量的方法来在运行时动态的mount 驱动库。
https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/9.0/base/Dockerfile#L30-33
In your case, you will need the following:
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES video,compute,utility
Also note that starting from CUDA 9.0, we have a new tag: nvidia/cuda:9.0-base
. It will setup our repositories and set the environment variables (but you will be missing the video
for ffmpeg). It avoid having a base image with ALL the CUDA libraries.
https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/9.0/base/Dockerfile#L30-33
https://hub.docker.com/r/nvidia/cuda/
这两个链接下的dockerfile可以用来编译生成docker cuda的镜像,里面的环境变量就是https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup nvidia-container-runtime里面定义的环境变量。
想自己手动编译成功不太容易,可部分参考:https://www.cnblogs.com/yxfangcs/p/8438462.html 这个文章。
最好的办法还是直接在https://hub.docker.com/r/nvidia/cuda/里找到你心仪的版本,例如9.1-runtime-centos7
然后:
docker pull nvidia/cuda:9.1-runtime-centos7
Tags 不同tag的版本意思
CUDA images come in three flavors:
base
: starting from CUDA 9.0, contains the bare minimum (libcudart) to deploy a pre-built CUDA application.
Use this image if you want to manually select which CUDA packages you want to install.runtime
: extends thebase
image by adding all the shared libraries from the CUDA toolkit.
Use this image if you have a pre-built application using multiple CUDA libraries.devel
: extends theruntime
image by adding the compiler toolchain, the debugging tools, the headers and the static libraries.
Use this image to compile a CUDA application from sources.
例如: 9.1-base-centos7
FROM centos:7 LABEL maintainer "NVIDIA CORPORATION <cudatools@nvidia.com>" RUN NVIDIA_GPGKEY_SUM=d1be581509378368edeec8c1eb2958702feedf3bc3d17011adbf24efacce4ab5 && \ curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pub | sed '/^Version/d' > /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA && \ echo "$NVIDIA_GPGKEY_SUM /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA" | sha256sum -c --strict - COPY cuda.repo /etc/yum.repos.d/cuda.repo #自己下载好cuda.repo到本地 ENV CUDA_VERSION 9.1.85 #指定安装的版本 ENV CUDA_PKG_VERSION 9-1-$CUDA_VERSION-1 RUN yum install -y \ cuda-cudart-$CUDA_PKG_VERSION && \ #安装cuda-9.1 ln -s cuda-9.1 /usr/local/cuda && \ #连接/usr/local/cuda rm -rf /var/cache/yum/* # nvidia-docker 1.0 LABEL com.nvidia.volumes.needed="nvidia_driver" LABEL com.nvidia.cuda.version="${CUDA_VERSION}" RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \ echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH} ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
#这边也是个奇葩,自己手动注册runtime,并且通过docker run --runtime=nvidia的方式启动,容器中没有/usr/local/nvidia的目录; 通过nvidia-docker或mesos-executor的方式启动的容器,内部是有这个目录的
#在我司碰到了个坑,我司是通过mesos的container来启动容器的,通过这种方式设置的环境变量没能体现到容器内部,并且在dockfile中进行ldconfig也不行坑死!!!!需要在CMD的启动脚本里ldconfig才有用。 与标准的docker做法有很大的出入!!!
# nvidia-container-runtime ENV NVIDIA_VISIBLE_DEVICES all ENV NVIDIA_DRIVER_CAPABILITIES compute,utility,video #根据自身需要修改 ENV NVIDIA_REQUIRE_CUDA "cuda>=9.1"
9.1-runtime-centos7 runtime 是从base而来
ARG repository
FROM ${repository}:9.1-base-centos7
LABEL maintainer "NVIDIA CORPORATION <cudatools@nvidia.com>"
RUN yum install -y \
cuda-libraries-$CUDA_PKG_VERSION && \
rm -rf /var/cache/yum/*
9.1-devel-centos7 devel是从runtime而来
ARG repository
FROM ${repository}:9.1-runtime-centos7
LABEL maintainer "NVIDIA CORPORATION <cudatools@nvidia.com>"
RUN yum install -y \
cuda-libraries-dev-$CUDA_PKG_VERSION \
cuda-nvml-dev-$CUDA_PKG_VERSION \
cuda-minimal-build-$CUDA_PKG_VERSION \
cuda-command-line-tools-$CUDA_PKG_VERSION && \
rm -rf /var/cache/yum/*
ENV LIBRARY_PATH /usr/local/cuda/lib64/stubs:${LIBRARY_PATH}
使用环境变量的两种方式:
1:像上面一样在dockfile里写
2:run的时候写:
docker run --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility,video --rm nvidia/cuda:9.1-runtime-centos7 nvidia-smi
https://github.com/nvidia/nvidia-container-runtime#nvidia_driver_capabilities
cuda.repo 网上下载下来
[cuda] name=cuda baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64 enabled=1 gpgcheck=1 gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pub
ldconfig -p | grep libnvidia
ldconfig -p | grep libcuda
find / -name "libnvidia-encode*"
/usr/lib64/nvidia/libnvidia-encode.so.390.30
ldd /usr/lib64/nvidia/libnvidia-encode.so.390.30 找出缺失的各种库
find / -name "libcuda*"
/usr/lib64/libcuda.so
ldd /usr/lib64/libcuda.so
libnvidia-encode这个库需要在NVIDIA_DRIVER_CAPABILITIES中带上video能力
什么是TMD nvidia-docker2
https://github.com/NVIDIA/nvidia-docker/issues/633
Installing the nvidia-container-runtime
for 17.03.2
should be just fine. Then you don't need the nvidia-docker2
package.
It's a very simple package that does two things: 1) provide a compatibility script called nvidia-docker
2) Register the new runtime to the docker daemon.
You can register the new runtime yourself: https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup
1: 提供一个简便的脚本nvidia-docker2
2: 注册runtime到docker deamon
https://devblogs.nvidia.com/nvidia-docker-gpu-server-application-deployment-made-easy/
docker使用GPU的早期做法:
One of the early work-arounds to this problem was to fully install the NVIDIA drivers inside the container and map in the character devices corresponding to the NVIDIA GPUs (e.g. /dev/nvidia0
) on launch.
This solution is brittle because the version of the host driver must exactly match the version of the driver installed in the container. 需要完全匹配
This requirement drastically reduced the portability of these early containers, undermining one of Docker’s more important features. 不可移植
To enable portability in Docker images that leverage NVIDIA GPUs, we developed nvidia-docker
, an open-source project hosted on Github that provides the two critical components needed for portable GPU-based containers:
- driver-agnostic CUDA images; and
- a Docker command line wrapper that mounts the user mode components of the driver and the GPUs (character devices) into the container at launch.
nvidia-docker
is essentially a wrapper around the docker
command that transparently provisions a container with the necessary components to execute code on the GPU.
If you need CUDA 6.5 or 7.0, you can specify a tag for the image. A list of available CUDA images for Ubuntu and CentOS can be found on the nvidia-docker wiki.
nvidia-docker run --rm -ti nvidia/cuda:7.0 nvcc --version
https://blog.csdn.net/a632189007/article/details/78801166
nvidia-docker-plugin
是一个docker plugin
,被用来帮助我们轻松部署container
到GPU
混合的环境下。类似一个守护进程,发现宿主机驱动文件以及GPU
设备,并且将这些挂载到来自docker守护进程
的请求中。以此来支持docker GPU
的使用。