https://blog.gtwang.org/virtualization/centos-linux-install-nvidia-docker-gpu-container-tutorial/
https://www.cnblogs.com/yxfangcs/p/8438462.html
https://hub.docker.com/r/nvidia/cuda/
https://cloud.google.com/compute/docs/gpus/add-gpus
https://kairen.github.io/2018/02/17/container/docker-nvidia-install/
本篇主要介紹如何使用 NVIDIA Docker v2 來讓容器使用 GPU,
過去 NVIDIA Docker v1 需要使用 nvidia-docker 來取代 Docker 執行 GPU image,或是透過手動掛載 NVIDIA driver 與 CUDA 來使 Docker 能夠編譯與執行 GPU 應用程式 image,
而新版本的 Docker 則可以透過 –-runtime 來選擇使用 NVIDIA Docker v2 的 Runtime 來執行 GPU 應用。
如果你的docker是最新的版本,你可能無需安裝nvidia-docker來使用docker GPU。參考后文描述
nvidia-docker是一個插件
NVIDIA driver and CUDA library
1: 安裝驅動和cuda
[root@v5]# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 387.26 Thu Nov 2 21:20:16 PDT 2017
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
[root@v5]#
[root@v5]#
[root@v5]# cat /usr/local/cuda/version.txt
CUDA Version 9.1.85
[root@~]# nvidia-smi Sat Apr 28 14:21:36 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.26 Driver Version: 387.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla M10 Off | 00000000:05:00.0 Off | N/A | | N/A 35C P0 16W / 53W | 0MiB / 8127MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla M10 Off | 00000000:06:00.0 Off | N/A | | N/A 32C P0 16W / 53W | 0MiB / 8127MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla M10 Off | 00000000:07:00.0 Off | N/A | | N/A 30C P0 15W / 53W | 0MiB / 8127MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla M10 Off | 00000000:08:00.0 Off | N/A | | N/A 30C P0 15W / 53W | 0MiB / 8127MiB | 1% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
[root@~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
https://github.com/moby/moby/issues/35906
yum install http://mirror.centos.org/centos/7/os/x86_64/Packages/libseccomp-2.3.1-3.el7.x86_64.rpm
docker/nvidia-container-runtime 依賴比較新的libseccomp版本,系統自帶的版本太低不行。
https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions
想使用docker gpu,需要安裝nvidia-container-runtime,注冊nvidia runtime到docker daemon.
注冊方法:
- Install the repository for your distribution by following the instructions here.
- Install the
nvidia-container-runtime
package:
sudo yum install nvidia-container-runtime
3:
sudo tee /etc/docker/daemon.json <<EOF { "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } } EOF sudo pkill -SIGHUP dockerd
https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup 文中提到了多種方法是各自獨立的,不要重復,否則會出現錯誤。 我們這里使用daemon.json的方式來注冊nvidia runtime。
如果你的docker版本不是最新的,那么你需要使用nvidia-docker和與其匹配的nvidia-container-runtime版本。如下所示:
How do I install 2.0 if I'm not using the latest Docker version?
You must pin the versions of both nvidia-docker2
and nvidia-container-runtime
when installing, for instance:
sudo apt-get install -y nvidia-docker2=2.0.2+docker1.12.6-1 nvidia-container-runtime=1.1.1+docker1.12.6-1
Use apt-cache madison nvidia-docker2 nvidia-container-runtime
or yum search --showduplicates nvidia-docker2 nvidia-container-runtime
to list the available versions.
nvidia-docker支持的最低docker版本是:Docker 1.12
[root@~]# docker version Client: Version: 17.05.0-ce API version: 1.29 Package version: docker-17.05.0-1001.el7.centos.x86_64 Go version: go1.8.3 Git commit: e1bfc47 Built: Fri Mar 23 13:44:53 2018 OS/Arch: linux/amd64 Server: Version: 17.05.0-ce API version: 1.29 (minimum version 1.12) Package version: docker-17.05.0-1001.el7.centos.x86_64 Go version: go1.8.3 Git commit: e1bfc47 Built: Fri Mar 23 13:44:53 2018 OS/Arch: linux/amd64 Experimental: false
版本匹配很重要,否則會出現flag provided but not defined: -console 的錯誤,
安裝匹配的版本,上文默認安裝的nvidia-container-runtime版本不行,nvidia-container-runtime-2.0.0-1.docker17.03.2.x86_64
最終終於出現了久違的
docker run --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility,video --rm nvidia/cuda:9.1-runtime-centos7 nvidia-smi
Sat Apr 28 07:18:35 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.26 Driver Version: 387.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla M10 Off | 00000000:05:00.0 Off | N/A | | N/A 35C P0 16W / 53W | 0MiB / 8127MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla M10 Off | 00000000:06:00.0 Off | N/A | | N/A 32C P0 16W / 53W | 0MiB / 8127MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla M10 Off | 00000000:07:00.0 Off | N/A | | N/A 30C P0 15W / 53W | 0MiB / 8127MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla M10 Off | 00000000:08:00.0 Off | N/A | | N/A 30C P0 15W / 53W | 0MiB / 8127MiB | 1% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
成功。
https://github.com/NVIDIA/nvidia-docker/issues/531#issuecomment-343993909
No, please don't install the driver/cuda inside the container :). The image won't be portable to other machines. 不要在鏡像中安裝驅動,那樣的做法不可移植。
With 2.0, we now use environment variables to list the driver libraries that must be mounted inside the container at runtime: 我們現在使用環境變量的方法來在運行時動態的mount 驅動庫。
https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/9.0/base/Dockerfile#L30-33
In your case, you will need the following:
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES video,compute,utility
Also note that starting from CUDA 9.0, we have a new tag: nvidia/cuda:9.0-base
. It will setup our repositories and set the environment variables (but you will be missing the video
for ffmpeg). It avoid having a base image with ALL the CUDA libraries.
https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/9.0/base/Dockerfile#L30-33
https://hub.docker.com/r/nvidia/cuda/
這兩個鏈接下的dockerfile可以用來編譯生成docker cuda的鏡像,里面的環境變量就是https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup nvidia-container-runtime里面定義的環境變量。
想自己手動編譯成功不太容易,可部分參考:https://www.cnblogs.com/yxfangcs/p/8438462.html 這個文章。
最好的辦法還是直接在https://hub.docker.com/r/nvidia/cuda/里找到你心儀的版本,例如9.1-runtime-centos7
然后:
docker pull nvidia/cuda:9.1-runtime-centos7
Tags 不同tag的版本意思
CUDA images come in three flavors:
base
: starting from CUDA 9.0, contains the bare minimum (libcudart) to deploy a pre-built CUDA application.
Use this image if you want to manually select which CUDA packages you want to install.runtime
: extends thebase
image by adding all the shared libraries from the CUDA toolkit.
Use this image if you have a pre-built application using multiple CUDA libraries.devel
: extends theruntime
image by adding the compiler toolchain, the debugging tools, the headers and the static libraries.
Use this image to compile a CUDA application from sources.
例如: 9.1-base-centos7
FROM centos:7 LABEL maintainer "NVIDIA CORPORATION <cudatools@nvidia.com>" RUN NVIDIA_GPGKEY_SUM=d1be581509378368edeec8c1eb2958702feedf3bc3d17011adbf24efacce4ab5 && \ curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pub | sed '/^Version/d' > /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA && \ echo "$NVIDIA_GPGKEY_SUM /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA" | sha256sum -c --strict - COPY cuda.repo /etc/yum.repos.d/cuda.repo #自己下載好cuda.repo到本地 ENV CUDA_VERSION 9.1.85 #指定安裝的版本 ENV CUDA_PKG_VERSION 9-1-$CUDA_VERSION-1 RUN yum install -y \ cuda-cudart-$CUDA_PKG_VERSION && \ #安裝cuda-9.1 ln -s cuda-9.1 /usr/local/cuda && \ #連接/usr/local/cuda rm -rf /var/cache/yum/* # nvidia-docker 1.0 LABEL com.nvidia.volumes.needed="nvidia_driver" LABEL com.nvidia.cuda.version="${CUDA_VERSION}" RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \ echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH} ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
#這邊也是個奇葩,自己手動注冊runtime,並且通過docker run --runtime=nvidia的方式啟動,容器中沒有/usr/local/nvidia的目錄; 通過nvidia-docker或mesos-executor的方式啟動的容器,內部是有這個目錄的
#在我司碰到了個坑,我司是通過mesos的container來啟動容器的,通過這種方式設置的環境變量沒能體現到容器內部,並且在dockfile中進行ldconfig也不行坑死!!!!需要在CMD的啟動腳本里ldconfig才有用。 與標准的docker做法有很大的出入!!!
# nvidia-container-runtime ENV NVIDIA_VISIBLE_DEVICES all ENV NVIDIA_DRIVER_CAPABILITIES compute,utility,video #根據自身需要修改 ENV NVIDIA_REQUIRE_CUDA "cuda>=9.1"
9.1-runtime-centos7 runtime 是從base而來
ARG repository
FROM ${repository}:9.1-base-centos7
LABEL maintainer "NVIDIA CORPORATION <cudatools@nvidia.com>"
RUN yum install -y \
cuda-libraries-$CUDA_PKG_VERSION && \
rm -rf /var/cache/yum/*
9.1-devel-centos7 devel是從runtime而來
ARG repository
FROM ${repository}:9.1-runtime-centos7
LABEL maintainer "NVIDIA CORPORATION <cudatools@nvidia.com>"
RUN yum install -y \
cuda-libraries-dev-$CUDA_PKG_VERSION \
cuda-nvml-dev-$CUDA_PKG_VERSION \
cuda-minimal-build-$CUDA_PKG_VERSION \
cuda-command-line-tools-$CUDA_PKG_VERSION && \
rm -rf /var/cache/yum/*
ENV LIBRARY_PATH /usr/local/cuda/lib64/stubs:${LIBRARY_PATH}
使用環境變量的兩種方式:
1:像上面一樣在dockfile里寫
2:run的時候寫:
docker run --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility,video --rm nvidia/cuda:9.1-runtime-centos7 nvidia-smi
https://github.com/nvidia/nvidia-container-runtime#nvidia_driver_capabilities
cuda.repo 網上下載下來
[cuda] name=cuda baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64 enabled=1 gpgcheck=1 gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pub
ldconfig -p | grep libnvidia
ldconfig -p | grep libcuda
find / -name "libnvidia-encode*"
/usr/lib64/nvidia/libnvidia-encode.so.390.30
ldd /usr/lib64/nvidia/libnvidia-encode.so.390.30 找出缺失的各種庫
find / -name "libcuda*"
/usr/lib64/libcuda.so
ldd /usr/lib64/libcuda.so
libnvidia-encode這個庫需要在NVIDIA_DRIVER_CAPABILITIES中帶上video能力
什么是TMD nvidia-docker2
https://github.com/NVIDIA/nvidia-docker/issues/633
Installing the nvidia-container-runtime
for 17.03.2
should be just fine. Then you don't need the nvidia-docker2
package.
It's a very simple package that does two things: 1) provide a compatibility script called nvidia-docker
2) Register the new runtime to the docker daemon.
You can register the new runtime yourself: https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup
1: 提供一個簡便的腳本nvidia-docker2
2: 注冊runtime到docker deamon
https://devblogs.nvidia.com/nvidia-docker-gpu-server-application-deployment-made-easy/
docker使用GPU的早期做法:
One of the early work-arounds to this problem was to fully install the NVIDIA drivers inside the container and map in the character devices corresponding to the NVIDIA GPUs (e.g. /dev/nvidia0
) on launch.
This solution is brittle because the version of the host driver must exactly match the version of the driver installed in the container. 需要完全匹配
This requirement drastically reduced the portability of these early containers, undermining one of Docker’s more important features. 不可移植
To enable portability in Docker images that leverage NVIDIA GPUs, we developed nvidia-docker
, an open-source project hosted on Github that provides the two critical components needed for portable GPU-based containers:
- driver-agnostic CUDA images; and
- a Docker command line wrapper that mounts the user mode components of the driver and the GPUs (character devices) into the container at launch.
nvidia-docker
is essentially a wrapper around the docker
command that transparently provisions a container with the necessary components to execute code on the GPU.
If you need CUDA 6.5 or 7.0, you can specify a tag for the image. A list of available CUDA images for Ubuntu and CentOS can be found on the nvidia-docker wiki.
nvidia-docker run --rm -ti nvidia/cuda:7.0 nvcc --version
https://blog.csdn.net/a632189007/article/details/78801166
nvidia-docker-plugin
是一個docker plugin
,被用來幫助我們輕松部署container
到GPU
混合的環境下。類似一個守護進程,發現宿主機驅動文件以及GPU
設備,並且將這些掛載到來自docker守護進程
的請求中。以此來支持docker GPU
的使用。