# lshw -numeric -C display #查看显卡数量
*-display
description: VGA compatible controller
product: NVIDIA Corporation [10DE:2206]
vendor: NVIDIA Corporation [10DE]
physical id: 0
bus info: pci@0000:0a:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:113 memory:fb000000-fbffffff memory:d0000000-dfffffff memory:e0000000-e1ffffff ioport:f000(size=128) memory:fc000000-fc07ffff
*-display
description: VGA compatible controller
product: NVIDIA Corporation [10DE:2206]
vendor: NVIDIA Corporation [10DE]
physical id: 0
bus info: pci@0000:0b:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:114 memory:f9000000-f9ffffff memory:b0000000-bfffffff memory:c0000000-c1ffffff ioport:e000(size=128) memory:fa000000-fa07ffff
第一步 获取显卡型号
想办法获取自己nvidia显卡的型号(一般买电脑的时候都会有显卡型号,我的显卡型号是在电脑上的一个贴纸上),本人的显卡是RTX3080。
# lspci -vnn | grep VGA 同理
0a:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2206] (rev a1) (prog-if 00 [VGA controller])
0b:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2206] (rev a1) (prog-if 00 [VGA controller])
第二步 查看RTX3080显卡驱动
去NVDIA driver search page查看支持 RTX3080 显卡的驱动的最新版本的版本号
驱动程序版本: 455.45 - 发行日期: 2020-11-17
更新软件源,运行
#apt-get upgrade
#apt-cache search nvidia-* |grep 455 查询455版本的驱动是否存在
nvidia-driver-418-server - NVIDIA Server Driver metapackage
nvidia-driver-440-server - NVIDIA Server Driver metapackage
nvidia-driver-450-server - NVIDIA Server Driver metapackage
nvidia-driver-455 - NVIDIA driver metapackage
安装
# apt-get install nvidia-driver-455 -y
安装完reboot系统
查看是否安装成功
# nvidia-smi
Fri Nov 27 09:56:30 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3080 Off | 00000000:0A:00.0 On | N/A |
| 0% 37C P8 2W / 320W | 299MiB / 10015MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3080 Off | 00000000:0B:00.0 Off | N/A |
| 0% 33C P8 4W / 320W | 10MiB / 10018MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1441 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 2997 G /usr/bin/gnome-shell 54MiB |
| 0 N/A N/A 3833 G /usr/lib/xorg/Xorg 94MiB |
| 0 N/A N/A 3990 G /usr/bin/gnome-shell 127MiB |
| 1 N/A N/A 1441 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2997 G /usr/bin/gnome-shell 0MiB |
| 1 N/A N/A 3833 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 3990 G /usr/bin/gnome-shell 0MiB |
表明安装成功
nvidia-docker安装 (若无权限可使用sudo方式)
先安装docker(可根据实际情况安装)
#apt-get update (更新ubuntu的apt源索引)
安装包允许apt通过HTTPS使用仓库
#apt-get install apt-transport-https ca-certificates curl software-properties-common
添加Docker官方GPG key
#curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
设置Docker稳定版仓库
#add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
添加仓库后,更新apt源索引
#apt-get update
安装最新版Docker CE(社区版)
apt-get install docker-ce
配置nvidia-docker
# curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
OK
导入官方nvidia镜像源
# distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
# curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
#deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH)
#deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH)
#deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH)
#deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH)
#deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH)
更新
# apt update
获取:1 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64 InRelease [1,139 B]
命中:2 https://download.docker.com/linux/ubuntu bionic InRelease
获取:3 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 InRelease [1,136 B]
获取:4 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64 InRelease [1,129 B]
获取:5 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64 Packages [9,128 B]
获取:6 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages [6,148 B]
获取:7 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64 Packages [4,332 B]
命中:8 http://archive.ubuntu.com/ubuntu bionic InRelease
获取:9 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
获取:10 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
获取:11 http://archive.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
已下载 275 kB,耗时 2秒 (129 kB/s)
正在读取软件包列表... 完成
正在分析软件包的依赖关系树
正在读取状态信息... 完成
有 8 个软件包可以升级。请执行 ‘apt list --upgradable’ 来查看它们。
安装nvidia-docker2
# apt install -y nvidia-docker2
# systemctl restart docker
配置docker文件
# cat /etc/docker/daemon.json
#注意一定要有default-runtime ,否则k8s里的docker容器运行起来后找不到nvidia-smi
{
"registry-mirrors": ["https://5twf62k1.mirror.aliyuncs.com"],
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
尤其是上面的path这个地方需要注意
重启Docker daemon
# systemctl daemon-reload && systemctl restart docker
验证docker2
# docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi 出现以下列表表明安装成功
Fri Nov 27 02:54:19 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3080 Off | 00000000:0A:00.0 On | N/A |
| 0% 36C P8 1W / 320W | 202MiB / 10015MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3080 Off | 00000000:0B:00.0 Off | N/A |
| 0% 33C P8 8W / 320W | 10MiB / 10018MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
=================================================================
如果存在这种现象:
"没有运行程序,nvidia-smi查看GPU-Util 达到100% GPU利用率很高"
需要把驱动模式设置为常驻内存才可以,设置命令:
root@node3:~#nvidia-smi -pm 1
=================================================================