# lshw -numeric -C display #查看顯卡數量
*-display
description: VGA compatible controller
product: NVIDIA Corporation [10DE:2206]
vendor: NVIDIA Corporation [10DE]
physical id: 0
bus info: pci@0000:0a:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:113 memory:fb000000-fbffffff memory:d0000000-dfffffff memory:e0000000-e1ffffff ioport:f000(size=128) memory:fc000000-fc07ffff
*-display
description: VGA compatible controller
product: NVIDIA Corporation [10DE:2206]
vendor: NVIDIA Corporation [10DE]
physical id: 0
bus info: pci@0000:0b:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:114 memory:f9000000-f9ffffff memory:b0000000-bfffffff memory:c0000000-c1ffffff ioport:e000(size=128) memory:fa000000-fa07ffff
第一步 獲取顯卡型號
想辦法獲取自己nvidia顯卡的型號(一般買電腦的時候都會有顯卡型號,我的顯卡型號是在電腦上的一個貼紙上),本人的顯卡是RTX3080。
# lspci -vnn | grep VGA 同理
0a:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2206] (rev a1) (prog-if 00 [VGA controller])
0b:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2206] (rev a1) (prog-if 00 [VGA controller])
第二步 查看RTX3080顯卡驅動
去NVDIA driver search page查看支持 RTX3080 顯卡的驅動的最新版本的版本號
驅動程序版本: 455.45 - 發行日期: 2020-11-17
更新軟件源,運行
#apt-get upgrade
#apt-cache search nvidia-* |grep 455 查詢455版本的驅動是否存在
nvidia-driver-418-server - NVIDIA Server Driver metapackage
nvidia-driver-440-server - NVIDIA Server Driver metapackage
nvidia-driver-450-server - NVIDIA Server Driver metapackage
nvidia-driver-455 - NVIDIA driver metapackage
安裝
# apt-get install nvidia-driver-455 -y
安裝完reboot系統
查看是否安裝成功
# nvidia-smi
Fri Nov 27 09:56:30 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3080 Off | 00000000:0A:00.0 On | N/A |
| 0% 37C P8 2W / 320W | 299MiB / 10015MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3080 Off | 00000000:0B:00.0 Off | N/A |
| 0% 33C P8 4W / 320W | 10MiB / 10018MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1441 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 2997 G /usr/bin/gnome-shell 54MiB |
| 0 N/A N/A 3833 G /usr/lib/xorg/Xorg 94MiB |
| 0 N/A N/A 3990 G /usr/bin/gnome-shell 127MiB |
| 1 N/A N/A 1441 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2997 G /usr/bin/gnome-shell 0MiB |
| 1 N/A N/A 3833 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 3990 G /usr/bin/gnome-shell 0MiB |
表明安裝成功
nvidia-docker安裝 (若無權限可使用sudo方式)
先安裝docker(可根據實際情況安裝)
#apt-get update (更新ubuntu的apt源索引)
安裝包允許apt通過HTTPS使用倉庫
#apt-get install apt-transport-https ca-certificates curl software-properties-common
添加Docker官方GPG key
#curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
設置Docker穩定版倉庫
#add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
添加倉庫后,更新apt源索引
#apt-get update
安裝最新版Docker CE(社區版)
apt-get install docker-ce
配置nvidia-docker
# curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
OK
導入官方nvidia鏡像源
# distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
# curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
#deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH)
#deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH)
#deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH)
#deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH)
#deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH)
更新
# apt update
獲取:1 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64 InRelease [1,139 B]
命中:2 https://download.docker.com/linux/ubuntu bionic InRelease
獲取:3 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 InRelease [1,136 B]
獲取:4 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64 InRelease [1,129 B]
獲取:5 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64 Packages [9,128 B]
獲取:6 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages [6,148 B]
獲取:7 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64 Packages [4,332 B]
命中:8 http://archive.ubuntu.com/ubuntu bionic InRelease
獲取:9 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
獲取:10 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
獲取:11 http://archive.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
已下載 275 kB,耗時 2秒 (129 kB/s)
正在讀取軟件包列表... 完成
正在分析軟件包的依賴關系樹
正在讀取狀態信息... 完成
有 8 個軟件包可以升級。請執行 ‘apt list --upgradable’ 來查看它們。
安裝nvidia-docker2
# apt install -y nvidia-docker2
# systemctl restart docker
配置docker文件
# cat /etc/docker/daemon.json
#注意一定要有default-runtime ,否則k8s里的docker容器運行起來后找不到nvidia-smi
{
"registry-mirrors": ["https://5twf62k1.mirror.aliyuncs.com"],
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
尤其是上面的path這個地方需要注意
重啟Docker daemon
# systemctl daemon-reload && systemctl restart docker
驗證docker2
# docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi 出現以下列表表明安裝成功
Fri Nov 27 02:54:19 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3080 Off | 00000000:0A:00.0 On | N/A |
| 0% 36C P8 1W / 320W | 202MiB / 10015MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3080 Off | 00000000:0B:00.0 Off | N/A |
| 0% 33C P8 8W / 320W | 10MiB / 10018MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
=================================================================
如果存在這種現象:
"沒有運行程序,nvidia-smi查看GPU-Util 達到100% GPU利用率很高"
需要把驅動模式設置為常駐內存才可以,設置命令:
root@node3:~#nvidia-smi -pm 1
=================================================================