安裝前准備:
查看顯卡及系統版本內核信息
cat /etc/centos-release
lshw -numeric -C display
lshw -numeric -C display
yum install pciutils
lspci | grep -i vga
lspci | grep -i nvidia
1、安裝編譯環境:gcc、kernel-devel、kernel-headers("kernel-devel-uname-r == $(uname -r)"可以確保安裝與當前運行內核版本一樣的kernel-header)
yum -y install gcc kernel-devel "kernel-devel-uname-r == $(uname -r)" dkms
2.檢查內核版本和源碼版本,保證一致(如不一致需用yum升級一致)
ls /boot | grep vmlinu
與
rpm -aq | grep kernel-devel
一致
移除其他版本內核重建內核啟動文件
grub2-set-default 0
grub2-mkconfig -o /boot/grub2/grub.cfg
重啟reboot
查看nouveau驅動是否開啟(無命令lsmod可yum安裝)
lsmod | grep nouveau
屏蔽系統自帶的nouveau
修改dist-blacklist.conf文件:
vim /lib/modprobe.d/dist-blacklist.conf
將nvidiafb注釋掉:
#blacklist nvidiafb
然后添加以下語句:
blacklist nouveau
options nouveau modeset=0
3、重新建立initramfs image文件(生成新的內核,這個內核在開機的時候不會加載nouveau驅動程序)
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut /boot/initramfs-$(uname -r).img $(uname -r)
修改運行級別為文本模式
systemctl set-default multi-user.target
重啟
reboot
輸入:lsmod | grep nouveau,沒有任何輸出,則確定nouveau沒有加載
一、安裝NVIDIA顯卡驅動
顯卡驅動程序下載:
https://www.nvidia.cn/drivers/unix/
添加權限+x 安裝
chmod +x
執行
./NVIDIA-Linux-x86_64-455.45.01.run --kernel-source-path=/usr/src/kernels/3.10.0-1160.15.2.el7.x86_64/ --no-drm
(注意:--no-drm要帶上,要不然安裝過程會報錯ERROR: The nvidia-drm kernel module failed to load. This kernel
module isrequired for the proper operation of DRM-KMS. If you do not need touse DRM-KMS, you can try to install
this driver package again withthe '--no-drm' option.)
點擊yes即可安裝完成后,重啟
reboot
輸入nvidia-smi,出現顯卡配置信息,說明NVIDIA驅動安裝成功
Sat Feb 27 15:39:09 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla M40 24GB Off | 00000000:0B:00.0 Off | 0 |
| N/A 38C P0 66W / 250W | 0MiB / 22945MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
安裝docker服務
安裝依賴:
yum install -y yum-utils device-mapper-persistent-data lvm2
導入repo文件
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
查看看在的版本:
yum list docker-ce --showduplicates | sort -r
安裝指定版本的docker
yum install docker-ce-18.09.6-3.el7 docker-ce-cli-18.09.6 containerd.io
啟動docker
systemctl start docker
systemctl status docker
systemctl enable docker
安裝nvidia-docker
參考文獻:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html 官網安裝文檔
https://nvidia.github.io/libnvidia-container/ (FQ可達)
設置key導入repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
清空yum緩存
yum clean expire-cache
重建cache
yum makecache
查找可安裝的nvidia docker版本:
yum search --showduplicates nvidia-docker
安裝nvidia-docker(可指定版本默認安裝最新穩定版)
yum install -y nvidia-docker2
修改daemon.json文件
root@slash:/home/slash# cat /etc/docker/daemon.json
#注意一定要有default-runtime ,否則k8s里的docker容器運行起來后找不到nvidia-smi
{
"registry-mirrors": ["https://5twf62k1.mirror.aliyuncs.com"],
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
尤其是上面的path這個地方需要注意
重啟Docker daemon
systemctl daemon-reload && systemctl restart docker
驗證docker2
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi 出現以下列表表明安裝成功
執行 nvidia-docker run --rm nvidia/cuda nvidia-smi
Mon Mar 1 02:47:16 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla M40 24GB Off | 00000000:0B:00.0 Off | 0 |
| N/A 40C P0 66W / 250W | 0MiB / 22945MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
使用 nvidia-docker 查看 GPU 信息:
nvidia-docker run --rm nvidia/cuda nvidia-smi