要運行高版本的GPU版TensorFlow,需要更新宿主機的顯卡驅動(本文以NVIDIA390為例)
一、更新驅動
禁用nouveau驅動: 添加/etc/modprobe.d/blacklist.conf文件 blacklist nouveau options nouveau modeset=0 “sudo update-initramfs -u” 執行“lsmod | grep nouveau”,如無變化,則禁用成功 此處不能直接重啟,否則進不了系統。 若重啟導致無法進入系統,解決方案:https://blog.csdn.net/wei_supreme/article/details/82227765
添加Graphic Drivers PPA: “sudo -E add-apt-repository ppa:graphics-drivers/ppa” “sudo apt-get update” 搜索適合的驅動“sudo ubuntu-drivers devices”
卸載已有驅動 sudo apt-get remove --purge nvidia*
關閉(圖形)桌面顯示管理器LightDM:“sudo service lightdm stop” 安裝驅動:“sudo apt-get install nvidia-384” 執行“sudo apt-get upgrade”,重啟sudo reboot 執行“nvidia-smi”即可查看驅動的安裝狀態顯示安裝成功 如出現錯誤:“nvidia-smi has failed because it couldn‘t communicate with the nvidia driver”,請disable系統的security boot即可 重新啟動圖形環境“sudo service lightdm start”
二、報錯:
Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --utility --pid=11077 /var/lib/docker/overlay2/510a6de5ed82decf7421a392e5274b4fe47e8d0cd3610175c3550f1d26c91376/merged]\\\\nnvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown
說是驅動有問題,第一個想到的就是因為將早先的
nvidia-384
驅動更新到了nvidia-410
可能有問題,再重啟之后沒有作用,於是嘗試通過apt
重新安裝nvidia-410
:
$ add-apt-repository ppa:graphics-drivers/ppa $ apt update $ apt install nvidia-410
重啟后依然發現類似問題,再去搜索發現 https://zhuanlan.zhihu.com/p/37519492 和我遇到的問題類似,通過命令 nvidia-container-cli -k -d /dev/tty info
得到具體的報錯:
E0117 08:51:20.843706 12905 driver.c:197] could not start driver service: load library failed: libnvidia-fatbinaryloader.so.384.145: cannot open shared object file: no such file or directory
384
這個驅動版本我明明已經刪了,為什么還要找這個庫呢?是不是因為新的 410
安裝的不全呢?再往后看,提到
安裝驅動的時候會自動安裝這個libcuda1-384包的,估計是什么歷史遺留問題,或者是purge 又install把包的依賴關系搞壞了,因此現在需要重新安裝。
立即想到我的 410
是不是也沒有安裝 libcuda1-410
呢?趕緊 apt search libcuda
發現果然有這么個依賴,apt install libcuda1-410
趕緊安裝,再次跑 nvidia-container-cli -k -d /dev/tty info
就一切正常了。
三、報錯:ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
解決方案:
進入/usr/lib/nvidia-390
建立軟連接:
sudo ln -f -s /usr/lib/x86_64-linux-gnu/libcuda.so.1 libcuda.so.1
四、安裝nvidia-docker2
官網安裝教程:https://github.com/NVIDIA/nvidia-docker
# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f sudo apt-get purge -y nvidia-docker # Add the package repositories curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \ sudo apt-key add - distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \ sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update # Install nvidia-docker2 and reload the Docker daemon configuration sudo apt-get install -y nvidia-docker2 sudo pkill -SIGHUP dockerd # Test nvidia-smi with the latest official CUDA image docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
測試是否成功:
docker run -it --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi