輸入下條命令,查看你的顯卡驅動所使用的內核版本
cat /proc/driver/nvidia/version
輸入下條命令,查看電腦驅動
cat /var/log/dpkg.log | grep nvidia
輸入下條命令,查看電腦所有驅動
sudo dpkg --list | grep nvidia-*
問題1:
root@4f80b64fe9f6:/# nvidia-smi
Failed to initialize NVML: Unknown Error
進入Docker
sudo docker run --gpus all -it ubuntu18_torch1.6:v0.3
需要加入--gpus all
問題2:
安裝好nvidia-docker,nvidia-driver,cuda,cudnn, 以及pytorch_cuda版后在docker中輸入torch.cuda.is_available(),返回False
解決方法:
sudo docker run --gpus all -it [-e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all]
需要加入:-e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all
問題3:
使用pycharm運行pytorch工程代碼,出現問題:RuntimeError:Not compiled with GPU support
解決方法:
刪除benchmark中整個build文件夾,重新編譯lib包:在根目錄下運行:python setup.py build develop
編譯好后,記得保存下鏡像:
sudo docker commit -a "comment" contain_id image_name:image_tag
然后在pycharm中重新配置新的docker鏡像即可
問題4:打開Pycharm2020.3版,在Settings里Build,Execution,Deployment里設置Docker時,出現Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker與守護進程間的通訊問題
解決方法:
在命令行里輸入
sudo chown *your-username* /var/run/docker.sock # *your-username*為主機名:igs
問題5:在docker里運行工程代碼時,報錯:RuntimeError: Unrecognized tensor type ID: AutogradCUDA
原因:編譯工程包時,使用了pytorch1.6+torchvision0.7,而在編譯完后,更新了pytorch1.7+torchvision0.8
解決方法:重新編譯工程,python setup.py build develop
問題6:在docker中升級pytorch:pip install pytorch1.7.1-***.whl
無法成功,提示超時,然后報錯
解決方法:加上--no-deps
pip install --no-deps pytorch1.7.1-***.whl
問題7:在多GPU環境下,配置NUM_WORKER 為2,直接報錯
export NGPUS=2
python -m torch.distributed.launch --nproc_per_node=NGPUS ../../tools/training/train.py
Traceback (most recent call last): File "train.py", line 159, in <module> train(args=args) File "train.py", line 50, in train rank = args.local_rank File "/home/wby/anaconda3/envs/wby/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group store, rank, world_size = next(rendezvous(url)) File "/home/wby/anaconda3/envs/wby/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 95, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon) RuntimeError: Address already in use
問題在於,TCP的端口被占用
解決方法一:
運行程序的同時指定端口,端口號隨意給出:
--master_port 29501 (端口號)
python train.py --master_port 29501
解決方法二:
查找占用的端口號(在程序里 插入print輸出),然后找到該端口號對應的PID值:netstat -nltp
,然后通過kill -9 PID
來解除對該端口的占用
問題8:no implementation found for {} on types that implement
if box1 == torch.Tensor: box1=box1.cpu().numpy()
修改為:
if type(box1) == torch.Tensor: box1=box1.cpu().numpy()
問題9:cant convert cuda:0 device type tenhsor to numpy
lt=np.maximum(box1[:,None,:2],box2[:,:2])
修改為:
if type(box1) == torch.Tensor: box1=box1.cpu().numpy() if type(box2) == torch.Tensor: box2=box2.cpu().numpy()
問題10:Docker訓練單GPU時,可正常收斂,但采用多GPU訓練時卻無法收斂
參考鏈接:
NVIDIA Docker CUDA容器化原理分析
https://cloud.tencent.com/developer/article/1496697