如果之前使用的訓練命令是 python train.py --device gpu --save_dir ./checkpoints
添加 -m paddle.distributed.launch
就能使用分布式訓練,python -m paddle.distributed.launch train.py --device gpu --save_dir ./checkpoints
然后報錯了 error code is libnccl.so: cannot open shared object file: No such file or directory
根據提示缺少nccl,並提供了下載地址https://developer.nvidia.com/nccl/nccl-download
一定要注冊才能下載。。。記錄下來吧:
Network Installer for Ubuntu18.04 $ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin $ sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600 $ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub $ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /" $ sudo apt-get update Network Installer for Ubuntu16.04 $ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-ubuntu1604.pin $ sudo mv cuda-ubuntu1604.pin /etc/apt/preferences.d/cuda-repository-pin-600 $ sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub $ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/ /" $ sudo apt-get update then run the following command to installer NCCL: For Ubuntu: sudo apt install libnccl2=2.11.4-1+cuda10.2 libnccl-dev=2.11.4-1+cuda10.2
哈哈,再次執行發現可以了
可見同時使用了4張卡,
為了不影響其他正在使用的,推薦先使用 export CUDA_VISIBLE_DEVICES=2,3
指定顯卡的可用性
還可以查看每個卡的使用情況,會在當前路徑下生成log文件夾: