docker容器下安裝nccl失敗，報錯：Failed to init nccl communicator for group，init nccl communicator for group nccl_world_group

本文轉載自查看原文 2021-07-17 13:14 160 浪潮計算平台/ MindSpore（深度學習計算框架）

=================================================================

docker 容器內安裝 nccl 后，測試是否安裝成功：

使用 NVIDIA公司官方提供的測試工具： nccl-tests

國內下載地址：

https://gitee.com/devilmaycry812839668/nccl-tests

下載后，進行編譯： make

If CUDA is not installed in /usr/local/cuda, you may specify CUDA_HOME.

Similarly, if NCCL is not installed in /usr, you may specify NCCL_HOME.

默認，cuda的安裝位置：/usr/local/cuda

默認，nccl的安裝位置：/usr

如果cuda和nccl的安裝位置不是默認的，而是手動選擇其他地方的，那么需要在make的時候配置路徑：

$ make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl

CUDA_HOME 為 cuda的安裝路徑。

NCCL_HOME 為 nccl的安裝路徑。

make 編譯后進行簡單的例子進行測試nccl是否安裝成功。

Quick examples

Run on 8 GPUs (-g 8), scanning from 8 Bytes to 128MBytes :

$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

Run with MPI on 40 processes (potentially on multiple nodes) with 4 GPUs each :

$ mpirun -np 40 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

（該例子假設系統中已經安裝了 openmpi）

這里我們需要修改的示例代碼的地方為 -g , 如果你有一個顯卡，那么就是-g 1 ，如果有四個顯卡就是 -g 4

需要注意的是如果有顯卡內存不夠用（被其他進程調用已經占滿），那么需要設置環境變量： export CUDA_VISIBLE_DEVICES="0,1,2,3"

CUDA_VISIBLE_DEVICES 變量用來指定可以用來進行測試的顯卡，同時修改-g 后的數值。

假設 1 號顯卡已經沒有顯存，那么設置 export CUDA_VISIBLE_DEVICES="0,2,3"

運行：

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

mpirun -np 40 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

===========================================================

在容器中安裝 nccl 后使用 nccl-tests 測試后報錯：

Failed to init nccl communicator for group

init nccl communicator for group nccl_world_group

78244:78465 [0] NCCL INFO Call to connect returned Connection timed out, retrying 78244:78466 [1] NCCL INFO Call to connect returned Connection timed out, retrying 78244:78465 [0] NCCL INFO Call to connect returned Connection timed out, retrying 78244:78466 [1] NCCL INFO Call to connect returned Connection timed out, retrying

很大的可能就是建立容器的時候沒有設置允許共享內存： --ipc=host

解決方案：

重新建立容器，在建立時加入設置： --ipc=host

形式如： sudo docker run -it --ipc=host **************************（其他參數這里不表）

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Linux下NCCL源碼編譯安裝 linux下安裝docker容器 Docker:容器啟動時報錯（iptables failed） Windows 7 下安裝 docker 應用容器引擎 docker安裝Elasticsearch:7.6.0啟動失敗，ERROR: [1] bootstrap checks failed Docker從安裝部署到Hello World centos7防火牆調整后docker容器啟動失敗並報錯 Docker Desktop 啟動失敗：Docker failed to initialize centos7下的docker安裝以及鏡像和容器的基礎命令 centos7下安裝docker（10容器底層--cgroup和namespace）