報錯 ncclCommInitRank failed.


環境

  • 4 GeForce GTX 1080 GPUS
  • docker image nnabla/nnabla-ext-cuda-multi-gpu:py36-cuda102-mpi3.1.6-v1.14.0

代碼

  • 從倉庫nnabla-ext-cuda-multi-gpu拉取鏡像docker pull nnabla/nnabla-ext-cuda-multi-gpu:py36-cuda102-mpi3.1.6-v1.14.0
  • 運行docker run -it --rm --gpus all nnabla/nnabla-ext-cuda-multi-gpu:py36-cuda102-mpi3.1.6-v1.14.0
  • 添加test.py
import nnabla.communicators as C
from nnabla.ext_utils import get_extension_context
extension_module = "cudnn"
ctx = get_extension_context(extension_module)
comm = C.MultiProcessCommunicator(ctx)
comm.init()
print(f'sizes={comm.size}, divice_id={comm.rank}')
  • 運行mpiexec -np 4 python test.py將會拋出異常。(異常只發生在使用GPU數大於2時)

bug

拋出異常如下:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    comm.init()
  File "communicator.pyx", line 121, in nnabla.communicator.Communicator.init
RuntimeError: target_specific error in init
/home/gitlab-runner/builds/g9zRZKFe/2/nnabla/builders/all/nnabla-ext-cuda/src/nbla/cuda/communicator/multi_process_data_parallel_communicator.cu:358
ncclCommInitRank failed.

使用NCCL_DEBUG=INFO查看詳細信息mpiexec -np 4 -x NCCL_DEBUG=INFO python test.py

...
0db89117f3b2:87:87 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
0db89117f3b2:87:87 [2] NCCL INFO include/shm.h:41 -> 2

0db89117f3b2:87:87 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-6d2dacd576938b74-0-3-2 (size 9637888)
...

可以看到沒有多余的共享內存,但是使用nvidia-smi查看GPU情況,發現內存並沒有過多使用。

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:01:00.0 Off |                  N/A |
| 27%   30C    P8     5W / 180W |    815MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    On   | 00000000:02:00.0 Off |                  N/A |
| 27%   33C    P8     6W / 180W |      4MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    On   | 00000000:03:00.0 Off |                  N/A |
| 28%   35C    P8     5W / 180W |      4MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    On   | 00000000:04:00.0  On |                  N/A |
| 28%   34C    P8     6W / 180W |      4MiB /  8118MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

原因

異常原因是NCCL不能在/dev/shm創建共享內存文件。因為docker默認的/dev/shm文件的大小為64MB太小了,所以當使用GPU數大於2時,會顯示內存不夠。

解決

有3中方式:

  • /etc/nccl.conf~/.nncd.conf文件中,添加配置NCCL_SHM_DISABLE=1;如果nccl版本小於2.7,還應該配置NCCL_P2P_LEVEL=SYS,具體參考NCCL配置
  • 可以映射宿主機上的/dev/shm,即docker run -v /dev/shm:/dev/shm ...,但是這樣會在宿主機上留下臟文件。
  • 運行時,修改容器共享內存的大小,即docker run --shm-size=256m ...

參考


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM