首先在ctrl+c后出现这些错误
训练后卡在
torch.distributed.init_process_group(backend='nccl', init_method='env://',world_size=2, rank=args.local_rank)
这句之前,使用ctrl+c后出现
torch.distributed.elastic.multiprocessing.api.SignalException: Process 214426 got signal: 2
torch.distributed.elastic.multiprocessing.api.SignalException: Process 214426 got signal: 2
:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 214465 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 214466 closing signal SIGINT
^CWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 214465 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 214466 closing signal SIGTERM
解决
网上都是说添加
torch.distributed.init_process_group(backend='nccl', init_method='env://',world_size=2, rank=args.local_rank)
os.environ['MASTER_ADDR'] = '127.0.0.1'
# os.environ['MASTER_PORT'] = '62222'#一机多卡不需要这个,有这个会导致不能开始
但是实际上一机多卡不需要'MASTER_PORT'这个变量注释掉就可以,或者使用另外一个初始化方法
torch.distributed.init_process_group(backend='gloo', init_method='file:///home/user/switch.txt',world_size=2, rank=args.local_rank)
使用这个初始化方法不需要网络主机等环境变量