单机单卡

1.判断卡存在
torch.cuda.is_available()
2.数据拷贝到GPU
data.cuda()
3.模型拷贝到GPU
model.cuda()
4.加载的时候，需要map_location参数设置加载到哪个GPU
torch.load(path, map_location= torch.device("cuda:0")) #可以是cpu,cuda, cuda:idx

单机多卡

方法一：torch.nn.DataParallel（单进程效率慢）

1.只需要加1行model=torch.nn.DataParallel(model.cuda(), device_ids=[0,1,2,3])
2.需要主要模型保存的时候需要是 torch.save(model.modules.state_dict()),单卡是torch.save(model.state_dict())
3.加载的时候，torch.load需要map_location参数设置加载到哪个GPU（同单GPU）
4.注意Batch_size为所有GPU的Batch_size总和

方法二：torch.nn.parallel.DistributedDataParallel（多进程多卡）

1、初始化进程组 torch.distributed.init_process_group("nccl"，world_size=n_gpu，rank=args.local_rank) # 第一参数nccl为GPU通信方式, world_size为当前机器GPU个数,rank为当前进程在哪个PGU上
2、设置进程使用第几张卡 torch.cuda.set_device(args.local_rank)
3.对模型进行包裹 model=torch.nn.DistributedDataParallel(model.cuda(args.local_rank), device_ids=[args.local_rank]), 这里device_ids传入一张卡即可，因为是多进程多卡，一个进程一个卡
4、将数据分配到不同的GPU train_sampler = torch.util.data.distributed.DistributedSampler(train_dataset) # train_dataset为Dataset()
5.将train_sampler传入到DataLoader中,不需要传入shuffle=True,因为shuffle和sampler互斥 data_dataloader = DataLoader(..., sampler=train_sampler)
6.数据拷贝到GPU data = data.cuda(args.local_rank)
注意：
7.在每个epoch开始时候，需要调用train_sampler.set_epoch(epoch)使得数据充分打乱，要不然每个epoch返回数据是相同的
8.模型保存 torch.save在local_rank=0的位置保存，torch.save(model.modules.state_dict())
9.加载的时候，torch.load需要map_location参数设置加载到哪个GPU（同单GPU）
10.执行命令的时候需加入-m torch.distributed.launch参数，nproc_per_node执行进程个数/GPU个数，launch会像train.py传入args.local_rank,local_rank从0到n_gpus - 1个索引
python -m torch.distributed.launch --nproc_per_node=n_gpus train.py
11、launch会像train.py传入args.local_rank,local_rank从0到n_gpus - 1个索引，train.py需要接受local_rank的参数
12.注意Batch_size为每个GPU的Batch_size

多机多卡

1、代码编写和单机多卡的DDP一致
2、执行的时候需要在多个机器上执行命令（以2个节点为例，每个节点n_gpus 个GPU）,--nnodes有几个机器, --node-rank当前机器是第几个
python -m torch.distributed.launch --nproc_per_node=n_gpus train.py --nnodes=2 --node-rank=0 --master_addr="主节点IP" --master_port="主节点端口" train.py
python -m torch.distributed.launch --nproc_per_node=n_gpus train.py --nnodes=2 --node-rank=1 --master_addr="主节点IP" --master_port="主节点端口" train.py

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 tensorflow 单机多GPU训练时间比单卡更慢/没有很大时间上提升《无忧行》——一个软件实现苹果单卡手机双卡双待（仅支持移动）单机多GPU训练报错 torch单机多卡重点： pytorch单机多卡训练 Pytorch使用单机多卡训练使用pytorch的DistributedParallel进行单机多卡训练 TensorFlow分布式部署【单机多卡】（转）PyTorch DDP模式单机多卡训练 pytorch单机多卡并行计算示例

GPU, 单机单卡， 多机多卡