單機單卡

1.判斷卡存在
torch.cuda.is_available()
2.數據拷貝到GPU
data.cuda()
3.模型拷貝到GPU
model.cuda()
4.加載的時候，需要map_location參數設置加載到哪個GPU
torch.load(path, map_location= torch.device("cuda:0")) #可以是cpu,cuda, cuda:idx

單機多卡

方法一：torch.nn.DataParallel（單進程效率慢）

1.只需要加1行model=torch.nn.DataParallel(model.cuda(), device_ids=[0,1,2,3])
2.需要主要模型保存的時候需要是 torch.save(model.modules.state_dict()),單卡是torch.save(model.state_dict())
3.加載的時候，torch.load需要map_location參數設置加載到哪個GPU（同單GPU）
4.注意Batch_size為所有GPU的Batch_size總和

方法二：torch.nn.parallel.DistributedDataParallel（多進程多卡）

1、初始化進程組 torch.distributed.init_process_group("nccl"，world_size=n_gpu，rank=args.local_rank) # 第一參數nccl為GPU通信方式, world_size為當前機器GPU個數,rank為當前進程在哪個PGU上
2、設置進程使用第幾張卡 torch.cuda.set_device(args.local_rank)
3.對模型進行包裹 model=torch.nn.DistributedDataParallel(model.cuda(args.local_rank), device_ids=[args.local_rank]), 這里device_ids傳入一張卡即可，因為是多進程多卡，一個進程一個卡
4、將數據分配到不同的GPU train_sampler = torch.util.data.distributed.DistributedSampler(train_dataset) # train_dataset為Dataset()
5.將train_sampler傳入到DataLoader中,不需要傳入shuffle=True,因為shuffle和sampler互斥 data_dataloader = DataLoader(..., sampler=train_sampler)
6.數據拷貝到GPU data = data.cuda(args.local_rank)
注意：
7.在每個epoch開始時候，需要調用train_sampler.set_epoch(epoch)使得數據充分打亂，要不然每個epoch返回數據是相同的
8.模型保存 torch.save在local_rank=0的位置保存，torch.save(model.modules.state_dict())
9.加載的時候，torch.load需要map_location參數設置加載到哪個GPU（同單GPU）
10.執行命令的時候需加入-m torch.distributed.launch參數，nproc_per_node執行進程個數/GPU個數，launch會像train.py傳入args.local_rank,local_rank從0到n_gpus - 1個索引
python -m torch.distributed.launch --nproc_per_node=n_gpus train.py
11、launch會像train.py傳入args.local_rank,local_rank從0到n_gpus - 1個索引，train.py需要接受local_rank的參數
12.注意Batch_size為每個GPU的Batch_size

多機多卡

1、代碼編寫和單機多卡的DDP一致
2、執行的時候需要在多個機器上執行命令（以2個節點為例，每個節點n_gpus 個GPU）,--nnodes有幾個機器, --node-rank當前機器是第幾個
python -m torch.distributed.launch --nproc_per_node=n_gpus train.py --nnodes=2 --node-rank=0 --master_addr="主節點IP" --master_port="主節點端口" train.py
python -m torch.distributed.launch --nproc_per_node=n_gpus train.py --nnodes=2 --node-rank=1 --master_addr="主節點IP" --master_port="主節點端口" train.py

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 tensorflow 單機多GPU訓練時間比單卡更慢/沒有很大時間上提升《無憂行》——一個軟件實現蘋果單卡手機雙卡雙待（僅支持移動）單機多GPU訓練報錯 torch單機多卡重點： pytorch單機多卡訓練 Pytorch使用單機多卡訓練使用pytorch的DistributedParallel進行單機多卡訓練 TensorFlow分布式部署【單機多卡】（轉）PyTorch DDP模式單機多卡訓練 pytorch單機多卡並行計算示例

GPU, 單機單卡， 多機多卡