記錄一個pytorch多卡訓練遇到的bug
報錯如下:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 30; expected version 29 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
這個是多卡訓練時候遇到的,單卡是一切正常的
先按網上的提示,在報錯的代碼前加上with torch.autograd.set_detect_anomaly(True):
語句,之后它會把掛掉時候的棧顯示出來,我的打出來是在batchNorm那里出的問題
搜索得到一個方案:https://discuss.pytorch.org/t/ddp-sync-batch-norm-gradient-computation-modified/82847/5
解決方法就是在DDP那里加上一個broadcast_buffers=False
參數