one of the variables needed for gradient computation has been modified by an inplace operation


記錄一個pytorch多卡訓練遇到的bug
報錯如下:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 30; expected version 29 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

這個是多卡訓練時候遇到的,單卡是一切正常的

先按網上的提示,在報錯的代碼前加上with torch.autograd.set_detect_anomaly(True):語句,之后它會把掛掉時候的棧顯示出來,我的打出來是在batchNorm那里出的問題

搜索得到一個方案:https://discuss.pytorch.org/t/ddp-sync-batch-norm-gradient-computation-modified/82847/5

解決方法就是在DDP那里加上一個broadcast_buffers=False參數


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM