[pytorch][持續更新]pytorch踩坑匯總


  1. BN層不能少於1張圖片
File "/home/user02/wildkid1024/haq/models/mobilenet.py", line 71, in forward
    x = self.features(x)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user02/wildkid1024/haq/lib/utils/utils.py", line 244, in lambda_forward
    return m.old_forward(x)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 76, in forward
    exponential_average_factor, self.eps)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/functional.py", line 1619, in batch_norm
    raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1, 1])

問題分析: 模型中用了batchnomolization,訓練中用batch訓練的時候,應該是有單數,比如dataset的總樣本數為17,batch_size為8,就會報這樣的錯誤。
解決方案: 1. 將dataloader的一個丟棄參數設置為true 2. 手動舍棄小於1的樣本數量 3. 如果是驗證過程,通過設置model.eval()改變BN層的行為。 4. 如果訓練過程中只能使用1個sample,替換BN為InstanceNorm.

  1. 自動求導的時候沒有設定變量可微分
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

問題分析 模型在使用梯度值的時候沒有設置參數的requires_grad=True,導致在取偏導的時候,計算圖已經被銷毀。
解決方案 檢查一下是否使用了model.eval()和torch.no_grad()函數,如果有就刪除,如果沒有,那就再input上添加var.required_grad = True。

  1. PyTorch訓練一個epoch時,模型不能接着訓練,Dataloader卡死,或者程序會非0值退出
    問題分析 與pytorch的多線程有關系,pytorch在多線程讀取的時候可能會出現死鎖的情況。
    解決方案  1. 檢查data讀取是否使用了cv2.imread,建議改成PIL的Image讀取。或者關閉關閉Opencv的多線程:cv2.setNumThreads(0)和cv2.ocl.setUseOpenCL(False)。 2. 將num_works設置為0,此時數據讀取會變慢。如果不想設置為0,那么應當設置pin_memory=True來預先分配內存。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM