新的框架是semseg, by hs-z https://github.com/hszhao/semseg
1. 安裝apex報錯:fatal error: gnu-crypt.h: No such file or directory
本質上是cryptacular的pip源有問題,使用conda install cryptacular即可
2. pip install總是安裝到別的虛擬環境里
這是因為當前正在使用的pip並非當前虛擬環境里的。這里conda install會默認安裝到當前虛擬環境,但是pip並不會。
所以使用 whereis pip查看想要的當前虛擬環境的pip程序的位置,然后使用絕對路徑來執行pip install即可
3. 關於pip和conda的源
今天是2019年6月10日,目前conda的清華源因為版權問題已經關閉,而pip的清華源仍可以正常使用。
4. ModuleNotFoundError: No module named 'yaml'
應該是
conda install pyaml
5. TypeError: Class advice impossible in Python3. Use the @implementer class decorator instead
首先切換當前的CUDA版本與pytorch的CUDA版本一致
然后卸載已經安裝過的apex。
然后:
git clone https://github.com/NVIDIA/apex.git && cd apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
對於多虛擬環境,可能會有些錯亂。此時使用
whereis pip
來找到你當前虛擬環境的pip執行程序的位置。然后使用pip的絕對路徑進行操作。
包括apex上面的最后一步的python也可以使用其絕對路徑來安裝,保證一定安裝到了正確的位置
不應該使用下面的這行命令來安裝apex:
git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext
原因可能是我在conda虛擬環境中
參考:https://github.com/NVIDIA/apex/issues/214#issuecomment-476399539
6. 什么報錯都沒有,用PDB也沒有。在 x = self.layer0(x) 處消失
batch size和輸入圖片尺寸小一些就好了
7. Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
在已經正確安裝的前提下,還堅持報這個錯誤,說明是個有深度的錯誤。
根據:https://discuss.pytorch.org/t/undefined-symbol-when-import-lltm-cpp-extension/32627/2
兩個解決方法:
- build cpp extensions with
-D_GLIBCXX_USE_CXX11_ABI=1
. - build pytorch with
-D_GLIBCXX_USE_CXX11_ABI=0
但是apex如何設置額外的編譯參數我也不會。根據他們提供的export方法,即:
export CFLAGS="-D_GLIBCXX_USE_CXX11_ABI=1 $CFLAGS"
然后再編譯apex,發現在編譯過程中這個參數還是等於0,沒有效果。
最終根據下面這段話:
The best way to solve this problem in any case is to compile Pytorch from source and use that same compiler for the extension. Then all problems go away.
決定還是把pytorch和apex都在本機上從源代碼編譯一遍得了。
然后發現pytorch從源代碼編譯很有困難……遇到了一堆找不到解決辦法的BUG,最后想了想把pytorch安裝回去吧。
之前安裝pytorch和這次的途徑不同:
之前的方式是:
conda install pytorch torchvision cudatoolkit=10.0
為了加快下載速度,就不想從pytorch官方源下載,而是選擇了從conda源下載。
然而在python中,使用:
torch._C._GLIBCXX_USE_CXX11_ABI
發現是True,也就是說
-D_GLIBCXX_USE_CXX11_ABI=1
不滿足要求
這次使用了pytorch官方的源:
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
神奇的事情發生了,這次
-D_GLIBCXX_USE_CXX11_ABI=0了
還是pytorch官方靠譜……conda上是收錄的官方的編譯包,更新的不夠及時。
然后就OK了……………………
總結一下,
1. 靠譜的還是官方,不要圖省事,也不要總想着自己去編譯,那樣子問題更多。
2. 遇到問題要到github上相應倉庫的issue去查詢,這也很重要。特別是,要用英文查詢。中文查詢都是二手信息。
3. Google的搜索能力的確很厲害,盡量用Google!
8. ValueError: batch_size should be a positive integer value, but got batch_size=0
在config文件里的
batch_size_val: 8 # batch size for validation during training, memory and speed tradeoff
需要設置為和GPU一樣的數量,雖然不知道為什么。
9. cv2.error: OpenCV(4.1.0) /io/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'
data_root不正確,沒有讀取到數據
10. Exception: process 0 terminated with signal SIGSEGV
內存不足
目前還無法解決
只能把分布式訓練給關了,暫時可以運行,但是很慢
11. pip下載速度慢
linux下,修改 ~/.pip/pip.conf (沒有就創建一個), 修改 index-url至tuna,內容如下:
[global] index-url = https://pypi.tuna.tsinghua.edu.cn/simple
12. 查看pytorch對應的cuda版本
print(torch.version.cuda)
13. libSM.so.6: cannot open shared object file: No such file or directory
pip install opencv-python-headless # also contrib, if needed pip install opencv-contrib-python-headless
14. fatal error: gnu-crypt.h: No such file or directory
在安裝apex過程中出現的。應該使用
conda install cryptacular
然后再安裝apex
15. OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://
www.intel.com/software/products/support/.
Traceback (most recent call last): │··································
File "tool/train.py", line 456, in <module> │··································
main() │··································
File "tool/train.py", line 106, in main │··································
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args)) │··································
File "/home/xxx/.conda/envs/pytorchseg10/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn │··································
while not spawn_context.join(): │··································
File "/home/xxx/.conda/envs/pytorchseg10/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join │··································
(error_index, name) │··································
Exception: process 2 terminated with signal SIGABRT
在新服務器上試圖重新配置環境,然而遇到這個問題
解決方法:
在train.sh里添加這一行
export KMP_INIT_AT_FORK=FALSE
此時的train.sh為

#!/bin/sh PARTITION=gpu PYTHON=python dataset=$1 exp_name=$2 exp_dir=exp/${dataset}/${exp_name} model_dir=${exp_dir}/model result_dir=${exp_dir}/result config=config/${dataset}/${dataset}_${exp_name}.yaml now=$(date +"%Y%m%d_%H%M%S") mkdir -p ${model_dir} ${result_dir} cp tool/train.sh tool/train.py ${config} ${exp_dir} export PYTHONPATH=./ export KMP_INIT_AT_FORK=FALSE #sbatch -p $PARTITION --gres=gpu:8 -c16 --job-name=train \ $PYTHON -u tool/train.py \ --config=${config} \ 2>&1 | tee ${model_dir}/train-$now.log
參考:https://github.com/ContinuumIO/anaconda-issues/issues/11294
16. RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1556653114079/work/torch/lib/c10d/ProcessGroupNCCL.cpp:272, unhandled cuda error
很多地方(https://github.com/pytorch/pytorch/issues/23534)
說是NCCL的版本問題,於是打印版本:(在命令行里,運行python之前)
export NCCL_DEBUG=VERSION
然后執行程序,看到我的版本是2.4.8
此時,標題中的報錯不再出現……
所以這個錯誤是一種表象,掩蓋了實際的錯誤
但是運行了一段時間后又自動斷開了,再次運行還是這個錯誤…… 很奇怪
在打印了debug信息后:加入
export NCCL_DEBUG=info
發現有一個錯誤:
Cuda failure 'an illegal memory access was encountered'
這個問題沒有通行的解決方案,每個人的問題都不太一樣。
我突然發現,每次到GPU3的進程開始初始化時就會報錯,然后取消使用GPU3 發現錯誤解決了…… 難道是硬件壞了
經過排查,已經確定只有在GPU3上有問題。運行多次,報錯不同,這里記錄一下
報錯記錄1:

Traceback (most recent call last): File "tool/train.py", line 480, in <module> main() File "tool/train.py", line 115, in main main_worker(args.train_gpu, args.ngpus_per_node, args) File "tool/train.py", line 290, in main_worker loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch) File "tool/train.py", line 335, in train output, main_loss, aux_loss = model(input, target) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/pspnet.py", line 93, in forward x = self.layer4(x_tmp) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/resnet.py", line 82, in forward out = self.conv2(out) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward return self.conv2d_forward(input, self.weight) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward self.padding, self.dilation, self.groups) RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
報錯記錄2:

Traceback (most recent call last): File "tool/train.py", line 480, in <module> main() File "tool/train.py", line 115, in main main_worker(args.train_gpu, args.ngpus_per_node, args) File "tool/train.py", line 290, in main_worker loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch) File "tool/train.py", line 335, in train output, main_loss, aux_loss = model(input, target) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/pspnet.py", line 92, in forward x_tmp = self.layer3(x) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/resnet.py", line 82, in forward out = self.conv2(out) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward return self.conv2d_forward(input, self.weight) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward self.padding, self.dilation, self.groups) RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1568696969690/work/aten/src/THC/THCGeneral.cpp:216
對於報錯1,
根據 https://github.com/qqwweee/keras-yolo3/issues/332#issuecomment-517989338
安裝后還是沒用
對於報錯記錄2:根據https://github.com/huggingface/transfer-learning-conv-ai/issues/10#issuecomment-496111466
增加了export CUDA_LAUNCH_BLOCKING=1
下面是報錯記錄3:

Traceback (most recent call last): File "tool/train.py", line 480, in <module> main() File "tool/train.py", line 115, in main main_worker(args.train_gpu, args.ngpus_per_node, args) File "tool/train.py", line 290, in main_worker loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch) File "tool/train.py", line 335, in train output, main_loss, aux_loss = model(input, target) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/pspnet.py", line 90, in forward x = self.layer1(x) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/resnet.py", line 90, in forward residual = self.downsample(x) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward return self.conv2d_forward(input, self.weight) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
添加了下面一行后暫時可以了,
torch.backends.cudnn.benchmark = True
好了十分鍾,又壞了,報錯和記錄3一樣……
==================================================================================================================
不確定是不是還是第三塊卡的問題,所以這次使用其他的所有卡,訓練多一些時間看看
使用了別的所有的卡一起訓練了很久都沒有問題。
總結兩點:
1. 只使用第三塊卡會有問題
2. 單卡時,相當於非常普通的訓練方式,並不會出發多線程、多進程以及分布式的代碼
再嘗試一下是不是可以通過軟件層面解決,不行的話就只能歸因於顯卡壞掉了。或者服務器有問題
有人說是CUDNN的版本問題,先將CUDNN關閉:
torch.backends.cudnn.enabled = False
關閉后,在單獨使用第三塊卡時候的確可以運行了。
然后測試使用所有卡+關閉CUDNN。在運行了20個ITERS后報錯:

[2019-09-29 20:03:08,046 INFO train.py line 404 98898] Epoch: [44/200][20/186] Data 0.001 (0.106) Batch 1.274 (1.504) Remain 12:11:15 MainLoss 0.1086 AuxLoss 0.1171 Loss 0.1555 Accuracy 0.9622. terminate called after throwing an instance of 'c10::Error' what(): CUDA error: an illegal memory access was encountered (insert_events at /opt/conda/conda-bld/pytorch_1568696969690/work/c10/cuda/CUDACachingAllocator.cpp:569) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7ffa46ff5477 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x17044 (0x7ffa47231044 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: <unknown function> + 0x1cccb (0x7ffa47236ccb in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7ffa46fe2e8d in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10.so) frame #4: <unknown function> + 0x1c2789 (0x7ffa7892f789 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #5: <unknown function> + 0x445d2b (0x7ffa78bb2d2b in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x445d61 (0x7ffa78bb2d61 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x1a184f (0x558f4667b84f in /home/lzx/.conda/envs/seg/bin/python) frame #8: <unknown function> + 0xfd1a8 (0x558f465d71a8 in /home/lzx/.conda/envs/seg/bin/python) frame #9: <unknown function> + 0x10e3c7 (0x558f465e83c7 in /home/lzx/.conda/envs/seg/bin/python) frame #10: <unknown function> + 0x10e3dd (0x558f465e83dd in /home/lzx/.conda/envs/seg/bin/python) frame #11: <unknown function> + 0x10e3dd (0x558f465e83dd in /home/lzx/.conda/envs/seg/bin/python) frame #12: <unknown function> + 0xf5777 (0x558f465cf777 in /home/lzx/.conda/envs/seg/bin/python) frame #13: <unknown function> + 0xf57e3 (0x558f465cf7e3 in /home/lzx/.conda/envs/seg/bin/python) frame #14: <unknown function> + 0xf5766 (0x558f465cf766 in /home/lzx/.conda/envs/seg/bin/python) frame #15: <unknown function> + 0x1db5e3 (0x558f466b55e3 in /home/lzx/.conda/envs/seg/bin/python) frame #16: _PyEval_EvalFrameDefault + 0x2a5a (0x558f466a7e4a in /home/lzx/.conda/envs/seg/bin/python) frame #17: _PyFunction_FastCallKeywords + 0xfb (0x558f4663dccb in /home/lzx/.conda/envs/seg/bin/python) frame #18: _PyEval_EvalFrameDefault + 0x6a3 (0x558f466a5a93 in /home/lzx/.conda/envs/seg/bin/python) frame #19: _PyFunction_FastCallKeywords + 0xfb (0x558f4663dccb in /home/lzx/.conda/envs/seg/bin/python) frame #20: _PyEval_EvalFrameDefault + 0x416 (0x558f466a5806 in /home/lzx/.conda/envs/seg/bin/python) frame #21: _PyEval_EvalCodeWithName + 0x2f9 (0x558f465ee539 in /home/lzx/.conda/envs/seg/bin/python) frame #22: _PyFunction_FastCallKeywords + 0x387 (0x558f4663df57 in /home/lzx/.conda/envs/seg/bin/python) frame #23: _PyEval_EvalFrameDefault + 0x14dc (0x558f466a68cc in /home/lzx/.conda/envs/seg/bin/python) frame #24: _PyEval_EvalCodeWithName + 0x2f9 (0x558f465ee539 in /home/lzx/.conda/envs/seg/bin/python) frame #25: PyEval_EvalCodeEx + 0x44 (0x558f465ef424 in /home/lzx/.conda/envs/seg/bin/python) frame #26: PyEval_EvalCode + 0x1c (0x558f465ef44c in /home/lzx/.conda/envs/seg/bin/python) frame #27: <unknown function> + 0x22ab74 (0x558f46704b74 in /home/lzx/.conda/envs/seg/bin/python) frame #28: PyRun_StringFlags + 0x7d (0x558f4670fddd in /home/lzx/.conda/envs/seg/bin/python) frame #29: PyRun_SimpleStringFlags + 0x3f (0x558f4670fe3f in /home/lzx/.conda/envs/seg/bin/python) frame #30: <unknown function> + 0x235f3d (0x558f4670ff3d in /home/lzx/.conda/envs/seg/bin/python) frame #31: _Py_UnixMain + 0x3c (0x558f467102bc in /home/lzx/.conda/envs/seg/bin/python) frame #32: __libc_start_main + 0xf0 (0x7ffa91705830 in /lib/x86_64-linux-gnu/libc.so.6) frame #33: <unknown function> + 0x1db062 (0x558f466b5062 in /home/lzx/.conda/envs/seg/bin/python) /home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown len(cache)) /home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown len(cache)) /home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown len(cache)) /home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown len(cache)) /home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown len(cache)) /home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown len(cache)) Traceback (most recent call last): File "tool/train.py", line 488, in <module> main() File "tool/train.py", line 114, in main mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args)) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception: -- Process 3 terminated with the following error: Traceback (most recent call last): File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/lzx/segmentation_exp_LINUX/L012_cell/tool/train.py", line 298, in main_worker loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch) File "/home/lzx/segmentation_exp_LINUX/L012_cell/tool/train.py", line 351, in train scaled_loss.backward() File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/tensor.py", line 120, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
可是為什么最后還是說CUDNN有錯,不是都關閉了嗎?
根據:https://blog.csdn.net/qq_39938666/article/details/86611474
可能是python版本有問題。於是選擇和他一樣的3.6.6進行嘗試
重新創建了Python=3.6.6的環境,還是不正確……
在別的卡都正確,只有使用GPU03不正確,說明代碼沒寫錯,就是卡的問題。或者就是不兼容
我嘗試將這些卡調換順序,但是報錯一直都是將GPU02有問題。(以前是GPU03,后來不知道為什么一直是GPU03了,會不會是電源插口有問題?)
目前的報錯(單卡GPU02)是cuda runtime error (77) : an illegal memory access was encountered
我正在嘗試將CUDA升級到10.1,目前是10.0
然而並沒有用…… 先就用7塊卡吧,
10月03日更新
今天算是解決了這個問題,誤打誤撞的
主要是參考了:https://github.com/pytorch/pytorch/issues/22050#issuecomment-521030783
這個人的頭像我很熟悉,是pytorch論壇里經常回復別人消息的NVIDIA員工
他說除了使用conda。也要嘗試使用pip安裝。
於是我嘗試了使用pip,將pytorch安裝在別的環境里。然后別的環境和以前的環境,都不再有問題了。
這本質上應該是CUDNN被pip安裝的東西覆蓋了。大家都說這個是CUDNN有問題。可能pip的這個版本正好是OK的
大神給的命令行是
pip3 install torch torchvision
但是,大家都知道pip可能會更新源的包。所以這里貼一下我的實際的下載到的包:
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting torch Downloading https://pypi.tuna.tsinghua.edu.cn/packages/30/57/d5cceb0799c06733eefce80c395459f28970ebb9e896846ce96ab579a3f1/torch-1.2.0-cp36-cp36m-manylinux1_x86_64.whl (748.8MB) |████████████████████████████████| 748.9MB 68kB/s Collecting torchvision Downloading https://pypi.tuna.tsinghua.edu.cn/packages/06/e6/a564eba563f7ff53aa7318ff6aaa5bd8385cbda39ed55ba471e95af27d19/torchvision-0.4.0-cp36-cp36m-manylinux1_x86_64.whl (8.8MB) |████████████████████████████████| 8.8MB 1.0MB/s Requirement already satisfied: numpy in /home/lzx/.conda/envs/pytorchseg10/lib/python3.6/site-packages (from torch) (1.17.2) Requirement already satisfied: pillow>=4.1.1 in /home/lzx/.conda/envs/pytorchseg10/lib/python3.6/site-packages (from torchvision) (6.1.0) Requirement already satisfied: six in /home/lzx/.conda/envs/pytorchseg10/lib/python3.6/site-packages (from torchvision) (1.12.0) Installing collected packages: torch, torchvision Successfully installed torch-1.2.0 torchvision-0.4.0
僅供參考