重新配置語義分割實驗環境遇到的坑


新的框架是semseg, by hs-z      https://github.com/hszhao/semseg

1. 安裝apex報錯:fatal error: gnu-crypt.h: No such file or directory

本質上是cryptacular的pip源有問題,使用conda install cryptacular即可

 

2. pip install總是安裝到別的虛擬環境里

這是因為當前正在使用的pip並非當前虛擬環境里的。這里conda install會默認安裝到當前虛擬環境,但是pip並不會。

所以使用 whereis pip查看想要的當前虛擬環境的pip程序的位置,然后使用絕對路徑來執行pip install即可

 

3. 關於pip和conda的源

今天是2019年6月10日,目前conda的清華源因為版權問題已經關閉,而pip的清華源仍可以正常使用。

 

4. ModuleNotFoundError: No module named 'yaml'

應該是

conda install pyaml

 

 

5. TypeError: Class advice impossible in Python3.  Use the @implementer class decorator instead

首先切換當前的CUDA版本與pytorch的CUDA版本一致

然后卸載已經安裝過的apex。

然后: 

git clone https://github.com/NVIDIA/apex.git && cd apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

 

對於多虛擬環境,可能會有些錯亂。此時使用 

whereis pip

來找到你當前虛擬環境的pip執行程序的位置。然后使用pip的絕對路徑進行操作。

包括apex上面的最后一步的python也可以使用其絕對路徑來安裝,保證一定安裝到了正確的位置

 

不應該使用下面的這行命令來安裝apex:

git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext

原因可能是我在conda虛擬環境中

 

參考:https://github.com/NVIDIA/apex/issues/214#issuecomment-476399539

 

6. 什么報錯都沒有,用PDB也沒有。在 x = self.layer0(x) 處消失

batch size和輸入圖片尺寸小一些就好了

 

7. Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.

在已經正確安裝的前提下,還堅持報這個錯誤,說明是個有深度的錯誤。

 

根據:https://discuss.pytorch.org/t/undefined-symbol-when-import-lltm-cpp-extension/32627/2

兩個解決方法:

  1. build cpp extensions with -D_GLIBCXX_USE_CXX11_ABI=1.
  2. build pytorch with -D_GLIBCXX_USE_CXX11_ABI=0

但是apex如何設置額外的編譯參數我也不會。根據他們提供的export方法,即:

export CFLAGS="-D_GLIBCXX_USE_CXX11_ABI=1 $CFLAGS"

然后再編譯apex,發現在編譯過程中這個參數還是等於0,沒有效果。

 

最終根據下面這段話:

The best way to solve this problem in any case is to compile Pytorch from source and use that same compiler for the extension. Then all problems go away.

 

決定還是把pytorch和apex都在本機上從源代碼編譯一遍得了。

 

然后發現pytorch從源代碼編譯很有困難……遇到了一堆找不到解決辦法的BUG,最后想了想把pytorch安裝回去吧。

之前安裝pytorch和這次的途徑不同:

 

之前的方式是:

conda install pytorch torchvision cudatoolkit=10.0

為了加快下載速度,就不想從pytorch官方源下載,而是選擇了從conda源下載。

然而在python中,使用:

torch._C._GLIBCXX_USE_CXX11_ABI

發現是True,也就是說

-D_GLIBCXX_USE_CXX11_ABI=1

不滿足要求

 

這次使用了pytorch官方的源:

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

神奇的事情發生了,這次

-D_GLIBCXX_USE_CXX11_ABI=0了

還是pytorch官方靠譜……conda上是收錄的官方的編譯包,更新的不夠及時。

 

然后就OK了……………………

 

總結一下,

1. 靠譜的還是官方,不要圖省事,也不要總想着自己去編譯,那樣子問題更多。

2. 遇到問題要到github上相應倉庫的issue去查詢,這也很重要。特別是,要用英文查詢。中文查詢都是二手信息。

3. Google的搜索能力的確很厲害,盡量用Google!

 

 

8. ValueError: batch_size should be a positive integer value, but got batch_size=0 

在config文件里的

batch_size_val: 8  # batch size for validation during training, memory and speed tradeoff

需要設置為和GPU一樣的數量,雖然不知道為什么。

 

9. cv2.error: OpenCV(4.1.0) /io/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'

data_root不正確,沒有讀取到數據

 

 

10. Exception: process 0 terminated with signal SIGSEGV

內存不足

目前還無法解決

 只能把分布式訓練給關了,暫時可以運行,但是很慢

 

11. pip下載速度慢

linux下,修改 ~/.pip/pip.conf (沒有就創建一個), 修改 index-url至tuna,內容如下:

 [global]
 index-url = https://pypi.tuna.tsinghua.edu.cn/simple
 

12. 查看pytorch對應的cuda版本

print(torch.version.cuda)

 

13. libSM.so.6: cannot open shared object file: No such file or directory

https://stackoverflow.com/questions/47113029/importerror-libsm-so-6-cannot-open-shared-object-file-no-such-file-or-directo

pip install opencv-python-headless
# also contrib, if needed
pip install opencv-contrib-python-headless

 

 

14. fatal error: gnu-crypt.h: No such file or directory

在安裝apex過程中出現的。應該使用

conda install cryptacular

然后再安裝apex

 

15.  OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).

 

OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://
www.intel.com/software/products/support/.

Traceback (most recent call last): │··································
File "tool/train.py", line 456, in <module> │··································
main() │··································
File "tool/train.py", line 106, in main │··································
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args)) │··································
File "/home/xxx/.conda/envs/pytorchseg10/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn │··································
while not spawn_context.join(): │··································
File "/home/xxx/.conda/envs/pytorchseg10/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join │··································
(error_index, name) │··································
Exception: process 2 terminated with signal SIGABRT

在新服務器上試圖重新配置環境,然而遇到這個問題

 

解決方法:

在train.sh里添加這一行

export KMP_INIT_AT_FORK=FALSE

 

此時的train.sh為

#!/bin/sh
PARTITION=gpu
PYTHON=python

dataset=$1
exp_name=$2
exp_dir=exp/${dataset}/${exp_name}
model_dir=${exp_dir}/model
result_dir=${exp_dir}/result
config=config/${dataset}/${dataset}_${exp_name}.yaml
now=$(date +"%Y%m%d_%H%M%S")

mkdir -p ${model_dir} ${result_dir}
cp tool/train.sh tool/train.py ${config} ${exp_dir}

export PYTHONPATH=./
export KMP_INIT_AT_FORK=FALSE
#sbatch -p $PARTITION --gres=gpu:8 -c16 --job-name=train \
$PYTHON -u tool/train.py \
  --config=${config} \
  2>&1 | tee ${model_dir}/train-$now.log
View Code

 

參考:https://github.com/ContinuumIO/anaconda-issues/issues/11294

 

 

 

16.  RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1556653114079/work/torch/lib/c10d/ProcessGroupNCCL.cpp:272, unhandled cuda error

很多地方(https://github.com/pytorch/pytorch/issues/23534

說是NCCL的版本問題,於是打印版本:(在命令行里,運行python之前)

export NCCL_DEBUG=VERSION

然后執行程序,看到我的版本是2.4.8

此時,標題中的報錯不再出現…… 

所以這個錯誤是一種表象,掩蓋了實際的錯誤

 

但是運行了一段時間后又自動斷開了,再次運行還是這個錯誤…… 很奇怪

在打印了debug信息后:加入

export NCCL_DEBUG=info

發現有一個錯誤:

Cuda failure 'an illegal memory access was encountered'

 

這個問題沒有通行的解決方案,每個人的問題都不太一樣。

我突然發現,每次到GPU3的進程開始初始化時就會報錯,然后取消使用GPU3 發現錯誤解決了…… 難道是硬件壞了

經過排查,已經確定只有在GPU3上有問題。運行多次,報錯不同,這里記錄一下

報錯記錄1:

Traceback (most recent call last):
  File "tool/train.py", line 480, in <module>
    main()
  File "tool/train.py", line 115, in main
    main_worker(args.train_gpu, args.ngpus_per_node, args)
  File "tool/train.py", line 290, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
  File "tool/train.py", line 335, in train
    output, main_loss, aux_loss = model(input, target)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/pspnet.py", line 93, in forward
    x = self.layer4(x_tmp)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/resnet.py", line 82, in forward
    out = self.conv2(out)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
View Code

報錯記錄2:

Traceback (most recent call last):
  File "tool/train.py", line 480, in <module>
    main()
  File "tool/train.py", line 115, in main
    main_worker(args.train_gpu, args.ngpus_per_node, args)
  File "tool/train.py", line 290, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
  File "tool/train.py", line 335, in train
    output, main_loss, aux_loss = model(input, target)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/pspnet.py", line 92, in forward
    x_tmp = self.layer3(x)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/resnet.py", line 82, in forward
    out = self.conv2(out)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1568696969690/work/aten/src/THC/THCGeneral.cpp:216
View Code

 

對於報錯1,

根據 https://github.com/qqwweee/keras-yolo3/issues/332#issuecomment-517989338

應該安裝補丁: https://developer.nvidia.com/cuda-10.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal

安裝后還是沒用

 

對於報錯記錄2:根據https://github.com/huggingface/transfer-learning-conv-ai/issues/10#issuecomment-496111466

增加了export CUDA_LAUNCH_BLOCKING=1

下面是報錯記錄3:

Traceback (most recent call last):
  File "tool/train.py", line 480, in <module>
    main()
  File "tool/train.py", line 115, in main
    main_worker(args.train_gpu, args.ngpus_per_node, args)
  File "tool/train.py", line 290, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
  File "tool/train.py", line 335, in train
    output, main_loss, aux_loss = model(input, target)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/pspnet.py", line 90, in forward
    x = self.layer1(x)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/resnet.py", line 90, in forward
    residual = self.downsample(x)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
View Code

添加了下面一行后暫時可以了, 

torch.backends.cudnn.benchmark = True

好了十分鍾,又壞了,報錯和記錄3一樣……

 

 ==================================================================================================================

 

 不確定是不是還是第三塊卡的問題,所以這次使用其他的所有卡,訓練多一些時間看看

使用了別的所有的卡一起訓練了很久都沒有問題。

 

總結兩點:

1. 只使用第三塊卡會有問題 

2. 單卡時,相當於非常普通的訓練方式,並不會出發多線程、多進程以及分布式的代碼

 

再嘗試一下是不是可以通過軟件層面解決,不行的話就只能歸因於顯卡壞掉了。或者服務器有問題

 

有人說是CUDNN的版本問題,先將CUDNN關閉:

torch.backends.cudnn.enabled = False 

關閉后,在單獨使用第三塊卡時候的確可以運行了。

 

然后測試使用所有卡+關閉CUDNN。在運行了20個ITERS后報錯:

[2019-09-29 20:03:08,046 INFO train.py line 404 98898] Epoch: [44/200][20/186] Data 0.001 (0.106) Batch 1.274 (1.504) Remain 12:11:15 MainLoss 0.1086 AuxLoss 0.1171 Loss 0.1555 Accuracy 0.9622.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered (insert_events at /opt/conda/conda-bld/pytorch_1568696969690/work/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7ffa46ff5477 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x17044 (0x7ffa47231044 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1cccb (0x7ffa47236ccb in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7ffa46fe2e8d in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x1c2789 (0x7ffa7892f789 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x445d2b (0x7ffa78bb2d2b in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x445d61 (0x7ffa78bb2d61 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x1a184f (0x558f4667b84f in /home/lzx/.conda/envs/seg/bin/python)
frame #8: <unknown function> + 0xfd1a8 (0x558f465d71a8 in /home/lzx/.conda/envs/seg/bin/python)
frame #9: <unknown function> + 0x10e3c7 (0x558f465e83c7 in /home/lzx/.conda/envs/seg/bin/python)
frame #10: <unknown function> + 0x10e3dd (0x558f465e83dd in /home/lzx/.conda/envs/seg/bin/python)
frame #11: <unknown function> + 0x10e3dd (0x558f465e83dd in /home/lzx/.conda/envs/seg/bin/python)
frame #12: <unknown function> + 0xf5777 (0x558f465cf777 in /home/lzx/.conda/envs/seg/bin/python)
frame #13: <unknown function> + 0xf57e3 (0x558f465cf7e3 in /home/lzx/.conda/envs/seg/bin/python)
frame #14: <unknown function> + 0xf5766 (0x558f465cf766 in /home/lzx/.conda/envs/seg/bin/python)
frame #15: <unknown function> + 0x1db5e3 (0x558f466b55e3 in /home/lzx/.conda/envs/seg/bin/python)
frame #16: _PyEval_EvalFrameDefault + 0x2a5a (0x558f466a7e4a in /home/lzx/.conda/envs/seg/bin/python)
frame #17: _PyFunction_FastCallKeywords + 0xfb (0x558f4663dccb in /home/lzx/.conda/envs/seg/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x6a3 (0x558f466a5a93 in /home/lzx/.conda/envs/seg/bin/python)
frame #19: _PyFunction_FastCallKeywords + 0xfb (0x558f4663dccb in /home/lzx/.conda/envs/seg/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x416 (0x558f466a5806 in /home/lzx/.conda/envs/seg/bin/python)
frame #21: _PyEval_EvalCodeWithName + 0x2f9 (0x558f465ee539 in /home/lzx/.conda/envs/seg/bin/python)
frame #22: _PyFunction_FastCallKeywords + 0x387 (0x558f4663df57 in /home/lzx/.conda/envs/seg/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x14dc (0x558f466a68cc in /home/lzx/.conda/envs/seg/bin/python)
frame #24: _PyEval_EvalCodeWithName + 0x2f9 (0x558f465ee539 in /home/lzx/.conda/envs/seg/bin/python)
frame #25: PyEval_EvalCodeEx + 0x44 (0x558f465ef424 in /home/lzx/.conda/envs/seg/bin/python)
frame #26: PyEval_EvalCode + 0x1c (0x558f465ef44c in /home/lzx/.conda/envs/seg/bin/python)
frame #27: <unknown function> + 0x22ab74 (0x558f46704b74 in /home/lzx/.conda/envs/seg/bin/python)
frame #28: PyRun_StringFlags + 0x7d (0x558f4670fddd in /home/lzx/.conda/envs/seg/bin/python)
frame #29: PyRun_SimpleStringFlags + 0x3f (0x558f4670fe3f in /home/lzx/.conda/envs/seg/bin/python)
frame #30: <unknown function> + 0x235f3d (0x558f4670ff3d in /home/lzx/.conda/envs/seg/bin/python)
frame #31: _Py_UnixMain + 0x3c (0x558f467102bc in /home/lzx/.conda/envs/seg/bin/python)
frame #32: __libc_start_main + 0xf0 (0x7ffa91705830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #33: <unknown function> + 0x1db062 (0x558f466b5062 in /home/lzx/.conda/envs/seg/bin/python)

/home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
/home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
/home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
/home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
/home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
/home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "tool/train.py", line 488, in <module>
    main()
  File "tool/train.py", line 114, in main
    mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/tool/train.py", line 298, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/tool/train.py", line 351, in train
    scaled_loss.backward()
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/tensor.py", line 120, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
View Code

可是為什么最后還是說CUDNN有錯,不是都關閉了嗎?

 

根據:https://blog.csdn.net/qq_39938666/article/details/86611474

可能是python版本有問題。於是選擇和他一樣的3.6.6進行嘗試

重新創建了Python=3.6.6的環境,還是不正確……

在別的卡都正確,只有使用GPU03不正確,說明代碼沒寫錯,就是卡的問題。或者就是不兼容

 

我嘗試將這些卡調換順序,但是報錯一直都是將GPU02有問題。(以前是GPU03,后來不知道為什么一直是GPU03了,會不會是電源插口有問題?)

 

目前的報錯(單卡GPU02)是cuda runtime error (77) : an illegal memory access was encountered

 

根據有的回答https://ethereum.stackexchange.com/questions/65652/error-cuda-mining-an-illegal-memory-access-was-encountered

我正在嘗試將CUDA升級到10.1,目前是10.0

 

然而並沒有用…… 先就用7塊卡吧,

 

 

10月03日更新

 

今天算是解決了這個問題,誤打誤撞的

主要是參考了:https://github.com/pytorch/pytorch/issues/22050#issuecomment-521030783

這個人的頭像我很熟悉,是pytorch論壇里經常回復別人消息的NVIDIA員工

他說除了使用conda。也要嘗試使用pip安裝。

於是我嘗試了使用pip,將pytorch安裝在別的環境里。然后別的環境和以前的環境,都不再有問題了。

這本質上應該是CUDNN被pip安裝的東西覆蓋了。大家都說這個是CUDNN有問題。可能pip的這個版本正好是OK的

 

大神給的命令行是

pip3 install torch torchvision

 

但是,大家都知道pip可能會更新源的包。所以這里貼一下我的實際的下載到的包:

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting torch
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/30/57/d5cceb0799c06733eefce80c395459f28970ebb9e896846ce96ab579a3f1/torch-1.2.0-cp36-cp36m-manylinux1_x86_64.whl (748.8MB)
     |████████████████████████████████| 748.9MB 68kB/s 
Collecting torchvision
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/06/e6/a564eba563f7ff53aa7318ff6aaa5bd8385cbda39ed55ba471e95af27d19/torchvision-0.4.0-cp36-cp36m-manylinux1_x86_64.whl (8.8MB)
     |████████████████████████████████| 8.8MB 1.0MB/s 
Requirement already satisfied: numpy in /home/lzx/.conda/envs/pytorchseg10/lib/python3.6/site-packages (from torch) (1.17.2)
Requirement already satisfied: pillow>=4.1.1 in /home/lzx/.conda/envs/pytorchseg10/lib/python3.6/site-packages (from torchvision) (6.1.0)
Requirement already satisfied: six in /home/lzx/.conda/envs/pytorchseg10/lib/python3.6/site-packages (from torchvision) (1.12.0)
Installing collected packages: torch, torchvision
Successfully installed torch-1.2.0 torchvision-0.4.0

 

僅供參考

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM