使用py-faster-rcnn訓練VOC2007數據集時遇到問題

本文轉載自查看原文 2017-05-31 14:51 1472 py-faster-rcnn

使用py-faster-rcnn訓練VOC2007數據集時遇到如下問題：

1. KeyError: 'chair'

File "/home/sai/py-faster-rcnn/tools/../lib/datasets/pascal_voc.py", line 217, in _load_pascal_annotation
cls = self._class_to_ind[obj.find('name').text.lower().strip()]
KeyError: 'chair'

解決：

You probably need to write some line of codes to ignore any objects with classes except the classes you are looking for when you are loading the annotation _load_pascal_annotation.
Something like
cls_objs = [
obj for obj, clas in objs, self._classes if obj.find('name').text== clas]

when you are loading the annotation in _load_pascal_annotation method, look for something like
objs = diff_objs (or non_diff_objs)
After that line insert something similar to below code

cls_objs = [obj for obj in objs if obj.find('name').text in self._classes]
objs = cls_objs


https://github.com/rbgirshick/py-faster-rcnn/issues/316

2. Check failed: error == cudaSuccess (30 vs. 0) unknown error

1. I'd just like to point out that CUDA runtime error (30) might show if your program is unable to create or open the /dev/nvidia-uvm device file. This is usually fixed by installing package nvidia-modprobe:

sudo apt-get install nvidia-modprobe

Note also that since your GPU has compute capability 2.1 as per this page you will not be able to use CuDNN and will need to disable support for CuDNN in the Caffe makefile.

2. The problem happened because modprobe could not insert nvidia_340_uvm.

Thus, I had to install nvidia_340_uvm via: sudo apt-get install nvidia-340-uvm.
對於第二點，由於我不是340的顯卡驅動，我運行后系統崩了。據說將其改為自己的驅動版本即可，我是378版本的驅動，將340換成378后顯示無法定位軟件包。

3. 我發現自己沒有裝cuda samples，參考別人的教程進行安裝。（系統和應用版本不一致，需注意）

編譯CUDA Samples

命令：

cd /usr/local/cuda-6.5/samples
sudo make

編譯完成后，進入路徑：/samples/bin/x86_64/linux/release

運行命令：

./deviceQuery

輸出：

./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Tesla K40c"
CUDA Driver Version / Runtime Version 6.5 / 6.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 1, Device0 = Tesla K40c
Result = PASS

如果輸出上述信息，恭喜你，NVIDIA和CUDA安裝成功，則可以繼續進行下一步安裝Caffe環境。

（2.2.4.9）驗證NVIDIA 驅動和CUDA是否安裝成功

查看安裝NVIDIA驅動版本命令：

cat /proc/driver/nvidia/version

輸出

NVRM version: NVIDIA UNIX x86_64 Kernel Module 340.96 Sun Nov 8 22:33:28 PST 2015
GCC version: gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-12ubuntu1)

3. 發現一個遇到了一樣問題的人，注冊nvidia帳號后向其請教。

https://devtalk.nvidia.com/default/topic/987119/problem-with-run-cuda-on-geforce-gt-755m/#reply

https://github.com/NVIDIA/DIGITS/issues/1663

Hello,

I have a problem with using GeForce GTX 1080Ti for machine learning (CAFFA framework)
My platform:
ubuntu 16.04, Cuda V8.0.61, CuDNN8.0

I suggest my version is too new and I have to downgrade.
Could you advise the best way for solve my problem?

Find in following more details:

nvcc is warning about deprecation. But it not error and as I know it is about future.

nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release

CAFFE & py-faster-rcnn install with no error, but on training with py-faster-rcnn I recieve next massage:

I0601 15:30:44.833746 28338 layer_factory.hpp:77] Creating layer input-data
I0601 15:30:44.834151 28338 net.cpp:106] Creating Layer input-data
I0601 15:30:44.834161 28338 net.cpp:411] input-data -> data
I0601 15:30:44.834169 28338 net.cpp:411] input-data -> im_info
I0601 15:30:44.834178 28338 net.cpp:411] input-data -> gt_boxes
F0601 15:30:44.852488 28338 syncedmem.hpp:18] Check failed: error == cudaSuccess (30 vs. 0)  unknown error
*** Check failure stack trace: ***

More outputs:

~/caffe#nvcc -V gives

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

~/caffe# nvidia-smi 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 378.13                 Driver Version: 378.13                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Graphics Device     Off  | 0000:01:00.0      On |                  N/A |
| 23%   39C    P8    17W / 250W |    578MiB / 11171MiB |     15%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1155    G   /usr/lib/xorg/Xorg                              20MiB |
|    0      1341    G   /usr/lib/xorg/Xorg                             262MiB |
|    0      1794    G   compiz                                          82MiB |
|    0      1939    G   fcitx-qimpanel                                   9MiB |
|    0      2015    G   ...el-token=3645A468136299F390B7B0886FE96671   173MiB |
+-----------------------------------------------------------------------------+

4. 仍然是2的錯誤，看見有人說 sudo ./experiments/scripts/faster_rcnn_alt_opt.sh 0 VGG16 pascal_voc

可以解決該問題，但是試了之后出現如下問題：

ImportError: libcudart.so.8.0: cannot open shared object file: No such file or directory

全部顯示結果如下：

+ set -e
+ export PYTHONUNBUFFERED=True
+ PYTHONUNBUFFERED=True
+ GPU_ID=0
+ NET=VGG16
+ NET_lc=vgg16
+ DATASET=pascal_voc
+ array=($@)
+ len=3
+ EXTRA_ARGS=
+ EXTRA_ARGS_SLUG=
+ case $DATASET in
+ TRAIN_IMDB=voc_2007_trainval
+ TEST_IMDB=voc_2007_test
+ PT_DIR=pascal_voc
+ ITERS=40000
++ date +%Y-%m-%d_%H-%M-%S
+ LOG=experiments/logs/faster_rcnn_alt_opt_VGG16_.txt.2017-06-01_14-56-06
+ exec
++ tee -a experiments/logs/faster_rcnn_alt_opt_VGG16_.txt.2017-06-01_14-56-06
+ echo Logging output to experiments/logs/faster_rcnn_alt_opt_VGG16_.txt.2017-06-01_14-56-06
Logging output to experiments/logs/faster_rcnn_alt_opt_VGG16_.txt.2017-06-01_14-56-06
+ ./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name VGG16 --weights data/imagenet_models/VGG16.v2.caffemodel --imdb voc_2007_trainval --cfg experiments/cfgs/faster_rcnn_alt_opt.yml
Traceback (most recent call last):
File "./tools/train_faster_rcnn_alt_opt.py", line 17, in <module>
from fast_rcnn.train import get_training_roidb, train_net
File "/home/jz/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 10, in <module>
import caffe
File "/home/jz/py-faster-rcnn/tools/../caffe-fast-rcnn/python/caffe/__init__.py", line 1, in <module>
from .pycaffe import Net, SGDSolver, NesterovSolver, AdaGradSolver, RMSPropSolver, AdaDeltaSolver, AdamSolver
File "/home/jz/py-faster-rcnn/tools/../caffe-fast-rcnn/python/caffe/pycaffe.py", line 13, in <module>
from ._caffe import Net, SGDSolver, NesterovSolver, AdaGradSolver, \
ImportError: libcudart.so.8.0: cannot open shared object file: No such file or directory

5. 仍然是2的錯誤，有人說是顯卡驅動的問題，於是我查詢NVIDIA X server setting想看自己使用的是否是NVIDIA的驅動，發現ubuntu16.04沒有prime profiles選項，無法一鍵切換驅動。

有人說要單獨安裝prime profiles，我sudo apt-get install nvidia-prime之后，在NVIDIA X server里仍然沒有該選項。想起在安裝驅動之后好像沒有禁用ubuntu自帶的顯卡，於是在軟件和更新里查看結果如下圖，修改不使用設備后，修改nouveau后重啟。

把 nouveau 驅動加入黑名單

  $sudo nano /etc/modprobe.d/blacklist-nouveau.conf

  在文件 blacklist-nouveau.conf 中加入如下內容：
  blacklist nouveau
  blacklist lbm-nouveau
  options nouveau modeset=0
  alias nouveau off
  alias lbm-nouveau off

禁用 nouveau 內核模塊

  $echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf

  $sudo update-initramfs -u

重啟
lsmod | grep nouveau

如果什么都沒有代表卸載成功，這時候重新啟動你會發現屏幕分辨率明顯變化了，如果沒有變化，注意可能默認驅動沒有被禁止，ubuntu16.04好像和14.04有區別，請查其他禁止驅動的方法
禁用成功后仍然出現2的錯誤。

6. 在5的基礎上，我試着跑caffe的mnist例程，未果。

輸入：./examples/mnist/train_lenet.sh

輸出：

F0601 20:57:37.822069 3866 db_lmdb.hpp:15] Check failed: mdb_status == 0 (13 vs. 0) Permission denied
*** Check failure stack trace: ***
@ 0x7f9308c9a95d google::LogMessage::Fail()
@ 0x7f9308c9c6e0 google::LogMessage::SendToLog()
@ 0x7f9308c9a543 google::LogMessage::Flush()
@ 0x7f9308c9d0ae google::LogMessageFatal::~LogMessageFatal()
@ 0x7f9309415428 caffe::db::LMDB::Open()
@ 0x7f93092b3b9f caffe::DataLayer<>::DataLayer()
@ 0x7f93092b3d32 caffe::Creator_DataLayer<>()
@ 0x7f9309468d90 caffe::Net<>::Init()
@ 0x7f930946b79e caffe::Net<>::Net()
@ 0x7f930944e865 caffe::Solver<>::InitTrainNet()
@ 0x7f930944fc55 caffe::Solver<>::Init()
@ 0x7f930944ff6f caffe::Solver<>::Solver()
@ 0x7f93094402b1 caffe::Creator_SGDSolver<>()
@ 0x40a9e8 train()
@ 0x4072e0 main
@ 0x7f9307c0b830 (unknown)
@ 0x407b09 _start

輸入：sudo ./examples/mnist/train_lenet.sh

輸出：

error while loading shared libraries: libcudart.so.8.0: cannot open shared object file: No such file or directory

解決：https://cgcvtutorials.wordpress.com/2016/10/14/error-while-loading-shared-libraries-libcudart-so-8-0-cannot-open-shared-object-file-no-such-file-or-directory/

sudo ldconfig /usr/local/cuda-8.0/lib64

再輸入sudo ./examples/mnist/train_lenet.sh后成功訓練mnist.

參考鏈接：https://devtalk.nvidia.com/default/topic/963814/cuda-setup-and-installation/cuda-8-libcudart-error/

https://askubuntu.com/questions/889015/cant-install-cuda-8-but-have-correct-nvidia-driver-ubuntu-16

https://github.com/tensorflow/tensorflow/issues/5343

http://blog.crboy.net/2012/05/solution-for-cannot-open-shared-object.html

7. 在6成功的基礎上，已經可以開始訓練py-faster-rcnn。

sudo ./experiments/scripts/faster_rcnn_alt_opt.sh 0 VGG16 pascal_voc

8. 成功訓練完模型之后，測試時又出現如下問題：

輸入：python ./tools/demo.py

輸出：I0602 20:04:00.106807 10014 net.cpp:413] Input 0 -> data

F0602 20:04:00.110862 10014 syncedmem.hpp:18] Check failed: error == cudaSuccess (30 vs. 0) unknown error
*** Check failure stack trace: ***
已放棄 (核心已轉儲)

解決：先后使用了如下命令：

1. sudo ldconfig /usr/local/cuda-8.0/lib64（未果）

2. export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64（未果）

3. export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}（在faster-rcnn路徑下，所以該句無效，修改根路徑后依然未果）

4. export LD_LIBRARY_PATH=${CUDA_HOME}/lib64

export PATH=${CUDA_HOME}/bin:${PATH}（未果）

5. export CUDA_HOME=/usr/local/cuda | export LD_LIBRARY_PATH=${CUDA_HOME}/lib64（未果）

6. sudo ldconfig /usr/local/cuda/lib64

python ./tools/demo.py（未果）

7. sudo python ./tools/demo.py

報錯：

Cannot copy param 0 weights from layer 'bbox_pred'; shape mismatch. Source param shape is 8 4096 (32768); target param shape is 84 4096 (344064). To learn this layer's parameters from scratch rather than copying from a saved net, rename the layer.

解決：修改'/home/jz/py-faster-rcnn/models/pascal_voc/VGG16/faster_rcnn_alt_opt/faster_rcnn_test.pt'文件中的bbox_pred為out_num為8

再次輸入7命令：sudo python ./tools/demo.py后成功測試結果。

8. 測試

sudo time ./tools/test_net.py --gpu 0 --def models/pascal_voc/VGG16/faster_rcnn_end2end/test.prototxt --net data/faster_rcnn_models/VGG16_end2end_ignore-difficult1.caffemodel

最終： sudo ./experiments/scripts/faster_rcnn_end2end_test.sh 0 VGG16 pascal_voc

參考：sudo ./tools/test_net.py --gpu 0 --def models/pascal_voc/VGG16/faster_rcnn_end2end/test.prototxt --net /home/jz/py-faster-rcnn/output/faster_rcnn_end2end/voc_2012_train/vgg16_faster_rcnn_iter_60000.caffemodel --imdb voc_2012_test --cfg experiments/cfgs/faster_rcnn_end2end.yml

問題：

wrote gt roidb to /home/jz/py-faster-rcnn/data/cache/voc_2007_test_gt_roidb.pkl
Traceback (most recent call last):
File "./tools/test_net.py", line 90, in <module>
test_net(net, imdb, max_per_image=args.max_per_image, vis=args.vis)
File "/home/jz/py-faster-rcnn/tools/../lib/fast_rcnn/test.py", line 242, in test_net
roidb = imdb.roidb
File "/home/jz/py-faster-rcnn/tools/../lib/datasets/imdb.py", line 67, in roidb
self._roidb = self.roidb_handler()
File "/home/jz/py-faster-rcnn/tools/../lib/datasets/pascal_voc.py", line 128, in selective_search_roidb
ss_roidb = self._load_selective_search_roidb(gt_roidb)
File "/home/jz/py-faster-rcnn/tools/../lib/datasets/pascal_voc.py", line 162, in _load_selective_search_roidb
'Selective search data not found at: {}'.format(filename)
AssertionError: Selective search data not found at: /home/jz/py-faster-rcnn/data/selective_search_data/voc_2007_test.mat
Command exited with non-zero status 1
1.39user 2.38system 0:06.47elapsed 58%CPU (0avgtext+0avgdata 1815028maxresident)k
1263144inputs+48outputs (589major+388691minor)pagefaults 0swaps

解決：You can modify the following flag in "lib/fast-rcnn/config.py"

# Propose boxes __C.TEST.HAS_RPN = True

問題：

如果在最后出現 KeyError: 'xxxxxxxxxx'，請刪除 $FRCN_ROOT/data/VOCdevkit2007/annotations_cache/annots.pkl
如果中途發現標錯了數據，重新標注數據后，請刪除 $FRCN_ROOT/data/cache/voc_2007_trainval_gt_roidb.pkl
如果最后測試出現IndexError: too many indices for array，那是因為你的測試數據中缺少了某些類別。請根據錯誤提示，找到對應的代碼（$FRCN_ROOT/lib/datasets/voc_eval.py第148行），前面加上一個if語句：
if len(BB) != 0: BB = BB[sorted_ind, :]

====================================================================================================================================================================

視頻檢測：

sudo python ./tools/demo_video1.py --net zf

10.結果

在py-faster-rcnn下，

執行：

[plain] view plain copy

./tools/demo.py --net zf

或者將默認的模型改為zf：

[html] view plain copy

parser.add_argument('--net', dest='demo_net', help='Network to use [vgg16]',
choices=NETS.keys(), default='vgg16')

修改：

[html] view plain copy

default='zf'

執行：

[plain] view plain copy

./tools/demo.py

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 制作自己的數據集（VOC2007格式）用於Faster-RCNN訓練使用py-faster-rcnn訓練自己的數據集 py-faster-rcnn end2end訓練時 batch size只能為1？ Faster-RCNN訓練自己的數據集——備忘使用faster-rcnn.pytorch訓練自己數據集（完整版） Faster R-CNN訓練自己的數據集時遇到的報錯及解決方案 py-faster-rcnn在windows下安裝 windows py-faster-rcnn配置 py-faster-rcnn:在windows上配置 Faster R-CNN 安裝並運行 demo + 訓練和測試 VOC 格式數據集