最近使用github上的一個開源項目訓練基於CNN的翻譯模型,使用THEANO_FLAGS='floatX=float32,device=gpu2,lib.cnmem=1' python run_nnet.py -w data/exp1/,運行時報錯,打印"The image and the kernel must have the same type. inputs(float64), kerns(float32)"的錯誤,然后使用THEANO_FLAGS='floatX=float64,device=gpu2,lib.cnmem=1' python run_nnet.py -w data/exp1/,運行成功。但幾百個訓練數據卻需要十幾分鍾,運行十分緩慢。
使用nvidia-smi -l查看GPU情況,發現GPU memory usage 是滿了,而GPU-Util卻是0,top命令看CPU卻是1600%(16核CPU),這與跑其他任務很不相同(GPU-Util接近100%,CPU不到100%)。看起來是CPU被打滿了,而GPU空着,運算完全在CPU上進行。查找原因,Google這個問題,卻沒有找到什么滿足需求的解答,只好回過頭來閱讀官方文檔。
首先使用assert_no_cpu_op=raise:
THEANO_FLAGS="floatX=float64,device=gpu2,force_device=True,mode=FAST_RUN,lib.cnmem=1,assert_no_cpu_op=raise" python run_nnet.py
按照官方文檔,如果設置了這個參數,有在CPU上執行的操作,是應該拋異常的。然而實際情況並沒有。
仔細閱讀官方文檔,發現在theano的FAQ文檔(refer: http://deeplearning.net/software/theano/faq.html)中說:
“It should be noted that using float32 and int{32, 64} together inside a function would provide float64 as output.
Since the GPU can’t compute this kind of output, it would be preferable not to use those dtypes together.
To help you find where float64 are created, see the warn_float64 Theano flag.”
也就是float64的話,GPU是不能計算的,所以就是CPU計算。進一步的,在使用GPU的文檔中(refer: http://deeplearning.net/software/theano/tutorial/using_gpu.html):
"
- Only computations with float32 data-type can be accelerated. Better support for float64 is expected in upcoming hardware but float64 computations are still relatively slow (Jan 2010).
- Prefer constructors like
matrix,vectorandscalartodmatrix,dvectoranddscalarbecause the former will give you float32 variables whenfloatX=float32. - Ensure that your output variables have a float32 dtype and not float64. The more float32 variables are in your graph, the more work the GPU can do for you."
所以原因就是代碼中有float64的輸入。根據文檔中建議,可以使用config的warn_float64來幫助尋找float64的輸入,所以執行:
THEANO_FLAGS="floatX=float64,device=gpu2,force_device=True,mode=FAST_RUN,lib.cnmem=1,warn_float64=raise" python run_nnet.py -w data/exp1/
異常棧如下:
Traceback (most recent call last):
File "run_nnet.py", line 570, in <module>
main()
File "run_nnet.py", line 208, in main
nnet_q.set_input((x_q, x_q_overlap))
File ".../nn_layers.py", line 65, in set_input
self.output = self.output_func(input)
File ".../nn_layers.py", line 89, in output_func
layer.set_input(cur_input)
這個棧信息之反映了在網絡set_input的時候有float64,但是float64的變量可是在此之前早就創建好的,所以還是無法定位到問題,這是一個然並卵的參數。
由於代碼中的輸入幾乎都是由numpy生成或者load的,查閱numpy的文檔,發現numpy建立數組的操作,如果沒有指定dtype,那么默認就是float64,例如numpy.ones, numpy.zero, numpy.random.RandomState.randn等,theano的config(refer: http://deeplearning.net/software/theano/library/config.html)中有一個cast_policy參數,按照文檔的說法,當設定floatX=float32,同時設置cast_policy=numpy+floatX時,執行過程中會自動的把numpy產生的數組轉換成float32的。於是執行:
THEANO_FLAGS="floatX=float32,device=gpu2,force_device=True,mode=FAST_RUN,lib.cnmem=1,cast_policy=numpy+floatX" python run_nnet.py -w data/exp1/
結果。。。依然報錯:"NotImplementedError: The image and the kernel must have the same type.inputs(float64), kerns(float32)"
這個參數的說明:
" Note that ‘numpy+floatX’ is not currently behaving exactly as planned (it is a work-in-progress), and thus you should consider it as experimental. "
好吧,果然還是實驗性質的,有些情況搞不定。
那么最后一招,仔細的檢查所有numpy的調用,把所有創建數組的地方都顯示的指定dtype=numpy.float32,全部改好后,執行:
THEANO_FLAGS="floatX=float32,device=gpu2,force_device=True,mode=FAST_RUN,lib.cnmem=1" python run_nnet.py -w data/exp1/
成功的把GPU-Util打滿,CPU也降到了100%,訓練幾百條數據的時間一下降到秒殺!
經驗總結:
遇到問題閱讀官方文檔是十分有效的方法,往往常見問題在這些文檔中已經說得很明確了,可以幫你明確的了解問題所在,進而找到解決方案。
