TensorFlow | ReluGrad input is not finite. Tensor had NaN values


問題的出現 Question

這個問題是我基於TensorFlow使用CNN訓練MNIST數據集的時候遇到的。關鍵的相關代碼是以下這部分:

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

學習速率是\((1e-4)\)的時候是沒有問題,但是當我把學習速率調到\(0.01/0.5\)的時候,很快就會報錯。

tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input is not finite. : Tensor had NaN values

分析 Analysis

學習速率 Learning Rate

於是我嘗試加上幾行代碼,希望能把y_conv和cross_entropy的狀態反映出來。

y_conv=tf.Print(y_conv,[y_conv],"y_conv: ")
cross_entropy =tf.Print(cross_entropy,[cross_entropy],"cross_entropy: ")

當learning rate \(=0.01\)時,程序會報錯:

I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [3.0374929e-06 0.0059775524 0.980205...]
step 0, training accuracy 0.04
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [9.2028862e-10 1.4812358e-05 0.044873074...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [648.49146]
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.024463326 1.4828938e-31 0...]
step 1, training accuracy 0.2
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [2.4634053e-11 3.3087209e-34 0...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [nan]
step 2, training accuracy 0.14
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [nan nan nan...]
W tensorflow/core/common_runtime/executor.cc:1027] 0x7ff51d92a940 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values

當learning rate \(=1e-4\)時,程序不會報錯。

I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.00056920078 8.4922984e-09 0.00033719366...]
step 0, training accuracy 0.14
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [7.0613837e-10 9.28294e-09 0.00016230672...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [439.95135]
step 1, training accuracy 0.16
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.031509314 3.6221365e-05 0.015359053...]
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [3.7112056e-07 1.8543299e-09 8.9234991e-06...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [436.37653]
step 2, training accuracy 0.12
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.015578311 0.0026688741 0.44736364...]
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [6.0428465e-07 0.0001744287 0.026451336...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [385.33765]

至此,我們可以看到,學習速率太大是產生error其中一個原因。

參考斯坦福CS 224D的Lecture Note,在訓練深度神經網絡的時候,出現NaN比較大的可能是因為學習速率過大,梯度值過大,產生梯度爆炸。
Refer to the lecture note of Stanford CS 224D, a precise definition of Gradient Explosion is:

During experimentation, once the gradient value grows extremely large, it causes an overflow (i.e. NaN) which is easily detectable at runtime; this issue is called the Gradient Explosion Problem.

解決方法 Solutions

  1. 適當減小學習速率 Try to decrease the learning rate.
  2. 加入Gradient clipping的方法。 Gradient clipping的方法最早是由Thomas Mikolov提出的。每當梯度達到一定的閾值,就把他們設置回一個小一些的數字。
    Refer to the lecture note of Stanford CS 224D, use gradient clipping.

To solve the problem of exploding gradients, Thomas Mikolov first introduced a simple heuristic solution that clips gradients to a small number whenever they explode. That is, whenever they reach a certain threshold, they are set back to a small number as shown in Algorithm 1.
Algorithm 1:
\(\frac{\partial E}{\partial W}\to g\)
if $ \Vert g\Vert\ge threshold$ then
\(\frac {threshold}{\Vert g\Vert} g\to g\)
end if


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM