TensorFlow | ReluGrad input is not finite. Tensor had NaN values

本文轉載自查看原文 2017-07-24 04:33 3009 Machine Learning/ Deep Learning

問題的出現 Question

這個問題是我基於TensorFlow使用CNN訓練MNIST數據集的時候遇到的。關鍵的相關代碼是以下這部分：

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

學習速率是$(1e-4)$的時候是沒有問題，但是當我把學習速率調到$0.01/0.5$的時候，很快就會報錯。

tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input is not finite. : Tensor had NaN values

分析 Analysis

學習速率 Learning Rate

於是我嘗試加上幾行代碼，希望能把y_conv和cross_entropy的狀態反映出來。

y_conv=tf.Print(y_conv,[y_conv],"y_conv: ")
cross_entropy =tf.Print(cross_entropy,[cross_entropy],"cross_entropy: ")

當learning rate $=0.01$時，程序會報錯：

I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [3.0374929e-06 0.0059775524 0.980205...]
step 0, training accuracy 0.04
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [9.2028862e-10 1.4812358e-05 0.044873074...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [648.49146]
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.024463326 1.4828938e-31 0...]
step 1, training accuracy 0.2
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [2.4634053e-11 3.3087209e-34 0...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [nan]
step 2, training accuracy 0.14
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [nan nan nan...]
W tensorflow/core/common_runtime/executor.cc:1027] 0x7ff51d92a940 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values

當learning rate $=1e-4$時，程序不會報錯。

I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.00056920078 8.4922984e-09 0.00033719366...]
step 0, training accuracy 0.14
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [7.0613837e-10 9.28294e-09 0.00016230672...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [439.95135]
step 1, training accuracy 0.16
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.031509314 3.6221365e-05 0.015359053...]
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [3.7112056e-07 1.8543299e-09 8.9234991e-06...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [436.37653]
step 2, training accuracy 0.12
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.015578311 0.0026688741 0.44736364...]
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [6.0428465e-07 0.0001744287 0.026451336...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [385.33765]

至此，我們可以看到，學習速率太大是產生error其中一個原因。

參考斯坦福CS 224D的Lecture Note，在訓練深度神經網絡的時候，出現NaN比較大的可能是因為學習速率過大，梯度值過大，產生梯度爆炸。
Refer to the lecture note of Stanford CS 224D, a precise definition of Gradient Explosion is:

During experimentation, once the gradient value grows extremely large, it causes an overflow (i.e. NaN) which is easily detectable at runtime; this issue is called the Gradient Explosion Problem.

解決方法 Solutions

適當減小學習速率 Try to decrease the learning rate.
加入Gradient clipping的方法。 Gradient clipping的方法最早是由Thomas Mikolov提出的。每當梯度達到一定的閾值，就把他們設置回一個小一些的數字。
Refer to the lecture note of Stanford CS 224D, use gradient clipping.

To solve the problem of exploding gradients, Thomas Mikolov first introduced a simple heuristic solution that clips gradients to a small number whenever they explode. That is, whenever they reach a certain threshold, they are set back to a small number as shown in Algorithm 1.
Algorithm 1:
$\frac{\partial E}{\partial W}\to g$
if $ \Vert g\Vert\ge threshold$ then
$\frac {threshold}{\Vert g\Vert} g\to g$
end if

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【TensorFlow】InternalError: Failed copying input tensor tensorflow 訓練的時候loss=nan Tensor-判斷tensor中是否存在NAN, inf Tensorflow之合並tensor [TensorFlow]Tensor維度理解 Tensorflow報錯：InvalidArgumentError: You must feed a value for placeholder tensor 'input_y' with dtype pytorch判斷tensor是否有臟數據NaN Tensorflow之改變tensor形狀 tensorflow中tensor的索引 AI - TensorFlow - 張量（Tensor）