問題的出現 Question
這個問題是我基於TensorFlow使用CNN訓練MNIST數據集的時候遇到的。關鍵的相關代碼是以下這部分:
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
學習速率是\((1e-4)\)的時候是沒有問題,但是當我把學習速率調到\(0.01/0.5\)的時候,很快就會報錯。
tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input is not finite. : Tensor had NaN values
分析 Analysis
學習速率 Learning Rate
於是我嘗試加上幾行代碼,希望能把y_conv和cross_entropy的狀態反映出來。
y_conv=tf.Print(y_conv,[y_conv],"y_conv: ")
cross_entropy =tf.Print(cross_entropy,[cross_entropy],"cross_entropy: ")
當learning rate \(=0.01\)時,程序會報錯:
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [3.0374929e-06 0.0059775524 0.980205...]
step 0, training accuracy 0.04
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [9.2028862e-10 1.4812358e-05 0.044873074...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [648.49146]
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.024463326 1.4828938e-31 0...]
step 1, training accuracy 0.2
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [2.4634053e-11 3.3087209e-34 0...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [nan]
step 2, training accuracy 0.14
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [nan nan nan...]
W tensorflow/core/common_runtime/executor.cc:1027] 0x7ff51d92a940 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
當learning rate \(=1e-4\)時,程序不會報錯。
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.00056920078 8.4922984e-09 0.00033719366...]
step 0, training accuracy 0.14
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [7.0613837e-10 9.28294e-09 0.00016230672...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [439.95135]
step 1, training accuracy 0.16
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.031509314 3.6221365e-05 0.015359053...]
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [3.7112056e-07 1.8543299e-09 8.9234991e-06...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [436.37653]
step 2, training accuracy 0.12
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.015578311 0.0026688741 0.44736364...]
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [6.0428465e-07 0.0001744287 0.026451336...]
I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [385.33765]
至此,我們可以看到,學習速率太大是產生error其中一個原因。
參考斯坦福CS 224D的Lecture Note,在訓練深度神經網絡的時候,出現NaN比較大的可能是因為學習速率過大,梯度值過大,產生梯度爆炸。
Refer to the lecture note of Stanford CS 224D, a precise definition of Gradient Explosion is:
During experimentation, once the gradient value grows extremely large, it causes an overflow (i.e. NaN) which is easily detectable at runtime; this issue is called the Gradient Explosion Problem.
解決方法 Solutions
- 適當減小學習速率 Try to decrease the learning rate.
- 加入Gradient clipping的方法。 Gradient clipping的方法最早是由Thomas Mikolov提出的。每當梯度達到一定的閾值,就把他們設置回一個小一些的數字。
Refer to the lecture note of Stanford CS 224D, use gradient clipping.
To solve the problem of exploding gradients, Thomas Mikolov first introduced a simple heuristic solution that clips gradients to a small number whenever they explode. That is, whenever they reach a certain threshold, they are set back to a small number as shown in Algorithm 1.
Algorithm 1:
\(\frac{\partial E}{\partial W}\to g\)
if $ \Vert g\Vert\ge threshold$ then
\(\frac {threshold}{\Vert g\Vert} g\to g\)
end if