《深度學習-改善深層神經網絡》-第二周-優化算法-Andrew Ng

本文轉載自查看原文 2019-03-29 21:14 513 Andrew Ng/ Deep Learning

看到有不少人挺推崇：An overview of gradient descent optimization algorithms；特此放到最上面，大家有機會可以閱讀一下；

本文內容主要來源於Coursera吳恩達《優化深度神經網絡》課程，另外一些不同優化算法之間的比較也會出現在其中，具體來源不再單獨說明，會在文末給出全部的參考文獻；

本主要主要介紹的優化算法有：

Mini-batch梯度下降（Mini-batch gradient descent）
指數加權平均（Exponentially weighted averages）
Momentum梯度下降法
RMSprop算法

Adam算法

其實就是對梯度下降的優化算法，每一種優化算法會介紹其：基本原理、TensorFlow中的使用、不同優化算法的優缺點總結；在最后會介紹調整學習率衰減的方式以及局部最優問題；

1. Mini-batch gradient descent
2. Exponentially weighted averages
3. Gradient descent with momentum（Momentum梯度下降法）
4. RMSprop
- 4.1 偽代碼表示
- 4.2 TensorFlow中的RMSprop
  - 4.2.1 構建optimizer
  - 4.2.2 tf.train.RMSPropOptimizer()
5. Adam optimization algorithm
- 5.1 Adam算法流程-偽代碼
- 5.2 TensorFlow中Adam optimization algorithm
  - 5.2.1 構建optimizer
  - 5.2.2 tf.train.AdamOptimizer
6. 不同優化算法的優缺點總結
7. Learning rate decay
- 7.1 學習率減小的幾種方式
- 7.2 TensorFlow中的學習率設置
8. The problem of local optima
Reference

1. Mini-batch gradient descent

如果樣本數量不是過於龐大，一般使用batch的方式進行計算，即將整個樣本集投入到深度神經網絡進行梯度下降；而一般實際應用中，樣本集的數量將會很大，如達到百萬數量級，這種情況下如果繼續使用batch的方式，訓練的速度往往會很慢；

因此，假如每次只對整個樣本集中的部分樣本執行梯度下降，這就有了Mini-batch gradient descent。

1.1 算法原理

整個樣本集\(X=[x^1, x^2, \cdots, x^m] \in R^{n \times m}\)；\(Y=[y^1, y^2, \cdots, y^m] \in R^{1 \times m}\)；

假設：

\(m=5000000\)；每一個mini-batch含有1000個樣本，即\(X^{\{t\}} \in R^{n \times 1000},Y^{\{t\}} \in R^{1 \times 1000}, t=1, 2, \cdots, 5000\)；

\(x^i\)表示第\(i\)個樣本；\(Z^{[l]}\)表示網絡第\(l\)層網絡的線性輸出；\(X^{\{t\}}, Y^{\{t\}}\)表示第\(t\)組mini-batch；

即在每一個mini-batch上執行梯度下降，偽代碼如下：

# 一個epoch
for t = 1, ..., T{
    Forward Propagation
    Compute Cost Function
    Backward Propagation
}

其中，每一步詳解：

（1）Forward Propagation

第一層網絡非線性輸出：

\[Z^{[1]} = W^{[1]}X^{\{t\}} + b^{[1]} \]

\[A^{[1]} = g^{(1)}(Z^{[1]}) \]

第\(l\)層網絡非線性輸出：

\[A^{[l]} = g^{[l]}(Z^{[l]}) \]

（2）Compute Cost Function

計算代價函數：

\[J = \dfrac{1}{1000} \sum_{i=1}^{l}Loss(\hat{y}^i, y^i) + \dfrac{\lambda}{2 \times 1000} \sum_{l}||W^l||_F^2 \]

（3）Backward Propagation

更新權重和偏置：

\[W^{[l]} : = W^{[l]} - \alpha dW^{[l]} \]

\[b^{[l]} : = b^{[l]} - \alpha db^{[l]} \]

經過T次for循環后，表示已經在整個樣本集上訓練了一次，即一個epoch；可以執行多個epoch；

1.2 進一步理解Mini-batch gradient descent

對與Batch Gradient Descent來說，一個epoch只進行了一次梯度下降；而對於Mini-batch Gradient Decent來說，一個epoch進行T次梯度下降；

1.2.1 Cost function

（1）左圖表示一般神經網絡中，使用Batch Gradient Descent，隨着在整個樣本集上迭代次數的增加，cost在不斷的減小；

（2）右圖表示使用Mini-batch Gradient Descent，隨着在不同的mini-batch上進行訓練，cost整體趨勢處於下降，但由於受到噪聲的影響，會出現震盪；

（3）Mini-batch Gradient Descent中cost出現震盪的原因時：不同的mini-batch之間是存在差異的，可能其中某些mini-batch是好的子集，而某些子集中存在噪聲，因此cost會出現震盪的情況；

1.2.2 如何選擇batch size

總共有三種選擇方式：（1）batch_size=m；（2）batch_size=1；（3）batch_size介於1和m之間；

（1）Batch Gradient Descent（batch_size = m）

當batch_size=m，就成了Batch Gradient Descent，只有包含一個子集，就是整個數據集；即\((X^{\{1\}}, Y^{\{1\}})=(X,Y)\)；

（2）Stochastic Gradient Descent（batch_size=1）

當batch_size=m，就成了Stochastic Gradient Descent，共包含m個子集，每個樣本作為一個子集，即\((X^{\{1\}}, Y^{\{1\}})=(x^i,y^i)\)；

（3）Mini-batch gradient descent（batch_size介於1和m之間）

上圖表示三者之間梯度下降曲線：

a. 藍色表示Batch Gradient Descent，會比較平穩的接近全局最小值；由於使用了全部數據集，每次前進的速度會比較慢；

b. 紫色表示Stochastic Gradient Descent，每次前進速度很快；但由於每次只使用了一個樣本，會出現較大的震盪；而且，不會收斂到最小值，最終會在最小值附近來回波動

c. 綠色表示Mini-batch gradient descent，每次前進速度較快，且震盪較小，基本能夠接近最小值；如果出現在最小值附近波動，可以減小學習率；

算法	Stochastic Gradient Descent	Mini-batch gradient descent	Batch Gradient Descent
優點	適用於單個樣本；	（1）能夠快速學習；（2）向量化加速；（3）未在整個訓練集上訓練完，就可以執行后續工作；
缺點	（1）丟失了向量化帶來的加速；（2）效率低；		單次迭代時間太長；

如何為Mini-batch gradient descent選擇batch size？

64-512，2的n次方，提高運算速度；
\(X^{\{t\}}, Y^{\{t\}}\)符合GPU、CPU內存；

1.3 TensorFlow中的梯度下降

1.3.1 構建optimizer

optimizer = tf.train.GradientDescentOptimizer(leraning_rate)
train = optimizer.minimize(loss)

1.3.2 tf.train.GradientDescentOptimizer()

tf.train.GradientDescentOptimizer.__init__(self, 
                                           learning_rate, 
                                           use_locking=False, 
                                           name="GradientDescent"):
Args:
	learning_rate: A Tensor or a floating point value.  The learning rate to use.  # 學習率
	use_locking: If True use locks for update operations.  # 
	name: Optional name prefix for the operations created when applying gradients. Defaults to "GradientDescent".

1.3.3 TensorFlow中的使用

#coding=utf-8
import tensorflow as tf

# Model parameters
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
y_pred = W * x + b
y = tf.placeholder(tf.float32)

# loss
loss = tf.reduce_sum(tf.square(y_pred - y))  # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)  # reset values to wrong
for i in range(1000):
    sess.run(train, {x: x_train, y: y_train})

# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train})
print("W: %s b: %s loss: %s" % (curr_W, curr_b, curr_loss))

2. Exponentially weighted averages

指數加權平均（Exponentially weighted averages）是除梯度下降算法之外其他優化算法中重要的概念，因此，這里先介紹其概念。

2.1 倫敦天氣溫度

這里不再介紹如何引入指數加權平均的，具體參考：網易雲課堂-吳恩達《優化深度神經網絡》-第二周或紅色石頭Will-吳恩達《優化深度神經網絡》課程筆記；

假設：\(V_0 = 0\)；

\[V_t = \beta V_{t-1} + (1 - \beta) \theta_t \]

其中，\(\theta_t\)表示第\(t\)天的溫度；\(V_t\)表示通過移動平均的方法對每天氣溫進行平滑處理后結果；
\(\beta\)值決定了指數加權平均的天數，即\(\dfrac{1}{1-\beta}\)；\(\beta\)表示加權平均的天數越多，平均后的趨勢越平緩，同時也會向右移動；

即，當\(\beta=0.9\)，則\(\dfrac{1}{1-\beta}=10\)，表示將前10天進行指數加權平均；

2.2 進一步理解Exponentially weighted averages

2.2.1 理解指數加權平均一般形式

\[V_t = \beta V_{t-1} + (1-\beta)\theta_{t} \]

\[V_t = (1-\beta) \cdot \theta_{t} + (1-\beta) \cdot \beta \cdot \theta_{t-1} + (1-\beta) \cdot \beta^2 \cdot \theta_{t-2} + \cdots + (1-\beta)\cdot \beta^{t-1}\cdot \theta_1 + \beta^t\cdot V_0 \]

其中，\(\theta_t, \theta_{t-1}, \cdots , \theta_1\)表示原始數據集，即下圖中的第一張圖；

\((1-\beta), (1-\beta)\cdot \beta, \cdots, (1-\beta)\cdot \beta^{t-1}\)類似指數曲線，如下圖中第二張圖；從右向左，呈指數下降；

\(V_t\)表示兩者點乘，將原始數據值與衰減指數點乘，相當於做了指數衰減，離的越近，影響就越大；離的越遠，影響就越小，衰減就越嚴重；

2.2.2 實際計算指數加權平均

實際應用中，為了減少內存的使用，可以使用如下語句實現指數加權平均：

\(V_0=0\)

Repeat{

\[Get \quad next \quad \theta_t \]

\[V_{\theta} := \beta V_{\theta} + (1-\beta)\theta_t \]

}

2.3 偏差修正（bias correction）

因為初始假設\(V_0=0\)，可以想到，在使用\(V_t = \beta V_{t-1} + (1-\beta)\theta_t\)計算的時候，前面的一些值將會受到很大的影響，會比正常值小一些，直到計算后面數據的時候，影響才會漸漸變小，趨於正常。

因此，修正這種問題的方式是偏移修正（bias correction），即對\(V_t\)作如下處理：

\[\dfrac{V_t}{1-\beta^t} \]

在機器學習中，偏移修正不是必須的；

3. Gradient descent with momentum（Momentum梯度下降法）

動量梯度下降算法（Gradient descent with momentum）的速度要快於標准的梯度下降算法；

具體做法是：在每次訓練時，對梯度計算指數加權平均，然后使用得到的梯度值更新權重和偏置；

3.1 梯度下降

如上圖藍色折線所示，表示標准梯度下降算法；在梯度下降的過程中，會出現震盪的情況，這是因為每一點的梯度只與當前梯度方向有關，因此會出現折線的效果；

如上圖紅色折線所示，表示使用momentum梯度下降算法；可以看到，在梯度下降的過程中，不會出現劇烈的震盪，這是因為，每一個點的梯度不僅與當前梯度方向有關，還與之前的梯度方向有關；能夠做到縱軸擺動變小，橫軸方向運動更快；

3.2 偽代碼表示

On iteration t{

Compute dW, db on the current mini-batch

\(V_{dW} = \beta V_{dW} + (1-\beta)dW\)

\(V_{db} = \beta V_{db} + (1-\beta)db\)

更新權重和偏置

\(W := W - \alpha V_{dW}, b := b - \alpha V_{db}\)

}

其中，初始化時，\(V_{dW}=0, V_{db}=0, \beta=0.9\)；

3.3 TensorFlow中的Gradient descent with momentum

3.3.1 構建optimizer

# optimizer
optimizer = tf.train.MomentumOptimizer(0.01, momentum) # \beta 
train = optimizer.minimize(loss)

3.3.2 tf.train.MomentumOptimizer()

tf.train.MomentumOptimizer.__init__(self, learning_rate, momentum,
               use_locking=False, name="Momentum", use_nesterov=False):
    
Args:
	learning_rat: A `Tensor` or a floating point value.  The learning rate. # 學習率
	momentum: A `Tensor` or a floating point value.  The momentum. # 就是指數加權平均中的超參數\alpha=0.9
	use_locking: If `True` use locks for update operations. 
	name: Optional name prefix for the operations created when applying gradients.  Defaults to "Momentum".
	use_nesterov: If `True` use Nesterov Momentum. # 另一種優化算法，由momentum改進而來，效果更好；來源於：http://jmlr.org/proceedings/papers/v28/sutskever13.pdf

Return:
    optimizer

4. RMSprop

RMSprop（Root mean squared prop）是另外一種優化梯度下降的算法，類似於Momentum Gradient descent，同樣可以在縱軸上減小擺動，在橫軸方向上運動更快；

4.1 偽代碼表示

On iteration t{

Compute dW, db on the current mini-batch

\(S_{dW} = \beta S_{dW} + (1-\beta)(dW)^2\)

\(S_{db} = \beta S_{db} + (1-\beta)(db)^2\)

更新權重和偏置

\(W := W - \alpha \dfrac{dW}{\sqrt{S_W}+\epsilon}, b := b - \alpha \dfrac{db}{\sqrt{S_W}+\epsilon}\)

}

其中，一般取\(\epsilon=10^{-8}\)，防止分母趨近於0；

4.2 TensorFlow中的RMSprop

4.2.1 構建optimizer

# optimizer
optimizer = tf.train.RMSPropOptimizer(0.01, decay, momentum) # decay不清楚具體什么作用？？求解：
train = optimizer.minimize(loss)

4.2.2 tf.train.RMSPropOptimizer()

tf.train.RMSPropOptimizer.__init__(self,
                                  learning_rate,
                                  decay=0.9,
                                  momentum=0.0,
                                  epsilon=1e-10,
                                  use_locking=False,
                                  centered=False,
                                  name="RMSProp")
Args:
	learning_rate: A Tensor or a floating point value.  The learning rate.  # 學習率
	decay: Discounting factor for the history/coming gradient  # ？？
	momentum: A scalar tensor. # \alpha
	epsilon: Small value to avoid zero denominator.  # \epsilon 防止分母趨近於0
	use_locking: If True use locks for update operation.
	centered: If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.
	name: Optional name prefix for the operations created when applying gradients. Defaults to "RMSProp".

5. Adam optimization algorithm

Adam優化算法是結合了Gradient descent with momentum與RMSprop兩種算法；被證明能夠適用於不同的神經網絡；

5.1 Adam算法流程-偽代碼

初始化：\(V_{dW}=0, S_{dW}=0, V_{db}=0, S_{db}=0\)；

On iteration t {

Compute \(dW, db\) on each mini-batch

\(V_{dW} = \beta_1 V_{dW} + (1-\beta_1)dW\)

\(V_{db} = \beta_1 V_{db} + (1-\beta_1)db\)

\(S_{dW} = \beta_2 S_{dW} + (1-\beta_2)(dW)^2\)

\(S_{db} = \beta_2 S_{db} + (1-\beta_2)(db)^2\)

\(V_{dW}^{corrected}= \dfrac{V_{dW}}{1-\beta_1^t}, V_{db}^{corrected}= \dfrac{V_{db}}{1-\beta_1^t}\)

\(S_{dW}^{corrected}= \dfrac{S_{dW}}{1-\beta_2^t}, S_{db}^{corrected}= \dfrac{S_{db}}{1-\beta_2^t}\)

\(W := W - \alpha \dfrac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon} b := b - \alpha \dfrac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}\)

}

Adam算法中需要做偏差修正；

超參數設置：\(\beta_1 = 0.9, \beta_2=0.999, \epsilon = 10^{-8}\)；一般只需要對學習率\(\alpha\)進行調試；

5.2 TensorFlow中Adam optimization algorithm

5.2.1 構建optimizer

optimizer = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon)
train = optimizer.minimize(loss)

5.2.2 tf.train.AdamOptimizer

tf.train.AdamOptimizer._init__(self,
                               learning_rate=0.001,
                               beta1=0.9,
                               beta2=0.999,
                               epsilon=1e-8,
                               use_locking=False,
                               name="Adam"):
Args:
	learning_rate: A Tensor or a floating point value.  The learning rate. # 學習率
	beta1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates. # \beta_1
	beta2: A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates. # \beta_2
	epsilon: A small constant for numerical stability. This epsilon is "epsilon hat" in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.
	use_locking: If True use locks for update operations.
	name: Optional name for the operations created when applying gradients. Defaults to "Adam".

6. 不同優化算法的優缺點總結

6.1 Batch Gradient Descent

思想：基於整個訓練集進行梯度下降，更新權重；

優點：

考慮的是全局損失，不會陷入局部最優；

缺點：

每次迭代計算量較大，占用內存較高；

6.2 Stochastic Gradient Descent

思想：從訓練集中隨機選取一個樣本計算梯度更新參數；

優點：

由於是對當個樣本的損失計算梯度，因此計算量較小；

缺點：

僅考慮單個樣本，容易陷入局部最優；
訓練集較大時，訓練時間較長；
選擇合適的學習率比較困難；
對參數初始化比較敏感；
由於引入了噪聲，因此具有正則化的效果；

6.3 Mini Batch Gradient Descent

思想：從整個樣本集中選擇batch_size個樣本計算損失的梯度，更新權重；

優點：

對於很大的訓練集，能夠較快的收斂；

缺點：

梯度更新的方向依賴於當前batch內的樣本，所以梯度的方向不穩定；
可能會出現不會收斂的最小值的情況，需要逐漸減小學習率；

6.4 Gradient Descent with Momentum

思想：基於之前梯度的方向以及當前batch的梯度方向進行更新；

優點：

減弱縱向方向的擺動，對震盪的情況能夠有一定的抑制作用；
加速橫向的運動，快速接近於最優值，加速收斂；

6.5 RMSprop

思想：類似於動量梯度下降，引入了指數權重加權平均值；

6.6 AdaGrad

思想：綜合了Gradient Descent with Momentum與RMSprop兩種優化算法；

優點：

訓練前期，更新幅度大；
訓練后期，更新幅度小；
適合處理稀疏梯度；

缺點：

訓練后期，會導致學習率很小，梯度更新的很慢；
自定義全局學習率；

7. Learning rate decay

在神經網絡訓練的過程中，適當減小學習率有利於提高訓練速度，該類方法稱為learning rate decay，即隨着迭代次數的增加，學習率\(\alpha\)逐漸減小；

7.1 學習率減小的幾種方式

（1）第一種：

\[\alpha = \dfrac{1}{1+ decay\_rate \times epoch\_num}\cdot \alpha_0 \]

其中，\(decay\_rate\)衰減參數；\(epoch\_num\)表示迭代次數；

（2）第二種：

\[\alpha = 0.95^{epoch\_num} \cdot \alpha_0 \]

（3）第三種：

\[alpha = \dfrac{k}{\sqrt{epoch\_num}}\cdot \alpha_0 \quad 或 \quad \dfrac{k}{\sqrt{t}}\cdot \alpha_0 \]

（4）第四種：

將\(\alpha\)設置為關於\(t\)的離散值，隨着\(t\)的增加，\(\alpha\)呈階梯式減少；

（5）第五種：

通過查看訓練日志，手動調整學習率；

7.2 TensorFlow中的學習率設置

由於TensorFlow中提供的學習率設置方式有不少種，而本文主要是敘述梯度下降的優化算法，在此處介紹將會占用不小的篇幅，顯得有些臃腫，因此，另總結一篇博文供自己學習；

TensorFlow中設置學習率的方式

8. The problem of local optima

在使用梯度下降算法減少cost function的時候，可能會得到局部最優解，而不是全局最優解；

我們認為的局部最優可能如下圖左邊所示；但在神經網絡中，局部最優的概念發生了變化；大部分梯度為零的“最優點“不是這些凹槽處，而是如下圖右邊的馬鞍處，稱為saddle point。

類似馬鞍狀的plateaus會降低神經網絡的學習速度。plateaus是梯度接近於零的平緩區域，如下圖所示，在plateaus上梯度很小，前進緩慢，達到saddle point需要很長時間；到達saddle point后，由於隨機擾動，梯度能夠進去下降；但是會在plateaus上花費很多時間；

動量梯度下降、RMSprop、Adam算法能夠解決plateaus下降過慢的問題，提高訓練速度；

結束！！！

博主個人網站:https://chenzhen.online

Reference

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 DeepLearning.ai學習筆記（二）改善深層神經網絡：超參數調試、正則化以及優化--Week1深度學習的實用層面 DeepLearning.ai學習筆記（一）神經網絡和深度學習--Week4深層神經網絡【原】Coursera—Andrew Ng機器學習—Week 4 習題—Neural Networks 神經網絡 TensorFlow學習筆記（二）深層神經網絡【神經網絡和深度學習】筆記 - 第二章反向傳播算法機器學習之路：深度學習 tensorflow 神經網絡優化算法學習率的設置神經網絡和深度學習深度學習-神經網絡 BP 算法推導過程 9、改善深層神經網絡之正則化、Dropout正則化優化深度神經網絡（三）Batch Normalization