優化器Optimizer


目前最流行的5種優化器:Momentum(動量優化)、NAG(Nesterov梯度加速)、AdaGrad、RMSProp、Adam,所有的優化算法都是在原始梯度下降算法的基礎上增加慣性和環境感知因素進行持續優化

Momentum優化

momentum優化的一個簡單思想:考慮物體運動慣性,想象一個保齡球在光滑表面滾下一個平緩的坡度,最開始會很慢,但是會迅速地恢復動力,直到達到最終速度(假設又一定的摩擦力核空氣阻力)

momentum優化關注以前的梯度是多少,公式:

\((1)m \leftarrow \beta m + \eta \nabla _\theta J(\theta)\)

\((2)\theta \leftarrow \theta - m\)

超參數\(\beta\)稱為動量,其必須設置在0(高摩擦)和1(零摩擦)之間,默認值為0.9

可以很容易地驗證當梯度保持一個常量,最終速度(即權重的最大值)就等於梯度乘以學習率乘以\(\frac{1}{1-\beta}\),當\(\beta = 0.9\)時,那么最終速度等於10倍梯度乘以學習率,所有momentum優化最終會比梯度下降快10倍,在不適用批量歸一化的深度神經網絡中,高層最終常會產生不同尺寸的輸入,因此使用momentum優化會很有幫助,同時還會幫助跨過局部最優

由於又動量,優化器可能會超調一點,然后返回,再超調,來回震盪多次后,最后穩定在最小值,這也是系統中要有一些摩擦的原因之一,它可以幫助擺脫震盪,從而加速收斂

optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,momentum=0.9)

Nesterov梯度加速

公式:

\((1)m \leftarrow \beta m + \eta \nabla _\theta J(\theta + \beta m)\)

\((2)\theta \leftarrow \theta - m\)

與momentum唯一不同的是用\(\theta + \beta m\)來測量梯度,這個小調整有效是因為在通常情況下,動量矢量會指向正確的方向,所以在該方向相對遠的地方使用梯度會比在原有地方更准確一些

optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,momentum=0.9,use_nesterov=True)

AdaGrad

AdaGrad對於簡單的二次問題一般表現都不錯,但是在訓練神經網絡時卻經常很早就停滯了,學習速率縮小得很多,在到達全局最優前算法就停止了,所以盡管tensorflow又AdagradOptimizer,也不要用它來訓練深度神經網絡
公式:

\((1)s \leftarrow s + \nabla _\theta J(\theta) \otimes \nabla _\theta J(\theta)\)

\((2)\theta \leftarrow \theta - \eta \nabla _\theta J(\theta) \oslash \sqrt{s+\varepsilon}\)

RMSProp

AdaGrad降速太快而且沒有辦法收斂到全局最優,RMSProp算法卻通過僅積累最近迭代中得梯度(而非從訓練開始得梯度)解決這個問題,它通在第一步使用指數衰減開實現
公式:

\((1)s \leftarrow \beta s + (1-\beta)\nabla _\theta J(\theta) \otimes \nabla _\theta J(\theta)\)

\((2)\theta \leftarrow \theta - \eta \nabla _\theta J(\theta) \oslash \sqrt{s+\varepsilon}\)

衰減率\(\eta\)通常為0.9

optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate,momentum=0.9,decay=0.9,epsilon=0.9)

除去非常簡單得問題,這個優化器得表現幾乎全部優於AdaGrad,同時表現也基本都優於Momentum優化和NAG,事實上在Adam優化出現之前,它是眾多研究者所推薦得優化算法

Adam優化

Adam代表了自適應力矩估計,集合了Momentum優化和RmsProp的想法,類似Momentum優化,它會跟蹤過去梯度的指數衰減平均值,同時也類似RMSProp,它會跟蹤過去梯度平方的指數衰減平均值,

Adam算法:

\((1)m \leftarrow \beta_1 m + (1-\beta_i) \nabla _\theta J(\theta)\)

\((2)s \leftarrow \beta_2s +(1-\beta_2)\nabla _\theta J(\theta) \otimes \nabla _\theta J(\theta)\)

\((3)m \leftarrow \frac{m}{1-\beta_1^T}\)

\((4)s \leftarrow \frac{s}{1-\beta_2^T}\)

\((5)\theta \leftarrow \theta - \eta m\oslash \sqrt{s+\varepsilon}\)

注:T表示迭代次數(從1開始)

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

使用Adam優化器對mnist進行測試

import tensorflow as tf
from tensorflow.contrib.layers import fully_connected,batch_norm
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets('MNIST_data',one_hot=True)

tf.reset_default_graph()
n_input = 784
n_hidden1 = 300
n_hidden2 = 100
n_output = 10

X = tf.placeholder(tf.float32,shape=(None,n_input),name='X')
Y = tf.placeholder(tf.int64,shape=(None,10),name='Y')
#歸一化參數
is_training = tf.placeholder(tf.bool,shape=(),name='is_training')
bn_params = {'is_training':is_training,'decay':0.99,'updates_collections':None}

with tf.name_scope('dnn'):
    with tf.contrib.framework.arg_scope([fully_connected],normalizer_fn=batch_norm,normalizer_params=bn_params):
        hidden1 = fully_connected(X,n_hidden1,activation_fn=tf.nn.elu,scope='hidden1')
        hidden2 = fully_connected(hidden1,n_hidden2,activation_fn=tf.nn.elu,scope='hidden2')
        y_prab = fully_connected(hidden2,n_output,activation_fn=tf.nn.softmax,scope='output')
with tf.name_scope('train'):
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=Y,logits=y_prab))
    learning_rate = tf.placeholder(tf.float32,shape=(),name='learning_rate')
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
with tf.name_scope('accuracy'):
    prab_bool = tf.equal(tf.argmax(y_prab,1),tf.argmax(Y,1))
    accuracy = tf.reduce_mean(tf.cast(prab_bool,tf.float32))
with tf.name_scope('tensorboard_mnist'):
    file_writer = tf.summary.FileWriter('./tensorboard/',tf.get_default_graph())
    accuracy_summary = tf.summary.scalar('accuracy',accuracy)
with tf.name_scope('saver'):
    saver = tf.train.Saver()
with tf.name_scope('collection'):
    tf.add_to_collection('logits',y_prab)
    
epoches = 20
batch_size = 100
n_batches = mnist.train.num_examples // batch_size
rate = 0.1
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(epoches):
        for batch in range(n_batches):
            x_batch,y_batch = mnist.train.next_batch(batch_size)
            sess.run(optimizer,feed_dict={X:x_batch,Y:y_batch,learning_rate:rate,is_training:True})
        result = sess.run([accuracy,accuracy_summary],feed_dict={X:mnist.test.images,Y:mnist.test.labels,
                                                                 learning_rate:rate,is_training:False})
        
        file_writer.add_summary(result[1],epoch)
        print('epoch:{},accuracy:{}'.format(epoch,result[0]))
    saver.save(sess,'./model/model_final.ckpt',global_step=5)
    print('stop')
Extracting MNIST_data\train-images-idx3-ubyte.gz
Extracting MNIST_data\train-labels-idx1-ubyte.gz
Extracting MNIST_data\t10k-images-idx3-ubyte.gz
Extracting MNIST_data\t10k-labels-idx1-ubyte.gz
epoch:0,accuracy:0.945900022983551
epoch:1,accuracy:0.9574999809265137
epoch:2,accuracy:0.9635000228881836
epoch:3,accuracy:0.9693999886512756
epoch:4,accuracy:0.970300018787384
epoch:5,accuracy:0.9704999923706055
epoch:6,accuracy:0.9758999943733215
epoch:7,accuracy:0.9757999777793884
epoch:8,accuracy:0.9768999814987183
epoch:9,accuracy:0.9783999919891357
epoch:10,accuracy:0.9783999919891357
epoch:11,accuracy:0.9642999768257141
epoch:12,accuracy:0.9779999852180481
epoch:13,accuracy:0.9799000024795532
epoch:14,accuracy:0.9760000109672546
epoch:15,accuracy:0.977400004863739
epoch:16,accuracy:0.9819999933242798
epoch:17,accuracy:0.9781000018119812
epoch:18,accuracy:0.9661999940872192
epoch:19,accuracy:0.9779000282287598
stop


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM