【tensorflow2.0】優化器optimizers

本文轉載自查看原文 2020-04-13 10:52 2735 tensorflow

機器學習界有一群煉丹師，他們每天的日常是：

拿來葯材（數據），架起八卦爐（模型），點着六味真火（優化算法），就搖着蒲扇等着丹葯出爐了。

不過，當過廚子的都知道，同樣的食材，同樣的菜譜，但火候不一樣了，這出來的口味可是千差萬別。火小了夾生，火大了易糊，火不勻則半生半糊。

機器學習也是一樣，模型優化算法的選擇直接關系到最終模型的性能。有時候效果不好，未必是特征的問題或者模型設計的問題，很可能就是優化算法的問題。

深度學習優化算法大概經歷了 SGD -> SGDM -> NAG ->Adagrad -> Adadelta(RMSprop) -> Adam -> Nadam 這樣的發展歷程。

詳見《一個框架看懂優化算法之異同 SGD/AdaGrad/Adam》

https://zhuanlan.zhihu.com/p/32230623

對於一般新手煉丹師，優化器直接使用Adam，並使用其默認參數就OK了。

一些愛寫論文的煉丹師由於追求評估指標效果，可能會偏愛前期使用Adam優化器快速下降，后期使用SGD並精調優化器參數得到更好的結果。

此外目前也有一些前沿的優化算法，據稱效果比Adam更好，例如LazyAdam, Look-ahead, RAdam, Ranger等.

一，優化器的使用

優化器主要使用apply_gradients方法傳入變量和對應梯度從而來對給定變量進行迭代，或者直接使用minimize方法對目標函數進行迭代優化。

當然，更常見的使用是在編譯時將優化器傳入keras的Model,通過調用model.fit實現對Loss的的迭代優化。

初始化優化器時會創建一個變量optimier.iterations用於記錄迭代的次數。因此優化器和tf.Variable一樣，一般需要在@tf.function外創建。

import tensorflow as tf
import numpy as np 
 
# 打印時間分割線
@tf.function
def printbar():
    ts = tf.timestamp()
    today_ts = ts%(24*60*60)
 
    hour = tf.cast(today_ts//3600+8,tf.int32)%tf.constant(24)
    minite = tf.cast((today_ts%3600)//60,tf.int32)
    second = tf.cast(tf.floor(today_ts%60),tf.int32)
 
    def timeformat(m):
        if tf.strings.length(tf.strings.format("{}",m))==1:
            return(tf.strings.format("0{}",m))
        else:
            return(tf.strings.format("{}",m))
 
    timestring = tf.strings.join([timeformat(hour),timeformat(minite),
                timeformat(second)],separator = ":")
    tf.print("=========="*8,end = "")
    tf.print(timestring)
 
# 求f(x) = a*x**2 + b*x + c的最小值
 
# 使用optimizer.apply_gradients
 
x = tf.Variable(0.0,name = "x",dtype = tf.float32)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
 
@tf.function
def minimizef():
    a = tf.constant(1.0)
    b = tf.constant(-2.0)
    c = tf.constant(1.0)
 
    while tf.constant(True): 
        with tf.GradientTape() as tape:
            y = a*tf.pow(x,2) + b*x + c
        dy_dx = tape.gradient(y,x)
        optimizer.apply_gradients(grads_and_vars=[(dy_dx,x)])
 
        #迭代終止條件
        if tf.abs(dy_dx)<tf.constant(0.00001):
            break
 
        if tf.math.mod(optimizer.iterations,100)==0:
            printbar()
            tf.print("step = ",optimizer.iterations)
            tf.print("x = ", x)
            tf.print("")
 
    y = a*tf.pow(x,2) + b*x + c
    return y
 
tf.print("y =",minimizef())
tf.print("x =",x)

================================================================================10:50:09
step =  100
x =  0.867380381

================================================================================10:50:09
step =  200
x =  0.98241204

================================================================================10:50:09
step =  300
x =  0.997667611

================================================================================10:50:09
step =  400
x =  0.999690652

================================================================================10:50:09
step =  500
x =  0.999959

================================================================================10:50:09
step =  600
x =  0.999994457

y = 0
x = 0.999995172

# 求f(x) = a*x**2 + b*x + c的最小值
 
# 使用optimizer.minimize
 
x = tf.Variable(0.0,name = "x",dtype = tf.float32)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)   
 
def f():   
    a = tf.constant(1.0)
    b = tf.constant(-2.0)
    c = tf.constant(1.0)
    y = a*tf.pow(x,2)+b*x+c
    return(y)
 
@tf.function
def train(epoch = 1000):  
    for _ in tf.range(epoch):  
        optimizer.minimize(f,[x])
    tf.print("epoch = ",optimizer.iterations)
    return(f())
 
train(1000)
tf.print("y = ",f())
tf.print("x = ",x)

epoch = 1000

y = 0

x = 0.99999851

# 求f(x) = a*x**2 + b*x + c的最小值
# 使用model.fit
 
tf.keras.backend.clear_session()
 
class FakeModel(tf.keras.models.Model):
    def __init__(self,a,b,c):
        super(FakeModel,self).__init__()
        self.a = a
        self.b = b
        self.c = c
 
    def build(self):
        self.x = tf.Variable(0.0,name = "x")
        self.built = True
 
    def call(self,features):
        loss  = self.a*(self.x)**2+self.b*(self.x)+self.c
        return(tf.ones_like(features)*loss)
 
def myloss(y_true,y_pred):
    return tf.reduce_mean(y_pred)
 
model = FakeModel(tf.constant(1.0),tf.constant(-2.0),tf.constant(1.0))
 
model.build()
model.summary()
 
model.compile(optimizer = 
              tf.keras.optimizers.SGD(learning_rate=0.01),loss = myloss)
history = model.fit(tf.zeros((100,2)),
                    tf.ones(100),batch_size = 1,epochs = 10)  #迭代1000次
 
tf.print("x=",model.x)
tf.print("loss=",model(tf.constant(0.0)))

Model: "fake_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
Total params: 1
Trainable params: 1
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
100/100 [==============================] - 0s 901us/step - loss: 0.2481
Epoch 2/10
100/100 [==============================] - 0s 940us/step - loss: 0.0044
Epoch 3/10
100/100 [==============================] - 0s 926us/step - loss: 7.6740e-05
Epoch 4/10
100/100 [==============================] - 0s 908us/step - loss: 1.3500e-06
Epoch 5/10
100/100 [==============================] - 0s 909us/step - loss: 1.8477e-08
Epoch 6/10
100/100 [==============================] - 0s 965us/step - loss: 0.0000e+00
Epoch 7/10
100/100 [==============================] - 0s 842us/step - loss: 0.0000e+00
Epoch 8/10
100/100 [==============================] - 0s 828us/step - loss: 0.0000e+00
Epoch 9/10
100/100 [==============================] - 0s 837us/step - loss: 0.0000e+00
Epoch 10/10
100/100 [==============================] - 0s 936us/step - loss: 0.0000e+00
x= 0.99999851
loss= 0

二，內置優化器

深度學習優化算法大概經歷了 SGD -> SGDM -> NAG ->Adagrad -> Adadelta(RMSprop) -> Adam -> Nadam 這樣的發展歷程。

在keras.optimizers子模塊中，它們基本上都有對應的類的實現。

SGD, 默認參數為純SGD, 設置momentum參數不為0實際上變成SGDM, 考慮了一階動量, 設置 nesterov為True后變成NAG，即 Nesterov Acceleration Gradient，在計算梯度時計算的是向前走一步所在位置的梯度。
Adagrad, 考慮了二階動量，對於不同的參數有不同的學習率，即自適應學習率。缺點是學習率單調下降，可能后期學習速率過慢乃至提前停止學習。
RMSprop, 考慮了二階動量，對於不同的參數有不同的學習率，即自適應學習率，對Adagrad進行了優化，通過指數平滑只考慮一定窗口內的二階動量。
Adadelta, 考慮了二階動量，與RMSprop類似，但是更加復雜一些，自適應性更強。
Adam, 同時考慮了一階動量和二階動量，可以看成RMSprop上進一步考慮了Momentum。
Nadam, 在Adam基礎上進一步考慮了 Nesterov Acceleration。

參考：

開源電子書地址：https://lyhue1991.github.io/eat_tensorflow2_in_30_days/

GitHub 項目地址：https://github.com/lyhue1991/eat_tensorflow2_in_30_days

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 神經網絡中的優化器 (tensorflow2.0) 【tensorflow2.0】中階api--模型、損失函數、優化器、數據管道、特征列等 tensorflow2.0——LSTM Tensorflow2.0 tensorflow2.0安裝 Tensorflow2.0學習（一） TensorFlow模型部署到服務器---TensorFlow2.0 keras_8_優化器 Optimizers tensorflow2.0編程規范 tensorflow2.0學習筆記（一）