TensorFlow中設置學習率的方式


  上文深度神經網絡中各種優化算法原理及比較中介紹了深度學習中常見的梯度下降優化算法;其中,有一個重要的超參數——學習率\(\alpha\)需要在訓練之前指定,學習率設定的重要性不言而喻:過小的學習率會降低網絡優化的速度,增加訓練時間;而過大的學習率則可能導致最后的結果不會收斂,或者在一個較大的范圍內擺動;因此,在訓練的過程中,根據訓練的迭代次數調整學習率的大小,是非常有必要的;

  因此,本文主要介紹TensorFlow中如何使用學習率、TensorFlow中的幾種學習率設置方式;文章中參考引用文獻將不再具體文中說明,在文章末尾處會給出所有的引用文獻鏈接;

  本主要主要介紹的學習率設置方式有:

  • 指數衰減: tf.train.exponential_decay()
  • 分段常數衰減: tf.train.piecewise_constant()
  • 自然指數衰減: tf.train.natural_exp_decay()
  • 多項式衰減tf.train.polynomial_decay()
  • 倒數衰減tf.train.inverse_time_decay()
  • 余弦衰減tf.train.cosine_decay()

1. 指數衰減

tf.train.exponential_decay(
    learning_rate,
    global_step, 
    decay_steps, 
    decay_rate,
	staircase=False, 
    name=None):
參數 用法
learning_rate 初始學習率;
global_step 迭代次數;
decay_steps 衰減周期,當staircase=True時,學習率在\(decay\_steps\)內保持不變,即得到離散型學習率;
decay_rate 衰減率系數;
staircase 是否定義為離散型學習率,默認False
name 名稱,默認ExponentialDecay

計算方式:

decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)
# 如果staircase=True,則學習率會在得到離散值,每decay_steps迭代次數,更新一次;

示例:

# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf

global_step = tf.Variable(0, name='global_step', trainable=False) # 迭代次數

y = []
z = []
EPOCH = 200

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(EPOCH):
        # 階梯型衰減
        learing_rate1 = tf.train.exponential_decay(
            learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=True)
        # 標准指數型衰減
        learing_rate2 = tf.train.exponential_decay(
            learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=False)
        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])
        y.append(lr1)
        z.append(lr2)

x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_ylim([0, 0.55])

plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'g-', linewidth=2)
plt.title('exponential_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels = ['staircase', 'continus'], loc = 'upper right')
plt.show()

2. 分段常數衰減

tf.train.piecewise_constant(
    x, 
    boundaries, 
    values, 
    name=None):
參數 用法
x 相當於global_step,迭代次數;
boundaries 列表,表示分割的邊界;
values 列表,分段學習率的取值;
name 名稱,默認PiecewiseConstant

計算方式:

# parameter
global_step = tf.Variable(0, trainable=False)
boundaries = [100, 200]
values = [1.0, 0.5, 0.1]
# learning_rate
learning_rate = tf.train.piecewise_constant(global_step, boundaries, values)
# 解釋
# 當global_step=[1, 100]時,learning_rate=1.0;
# 當global_step=[101, 200]時,learning_rate=0.5;
# 當global_step=[201, ~]時,learning_rate=0.1;

示例:

# coding:utf-8

import matplotlib.pyplot as plt
import tensorflow as tf

global_step = tf.Variable(0, name='global_step', trainable=False)
boundaries = [10, 20, 30]
learing_rates = [0.1, 0.07, 0.025, 0.0125]

y = []
N = 40

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(N):
        learing_rate = tf.train.piecewise_constant(global_step, boundaries=boundaries, values=learing_rates)
        lr = sess.run([learing_rate])
        y.append(lr)

x = range(N)
plt.plot(x, y, 'r-', linewidth=2)
plt.title('piecewise_constant')
plt.show()

3. 自然指數衰減

類似與指數衰減,同樣與當前迭代次數相關,只不過以e為底;

tf.train.natural_exp_decay(
    learning_rate,
    global_step,
    decay_steps,
    decay_rate,
    staircase=False,
    name=None
)
參數 用法
learning_rate 初始學習率;
global_step 迭代次數;
decay_steps 衰減周期,當staircase=True時,學習率在\(decay\_steps\)內保持不變,即得到離散型學習率;
decay_rate 衰減率系數;
staircase 是否定義為離散型學習率,默認False
name 名稱,默認ExponentialTimeDecay

計算方式:

decayed_learning_rate = learning_rate * exp(-decay_rate * global_step)
# 如果staircase=True,則學習率會在得到離散值,每decay_steps迭代次數,更新一次;

示例:

# coding:utf-8

import matplotlib.pyplot as plt
import tensorflow as tf

global_step = tf.Variable(0, name='global_step', trainable=False)

y = []
z = []
w = []
m = []
EPOCH = 200

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(EPOCH):

        # 階梯型衰減
        learing_rate1 = tf.train.natural_exp_decay(
            learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=True)

        # 標准指數型衰減
        learing_rate2 = tf.train.natural_exp_decay(
            learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=False)

        # 階梯型指數衰減
        learing_rate3 = tf.train.exponential_decay(
            learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=True)

        # 標准指數衰減
        learing_rate4 = tf.train.exponential_decay(
            learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=False)

        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])
        lr3 = sess.run([learing_rate3])
        lr4 = sess.run([learing_rate4])

        y.append(lr1)
        z.append(lr2)
        w.append(lr3)
        m.append(lr4)

x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_ylim([0, 0.55])

plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'g-', linewidth=2)
plt.plot(x, w, 'r--', linewidth=2)
plt.plot(x, m, 'g--', linewidth=2)

plt.title('natural_exp_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels = ['natural_staircase', 'natural_continus', 'staircase', 'continus'], loc = 'upper right')
plt.show()

可以看到自然指數衰減對學習率的衰減程度遠大於一般的指數衰減;

4. 多項式衰減

tf.train.polynomial_decay(
    learning_rate, 
    global_step, 
    decay_steps,
	end_learning_rate=0.0001, 
    power=1.0,
	cycle=False, name=None):
參數 用法
learning_rate 初始學習率;
global_step 迭代次數;
decay_steps 衰減周期;
end_learning_rate 最小學習率,默認0.0001;
power 多項式的冪,默認1;
cycle bool,表示達到最低學習率時,是否升高再降低,默認False
name 名稱,默認PolynomialDecay

計算方式:

# 如果cycle=False
global_step = min(global_step, decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) *
                          (1 - global_step / decay_steps) ^ (power) +
                          end_learning_rate
# 如果cycle=True
decay_steps = decay_steps * ceil(global_step / decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) *
                          (1 - global_step / decay_steps) ^ (power) +
                          end_learning_rate

示例:

# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf

y = []
z = []
EPOCH = 200

global_step = tf.Variable(0, name='global_step', trainable=False)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(EPOCH):
        # cycle=False
        learing_rate1 = tf.train.polynomial_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=50,
            end_learning_rate=0.01, power=0.5, cycle=False)
        # cycle=True
        learing_rate2 = tf.train.polynomial_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=50,
            end_learning_rate=0.01, power=0.5, cycle=True)

        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])
        y.append(lr1)
        z.append(lr2)

x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, z, 'g-', linewidth=2)
plt.plot(x, y, 'r--', linewidth=2)
plt.title('polynomial_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['cycle=True', 'cycle=False'], loc='uppper right')
plt.show()

可以看到學習率在decay_steps=50迭代次數后到達最小值;同時,當cycle=False時,學習率達到預設的最小值后,就保持最小值不再變化;當cycle=True時,學習率將會瞬間增大,再降低;

多項式衰減中設置學習率可以往復升降的目的:時為了防止在神經網絡訓練后期由於學習率過小,導致網絡參數陷入局部最優,將學習率升高,有可能使其跳出局部最優;

5. 倒數衰減

inverse_time_decay(
    learning_rate,
	global_step,
	decay_steps,
	decay_rate,
	staircase=False,
	name=None):
參數 用法
learning_rate 初始學習率;
global_step 迭代次數;
decay_steps 衰減周期;
decay_rate 學習率衰減參數;
staircase 是否得到離散型學習率,默認False
name 名稱;默認InverseTimeDecay

計算方式:

# 如果staircase=False,即得到連續型衰減學習率;
decayed_learning_rate = learning_rate / (1 + decay_rate * global_step / decay_step)

# 如果staircase=True,即得到離散型衰減學習率;
decayed_learning_rate = learning_rate / (1 + decay_rate * floor(global_step / decay_step))

示例:

# coding:utf-8

import matplotlib.pyplot as plt
import tensorflow as tf

y = []
z = []
EPOCH = 200
global_step = tf.Variable(0, name='global_step', trainable=False)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(EPOCH):
        # 階梯型衰減
        learing_rate1 = tf.train.inverse_time_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=20,
            decay_rate=0.2, staircase=True)

        # 連續型衰減
        learing_rate2 = tf.train.inverse_time_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=20,
            decay_rate=0.2, staircase=False)

        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])

        y.append(lr1)
        z.append(lr2)

x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, z, 'r-', linewidth=2)
plt.plot(x, y, 'g-', linewidth=2)
plt.title('inverse_time_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['continus', 'staircase'])
plt.show()

同樣可以看到,隨着迭代次數的增加,學習率在逐漸減小,同時減小的幅度也在降低;

6. 余弦衰減

之前使用TensorFlow的版本是1.4,沒有學習率的余弦衰減;之后升級到了1.9版本,發現多了四個有關學習率余弦衰減的方法;下面將進行介紹:

6.1 標准余弦衰減

來源於:[Loshchilov & Hutter, ICLR2016], SGDR: Stochastic Gradient Descent with Warm Restarts. https://arxiv.org/abs/1608.03983

tf.train.cosine_decay(
    learning_rate, 
    global_step, 
    decay_steps, 
    alpha=0.0, 
    name=None):
參數 用法
learning_rate 初始學習率;
global_step 迭代次數;
decay_steps 衰減周期;
alpha 最小學習率,默認為0;
name 名稱,默認CosineDecay

計算方式:

global_step = min(global_step, decay_steps)
cosine_decay = 0.5 * (1 + cos(pi * global_step / decay_steps))
decayed = (1 - alpha) * cosine_decay + alpha
decayed_learning_rate = learning_rate * decayed

示例:

# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf

y = []
z = []
EPOCH = 200
global_step = tf.Variable(0, name='global_step', trainable=False)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(EPOCH):
        # 余弦衰減
        learing_rate1 = tf.train.cosine_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=50)
        learing_rate2 = tf.train.cosine_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=100)

        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])
        y.append(lr1)
        z.append(lr2)

x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'b-', linewidth=2)
plt.title('cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['decay_steps=50', 'decay_steps=100'],z loc='upper right')
plt.show()

6.2 重啟余弦衰減

來源於:[Loshchilov & Hutter, ICLR2016], SGDR: Stochastic Gradient Descent with Warm Restarts. https://arxiv.org/abs/1608.03983

tf.train.cosine_decay_restarts(
    learning_rate,
	global_step,
	first_decay_steps,
	t_mul=2.0,
	m_mul=1.0,
	alpha=0.0,
	name=None):
參數 用法
learning_rate 初始學習率;
global_step 迭代次數;
first_decay_steps 衰減周期;
t_mul Used to derive the number of iterations in the i-th period
m_mul Used to derive the initial learning rate of the i-th period:
alpha 最小學習率,默認為0;
name 名稱,默認SGDRDecay

示例:

# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf

y = []
z = []
EPOCH = 100
global_step = tf.Variable(0, name='global_step', trainable=False)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(EPOCH):
        # 重啟余弦衰減
        learing_rate1 = tf.train.cosine_decay_restarts(learning_rate=0.1, global_step=global_step,
                                           first_decay_steps=40)
        learing_rate2 = tf.train.cosine_decay_restarts(learning_rate=0.1, global_step=global_step,
                                                       first_decay_steps=60)

        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])
        y.append(lr1)
        z.append(lr2)

x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'b-', linewidth=2)
plt.title('cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['decay_steps=40', 'decay_steps=60'], loc='upper right')
plt.show()

6.3 線性余弦噪聲

來源於:[Bello et al., ICML2017] Neural Optimizer Search with RL. https://arxiv.org/abs/1709.07417

tf.train.linear_cosine_decay(
    learning_rate,
	global_step,
	decay_steps,
	num_periods=0.5,
	alpha=0.0,
	beta=0.001,
	name=None):
參數 用法
learning_rate 初始學習率;
global_step 迭代次數;
decay_steps 衰減周期;
num_periods Number of periods in the cosine part of the decay.
alpha 最小學習率;
beta 同上;
name 名稱,

計算方式:

global_step = min(global_step, decay_steps)
linear_decay = (decay_steps - global_step) / decay_steps)
cosine_decay = 0.5 * (1 + cos(pi * 2 * num_periods * global_step / decay_steps))
decayed = (alpha + linear_decay) * cosine_decay + beta
decayed_learning_rate = learning_rate * decayed

示例:

# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf

y = []
z = []
EPOCH = 100
global_step = tf.Variable(0, name='global_step', trainable=False)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(EPOCH):
        # 線性余弦衰減
        learing_rate1 = tf.train.linear_cosine_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=40,
            num_periods=0.2, alpha=0.5, beta=0.2)
        learing_rate2 = tf.train.linear_cosine_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=60,
            num_periods=0.2, alpha=0.5, beta=0.2)

        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])
        y.append(lr1)
        z.append(lr2)


x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'b-', linewidth=2)
plt.title('linear_cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['decay_steps=40', 'decay_steps=60'], loc='upper right')
plt.show()

6.4 噪聲余弦衰減

來源於:[Bello et al., ICML2017] Neural Optimizer Search with RL. https://arxiv.org/abs/1709.07417

tf.train.noisy_linear_cosine_decay(
    learning_rate,
	global_step,
	decay_steps,
	initial_variance=1.0,
	variance_decay=0.55,
	num_periods=0.5,
	alpha=0.0,
	beta=0.001,
	name=None):
參數 用法
learning_rate 初始學習率;
global_step 迭代次數;
decay_steps 衰減周期;
initial_variance initial variance for the noise.
variance_decay decay for the noise's variance.
num_periods Number of periods in the cosine part of the decay.
alpha 最小學習率;
beta 查看計算公式;
name 名稱,默認NoisyLinearCosineDecay

計算方式:

global_step = min(global_step, decay_steps)
linear_decay = (decay_steps - global_step) / decay_steps)
cosine_decay = 0.5 * (
    1 + cos(pi * 2 * num_periods * global_step / decay_steps))
decayed = (alpha + linear_decay + eps_t) * cosine_decay + beta
decayed_learning_rate = learning_rate * decayed

示例:

# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf

y = []
z = []
EPOCH = 100
global_step = tf.Variable(0, name='global_step', trainable=False)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(EPOCH):
        # # 噪聲線性余弦衰減
        learing_rate1 = tf.train.noisy_linear_cosine_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=40,
            initial_variance=0.01, variance_decay=0.1, num_periods=2, alpha=0.5, beta=0.2)
        learing_rate2 = tf.train.noisy_linear_cosine_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=60,
            initial_variance=0.01, variance_decay=0.1, num_periods=2, alpha=0.5, beta=0.2)

        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])
        y.append(lr1)
        z.append(lr2)

x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'b-', linewidth=2)
plt.title('noisy_linear_cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['decay_steps=40', 'decay_steps=60'], loc='upper right')
plt.show()

 寫在最后,將TensorFlow中提供的所有學習率衰減的方式大致地使用了一遍,突然發現,掌握的也僅僅是TensorFlow中提供了哪些衰減方式、大致如何使用;然而,當涉及到某種具體的衰減方式、參數如何設置與背后的數學意義,以及不同的方法適用於什么情況....等等一些問題,仍不能掌握。如有可能,在之后使用的過程中,當發現有新的理解,再回來補充。

  如有錯誤,請不吝指正,謝謝。

  這里感謝未雨愁眸-tensorflow中學習率更新策略的博文,本文主要參考該篇文章;當然在其他地方也有看到類似文章,至於說首發何人何處,卻未從考證;

博主個人網站:https://chenzhen.onliine

Reference


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM