上文深度神經網絡中各種優化算法原理及比較中介紹了深度學習中常見的梯度下降優化算法;其中,有一個重要的超參數——學習率\(\alpha\)需要在訓練之前指定,學習率設定的重要性不言而喻:過小的學習率會降低網絡優化的速度,增加訓練時間;而過大的學習率則可能導致最后的結果不會收斂,或者在一個較大的范圍內擺動;因此,在訓練的過程中,根據訓練的迭代次數調整學習率的大小,是非常有必要的;
因此,本文主要介紹TensorFlow中如何使用學習率、TensorFlow中的幾種學習率設置方式;文章中參考引用文獻將不再具體文中說明,在文章末尾處會給出所有的引用文獻鏈接;
本主要主要介紹的學習率設置方式有:
- 指數衰減: tf.train.exponential_decay()
- 分段常數衰減: tf.train.piecewise_constant()
- 自然指數衰減: tf.train.natural_exp_decay()
- 多項式衰減tf.train.polynomial_decay()
- 倒數衰減tf.train.inverse_time_decay()
- 余弦衰減tf.train.cosine_decay()
1. 指數衰減
tf.train.exponential_decay(
learning_rate,
global_step,
decay_steps,
decay_rate,
staircase=False,
name=None):
參數 | 用法 |
---|---|
learning_rate |
初始學習率; |
global_step |
迭代次數; |
decay_steps |
衰減周期,當staircase=True 時,學習率在\(decay\_steps\)內保持不變,即得到離散型學習率; |
decay_rate |
衰減率系數; |
staircase |
是否定義為離散型學習率,默認False ; |
name |
名稱,默認ExponentialDecay ; |
計算方式:
decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)
# 如果staircase=True,則學習率會在得到離散值,每decay_steps迭代次數,更新一次;
示例:
# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf
global_step = tf.Variable(0, name='global_step', trainable=False) # 迭代次數
y = []
z = []
EPOCH = 200
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for global_step in range(EPOCH):
# 階梯型衰減
learing_rate1 = tf.train.exponential_decay(
learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=True)
# 標准指數型衰減
learing_rate2 = tf.train.exponential_decay(
learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=False)
lr1 = sess.run([learing_rate1])
lr2 = sess.run([learing_rate2])
y.append(lr1)
z.append(lr2)
x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_ylim([0, 0.55])
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'g-', linewidth=2)
plt.title('exponential_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels = ['staircase', 'continus'], loc = 'upper right')
plt.show()
2. 分段常數衰減
tf.train.piecewise_constant(
x,
boundaries,
values,
name=None):
參數 | 用法 |
---|---|
x |
相當於global_step ,迭代次數; |
boundaries |
列表,表示分割的邊界; |
values |
列表,分段學習率的取值; |
name |
名稱,默認PiecewiseConstant ; |
計算方式:
# parameter
global_step = tf.Variable(0, trainable=False)
boundaries = [100, 200]
values = [1.0, 0.5, 0.1]
# learning_rate
learning_rate = tf.train.piecewise_constant(global_step, boundaries, values)
# 解釋
# 當global_step=[1, 100]時,learning_rate=1.0;
# 當global_step=[101, 200]時,learning_rate=0.5;
# 當global_step=[201, ~]時,learning_rate=0.1;
示例:
# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf
global_step = tf.Variable(0, name='global_step', trainable=False)
boundaries = [10, 20, 30]
learing_rates = [0.1, 0.07, 0.025, 0.0125]
y = []
N = 40
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for global_step in range(N):
learing_rate = tf.train.piecewise_constant(global_step, boundaries=boundaries, values=learing_rates)
lr = sess.run([learing_rate])
y.append(lr)
x = range(N)
plt.plot(x, y, 'r-', linewidth=2)
plt.title('piecewise_constant')
plt.show()
3. 自然指數衰減
類似與指數衰減,同樣與當前迭代次數相關,只不過以e為底;
tf.train.natural_exp_decay(
learning_rate,
global_step,
decay_steps,
decay_rate,
staircase=False,
name=None
)
參數 | 用法 |
---|---|
learning_rate |
初始學習率; |
global_step |
迭代次數; |
decay_steps |
衰減周期,當staircase=True 時,學習率在\(decay\_steps\)內保持不變,即得到離散型學習率; |
decay_rate |
衰減率系數; |
staircase |
是否定義為離散型學習率,默認False ; |
name |
名稱,默認ExponentialTimeDecay ; |
計算方式:
decayed_learning_rate = learning_rate * exp(-decay_rate * global_step)
# 如果staircase=True,則學習率會在得到離散值,每decay_steps迭代次數,更新一次;
示例:
# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf
global_step = tf.Variable(0, name='global_step', trainable=False)
y = []
z = []
w = []
m = []
EPOCH = 200
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for global_step in range(EPOCH):
# 階梯型衰減
learing_rate1 = tf.train.natural_exp_decay(
learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=True)
# 標准指數型衰減
learing_rate2 = tf.train.natural_exp_decay(
learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=False)
# 階梯型指數衰減
learing_rate3 = tf.train.exponential_decay(
learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=True)
# 標准指數衰減
learing_rate4 = tf.train.exponential_decay(
learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=False)
lr1 = sess.run([learing_rate1])
lr2 = sess.run([learing_rate2])
lr3 = sess.run([learing_rate3])
lr4 = sess.run([learing_rate4])
y.append(lr1)
z.append(lr2)
w.append(lr3)
m.append(lr4)
x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_ylim([0, 0.55])
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'g-', linewidth=2)
plt.plot(x, w, 'r--', linewidth=2)
plt.plot(x, m, 'g--', linewidth=2)
plt.title('natural_exp_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels = ['natural_staircase', 'natural_continus', 'staircase', 'continus'], loc = 'upper right')
plt.show()
可以看到自然指數衰減對學習率的衰減程度遠大於一般的指數衰減;
4. 多項式衰減
tf.train.polynomial_decay(
learning_rate,
global_step,
decay_steps,
end_learning_rate=0.0001,
power=1.0,
cycle=False, name=None):
參數 | 用法 |
---|---|
learning_rate |
初始學習率; |
global_step |
迭代次數; |
decay_steps |
衰減周期; |
end_learning_rate |
最小學習率,默認0.0001; |
power |
多項式的冪,默認1; |
cycle |
bool ,表示達到最低學習率時,是否升高再降低,默認False ; |
name |
名稱,默認PolynomialDecay ; |
計算方式:
# 如果cycle=False
global_step = min(global_step, decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) *
(1 - global_step / decay_steps) ^ (power) +
end_learning_rate
# 如果cycle=True
decay_steps = decay_steps * ceil(global_step / decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) *
(1 - global_step / decay_steps) ^ (power) +
end_learning_rate
示例:
# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf
y = []
z = []
EPOCH = 200
global_step = tf.Variable(0, name='global_step', trainable=False)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for global_step in range(EPOCH):
# cycle=False
learing_rate1 = tf.train.polynomial_decay(
learning_rate=0.1, global_step=global_step, decay_steps=50,
end_learning_rate=0.01, power=0.5, cycle=False)
# cycle=True
learing_rate2 = tf.train.polynomial_decay(
learning_rate=0.1, global_step=global_step, decay_steps=50,
end_learning_rate=0.01, power=0.5, cycle=True)
lr1 = sess.run([learing_rate1])
lr2 = sess.run([learing_rate2])
y.append(lr1)
z.append(lr2)
x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, z, 'g-', linewidth=2)
plt.plot(x, y, 'r--', linewidth=2)
plt.title('polynomial_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['cycle=True', 'cycle=False'], loc='uppper right')
plt.show()
可以看到學習率在decay_steps=50
迭代次數后到達最小值;同時,當cycle=False
時,學習率達到預設的最小值后,就保持最小值不再變化;當cycle=True
時,學習率將會瞬間增大,再降低;
多項式衰減中設置學習率可以往復升降的目的:時為了防止在神經網絡訓練后期由於學習率過小,導致網絡參數陷入局部最優,將學習率升高,有可能使其跳出局部最優;
5. 倒數衰減
inverse_time_decay(
learning_rate,
global_step,
decay_steps,
decay_rate,
staircase=False,
name=None):
參數 | 用法 |
---|---|
learning_rate |
初始學習率; |
global_step |
迭代次數; |
decay_steps |
衰減周期; |
decay_rate |
學習率衰減參數; |
staircase |
是否得到離散型學習率,默認False ; |
name |
名稱;默認InverseTimeDecay ; |
計算方式:
# 如果staircase=False,即得到連續型衰減學習率;
decayed_learning_rate = learning_rate / (1 + decay_rate * global_step / decay_step)
# 如果staircase=True,即得到離散型衰減學習率;
decayed_learning_rate = learning_rate / (1 + decay_rate * floor(global_step / decay_step))
示例:
# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf
y = []
z = []
EPOCH = 200
global_step = tf.Variable(0, name='global_step', trainable=False)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for global_step in range(EPOCH):
# 階梯型衰減
learing_rate1 = tf.train.inverse_time_decay(
learning_rate=0.1, global_step=global_step, decay_steps=20,
decay_rate=0.2, staircase=True)
# 連續型衰減
learing_rate2 = tf.train.inverse_time_decay(
learning_rate=0.1, global_step=global_step, decay_steps=20,
decay_rate=0.2, staircase=False)
lr1 = sess.run([learing_rate1])
lr2 = sess.run([learing_rate2])
y.append(lr1)
z.append(lr2)
x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, z, 'r-', linewidth=2)
plt.plot(x, y, 'g-', linewidth=2)
plt.title('inverse_time_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['continus', 'staircase'])
plt.show()
同樣可以看到,隨着迭代次數的增加,學習率在逐漸減小,同時減小的幅度也在降低;
6. 余弦衰減
之前使用TensorFlow的版本是1.4,沒有學習率的余弦衰減;之后升級到了1.9版本,發現多了四個有關學習率余弦衰減的方法;下面將進行介紹:
6.1 標准余弦衰減
來源於:[Loshchilov & Hutter, ICLR2016], SGDR: Stochastic Gradient Descent with Warm Restarts. https://arxiv.org/abs/1608.03983
tf.train.cosine_decay(
learning_rate,
global_step,
decay_steps,
alpha=0.0,
name=None):
參數 | 用法 |
---|---|
learning_rate |
初始學習率; |
global_step |
迭代次數; |
decay_steps |
衰減周期; |
alpha |
最小學習率,默認為0; |
name |
名稱,默認CosineDecay ; |
計算方式:
global_step = min(global_step, decay_steps)
cosine_decay = 0.5 * (1 + cos(pi * global_step / decay_steps))
decayed = (1 - alpha) * cosine_decay + alpha
decayed_learning_rate = learning_rate * decayed
示例:
# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf
y = []
z = []
EPOCH = 200
global_step = tf.Variable(0, name='global_step', trainable=False)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for global_step in range(EPOCH):
# 余弦衰減
learing_rate1 = tf.train.cosine_decay(
learning_rate=0.1, global_step=global_step, decay_steps=50)
learing_rate2 = tf.train.cosine_decay(
learning_rate=0.1, global_step=global_step, decay_steps=100)
lr1 = sess.run([learing_rate1])
lr2 = sess.run([learing_rate2])
y.append(lr1)
z.append(lr2)
x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'b-', linewidth=2)
plt.title('cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['decay_steps=50', 'decay_steps=100'],z loc='upper right')
plt.show()
6.2 重啟余弦衰減
來源於:[Loshchilov & Hutter, ICLR2016], SGDR: Stochastic Gradient Descent with Warm Restarts. https://arxiv.org/abs/1608.03983
tf.train.cosine_decay_restarts(
learning_rate,
global_step,
first_decay_steps,
t_mul=2.0,
m_mul=1.0,
alpha=0.0,
name=None):
參數 | 用法 |
---|---|
learning_rate |
初始學習率; |
global_step |
迭代次數; |
first_decay_steps |
衰減周期; |
t_mul |
Used to derive the number of iterations in the i-th period |
m_mul |
Used to derive the initial learning rate of the i-th period: |
alpha |
最小學習率,默認為0; |
name |
名稱,默認SGDRDecay ; |
示例:
# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf
y = []
z = []
EPOCH = 100
global_step = tf.Variable(0, name='global_step', trainable=False)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for global_step in range(EPOCH):
# 重啟余弦衰減
learing_rate1 = tf.train.cosine_decay_restarts(learning_rate=0.1, global_step=global_step,
first_decay_steps=40)
learing_rate2 = tf.train.cosine_decay_restarts(learning_rate=0.1, global_step=global_step,
first_decay_steps=60)
lr1 = sess.run([learing_rate1])
lr2 = sess.run([learing_rate2])
y.append(lr1)
z.append(lr2)
x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'b-', linewidth=2)
plt.title('cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['decay_steps=40', 'decay_steps=60'], loc='upper right')
plt.show()
6.3 線性余弦噪聲
來源於:[Bello et al., ICML2017] Neural Optimizer Search with RL. https://arxiv.org/abs/1709.07417
tf.train.linear_cosine_decay(
learning_rate,
global_step,
decay_steps,
num_periods=0.5,
alpha=0.0,
beta=0.001,
name=None):
參數 | 用法 |
---|---|
learning_rate |
初始學習率; |
global_step |
迭代次數; |
decay_steps |
衰減周期; |
num_periods |
Number of periods in the cosine part of the decay. |
alpha |
最小學習率; |
beta |
同上; |
name |
名稱, |
計算方式:
global_step = min(global_step, decay_steps)
linear_decay = (decay_steps - global_step) / decay_steps)
cosine_decay = 0.5 * (1 + cos(pi * 2 * num_periods * global_step / decay_steps))
decayed = (alpha + linear_decay) * cosine_decay + beta
decayed_learning_rate = learning_rate * decayed
示例:
# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf
y = []
z = []
EPOCH = 100
global_step = tf.Variable(0, name='global_step', trainable=False)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for global_step in range(EPOCH):
# 線性余弦衰減
learing_rate1 = tf.train.linear_cosine_decay(
learning_rate=0.1, global_step=global_step, decay_steps=40,
num_periods=0.2, alpha=0.5, beta=0.2)
learing_rate2 = tf.train.linear_cosine_decay(
learning_rate=0.1, global_step=global_step, decay_steps=60,
num_periods=0.2, alpha=0.5, beta=0.2)
lr1 = sess.run([learing_rate1])
lr2 = sess.run([learing_rate2])
y.append(lr1)
z.append(lr2)
x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'b-', linewidth=2)
plt.title('linear_cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['decay_steps=40', 'decay_steps=60'], loc='upper right')
plt.show()
6.4 噪聲余弦衰減
來源於:[Bello et al., ICML2017] Neural Optimizer Search with RL. https://arxiv.org/abs/1709.07417
tf.train.noisy_linear_cosine_decay(
learning_rate,
global_step,
decay_steps,
initial_variance=1.0,
variance_decay=0.55,
num_periods=0.5,
alpha=0.0,
beta=0.001,
name=None):
參數 | 用法 |
---|---|
learning_rate |
初始學習率; |
global_step |
迭代次數; |
decay_steps |
衰減周期; |
initial_variance |
initial variance for the noise. |
variance_decay |
decay for the noise's variance. |
num_periods |
Number of periods in the cosine part of the decay. |
alpha |
最小學習率; |
beta |
查看計算公式; |
name |
名稱,默認NoisyLinearCosineDecay ; |
計算方式:
global_step = min(global_step, decay_steps)
linear_decay = (decay_steps - global_step) / decay_steps)
cosine_decay = 0.5 * (
1 + cos(pi * 2 * num_periods * global_step / decay_steps))
decayed = (alpha + linear_decay + eps_t) * cosine_decay + beta
decayed_learning_rate = learning_rate * decayed
示例:
# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf
y = []
z = []
EPOCH = 100
global_step = tf.Variable(0, name='global_step', trainable=False)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for global_step in range(EPOCH):
# # 噪聲線性余弦衰減
learing_rate1 = tf.train.noisy_linear_cosine_decay(
learning_rate=0.1, global_step=global_step, decay_steps=40,
initial_variance=0.01, variance_decay=0.1, num_periods=2, alpha=0.5, beta=0.2)
learing_rate2 = tf.train.noisy_linear_cosine_decay(
learning_rate=0.1, global_step=global_step, decay_steps=60,
initial_variance=0.01, variance_decay=0.1, num_periods=2, alpha=0.5, beta=0.2)
lr1 = sess.run([learing_rate1])
lr2 = sess.run([learing_rate2])
y.append(lr1)
z.append(lr2)
x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'b-', linewidth=2)
plt.title('noisy_linear_cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['decay_steps=40', 'decay_steps=60'], loc='upper right')
plt.show()
寫在最后,將TensorFlow中提供的所有學習率衰減的方式大致地使用了一遍,突然發現,掌握的也僅僅是TensorFlow中提供了哪些衰減方式、大致如何使用;然而,當涉及到某種具體的衰減方式、參數如何設置與背后的數學意義,以及不同的方法適用於什么情況....等等一些問題,仍不能掌握。如有可能,在之后使用的過程中,當發現有新的理解,再回來補充。
如有錯誤,請不吝指正,謝謝。
這里感謝未雨愁眸-tensorflow中學習率更新策略的博文,本文主要參考該篇文章;當然在其他地方也有看到類似文章,至於說首發何人何處,卻未從考證;
博主個人網站:https://chenzhen.onliine