Learning rate這件小事


Learning rate這件小事

 

 

1. Learning Rate Finder

 

Deep learning models are typically trained by a stochastic gradient descent optimizer. There are many variations of stochastic gradient descent: Adam, RMSProp, Adagrad, etc. All of them let you set the learning rate. This parameter tells the optimizer how far to move the weights in the direction of the gradient for a mini-batch.

 

If the learning rate is low, then training is more reliable, but optimization will take a lot of time because steps towards the minimum of the loss function are tiny.

 

If the learning rate is high, then training may not converge or even diverge. Weight changes can be so big that the optimizer overshoots the minimum and makes the loss worse.

 

Gradient descent with small (top) and large (bottom) learning rates. Source: Andrew Ng’s Machine Learning course on Coursera

 

The training should start from a relatively large learning rate because, in the beginning, random weights are far from optimal, and then the learning rate can decrease during training to allow more fine-grained weight updates.

 

How to find

 

Leslie N. Smith describes a powerful technique to select a range of learning rates for a neural network in section 3.3 of the 2015 paper “Cyclical Learning Rates for Training Neural Networks” .

 

The trick is to train a network starting from a low learning rate and increase the learning rate exponentially for every batch.

 

Learning rate increases after each mini-batch

 

Record the learning rate and training loss for every batch. Then, plot the loss and the learning rate. Typically, it looks like this: 

 

First, with low learning rates, the loss improves slowly, then training accelerates until the learning rate becomes too large and loss goes up: the training process diverges.

 

Another way to look at these numbers is calculating the rate of change of the loss (a derivative of the loss function with respect to iteration number), then plot the change rate on the y-axis and the learning rate on the x-axis.

 

Rate of change of the loss

 

It looks too noisy, let’s smooth it out using simple moving average.

 

Rate of change of the loss, simple moving average

 

This looks better. On this graph, we need to find the minimum. It is close to lr=0.01.

 

How to pick learning rate

 

一個需要注意的點是,我們怎么樣根據 Learning rate vs. Rate 的圖,選擇合適的learning rate。 
 
在上圖這個例子中,我們選擇的是 lr=1e2(102) lr=1e−2(10−2) 而不是 lr=1e1(101)lr=1e−1(10−1) 。也就是說,我們不是選loss最低的那個lr,而是go back a little。

 

那么問題來了,為啥要這樣選?

 

這就要提到一個神奇的東西:stochastic gradient descent with restarts (SGDR)。

 

SGD restarts

 

Another method proposed by Loshchilov & Hutter [5] in their paper “Sgdr: Stochastic gradient descent with restarts”, called ‘cosine annealing’ in which the learning rate is decreasing from max value following the cosine function and then ‘restarts’ with the maximum at the beginning of the next cycle. Authors also suggest making each next cycle longer than the previous one by some constant factor T_mul.

 

在解釋什么是SGDR之前,先來看一個technique——learning rate annealing。 
learning rate annealing長這樣: 
 
這是一種大嘎都在用的采用cosine curve form來adjust learning rate的手段。即一開始優化時可以采用比較大的learning rate,讓loss下降的快一點。之后離minimum比較近了,就可以quickly drop down learning rate,然后小步小步的前進。

 

那這跟SGD restarts有啥關系呢?再來看看下面這段代碼的output:

 

data = ImageClassifierData.from_paths(PATH, tfms=tfms)
learn = ConvLearner.pretrained(arch, data, precompute=True) learn.fit(1e-2, 3, cycle_len=1) learn.sched.plot_lr()

 

  • 1
  • 2
  • 3
  • 4

 

 

上圖表示,我們先對learning rate用cosine annealing,然后jump up again,再用cosine learning rate,這樣循環。 
這就是為啥叫 restart 。

 

這樣做的好處是顯而易見的,如果我們找到了一個spiky的minimum,那么重新增大learning rate,我們就可以跳出這個spiky part;如果我們找到的是一個generalize的較寬的minimum,那我增大learning rate,下一步還是落在這個平坦谷底,這樣我們找到的learning rate就會更generalize。

 

Although deep neural networks don’t usually converge to a global minimum, there’s a notion of ‘good’ and ‘bad’ local minimums in terms of generalization. Keskar et al. [6] argue that local minima with flat basins tend to generalize better. It should be intuitive that sharp minima is not the best, because slight changes to the weights tend to change model predictions dramatically. If the learning rate is large enough, intrinsic random motion across gradient steps prevents the optimizer from reaching any of the sharp basins along its optimization path. However, if the learning rate is small, the model tends to converge into the closest local minimum. That being said, increasing the learning rate from time to time helps the optimization algorithm to escape from sharp minimas, resulting in converging to a ‘good’ set of weights.

 

如下圖所示: 

 

這也是為啥我們在pick learning rate的時候不選那個minimum loss的learning rate,而是選back一丟丟的 10110−1 ,因為這個 10110−1 是作為restart的起點(看上上圖的y軸最高處),然后再慢慢decrease的。

 

如果我們一開始就選擇一個較小的learning rate(10210−2),那即便我們jump up了,也可能跳不出這個spiky的minimum,也就無法去到更generalize的那個minimum。

 

所以上面這個 cycle_len 參數就是用來set restart的epoch間隔的。比如 cycle_len=1 就代表 reset learning rate to jump up every epoch。

 

Epoch, iterations, cycles and stepsize

These terms have specific meaning in this algorithm, understanding 
them will make it easy to plug them in equations.

Let us consider a training dataset with 50,000 instances.

An epoch is one run of your training algorithm across the entire 
training set. If we set a batch size of 100, we get 500 batches in 1 
epoch or 500 iterations. The iteration count is accumulated over 
epochs, so that in epoch 2, we get iterations 501 to 1000 for the same 
batch of 500, and so one.

With that in mind, a cycle is defined as that many iterations 
where we want our learning rate to go from a base learning rate to a 
max learning rate, and back. And a stepsize is half of a cycle. Note that a cycle, in this case, need not fall on the boundary of an 
epoch, though in practice it does.

clr step size

In the above diagram, we set a base lr and max lr for the 
algorithm, demarcated by the red lines. The blue line suggests the way 
learning rate is modified (in a triangular fashion), with the x-axis 
being the iterations. A complete up and down of the blue line is one 
cycle. And stepsize is half of that.

 


 

2. Visualizing Learning rate vs Batch size

 

 
The plot shows Loss vs. Learning rate for the dataset. Now it is easy to choose an optimal range for learning rate before the curve flattens.

 

So simple -once discovered- that surprises how revolutionary this is. It has the potential to save a huge amount of hours and computational resources as soon as it becomes widespread practice. Aditionally to realizing this I also thought about how insightful it would be to use the tool to visualize the famous relationship between learning rate and batch size.

 

For the ones unaware, general rule is “bigger batch size bigger learning rate”. This is just logical because bigger batch size means more confidence in the direction of your “descent” of the error surface while the smaller a batch size is the closer you are to “stochastic” descent (batch size 1). Small steps can also work but the direction of each individual “steps” is more… well, more stochastic.

 

batch size很大(比如1個epoch共1024筆data,你一個batch size就512了),說明你對你的data相當有信心,反正不管從哪出發它們最后都會相聚在同一個谷底,所以你只管大步向前走,一路俯沖就行了; 
反之,batch size賊小,極端情況等於1時(1個epoch分成1024個batch),這就是大嘎說的傳統SGD了。這時候小步小步向前也是可以的,不過因為大部隊分散了,到處打游擊,隨機性就增加了。

 

秀一波圖片,勝過萬語千言: 

 

There you have it, the relationship between learning rate error plotted using batches from 64 to 4 for the “cats vs. dogs” dataset.

 

As expected bigger batch size shows bigger optimal learning rate, but the picture gives a more subtle and complete information. I find it interesting to see how those curves relate to each other, also worth noting how noise of the relationship increases as batch size gets smaller.

 

最后結論就是:batch size越大!learning rate越高!

 

Source

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM