1. Learning Rate Finder
Deep learning models are typically trained by a stochastic gradient descent optimizer. There are many variations of stochastic gradient descent: Adam, RMSProp, Adagrad, etc. All of them let you set the learning rate. This parameter tells the optimizer how far to move the weights in the direction of the gradient for a mini-batch.
If the learning rate is low, then training is more reliable, but optimization will take a lot of time because steps towards the minimum of the loss function are tiny.
If the learning rate is high, then training may not converge or even diverge. Weight changes can be so big that the optimizer overshoots the minimum and makes the loss worse.
The training should start from a relatively large learning rate because, in the beginning, random weights are far from optimal, and then the learning rate can decrease during training to allow more fine-grained weight updates.
How to find
Leslie N. Smith describes a powerful technique to select a range of learning rates for a neural network in section 3.3 of the 2015 paper “Cyclical Learning Rates for Training Neural Networks” .
The trick is to train a network starting from a low learning rate and increase the learning rate exponentially for every batch.
Record the learning rate and training loss for every batch. Then, plot the loss and the learning rate. Typically, it looks like this:
First, with low learning rates, the loss improves slowly, then training accelerates until the learning rate becomes too large and loss goes up: the training process diverges.
Another way to look at these numbers is calculating the rate of change of the loss (a derivative of the loss function with respect to iteration number), then plot the change rate on the y-axis and the learning rate on the x-axis.
It looks too noisy, let’s smooth it out using simple moving average.
This looks better. On this graph, we need to find the minimum. It is close to lr=0.01.
How to pick learning rate
一个需要注意的点是,我们怎么样根据 Learning rate vs. Rate 的图,选择合适的learning rate。
在上图这个例子中,我们选择的是 lr=1e−2(10−2)
lr=1e−2(10−2) 而不是 lr=1e−1(10−1)lr=1e−1(10−1) 。也就是说,我们不是选loss最低的那个lr,而是go back a little。
那么问题来了,为啥要这样选?
这就要提到一个神奇的东西:stochastic gradient descent with restarts (SGDR)。
SGD restarts
Another method proposed by Loshchilov & Hutter [5] in their paper “Sgdr: Stochastic gradient descent with restarts”, called ‘cosine annealing’ in which the learning rate is decreasing from max value following the cosine function and then ‘restarts’ with the maximum at the beginning of the next cycle. Authors also suggest making each next cycle longer than the previous one by some constant factor T_mul.
在解释什么是SGDR之前,先来看一个technique——learning rate annealing。
learning rate annealing长这样:
这是一种大嘎都在用的采用cosine curve form来adjust learning rate的手段。即一开始优化时可以采用比较大的learning rate,让loss下降的快一点。之后离minimum比较近了,就可以quickly drop down learning rate,然后小步小步的前进。
那这跟SGD restarts有啥关系呢?再来看看下面这段代码的output:
data = ImageClassifierData.from_paths(PATH, tfms=tfms)
learn = ConvLearner.pretrained(arch, data, precompute=True) learn.fit(1e-2, 3, cycle_len=1) learn.sched.plot_lr()
- 1
- 2
- 3
- 4
上图表示,我们先对learning rate用cosine annealing,然后jump up again,再用cosine learning rate,这样循环。
这就是为啥叫 restart 。
这样做的好处是显而易见的,如果我们找到了一个spiky的minimum,那么重新增大learning rate,我们就可以跳出这个spiky part;如果我们找到的是一个generalize的较宽的minimum,那我增大learning rate,下一步还是落在这个平坦谷底,这样我们找到的learning rate就会更generalize。
Although deep neural networks don’t usually converge to a global minimum, there’s a notion of ‘good’ and ‘bad’ local minimums in terms of generalization. Keskar et al. [6] argue that local minima with flat basins tend to generalize better. It should be intuitive that sharp minima is not the best, because slight changes to the weights tend to change model predictions dramatically. If the learning rate is large enough, intrinsic random motion across gradient steps prevents the optimizer from reaching any of the sharp basins along its optimization path. However, if the learning rate is small, the model tends to converge into the closest local minimum. That being said, increasing the learning rate from time to time helps the optimization algorithm to escape from sharp minimas, resulting in converging to a ‘good’ set of weights.
如下图所示:
这也是为啥我们在pick learning rate的时候不选那个minimum loss的learning rate,而是选back一丢丢的 10−110−1 ,因为这个 10−110−1 是作为restart的起点(看上上图的y轴最高处),然后再慢慢decrease的。
如果我们一开始就选择一个较小的learning rate(10−210−2),那即便我们jump up了,也可能跳不出这个spiky的minimum,也就无法去到更generalize的那个minimum。
所以上面这个 cycle_len 参数就是用来set restart的epoch间隔的。比如 cycle_len=1
就代表 reset learning rate to jump up every epoch。
Epoch, iterations, cycles and stepsize
These terms have specific meaning in this algorithm, understanding
them will make it easy to plug them in equations.Let us consider a training dataset with 50,000 instances.
An epoch is one run of your training algorithm across the entire
training set. If we set a batch size of 100, we get 500 batches in 1
epoch or 500 iterations. The iteration count is accumulated over
epochs, so that in epoch 2, we get iterations 501 to 1000 for the same
batch of 500, and so one.With that in mind, a cycle is defined as that many iterations
where we want our learning rate to go from a base learning rate to a
max learning rate, and back. And a stepsize is half of a cycle. Note that a cycle, in this case, need not fall on the boundary of an
epoch, though in practice it does.
In the above diagram, we set a base lr and max lr for the
algorithm, demarcated by the red lines. The blue line suggests the way
learning rate is modified (in a triangular fashion), with the x-axis
being the iterations. A complete up and down of the blue line is one
cycle. And stepsize is half of that.
2. Visualizing Learning rate vs Batch size
The plot shows Loss vs. Learning rate for the dataset. Now it is easy to choose an optimal range for learning rate before the curve flattens.
So simple -once discovered- that surprises how revolutionary this is. It has the potential to save a huge amount of hours and computational resources as soon as it becomes widespread practice. Aditionally to realizing this I also thought about how insightful it would be to use the tool to visualize the famous relationship between learning rate and batch size.
For the ones unaware, general rule is “bigger batch size bigger learning rate”. This is just logical because bigger batch size means more confidence in the direction of your “descent” of the error surface while the smaller a batch size is the closer you are to “stochastic” descent (batch size 1). Small steps can also work but the direction of each individual “steps” is more… well, more stochastic.
batch size很大(比如1个epoch共1024笔data,你一个batch size就512了),说明你对你的data相当有信心,反正不管从哪出发它们最后都会相聚在同一个谷底,所以你只管大步向前走,一路俯冲就行了;
反之,batch size贼小,极端情况等于1时(1个epoch分成1024个batch),这就是大嘎说的传统SGD了。这时候小步小步向前也是可以的,不过因为大部队分散了,到处打游击,随机性就增加了。
秀一波图片,胜过万语千言:
There you have it, the relationship between learning rate error plotted using batches from 64 to 4 for the “cats vs. dogs” dataset.
As expected bigger batch size shows bigger optimal learning rate, but the picture gives a more subtle and complete information. I find it interesting to see how those curves relate to each other, also worth noting how noise of the relationship increases as batch size gets smaller.
最后结论就是:batch size越大!learning rate越高!
Source