Caffe的solver參數設置
http://caffe.berkeleyvision.org/tutorial/solver.html
solver是通過協調前向-反向傳播的參數更新來控制參數優化的。一個模型的學習是通過Solver來監督優化和參數更新,以及通過Net來產生loss和梯度完成的。
Caffe提供的優化方法有:
- Stochastic Gradient Descent (type: “SGD”),
- AdaDelta (type: “AdaDelta”),
- Adaptive Gradient (type: “AdaGrad”),
- Adam (type: “Adam”),
- Nesterov’s Accelerated Gradient (type: “Nesterov”),
- RMSprop (type: “RMSProp”)
The solver
scaffolds the optimization bookkeeping and creates the training network for learning and test network(s) for evaluation.
iteratively optimizes by calling forward / backward and updating parameters
(periodically) evaluates the test networks
snapshots the model and solver state throughout the optimization
where each iteration
calls network forward to compute the output and loss
calls network backward to compute the gradients
incorporates the gradients into parameter updates according to the solver method
updates the solver state according to learning rate, history, and method
to take the weights all the way from initialization to learned model.
Like Caffe models, Caffe solvers run in CPU / GPU modes.
SGD
Stochastic gradient descent (type: “SGD”) updates the weights W by a linear combination of the negative gradient ∇L(W) and the previous weight update Vt. The learning rate α is the weight of the negative gradient. The momentum μ is the weight of the previous update.
Formally, we have the following formulas to compute the update value Vt+1 and the updated weights Wt+1 at iteration t+1, given the previous weight update Vt and current weights Wt:
Vt+1=μVt−α∇L(Wt)
Wt+1=Wt+Vt+1
The learning “hyperparameters” (α and μ) might require a bit of tuning for best results. If you’re not sure where to start, take a look at the “Rules of thumb” below, and for further information you might refer to Leon Bottou’s Stochastic Gradient Descent Tricks [1].
[1] L. Bottou. Stochastic Gradient Descent Tricks. Neural Networks: Tricks of the Trade: Springer, 2012.
總結solver文件個參數的意義
iteration: 數據進行一次前向-后向的訓練
batchsize:每次迭代訓練圖片的數量
epoch:1個epoch就是將所有的訓練圖像全部通過網絡訓練一次
例如:假如有1280000張圖片,batchsize=256,則1個epoch需要1280000/256=5000次iteration
它的max-iteration=450000,則共有450000/5000=90個epoch
而lr什么時候衰減與stepsize有關,減少多少與gamma有關,即:若stepsize=500, base_lr=0.01, gamma=0.1,則當迭代到第一個500次時,lr第一次衰減,衰減后的lr=lr*gamma=0.01*0.1=0.001,以后重復該過程,所以
stepsize是lr的衰減步長,gamma是lr的衰減系數。
在訓練過程中,每到一定的迭代次數都會測試,迭代次數是由test-interval決定的,如test_interval=1000,則訓練集每迭代1000次測試一遍網絡,而
test_size, test_iter, 和test圖片的數量決定了怎樣test, test-size決定了test時每次迭代輸入圖片的數量,test_iter就是test所有的圖片的迭代次數,如:500張test圖片,test_iter=100,則test_size=5, 而solver文檔里只需要根據test圖片總數量來設置test_iter,以及根據需要設置test_interval即可。