caffe 中base_lr、weight_decay、lr_mult、decay_mult代表什么意思?


機器學習或者模式識別中,會出現overfitting,而當網絡逐漸overfitting時網絡權值逐漸變大,因此,為了避免出現overfitting,會給誤差函數添加一個懲罰項,常用的懲罰項是所有權重的平方乘以一個衰減常量之和。其用來懲罰大的權值。

The learning rate is a parameter that determines how much an updating step influences the current value of the weights. While weight decay is an additional term in the weight update rule that causes the weights to exponentially decay to zero, if no other update is scheduled.

So let's say that we have a cost or error function E(w) that we want to minimize. Gradient descent tells us to modify the weights w in the direction of steepest descent in E:

wiwiηEwi,

where η is the learning rate, and if it's large you will have a correspondingly large modification of the weights wi(in general it shouldn't be too large, otherwise you'll overshoot the local minimum in your cost function).

In order to effectively limit the number of free parameters in your model so as to avoid over-fitting, it is possible to regularize the cost function. An easy way to do that is by introducing a zero mean Gaussian prior over the weights, which is equivalent to changing the cost function to E˜(w)=E(w)+λ2w2. In practice this penalizes large weights and effectively limits the freedom in your model. The regularization parameter λ determines how you trade off the original cost E with the large weights penalization.

Applying gradient descent to this new cost function we obtain:

wiwiηEwiηλwi.

The new term ηλwi coming from the regularization causes the weight to decay in proportion to its size.

In your solver you likely have a learning rate set as well as weight decay.  lr_mult indicates what to multiply the learning rate by for a particular layer.  This is useful if you want to update some layers with a smaller learning rate (e.g. when finetuning some layers while training others from scratch) or if you do not want to update the weights for one layer (perhaps you keep all the conv layers the same and just retrain fully connected layers).  decay_mult is the same, just for weight decay.

 

參考資料:http://stats.stackexchange.com/questions/29130/difference-between-neural-net-weight-decay-and-learning-rate

     http://blog.csdn.net/u010025211/article/details/50055815

     https://groups.google.com/forum/#!topic/caffe-users/8J_J8tc1ZHc


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM