在機器學習或者模式識別中,會出現overfitting,而當網絡逐漸overfitting時網絡權值逐漸變大,因此,為了避免出現overfitting,會給誤差函數添加一個懲罰項,常用的懲罰項是所有權重的平方乘以一個衰減常量之和。其用來懲罰大的權值。
The learning rate is a parameter that determines how much an updating step influences the current value of the weights. While weight decay is an additional term in the weight update rule that causes the weights to exponentially decay to zero, if no other update is scheduled.
So let's say that we have a cost or error function E(w) that we want to minimize. Gradient descent tells us to modify the weights w in the direction of steepest descent in E:
where η is the learning rate, and if it's large you will have a correspondingly large modification of the weights wi(in general it shouldn't be too large, otherwise you'll overshoot the local minimum in your cost function).
In order to effectively limit the number of free parameters in your model so as to avoid over-fitting, it is possible to regularize the cost function. An easy way to do that is by introducing a zero mean Gaussian prior over the weights, which is equivalent to changing the cost function to E˜(w)=E(w)+λ2w2. In practice this penalizes large weights and effectively limits the freedom in your model. The regularization parameter λ determines how you trade off the original cost E with the large weights penalization.
Applying gradient descent to this new cost function we obtain:
The new term −ηλwi coming from the regularization causes the weight to decay in proportion to its size.
In your solver you likely have a learning rate set as well as weight decay. lr_mult indicates what to multiply the learning rate by for a particular layer. This is useful if you want to update some layers with a smaller learning rate (e.g. when finetuning some layers while training others from scratch) or if you do not want to update the weights for one layer (perhaps you keep all the conv layers the same and just retrain fully connected layers). decay_mult is the same, just for weight decay.
參考資料:http://stats.stackexchange.com/questions/29130/difference-between-neural-net-weight-decay-and-learning-rate
http://blog.csdn.net/u010025211/article/details/50055815
https://groups.google.com/forum/#!topic/caffe-users/8J_J8tc1ZHc
視覺層(Vision Layers)及參數
所有的層都具有的參數,如name, type, bottom, top和transform_param請參看我的前一篇文章:Caffe學習系列(2):數據層及參數
本文只講解視覺層(Vision Layers)的參數,視覺層包括Convolution, Pooling, Local Response Normalization (LRN), im2col等層。
1、Convolution層:
就是卷積層,是卷積神經網絡(CNN)的核心層。
層類型:Convolution
lr_mult: 學習率的系數,最終的學習率是這個數乘以solver.prototxt配置文件中的base_lr。如果有兩個lr_mult, 則第一個表示權值的學習率,第二個表示偏置項的學習率。一般偏置項的學習率是權值學習率的兩倍。
在后面的convolution_param中,我們可以設定卷積層的特有參數。
必須設置的參數:
num_output: 卷積核(filter)的個數
kernel_size: 卷積核的大小。如果卷積核的長和寬不等,需要用kernel_h和kernel_w分別設定
其它參數:
stride: 卷積核的步長,默認為1。也可以用stride_h和stride_w來設置。
pad: 擴充邊緣,默認為0,不擴充。 擴充的時候是左右、上下對稱的,比如卷積核的大小為5*5,那么pad設置為2,則四個邊緣都擴充2個像素,即寬度和高度都擴充了4個像素,這樣卷積運算之后的特征圖就不會變小。也可以通過pad_h和pad_w來分別設定。
layer { name: "conv1" type: "Convolution" bottom: "data" top: "conv1" param { lr_mult: 1 } param { lr_mult: 2 } convolution_param { num_output: 20 kernel_size: 5 stride: 1 weight_filler { type: "xavier" } bias_filler { type: "constant" } } }
layer { name: "pool1" type: "Pooling" bottom: "conv1" top: "pool1" pooling_param { pool: MAX kernel_size: 3 stride: 2 } }
pooling層的運算方法基本是和卷積層是一樣的。

layers { name: "norm1" type: LRN bottom: "pool1" top: "norm1" lrn_param { local_size: 5 alpha: 0.0001 beta: 0.75 } }
4、im2col層
如果對matlab比較熟悉的話,就應該知道im2col是什么意思。它先將一個大矩陣,重疊地划分為多個子矩陣,對每個子矩陣序列化成向量,最后得到另外一個矩陣。
看一看圖就知道了:
在caffe中,卷積運算就是先對數據進行im2col操作,再進行內積運算(inner product)。這樣做,比原始的卷積操作速度更快。
看看兩種卷積操作的異同: