概念:Adam 是一種可以替代傳統隨機梯度下降過程的一階優化算法,它能基於訓練數據迭代地更新神經網絡權重。Adam 最開始是由 OpenAI 的 Diederik Kingma 和多倫多大學的 Jimmy Ba 在提交到 2015 年 ICLR 論文(Adam: A Method for Stochastic Optimization)中提出的.該算法名為「Adam」,其並不是首字母縮寫,也不是人名。它的名稱來源於適應性矩估計(adaptive moment estimation)
Adam(Adaptive Moment Estimation)本質上是帶有動量項的RMSprop,它利用梯度的一階矩估計和二階矩估計動態調整每個參數的學習率。它的優點主要在於經過偏置校正后,每一次迭代學習率都有個確定范圍,使得參數比較平穩。其公式如下:
其中,前兩個公式分別是對梯度的一階矩估計和二階矩估計,可以看作是對期望E|gt|,E|gt^2|的估計;
公式3,4是對一階二階矩估計的校正,這樣可以近似為對期望的無偏估計。可以看出,直接對梯度的矩估計對內存沒有額外的要求,而且可以根據梯度進行動態調整。最后一項前面部分是對學習率n形成的一個動態約束,而且有明確的范圍。
優點:
1、結合了Adagrad善於處理稀疏梯度和RMSprop善於處理非平穩目標的優點;
2、對內存需求較小;
3、為不同的參數計算不同的自適應學習率;
4、也適用於大多非凸優化-適用於大數據集和高維空間。
應用和源碼:
參數實例:
class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
參數含義:
params(iterable):可用於迭代優化的參數或者定義參數組的dicts。
lr (float, optional) :學習率(默認: 1e-3) betas (Tuple[float, float], optional):
用於計算梯度的平均和平方的系數(默認: (0.9, 0.999)) eps (float, optional):
為了提高數值穩定性而添加到分母的一個項(默認: 1e-8) weight_decay (float, optional):權重衰減(如L2懲罰)(默認: 0)
torch.optim.adam源碼:
1 import math 2 from .optimizer import Optimizer 3 4 class Adam(Optimizer): 5 def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,weight_decay=0): 6 defaults = dict(lr=lr, betas=betas, eps=eps,weight_decay=weight_decay) 7 super(Adam, self).__init__(params, defaults) 8 9 def step(self, closure=None): 10 loss = None 11 if closure is not None: 12 loss = closure() 13 14 for group in self.param_groups: 15 for p in group['params']: 16 if p.grad is None: 17 continue 18 grad = p.grad.data 19 state = self.state[p] 20 21 # State initialization 22 if len(state) == 0: 23 state['step'] = 0 24 # Exponential moving average of gradient values 25 state['exp_avg'] = grad.new().resize_as_(grad).zero_() 26 # Exponential moving average of squared gradient values 27 state['exp_avg_sq'] = grad.new().resize_as_(grad).zero_() 28 29 exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq'] 30 beta1, beta2 = group['betas'] 31 32 state['step'] += 1 33 34 if group['weight_decay'] != 0: 35 grad = grad.add(group['weight_decay'], p.data) 36 37 # Decay the first and second moment running average coefficient 38 exp_avg.mul_(beta1).add_(1 - beta1, grad) 39 exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) 40 41 denom = exp_avg_sq.sqrt().add_(group['eps']) 42 43 bias_correction1 = 1 - beta1 ** state['step'] 44 bias_correction2 = 1 - beta2 ** state['step'] 45 step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1 46 47 p.data.addcdiv_(-step_size, exp_avg, denom) 48 49 return loss
使用例子:
1 import torch 2 3 # N is batch size; D_in is input dimension; 4 # H is hidden dimension; D_out is output dimension. 5 N, D_in, H, D_out = 64, 1000, 100, 10 6 7 # Create random Tensors to hold inputs and outputs 8 x = torch.randn(N, D_in) 9 y = torch.randn(N, D_out) 10 11 # Use the nn package to define our model and loss function. 12 model = torch.nn.Sequential( 13 torch.nn.Linear(D_in, H), 14 torch.nn.ReLU(), 15 torch.nn.Linear(H, D_out), 16 ) 17 loss_fn = torch.nn.MSELoss(reduction='sum') 18 19 # Use the optim package to define an Optimizer that will update the weights of 20 # the model for us. Here we will use Adam; the optim package contains many other 21 # optimization algoriths. The first argument to the Adam constructor tells the 22 # optimizer which Tensors it should update. 23 learning_rate = 1e-4 24 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) 25 for t in range(500): 26 # Forward pass: compute predicted y by passing x to the model. 27 y_pred = model(x) 28 29 # Compute and print loss. 30 loss = loss_fn(y_pred, y) 31 print(t, loss.item()) 32 33 # Before the backward pass, use the optimizer object to zero all of the 34 # gradients for the variables it will update (which are the learnable 35 # weights of the model). This is because by default, gradients are 36 # accumulated in buffers( i.e, not overwritten) whenever .backward() 37 # is called. Checkout docs of torch.autograd.backward for more details. 38 optimizer.zero_grad() 39 40 # Backward pass: compute gradient of the loss with respect to model 41 # parameters 42 loss.backward() 43 44 # Calling the step function on an Optimizer makes an update to its 45 # parameters 46 optimizer.step()
到這里,相信對付絕大多數的應用是可以的了.我的目的也就基本完成了.接下來就要在應用中加深理解了.
參考文檔:
1 https://blog.csdn.net/kgzhang/article/details/77479737
2 https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_optim.html