一、參數更新策略
1.SGD
也就是隨機梯度下降,最簡單的更新形式是沿着負梯度方向改變參數(因為梯度指向的是上升方向,但是我們通常希望最小化損失函數)。假設有一個參數向量x及其梯度dx,那么最簡單的更新的形式是:
x += - learning_rate * dx
其中learning_rate是一個超參數,表示的是更新的幅度。這是一個重要的參數,lr過大可能會出現loss異常的情況,過小會使訓練時間過長,后面也會介紹lr參數更新的一些trick。
2. Momentum
又被成為動量法,我覺得很多地方對這個方法並沒有說清楚,我個人簡單的理解是,在更新梯度的時候,我們要注意保留之前的梯度的信息,可以相信,一個梯度一直朝下減小的函數,一般不會遇到突然向左減小,所以加上動量以后,就可以處理這種情況,這個如果知道共軛點的話(不確定是不是這么叫的,也就是一個梯度下降的時候,向下一直減小,但是左右一直變化幅度很大的情況),就知道momentum的強大了。
def sgd_momentum(w, dw, config=None): """ Performs stochastic gradient descent with momentum. config format: - learning_rate: Scalar learning rate. - momentum: Scalar between 0 and 1 giving the momentum value. Setting momentum = 0 reduces to sgd. - velocity: A numpy array of the same shape as w and dw used to store a moving average of the gradients. """ if config is None: config = {} config.setdefault('learning_rate', 1e-2) config.setdefault('momentum', 0.9) v = config.get('velocity', np.zeros_like(w)) next_w = None v = config['momentum'] * v - config['learning_rate'] * dw next_w = w + v config['velocity'] = v return next_w, config
3.Nestero
這個是動量法的進階版,既然我們要用之前梯度的信息,那么為什么不在更新當前梯度的時候,直接取最后的方向呢。
x_ahead = x + mu * v # 計算dx_ahead(在x_ahead處的梯度,而不是在x處的梯度) v = mu * v - learning_rate * dx_ahead x += v
在實際編程的時候,我們采用的是下面的方法,因為這樣就可以和之前的代碼統一起來了 ,因為一般來說我們只有dx:
v_prev = v # 存儲備份 v = mu * v - learning_rate * dx # 速度更新保持不變 x += -mu * v_prev + (1 + mu) * v # 位置更新變了形式
為什么可以這樣做?據說只要根據X_ahead=x+mu*v應該是可以換算的,但是我試了一下...沒推出來.我懷疑只是實際使用的時候效果相似,不一定推出來,因為你不可能去計算dx_ahead的..
4.RMSProp and Adam
這兩個是學習速率的更新策略
def rmsprop(x, dx, config=None): """ Uses the RMSProp update rule, which uses a moving average of squared gradient values to set adaptive per-parameter learning rates. config format: - learning_rate: Scalar learning rate. - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared gradient cache. - epsilon: Small scalar used for smoothing to avoid dividing by zero. - cache: Moving average of second moments of gradients. """ if config is None: config = {} config.setdefault('learning_rate', 1e-2) config.setdefault('decay_rate', 0.99) config.setdefault('epsilon', 1e-8) config.setdefault('cache', np.zeros_like(x)) next_x = None cache = config['cache'] decay_rate = config['decay_rate'] learning_rate = config['learning_rate'] epsilon = config['epsilon'] cache = decay_rate * cache + (1 - decay_rate) * (dx**2) #用一種更為平滑的方式更新lr,注意到這里是累加的,因為越往后懲罰項越大,lr越小 x += - learning_rate * dx / (np.sqrt(cache) + epsilon) config['cache'] = cache next_x = x return next_x, config def adam(x, dx, config=None): """ Uses the Adam update rule, which incorporates moving averages of both the gradient and its square and a bias correction term. config format: - learning_rate: Scalar learning rate. - beta1: Decay rate for moving average of first moment of gradient. - beta2: Decay rate for moving average of second moment of gradient. - epsilon: Small scalar used for smoothing to avoid dividing by zero. - m: Moving average of gradient. - v: Moving average of squared gradient. - t: Iteration number. """ if config is None: config = {} config.setdefault('learning_rate', 1e-3) config.setdefault('beta1', 0.9) config.setdefault('beta2', 0.999) config.setdefault('epsilon', 1e-8) config.setdefault('m', np.zeros_like(x)) config.setdefault('v', np.zeros_like(x)) config.setdefault('t', 0) next_x = None m = config['m'] v = config['v'] beta1 = config['beta1'] beta2 = config['beta2'] learning_rate = config['learning_rate'] epsilon = config['epsilon'] t = config['t'] t += 1 m = beta1 * m + (1 - beta1) * dx v = beta2 * v + (1 - beta2) * (dx**2) m_bias = m / (1 - beta1**t) v_bias = v / (1 - beta2**t) x += - learning_rate * m_bias / (np.sqrt(v_bias) + epsilon) next_x = x config['m'] = m config['v'] = v config['t'] = t return next_x, config
二、Batch Normalization
接下來要介紹兩個在訓練神經網絡的時候,非常有用技巧,首先是batch normalization,簡單的解釋:在每次NN的輸入的時候,我們都知道要進行數據預處理,一般是讓數據是zero means和單位方差的,這樣對於訓練是有好處的,但當數據走過幾層以后,就基本不可能還保持這個特性了,BN做的事情就是在每一層的開始,加上這個操作,但是有的數據可能會因此丟失了一些信息,所以再加上beta和gama來恢復原始數據,這里beta和gama是可學習的。
#BN參數的初始化 if self.use_batchnorm and i < len(hidden_dims): self.params['gamma' + str(i+1)] = np.ones((1, layers_dims[i+1])) self.params['beta' + str(i+1)] = np.zeros((1, layers_dims[i+1]))
BN的的關鍵是前向傳播,后向傳播以及實際使用的。
按照上面推導出來的公式,即可進行BN層的構建:
def batchnorm_forward(x, gamma, beta, bn_param): mode = bn_param['mode'] #因為train和test是兩種不同的方法 eps = bn_param.get('eps', 1e-5) momentum = bn_param.get('momentum', 0.9) N, D = x.shape running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype)) running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype)) out, cache = None, None if mode == 'train': sample_mean = np.mean(x, axis=0, keepdims=True) # [1,D] sample_var = np.var(x, axis=0, keepdims=True) # [1,D] x_normalized = (x - sample_mean) / np.sqrt(sample_var + eps) # [N,D] out = gamma * x_normalized + beta cache = (x_normalized, gamma, beta, sample_mean, sample_var, x, eps) running_mean = momentum * running_mean + (1 - momentum) * sample_mean #通過moument得到最終的running_mean和running_var running_var = momentum * running_var + (1 - momentum) * sample_var elif mode == 'test': x_normalized = (x - running_mean) / np.sqrt(running_var + eps) #test的時候如何通過BN層 out = gamma * x_normalized + beta else: raise ValueError('Invalid forward batchnorm mode "%s"' % mode) # Store the updated running means back into bn_param bn_param['running_mean'] = running_mean bn_param['running_var'] = running_var return out, cache def batchnorm_backward(dout, cache): dx, dgamma, dbeta = None, None, None x_normalized, gamma, beta, sample_mean, sample_var, x, eps = cache N, D = x.shape dx_normalized = dout * gamma # [N,D] x_mu = x - sample_mean # [N,D] sample_std_inv = 1.0 / np.sqrt(sample_var + eps) # [1,D] dsample_var = -0.5 * np.sum(dx_normalized * x_mu, axis=0, keepdims=True) * sample_std_inv**3 dsample_mean = -1.0 * np.sum(dx_normalized * sample_std_inv, axis=0, keepdims=True) - \ 2.0 * dsample_var * np.mean(x_mu, axis=0, keepdims=True) dx1 = dx_normalized * sample_std_inv dx2 = 2.0/N * dsample_var * x_mu dx = dx1 + dx2 + 1.0/N * dsample_mean dgamma = np.sum(dout * x_normalized, axis=0, keepdims=True) dbeta = np.sum(dout, axis=0, keepdims=True) return dx, dgamma, dbeta
Batch Normalization解決的一個重要問題就是梯度飽和,配合Relu可以說基本解決了梯度飽和的問題。
三、Dropout
dropout是非常好理解的,就是在訓練的時候以一定的概率來去每層的神經元,如下圖所示:
個人理解:每次訓練進行deoptout的操作,可以防止過擬合,為什么呢,因為每次訓練的模型都長的不一樣,但是他們的參數實際上是共享的,可以簡單的理解為是個bagging的操作,眾所周知..bagging在machine learning中是防止過擬合的神器,每次限制神經元的數量也防止過大的神經網絡對數據集的過擬合。
還可以理解為dropout是一個正則化的操作,他在每次訓練的時候,強行讓一些feature為0,這樣提高了網絡的稀疏表達能力
def dropout_forward(x, dropout_param): p, mode = dropout_param['p'], dropout_param['mode'] if 'seed' in dropout_param: np.random.seed(dropout_param['seed']) mask = None out = None if mode == 'train': mask = (np.random.rand(*x.shape) < p) / p #注意這里除以了一個P,這樣在test的輸出的時候,維持原樣即可 out = x * mask elif mode == 'test': out = x cache = (dropout_param, mask) out = out.astype(x.dtype, copy=False) return out, cache def dropout_backward(dout, cache): dropout_param, mask = cache mode = dropout_param['mode'] dx = None if mode == 'train': dx = dout * mask elif mode == 'test': dx = dout return dx
四、總結
總的來說,很多trick都是在train model的時候很有用,不過向BN已經逐漸要被替代和取消了,所以說明實時的follow最新的會議是多么的重要......然后,一些推導之后還是要在自己做一做,預備在后面復習cs231n的講義的時候做一遍推導。