第二個作業難度很高,但做(抄)完之后收獲還是很大的....
首先是對之前的神經網絡的程序進行重構,目的是可以構建任意大小的全連接的neural network,這里用模塊化的思想構建整個代碼,具體思路如下:
#前向傳播 def layer_forward(x, w): """ Receive inputs x and weights w """ # 做前向計算 z = # 需要存儲的中間值,便於BP的時候使用 # Do some more computations ... out = # the output cache = (x, w, z, out) # Values we need to compute gradients return out, cache #后向傳播 def layer_backward(dout, cache): """ Receive derivative of loss with respect to outputs and cache, and compute derivative with respect to inputs. """ # Unpack cache values x, w, z, out = cache # Use values in cache to compute derivatives dx = # Derivative of loss with respect to x dw = # Derivative of loss with respect to w return dx, dw
在上面的思想指導下,要求實現下面的代碼:
def affine_forward(x, w, b): """ X的shape是(N,d_1,d_2,...d_k),第一維帶便minibatch的數目,后面是把圖片的shape,所以進來的時候把后面全面轉為
一維的向量 Inputs: - x: A numpy array containing input data, of shape (N, d_1, ..., d_k) - w: A numpy array of weights, of shape (D, M) - b: A numpy array of biases, of shape (M,) Returns a tuple of: - out: output, of shape (N, M) - cache: (x, w, b) """ out = None N=x.shape[0] x_new=x.reshape(N,-1)#轉為二維向量 out=np.dot(x_new,w)+b cache = (x, w, b) # 不需要保存out return out, cache def affine_backward(dout, cache): x, w, b = cache dx, dw, db = None, None, None dx=np.dot(dout,w.T) dx=np.reshape(dx,x.shape) x_new=x.reshape(x.shape[0],-1) dw=np.dot(x_new.T,dout) db=np.sum(dout,axis=0,keepdims=True) return dx, dw, db def relu_forward(x): """ Computes the forward pass for a layer of rectified linear units (ReLUs). Input: - x: Inputs, of any shape Returns a tuple of: - out: Output, of the same shape as x - cache: x """ out = None out=np.maximum(0,x) cache = x return out, cache def relu_backward(dout, cache): dx, x = None, cache ############################################################################# # TODO: Implement the ReLU backward pass. # ############################################################################# dx=dout dx[x<=0]=0 ############################################################################# # END OF YOUR CODE # ############################################################################# return dx
上面值得商討的就是為什么求db的公式是db=np.sum(dout,axis=0,keepdims=True),在我看來是少了一個平均的操作的,個人感覺還是因為db的作用小,所以這里用sum的話會方便...grandient check的代碼不需要專門為它進行改變。
完成上面兩個基本的layer,就可以構建一個Sandwich的層了,因為fc-relu的使用還是比較常見的,所以這里直接構建了出來:
def affine_relu_forward(x, w, b): """ Convenience layer that perorms an affine transform followed by a ReLU Inputs: - x: Input to the affine layer - w, b: Weights for the affine layer Returns a tuple of: - out: Output from the ReLU - cache: Object to give to the backward pass """ a, fc_cache = affine_forward(x, w, b) out, relu_cache = relu_forward(a) cache = (fc_cache, relu_cache) return out, cache def affine_relu_backward(dout, cache): """ Backward pass for the affine-relu convenience layer """ fc_cache, relu_cache = cache da = relu_backward(dout, relu_cache) dx, dw, db = affine_backward(da, fc_cache) return dx, dw, db
后面有一個構建上層layer的網絡,我不准備說了,直接聊一聊一個迄今為止最厲害的類FullyConnectecNEt吧,先上代碼和注釋:
1 class FullyConnectedNet(object):
2 """ 3 A fully-connected neural network with an arbitrary number of hidden layers, 4 ReLU nonlinearities, and a softmax loss function. This will also implement 5 dropout and batch normalization as options. For a network with L layers, 6 the architecture will be 7 8 {affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax 9 10 where batch normalization and dropout are optional, and the {...} block is 11 repeated L - 1 times. 12 13 Similar to the TwoLayerNet above, learnable parameters are stored in the 14 self.params dictionary and will be learned using the Solver class. 15 """ 16 17 def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10, 18 dropout=0, use_batchnorm=False, reg=0.0, 19 weight_scale=1e-2, dtype=np.float32, seed=None): 20 """ 21 Initialize a new FullyConnectedNet. 22 23 Inputs: 24 - hidden_dims: A list of integers giving the size of each hidden layer. 25 - input_dim: An integer giving the size of the input. 26 - num_classes: An integer giving the number of classes to classify. 27 - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=0 then 28 the network should not use dropout at all. 29 - use_batchnorm: Whether or not the network should use batch normalization. 30 - reg: Scalar giving L2 regularization strength. 31 - weight_scale: Scalar giving the standard deviation for random 32 initialization of the weights. 33 - dtype: A numpy datatype object; all computations will be performed using 34 this datatype. float32 is faster but less accurate, so you should use 35 float64 for numeric gradient checking. 36 - seed: If not None, then pass this random seed to the dropout layers. This 37 will make the dropout layers deteriminstic so we can gradient check the 38 model. 39 """ 40 self.use_batchnorm = use_batchnorm 41 self.use_dropout = dropout > 0 42 self.reg = reg 43 self.num_layers = 1 + len(hidden_dims) 44 self.dtype = dtype 45 self.params = {} 46 47 ############################################################################ 48 # TODO: Initialize the parameters of the network, storing all values in # 49 # the self.params dictionary. Store weights and biases for the first layer # 50 # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be # 51 # initialized from a normal distribution with standard deviation equal to # 52 # weight_scale and biases should be initialized to zero. # 53 # # 54 # When using batch normalization, store scale and shift parameters for the # 55 # first layer in gamma1 and beta1; for the second layer use gamma2 and # 56 # beta2, etc. Scale parameters should be initialized to one and shift # 57 # parameters should be initialized to zero. # 58 ############################################################################
59 layers_dims = [input_dim] + hidden_dims + [num_classes] #z這里存儲的是每個layer的大小,因為中間的是list,所以要把前后連個加上list來做 60 for i in xrange(self.num_layers): 61 self.params['W' + str(i + 1)] = weight_scale * np.random.randn(layers_dims[i], layers_dims[i + 1]) 62 self.params['b' + str(i + 1)] = np.zeros((1, layers_dims[i + 1])) 63 if self.use_batchnorm and i < len(hidden_dims):#最后一層是不需要batchnorm的 64 self.params['gamma' + str(i + 1)] = np.ones((1, layers_dims[i + 1])) 65 self.params['beta' + str(i + 1)] = np.zeros((1, layers_dims[i + 1])) 66 ############################################################################ 67 # END OF YOUR CODE # 68 ############################################################################ 69 70 # When using dropout we need to pass a dropout_param dictionary to each 71 # dropout layer so that the layer knows the dropout probability and the mode 72 # (train / test). You can pass the same dropout_param to each dropout layer. 73 self.dropout_param = {} 74 if self.use_dropout: 75 self.dropout_param = {'mode': 'train', 'p': dropout} 76 if seed is not None: 77 self.dropout_param['seed'] = seed 78 79 # With batch normalization we need to keep track of running means and 80 # variances, so we need to pass a special bn_param object to each batch 81 # normalization layer. You should pass self.bn_params[0] to the forward pass 82 # of the first batch normalization layer, self.bn_params[1] to the forward 83 # pass of the second batch normalization layer, etc. 84 self.bn_params = [] 85 if self.use_batchnorm: 86 self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - 1)] 87 88 # Cast all parameters to the correct datatype 89 for k, v in self.params.iteritems(): 90 self.params[k] = v.astype(dtype) 91 92 93 def loss(self, X, y=None): 94 """ 95 Compute loss and gradient for the fully-connected net. 96 97 Input / output: Same as TwoLayerNet above. 98 """ 99 X = X.astype(self.dtype) 100 mode = 'test' if y is None else 'train' 101 102 # Set train/test mode for batchnorm params and dropout param since they 103 # behave differently during training and testing. 104 if self.dropout_param is not None: 105 self.dropout_param['mode'] = mode 106 if self.use_batchnorm: 107 for bn_param in self.bn_params: 108 bn_param[mode] = mode 109 110 scores = None 111 ############################################################################ 112 # TODO: Implement the forward pass for the fully-connected net, computing # 113 # the class scores for X and storing them in the scores variable. # 114 # # 115 # When using dropout, you'll need to pass self.dropout_param to each # 116 # dropout forward pass. # 117 # # 118 # When using batch normalization, you'll need to pass self.bn_params[0] to # 119 # the forward pass for the first batch normalization layer, pass # 120 # self.bn_params[1] to the forward pass for the second batch normalization # 121 # layer, etc. # 122 ############################################################################ 123 h, cache1, cache2, cache3,cache4, bn, out = {}, {}, {}, {}, {}, {},{} 124 out[0] = X #存儲每一層的out,按照邏輯,X就是out0[0] 125 126 # Forward pass: compute loss 127 for i in xrange(self.num_layers - 1): 128 # 得到每一層的參數 129 w, b = self.params['W' + str(i + 1)], self.params['b' + str(i + 1)] 130 if self.use_batchnorm: 131 gamma, beta = self.params['gamma' + str(i + 1)], self.params['beta' + str(i + 1)] 132 h[i], cache1[i] = affine_forward(out[i], w, b) 133 bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i]) 134 out[i + 1], cache3[i] = relu_forward(bn[i]) 135 if self.use_dropout: 136 out[i+1], cache4[i] = dropout_forward(out[i+1] , self.dropout_param) 137 else: 138 out[i + 1], cache3[i] = affine_relu_forward(out[i], w, b) 139 if self.use_dropout: 140 out[i + 1], cache4[i] = dropout_forward(out[i + 1], self.dropout_param) 141 142 W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)] 143 scores, cache = affine_forward(out[self.num_layers - 1], W, b) #對最后一層進行計算144 145 ############################################################################ 146 # END OF YOUR CODE # 147 ############################################################################ 148 149 # If test mode return early 150 if mode == 'test': 151 return scores 152 153 loss, grads = 0.0, {} 154 ############################################################################ 155 # TODO: Implement the backward pass for the fully-connected net. Store the # 156 # loss in the loss variable and gradients in the grads dictionary. Compute # 157 # data loss using softmax, and make sure that grads[k] holds the gradients # 158 # for self.params[k]. Don't forget to add L2 regularization! # 159 # # 160 # When using batch normalization, you don't need to regularize the scale # 161 # and shift parameters. # 162 # # 163 # NOTE: To ensure that your implementation matches ours and you pass the # 164 # automated tests, make sure that your L2 regularization includes a factor # 165 # of 0.5 to simplify the expression for the gradient. # 166 ############################################################################ 167 data_loss, dscores = softmax_loss(scores, y) 168 reg_loss = 0 169 for i in xrange(self.num_layers): 170 reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i + 1)] * self.params['W' + str(i + 1)]) 171 loss = data_loss + reg_loss 172 173 # Backward pass: compute gradients 174 dout, dbn, dh, ddrop = {}, {}, {}, {} 175 t = self.num_layers - 1 176 dout[t], grads['W' + str(t + 1)], grads['b' + str(t + 1)] = affine_backward(dscores, cache)#這個cache就是上面得到的177 for i in xrange(t): 178 if self.use_batchnorm: 179 if self.use_dropout: 180 dout[t - i] = dropout_backward(dout[t-i], cache4[t-1-i]) 181 dbn[t - 1 - i] = relu_backward(dout[t - i], cache3[t - 1 - i]) 182 dh[t - 1 - i], grads['gamma' + str(t - i)], grads['beta' + str(t - i)] = batchnorm_backward(dbn[t - 1 - i], 183 cache2[ 184 t - 1 - i]) 185 dout[t - 1 - i], grads['W' + str(t - i)], grads['b' + str(t - i)] = affine_backward(dh[t - 1 - i], 186 cache1[t - 1 - i]) 187 else: 188 if self.use_dropout: 189 dout[t - i] = dropout_backward(dout[t - i], cache4[t - 1 - i]) 190 191 dout[t - 1 - i], grads['W' + str(t - i)], grads['b' + str(t - i)] = affine_relu_backward(dout[t - i], 192 cache3[t - 1 - i]) 193 194 # Add the regularization gradient contribution 195 for i in xrange(self.num_layers): 196 grads['W' + str(i + 1)] += self.reg * self.params['W' + str(i + 1)] 197 ############################################################################ 198 # END OF YOUR CODE # 199 ############################################################################ 200 201 return loss, grads
上面的代碼因為是上層代碼,不需要關心具體的Bp如何實現(因為之前已經實現了),所以還是很好看懂的,但到現在還是沒有結束的,我們還要使用slover來對
神經網絡進優化求解。
1 import numpy as np 2 3 from cs231n import optim 4 5 6 class Solver(object): 7 """ 8 A Solver encapsulates all the logic necessary for training classification 9 models. The Solver performs stochastic gradient descent using different 10 update rules defined in optim.py. 11 12 The solver accepts both training and validataion data and labels so it can 13 periodically check classification accuracy on both training and validation 14 data to watch out for overfitting. 15 16 To train a model, you will first construct a Solver instance, passing the 17 model, dataset, and various optoins (learning rate, batch size, etc) to the 18 constructor. You will then call the train() method to run the optimization 19 procedure and train the model. 20 21 After the train() method returns, model.params will contain the parameters 22 that performed best on the validation set over the course of training. 23 In addition, the instance variable solver.loss_history will contain a list 24 of all losses encountered during training and the instance variables 25 solver.train_acc_history and solver.val_acc_history will be lists containing 26 the accuracies of the model on the training and validation set at each epoch. 27 28 Example usage might look something like this: 29 30 data = { 31 'X_train': # training data 32 'y_train': # training labels 33 'X_val': # validation data 34 'X_train': # validation labels 35 } 36 model = MyAwesomeModel(hidden_size=100, reg=10) 37 solver = Solver(model, data, 38 update_rule='sgd', 39 optim_config={ 40 'learning_rate': 1e-3, 41 }, 42 lr_decay=0.95, 43 num_epochs=10, batch_size=100, 44 print_every=100) 45 solver.train() 46 47 48 A Solver works on a model object that must conform to the following API: 49 50 - model.params must be a dictionary mapping string parameter names to numpy 51 arrays containing parameter values. 52 53 - model.loss(X, y) must be a function that computes training-time loss and 54 gradients, and test-time classification scores, with the following inputs 55 and outputs: 56 57 Inputs: 58 - X: Array giving a minibatch of input data of shape (N, d_1, ..., d_k) 59 - y: Array of labels, of shape (N,) giving labels for X where y[i] is the 60 label for X[i]. 61 62 Returns: 63 If y is None, run a test-time forward pass and return: 64 - scores: Array of shape (N, C) giving classification scores for X where 65 scores[i, c] gives the score of class c for X[i]. 66 67 If y is not None, run a training time forward and backward pass and return 68 a tuple of: 69 - loss: Scalar giving the loss 70 - grads: Dictionary with the same keys as self.params mapping parameter 71 names to gradients of the loss with respect to those parameters. 72 """ 73 74 def __init__(self, model, data, **kwargs): 75 """ 76 Construct a new Solver instance. 77 78 Required arguments: 79 - model: A model object conforming to the API described above 80 - data: A dictionary of training and validation data with the following: 81 'X_train': Array of shape (N_train, d_1, ..., d_k) giving training images 82 'X_val': Array of shape (N_val, d_1, ..., d_k) giving validation images 83 'y_train': Array of shape (N_train,) giving labels for training images 84 'y_val': Array of shape (N_val,) giving labels for validation images 85 86 Optional arguments: 87 - update_rule: A string giving the name of an update rule in optim.py. 88 Default is 'sgd'. 89 - optim_config: A dictionary containing hyperparameters that will be 90 passed to the chosen update rule. Each update rule requires different 91 hyperparameters (see optim.py) but all update rules require a 92 'learning_rate' parameter so that should always be present. 93 - lr_decay: A scalar for learning rate decay; after each epoch the learning 94 rate is multiplied by this value. 95 - batch_size: Size of minibatches used to compute loss and gradient during 96 training. 97 - num_epochs: The number of epochs to run for during training. 98 - print_every: Integer; training losses will be printed every print_every 99 iterations. 100 - verbose: Boolean; if set to false then no output will be printed during 101 training. 102 """ 103 self.model = model 104 self.X_train = data['X_train'] 105 self.y_train = data['y_train'] 106 self.X_val = data['X_val'] 107 self.y_val = data['y_val'] 108 109 # Unpack keyword arguments 110 self.update_rule = kwargs.pop('update_rule', 'sgd') 111 self.optim_config = kwargs.pop('optim_config', {}) 112 self.lr_decay = kwargs.pop('lr_decay', 1.0) 113 self.batch_size = kwargs.pop('batch_size', 100) 114 self.num_epochs = kwargs.pop('num_epochs', 10) 115 116 self.print_every = kwargs.pop('print_every', 10) 117 self.verbose = kwargs.pop('verbose', True) 118 119 # Throw an error if there are extra keyword arguments 120 if len(kwargs) > 0: 121 extra = ', '.join('"%s"' % k for k in kwargs.keys()) 122 raise ValueError('Unrecognized arguments %s' % extra) 123 124 # Make sure the update rule exists, then replace the string 125 # name with the actual function 126 if not hasattr(optim, self.update_rule): 127 raise ValueError('Invalid update_rule "%s"' % self.update_rule) 128 self.update_rule = getattr(optim, self.update_rule) 129 130 self._reset() 131 132 133 def _reset(self): 134 """ 135 Set up some book-keeping variables for optimization. Don't call this 136 manually. 137 """ 138 # Set up some variables for book-keeping 139 self.epoch = 0 140 self.best_val_acc = 0 141 self.best_params = {} 142 self.loss_history = [] 143 self.train_acc_history = [] 144 self.val_acc_history = [] 145 146 # Make a deep copy of the optim_config for each parameter 147 self.optim_configs = {} 148 for p in self.model.params: 149 d = {k: v for k, v in self.optim_config.iteritems()} 150 self.optim_configs[p] = d 151 152 153 def _step(self): 154 """ 155 Make a single gradient update. This is called by train() and should not 156 be called manually. 157 """ 158 # Make a minibatch of training data 159 num_train = self.X_train.shape[0] 160 batch_mask = np.random.choice(num_train, self.batch_size) 161 X_batch = self.X_train[batch_mask] 162 y_batch = self.y_train[batch_mask] 163 164 # Compute loss and gradient 165 loss, grads = self.model.loss(X_batch, y_batch) 166 self.loss_history.append(loss) 167 168 # Perform a parameter update 169 for p, w in self.model.params.iteritems(): 170 dw = grads[p] 171 config = self.optim_configs[p] 172 next_w, next_config = self.update_rule(w, dw, config) #因為有很多update的方法 173 self.model.params[p] = next_w 174 self.optim_configs[p] = next_config 175 176 177 def check_accuracy(self, X, y, num_samples=None, batch_size=100): 178 """ 179 Check accuracy of the model on the provided data. 180 181 Inputs: 182 - X: Array of data, of shape (N, d_1, ..., d_k) 183 - y: Array of labels, of shape (N,) 184 - num_samples: If not None, subsample the data and only test the model 185 on num_samples datapoints. 186 - batch_size: Split X and y into batches of this size to avoid using too 187 much memory. 188 189 Returns: 190 - acc: Scalar giving the fraction of instances that were correctly 191 classified by the model. 192 """ 193 194 # Maybe subsample the data 195 N = X.shape[0] 196 if num_samples is not None and N > num_samples: 197 mask = np.random.choice(N, num_samples) 198 N = num_samples 199 X = X[mask] 200 y = y[mask] 201 202 # Compute predictions in batches 203 num_batches = N / batch_size 204 if N % batch_size != 0: 205 num_batches += 1 206 y_pred = [] 207 for i in xrange(num_batches): 208 start = i * batch_size 209 end = (i + 1) * batch_size 210 scores = self.model.loss(X[start:end]) 211 y_pred.append(np.argmax(scores, axis=1)) 212 y_pred = np.hstack(y_pred) 213 acc = np.mean(y_pred == y) 214 215 return acc 216 217 218 def train(self): 219 """ 220 Run optimization to train the model. 221 """ 222 num_train = self.X_train.shape[0] 223 iterations_per_epoch = max(num_train / self.batch_size, 1) 224 num_iterations = self.num_epochs * iterations_per_epoch 225 226 for t in xrange(num_iterations): 227 self._step() 228 229 # Maybe print training loss 230 if self.verbose and t % self.print_every == 0: 231 print '(Iteration %d / %d) loss: %f' % ( 232 t + 1, num_iterations, self.loss_history[-1]) 233 234 # At the end of every epoch, increment the epoch counter and decay the 235 # learning rate. 236 epoch_end = (t + 1) % iterations_per_epoch == 0 237 if epoch_end: 238 self.epoch += 1 239 for k in self.optim_configs: 240 self.optim_configs[k]['learning_rate'] *= self.lr_decay 241 242 # Check train and val accuracy on the first iteration, the last 243 # iteration, and at the end of each epoch. 244 first_it = (t == 0) 245 last_it = (t == num_iterations + 1) 246 if first_it or last_it or epoch_end: 247 train_acc = self.check_accuracy(self.X_train, self.y_train, 248 num_samples=1000) 249 val_acc = self.check_accuracy(self.X_val, self.y_val) 250 self.train_acc_history.append(train_acc) 251 self.val_acc_history.append(val_acc) 252 253 if self.verbose: 254 print '(Epoch %d / %d) train acc: %f; val_acc: %f' % ( 255 self.epoch, self.num_epochs, train_acc, val_acc) 256 257 # Keep track of the best model 258 if val_acc > self.best_val_acc: 259 self.best_val_acc = val_acc 260 self.best_params = {} 261 for k, v in self.model.params.iteritems(): 262 self.best_params[k] = v.copy() 263 264 # At the end of training swap the best params into the model 265 self.model.params = self.best_params
至此可以說構建了一個deep learning全連接網絡的框架,我們可以來回顧一下具體做了些事:
1.編寫全連接層,Relu層的前向傳播和反向傳播算法。
2.編寫Sandwich的函數,只是將上面的集成起來而已。
3.編一個FUllyconnect的類,功能是:傳入neural network相應的參數,得到一個對應的model。
4.編寫一個solver的類,功能是:傳入model和圖片,進行最后的最優求解。
有哪些問題呢:
1.前向傳播的時候需要保存一些參數,這里直接返回cache和out。
2.編寫多層的時候需要注意很多點,各層的參數,注意,對於i層,它的輸入時out[i],輸出是out[i+1],參數信息是cache[i]。
3.SGD的update的rule畢竟還是太navie了,之后可以嘗試一下別的。
寫好了上面的代碼,然后呢?
這里有一些很有用的trick需要記住的。
當你構建了一個neural network准備去跑你的數據集的時候,你肯定不能一次就去跑那個最大最原始的,最好的方法是先去overfitting一個小數據集,證明你的網絡是有不錯的學習能力的,這個時候就要大膽調參了...個人建議LR小一點,迭代次數多一點,scale也要看情況。
總結:第二次作業的內容很多,這次先說這么多了,未完待續。