前言:
本次是練習2個隱含層的網絡的訓練方法,每個網絡層都是用的sparse autoencoder思想,利用兩個隱含層的網絡來提取出輸入數據的特征。本次實驗驗要完成的任務是對MINST進行手寫數字識別,實驗內容及步驟參考網頁教程Exercise: Implement deep networks for digit classification。當提取出手寫數字圖片的特征后,就用softmax進行對其進行分類。關於MINST的介紹可以參考網頁:MNIST Dataset。本文的理論介紹也可以參考前面的博文:Deep learning:十六(deep networks)。
實驗基礎:
進行deep network的訓練方法大致如下:
1. 用原始輸入數據作為輸入,訓練出(利用sparse autoencoder方法)第一個隱含層結構的網絡參數,並將用訓練好的參數算出第1個隱含層的輸出。
2. 把步驟1的輸出作為第2個網絡的輸入,用同樣的方法訓練第2個隱含層網絡的參數。
3. 用步驟2 的輸出作為多分類器softmax的輸入,然后利用原始數據的標簽來訓練出softmax分類器的網絡參數。
4. 計算2個隱含層加softmax分類器整個網絡一起的損失函數,以及整個網絡對每個參數的偏導函數值。
5. 用步驟1,2和3的網絡參數作為整個深度網絡(2個隱含層,1個softmax輸出層)參數初始化的值,然后用lbfs算法迭代求出上面損失函數最小值附近處的參數值,並作為整個網絡最后的最優參數值。
上面的訓練過程是針對使用softmax分類器進行的,而softmax分類器的損失函數等是有公式進行計算的。所以在進行參數校正時,可以對把所有網絡看做是一個整體,然后計算整個網絡的損失函數和其偏導,這樣的話當我們有了標注好了的數據后,就可以用前面訓練好了的參數作為初始參數,然后用優化算法求得整個網絡的參數了。但如果我們后面的分類器不是用的softmax分類器,而是用的其它的,比如svm,隨機森林等,這個時候前面特征提取的網絡參數已經預訓練好了,用該參數是可以初始化前面的網絡,但是此時該怎么微調呢?因為此時標注的數值只能在后面的分類器中才用得到,所以沒法計算系統的損失函數等。難道又要將前面n層網絡的最終輸出等價於第一層網絡的輸入(也就是多網絡的sparse autoencoder)?本人暫時還沒弄清楚,日后應該會想明白的。
關於深度網絡的學習幾個需要注意的小點(假設隱含層為2層):
- 利用sparse autoencoder進行預訓練時,需要依次計算出每個隱含層的輸出,如果后面是采用softmax分類器的話,則同樣也需要用最后一個隱含層的輸出作為softmax的輸入來訓練softmax的網絡參數。
- 由步驟1可知,在進行參數校正之前是需要對分類器的參數進行預訓練的。且在進行參數校正(Finetuning )時是將所有的隱含層看做是一個單一的網絡層,因此每一次迭代就可以更新所有網絡層的參數。
另外在實際的訓練過程中可以看到,訓練第一個隱含層所用的時間較長,應該需要訓練的參數矩陣為200*784(沒包括b參數),訓練第二個隱含層的時間較第一個隱含層要短些,主要原因是此時只需學習到200*200的參數矩陣,其參數個數大大減小。而訓練softmax的時間更短,那是因為它的參數個數更少,且損失函數和偏導的計算公式也沒有前面兩層的復雜。最后對整個網絡的微調所用的時間和第二個隱含層的訓練時間長短差不多。
程序中部分函數:
[params, netconfig] = stack2params(stack)
是將stack層次的網絡參數(可能是多個參數)轉換成一個向量params,這樣有利用使用各種優化算法來進行優化操作。Netconfig中保存的是該網絡的相關信息,其中netconfig.inputsize表示的是網絡的輸入層節點的個數。netconfig.layersizes中的元素分別表示每一個隱含層對應節點的個數。
[ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, numClasses, netconfig,lambda, data, labels)
該函數內部實現整個網絡損失函數和損失函數對每個參數偏導的計算。其中損失函數是個實數值,當然就只有1個了,其計算方法是根據sofmax分類器來計算的,只需知道標簽值和softmax輸出層的值即可。而損失函數對所有參數的偏導卻有很多個,因此每個參數處應該就有一個偏導值,這些參數不僅包括了多個隱含層的,而且還包括了softmax那個網絡層的。其中softmax那部分的偏導是根據其公式直接獲得,而深度網絡層那部分這通過BP算法方向推理得到(即先計算每一層的誤差值,然后利用該誤差值計算參數w和b)。
stack = params2stack(params, netconfig)
和上面的函數功能相反,是吧一個向量參數按照深度網絡的結構依次展開。
[pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data)
這個函數其實就是對輸入的data數據進行預測,看該data對應的輸出類別是多少。其中theta為整個網絡的參數(包括了分類器部分的網絡),numClasses為所需分類的類別,netconfig為網絡的結構參數。
[h, array] = display_network(A, opt_normalize, opt_graycolor, cols, opt_colmajor)
該函數是用來顯示矩陣A的,此時要求A中的每一列為一個權值,並且A是完全平方數。函數運行后會將A中每一列顯示為一個小的patch圖像,具體的有多少個patch和patch之間該怎么擺設是程序內部自動決定的。
matlab內嵌函數:
struct:
s = sturct;表示創建一個結構數組s。
nargout:
表示函數輸出參數的個數。
save:
比如函數save('saves/step2.mat', 'sae1OptTheta');則要求當前目錄下有saves這個目錄,否則該語句會調用失敗的。
實驗結果:
第一個隱含層的特征值如下所示:
第二個隱含層的特征值顯示不知道該怎么弄,因為第二個隱含層每個節點都是對應的200維,用display_network這個函數去顯示的話是不行的,它只能顯示維數能夠開平方的那些特征,所以不知道是該將200弄成20*10,還是弄成16*25好,很好奇關於deep learning那么多文章中第二層網絡是怎么顯示的,將200分解后的顯示哪個具有代表性呢?待定。所以這里暫且不顯示,因為截取200前面的196位用display_network來顯示的話,什么都看不出來:
沒有經過網絡參數微調時的識別准去率為:
Before Finetuning Test Accuracy: 92.190%
經過了網絡參數微調后的識別准確率為:
After Finetuning Test Accuracy: 97.670%
實驗主要部分代碼及注釋:
stackedAEExercise.m:
%% CS294A/CS294W Stacked Autoencoder Exercise % Instructions % ------------ % % This file contains code that helps you get started on the % sstacked autoencoder exercise. You will need to complete code in % stackedAECost.m % You will also need to have implemented sparseAutoencoderCost.m and % softmaxCost.m from previous exercises. You will need the initializeParameters.m % loadMNISTImages.m, and loadMNISTLabels.m files from previous exercises. % % For the purpose of completing the assignment, you do not need to % change the code in this file. % %%====================================================================== %% STEP 0: Here we provide the relevant parameters values that will % allow your sparse autoencoder to get good filters; you do not need to % change the parameters below. DISPLAY = true; inputSize = 28 * 28; numClasses = 10; hiddenSizeL1 = 200; % Layer 1 Hidden Size hiddenSizeL2 = 200; % Layer 2 Hidden Size sparsityParam = 0.1; % desired average activation of the hidden units. % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p", % in the lecture notes). lambda = 3e-3; % weight decay parameter beta = 3; % weight of sparsity penalty term %%====================================================================== %% STEP 1: Load data from the MNIST database % % This loads our training data from the MNIST database files. % Load MNIST database files trainData = loadMNISTImages('train-images.idx3-ubyte'); trainLabels = loadMNISTLabels('train-labels.idx1-ubyte'); trainLabels(trainLabels == 0) = 10; % Remap 0 to 10 since our labels need to start from 1 %%====================================================================== %% STEP 2: Train the first sparse autoencoder % This trains the first sparse autoencoder on the unlabelled STL training % images. % If you've correctly implemented sparseAutoencoderCost.m, you don't need % to change anything here. % Randomly initialize the parameters sae1Theta = initializeParameters(hiddenSizeL1, inputSize); %% ---------------------- YOUR CODE HERE --------------------------------- % Instructions: Train the first layer sparse autoencoder, this layer has % an hidden size of "hiddenSizeL1" % You should store the optimal parameters in sae1OptTheta addpath minFunc/; options = struct; options.Method = 'lbfgs'; options.maxIter = 400; options.display = 'on'; [sae1OptTheta, cost] = minFunc(@(p)sparseAutoencoderCost(p,... inputSize,hiddenSizeL1,lambda,sparsityParam,beta,trainData),sae1Theta,options);%訓練出第一層網絡的參數 save('saves/step2.mat', 'sae1OptTheta'); if DISPLAY W1 = reshape(sae1OptTheta(1:hiddenSizeL1 * inputSize), hiddenSizeL1, inputSize); display_network(W1'); end % ------------------------------------------------------------------------- %%====================================================================== %% STEP 2: Train the second sparse autoencoder % This trains the second sparse autoencoder on the first autoencoder % featurse. % If you've correctly implemented sparseAutoencoderCost.m, you don't need % to change anything here. [sae1Features] = feedForwardAutoencoder(sae1OptTheta, hiddenSizeL1, ... inputSize, trainData); % Randomly initialize the parameters sae2Theta = initializeParameters(hiddenSizeL2, hiddenSizeL1); %% ---------------------- YOUR CODE HERE --------------------------------- % Instructions: Train the second layer sparse autoencoder, this layer has % an hidden size of "hiddenSizeL2" and an inputsize of % "hiddenSizeL1" % % You should store the optimal parameters in sae2OptTheta [sae2OptTheta, cost] = minFunc(@(p)sparseAutoencoderCost(p,... hiddenSizeL1,hiddenSizeL2,lambda,sparsityParam,beta,sae1Features),sae2Theta,options);%訓練出第一層網絡的參數 save('saves/step3.mat', 'sae2OptTheta'); figure; if DISPLAY W11 = reshape(sae1OptTheta(1:hiddenSizeL1 * inputSize), hiddenSizeL1, inputSize); W12 = reshape(sae2OptTheta(1:hiddenSizeL2 * hiddenSizeL1), hiddenSizeL2, hiddenSizeL1); % TODO(zellyn): figure out how to display a 2-level network % display_network(log(W11' ./ (1-W11')) * W12'); % W12_temp = W12(1:196,1:196); % display_network(W12_temp'); % figure; % display_network(W12_temp'); end % ------------------------------------------------------------------------- %%====================================================================== %% STEP 3: Train the softmax classifier % This trains the sparse autoencoder on the second autoencoder features. % If you've correctly implemented softmaxCost.m, you don't need % to change anything here. [sae2Features] = feedForwardAutoencoder(sae2OptTheta, hiddenSizeL2, ... hiddenSizeL1, sae1Features); % Randomly initialize the parameters saeSoftmaxTheta = 0.005 * randn(hiddenSizeL2 * numClasses, 1); %% ---------------------- YOUR CODE HERE --------------------------------- % Instructions: Train the softmax classifier, the classifier takes in % input of dimension "hiddenSizeL2" corresponding to the % hidden layer size of the 2nd layer. % % You should store the optimal parameters in saeSoftmaxOptTheta % % NOTE: If you used softmaxTrain to complete this part of the exercise, % set saeSoftmaxOptTheta = softmaxModel.optTheta(:); softmaxLambda = 1e-4; numClasses = 10; softoptions = struct; softoptions.maxIter = 400; softmaxModel = softmaxTrain(hiddenSizeL2,numClasses,softmaxLambda,... sae2Features,trainLabels,softoptions); saeSoftmaxOptTheta = softmaxModel.optTheta(:); save('saves/step4.mat', 'saeSoftmaxOptTheta'); % ------------------------------------------------------------------------- %%====================================================================== %% STEP 5: Finetune softmax model % Implement the stackedAECost to give the combined cost of the whole model % then run this cell. % Initialize the stack using the parameters learned stack = cell(2,1); %其中的saelOptTheta和sae1ptTheta都是包含了sparse autoencoder的重建層網絡權值的 stack{1}.w = reshape(sae1OptTheta(1:hiddenSizeL1*inputSize), ... hiddenSizeL1, inputSize); stack{1}.b = sae1OptTheta(2*hiddenSizeL1*inputSize+1:2*hiddenSizeL1*inputSize+hiddenSizeL1); stack{2}.w = reshape(sae2OptTheta(1:hiddenSizeL2*hiddenSizeL1), ... hiddenSizeL2, hiddenSizeL1); stack{2}.b = sae2OptTheta(2*hiddenSizeL2*hiddenSizeL1+1:2*hiddenSizeL2*hiddenSizeL1+hiddenSizeL2); % Initialize the parameters for the deep model [stackparams, netconfig] = stack2params(stack); stackedAETheta = [ saeSoftmaxOptTheta ; stackparams ];%stackedAETheta是個向量,為整個網絡的參數,包括分類器那部分,且分類器那部分的參數放前面 %% ---------------------- YOUR CODE HERE --------------------------------- % Instructions: Train the deep network, hidden size here refers to the ' % dimension of the input to the classifier, which corresponds % to "hiddenSizeL2". % % [stackedAEOptTheta, cost] = minFunc(@(p)stackedAECost(p,inputSize,hiddenSizeL2,... numClasses, netconfig,lambda, trainData, trainLabels),... stackedAETheta,options);%訓練出第一層網絡的參數 save('saves/step5.mat', 'stackedAEOptTheta'); figure; if DISPLAY optStack = params2stack(stackedAEOptTheta(hiddenSizeL2*numClasses+1:end), netconfig); W11 = optStack{1}.w; W12 = optStack{2}.w; % TODO(zellyn): figure out how to display a 2-level network % display_network(log(1 ./ (1-W11')) * W12'); end % ------------------------------------------------------------------------- %%====================================================================== %% STEP 6: Test % Instructions: You will need to complete the code in stackedAEPredict.m % before running this part of the code % % Get labelled test images % Note that we apply the same kind of preprocessing as the training set testData = loadMNISTImages('t10k-images.idx3-ubyte'); testLabels = loadMNISTLabels('t10k-labels.idx1-ubyte'); testLabels(testLabels == 0) = 10; % Remap 0 to 10 [pred] = stackedAEPredict(stackedAETheta, inputSize, hiddenSizeL2, ... numClasses, netconfig, testData); acc = mean(testLabels(:) == pred(:)); fprintf('Before Finetuning Test Accuracy: %0.3f%%\n', acc * 100); [pred] = stackedAEPredict(stackedAEOptTheta, inputSize, hiddenSizeL2, ... numClasses, netconfig, testData); acc = mean(testLabels(:) == pred(:)); fprintf('After Finetuning Test Accuracy: %0.3f%%\n', acc * 100); % Accuracy is the proportion of correctly classified images % The results for our implementation were: % % Before Finetuning Test Accuracy: 87.7% % After Finetuning Test Accuracy: 97.6% % % If your values are too low (accuracy less than 95%), you should check % your code for errors, and make sure you are training on the % entire data set of 60000 28x28 training images % (unless you modified the loading code, this should be the case)
stackedAECost.m:
function [ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, ... numClasses, netconfig, ... lambda, data, labels) % stackedAECost: Takes a trained softmaxTheta and a training data set with labels, % and returns cost and gradient using a stacked autoencoder model. Used for % finetuning. % theta: trained weights from the autoencoder % visibleSize: the number of input units % hiddenSize: the number of hidden units *at the 2nd layer* % numClasses: the number of categories % netconfig: the network configuration of the stack % lambda: the weight regularization penalty % data: Our matrix containing the training data as columns. So, data(:,i) is the i-th training example. % labels: A vector containing labels, where labels(i) is the label for the % i-th training example %% Unroll softmaxTheta parameter % We first extract the part which compute the softmax gradient softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize); % Extract out the "stack" stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig); % You will need to compute the following gradients softmaxThetaGrad = zeros(size(softmaxTheta)); stackgrad = cell(size(stack)); for d = 1:numel(stack) stackgrad{d}.w = zeros(size(stack{d}.w)); stackgrad{d}.b = zeros(size(stack{d}.b)); end cost = 0; % You need to compute this % You might find these variables useful M = size(data, 2); groundTruth = full(sparse(labels, 1:M, 1)); %% --------------------------- YOUR CODE HERE ----------------------------- % Instructions: Compute the cost function and gradient vector for % the stacked autoencoder. % % You are given a stack variable which is a cell-array of % the weights and biases for every layer. In particular, you % can refer to the weights of Layer d, using stack{d}.w and % the biases using stack{d}.b . To get the total number of % layers, you can use numel(stack). % % The last layer of the network is connected to the softmax % classification layer, softmaxTheta. % % You should compute the gradients for the softmaxTheta, % storing that in softmaxThetaGrad. Similarly, you should % compute the gradients for each layer in the stack, storing % the gradients in stackgrad{d}.w and stackgrad{d}.b % Note that the size of the matrices in stackgrad should % match exactly that of the size of the matrices in stack. % depth = numel(stack); z = cell(depth+1,1); a = cell(depth+1, 1); a{1} = data; for layer = (1:depth) z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]); a{layer+1} = sigmoid(z{layer+1}); end M = softmaxTheta * a{depth+1}; M = bsxfun(@minus, M, max(M)); p = bsxfun(@rdivide, exp(M), sum(exp(M))); cost = -1/numClasses * groundTruth(:)' * log(p(:)) + lambda/2 * sum(softmaxTheta(:) .^ 2); softmaxThetaGrad = -1/numClasses * (groundTruth - p) * a{depth+1}' + lambda * softmaxTheta; d = cell(depth+1); d{depth+1} = -(softmaxTheta' * (groundTruth - p)) .* a{depth+1} .* (1-a{depth+1}); for layer = (depth:-1:2) d{layer} = (stack{layer}.w' * d{layer+1}) .* a{layer} .* (1-a{layer}); end for layer = (depth:-1:1) stackgrad{layer}.w = (1/numClasses) * d{layer+1} * a{layer}'; stackgrad{layer}.b = (1/numClasses) * sum(d{layer+1}, 2); end % ------------------------------------------------------------------------- %% Roll gradient vector grad = [softmaxThetaGrad(:) ; stack2params(stackgrad)]; end % You might find this useful function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end
stackedAEPredict.m:
function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data) % stackedAEPredict: Takes a trained theta and a test data set, % and returns the predicted labels for each example. % theta: trained weights from the autoencoder % visibleSize: the number of input units % hiddenSize: the number of hidden units *at the 2nd layer* % numClasses: the number of categories % data: Our matrix containing the training data as columns. So, data(:,i) is the i-th training example. % Your code should produce the prediction matrix % pred, where pred(i) is argmax_c P(y(c) | x(i)). %% Unroll theta parameter % We first extract the part which compute the softmax gradient softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize); % Extract out the "stack" stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig); %% ---------- YOUR CODE HERE -------------------------------------- % Instructions: Compute pred using theta assuming that the labels start % from 1. depth = numel(stack); z = cell(depth+1,1); a = cell(depth+1, 1); a{1} = data; for layer = (1:depth) z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]); a{layer+1} = sigmoid(z{layer+1}); end [~, pred] = max(softmaxTheta * a{depth+1});%閫夋鐜囨渶澶х殑閭d釜杈撳嚭鍊� % ----------------------------------------------------------- end % You might find this useful function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end
參考資料:
Exercise: Implement deep networks for digit classification
Deep learning:十六(deep networks)