Deep learning:三十七(Deep learning中的優化方法)


 

  內容:

  本文主要是參考論文:On optimization methods for deep learning,文章內容主要是筆記SGD(隨機梯度下降),LBFGS(受限的BFGS),CG(共軛梯度法)三種常見優化算法的在deep learning體系中的性能。下面是一些讀完的筆記。

  SGD優點:實現簡單,當訓練樣本足夠多時優化速度非常快。

  SGD缺點:需要人為調整很多參數,比如學習率,收斂准則等。另外,它是序列的方法,不利於GPU並行或分布式處理。

  各種deep learning中常見方法(比如說Autoencoder,RBM,DBN,ICA,Sparse coding)的區別是:目標函數形式不同。這其實才是最本質的區別,由於目標函數的不同導致了對其優化的方法也可能會不同,比如說RBM中目標函數跟網絡能量有關,采用CD優化的,而Autoencoder目標函數為理論輸出和實際輸出的MSE,由於此時的目標函數的偏導可以直接被計算,所以可以用LBFGS,CG等方法優化,其它的類似。所以不能單從網絡的結構來判斷其屬於Deep learning中的哪種方法,比如說我單獨給定64-100的2層網絡,你就無法知道它屬於deep learning中的哪一種方法,因為這個網絡既可以用RBM也可以用Autoencoder來訓練。

  作者通過實驗得出的結論是:不同的優化算法有不同的優缺點,適合不同的場合,比如LBFGS算法在參數的維度比較低(一般指小於10000維)時的效果要比SGD(隨機梯度下降)和CG(共軛梯度下降)效果好,特別是帶有convolution的模型。而針對高維的參數問題,CG的效果要比另2種好。也就是說一般情況下,SGD的效果要差一些,這種情況在使用GPU加速時情況一樣,即在GPU上使用LBFGS和CG時,優化速度明顯加快,而SGD算法優化速度提高很小。在單核處理器上,LBFGS的優勢主要是利用參數之間的2階近視特性來加速優化,而CG則得得益於參數之間的共軛信息,需要計算器Hessian矩陣。

  不過當使用一個大的minibatch且采用線搜索的話,SGD的優化性能也會提高。

  在單核上比較SGD,LBFGS,CG三種算法的優化性能,當針對Autoencoder模型。結果如下:

   

  可以看出,SGD效果最差。

  同樣的情況下,訓練的是Sparse autoencoder模型的比較情況如下:

   

  這時SGD的效果更差。這主要原因是LBFGS和CG能夠使用大的minibatch數據來估算每個節點的期望激發值,這個值是可以用來約束該節點的稀疏特性的,而SGD需要去估計噪聲信息。

  當然了作者還有在GUP,convolution上也做了不少實驗。

  最后,作者訓練了一個2隱含層(這2層不算pooling層)的Sparse autocoder網絡,並應用於MNIST上,其識別率結果如下:

   

  作者網站上給出了一些code,見deep autoencoder with L-BFGS。看着標題本以為code會實現deep convolution autoencoder pre-training和fine-tuning的,因為作者paper里面用的是convolution,閱讀完code后發現其實現就是一個普通二層的autoencoder。看來還是得到前面博文第二個問題的答案:Deep learning:三十六(關於構建深度卷積SAE網絡的一點困惑)

 

  下面是作者code主要部分的一些注釋:

optimizeAutoencoderLBFGS.m(實現deep autoencoder網絡的參數優化過程):

function [] = optimizeAutoencoderLBFGS(layersizes, datasetpath, ...
                                       finalObjective)
% train a deep autoencoder with variable hidden sizes
% layersizes : the sizes of the hidden layers. For istance, specifying layersizes =
%     [200 100] will create a network looks like input -> 200 -> 100 -> 200
%     -> output (same size as input). Notice the mirroring structure of the
%     autoencoders. Default layersizes = [2*3072 100]
% datasetpath: the path to the CIFAR dataset (where we find the *.mat
%     files). see loadData.m
% finalObjective: the final objective that you use to compare to
%                 terminate your optimization. To qualify, the objective
%                 function on the entire training set must be below this
%                 value.
%
% Author: Quoc V. Le (quocle@stanford.edu)
% 
%% Handle default parameters
if nargin < 3 || isempty(finalObjective)
    finalObjective = 70; % i am just making this up, the evaluation objective 
                         % will be much lower
end
if nargin < 2 || isempty(datasetpath)
  datasetpath = '.';
end
if nargin < 1 || isempty(layersizes)
  layersizes = [2*3072 100];
  layersizes = [200 100];
end

%% Load data
loadData %traindata 3072*10000的,每一列表示一個向量

%% Random initialization
initializeWeights;%看作者對應該部分的code,也沒有感覺出convolution和pooling的影響啊,怎么它就連接起來了呢

%% Optimization: minibatch L-BFGS
% Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, A.Y. Ng. 
% On optimization methods for deep learning. ICML, 2011

addpath minFunc/
options.Method = 'lbfgs'; 
options.maxIter = 20;      
options.display = 'on';
options.TolX = 1e-3;

perm = randperm(size(traindata,2));
traindata = traindata(:,perm);% 將訓練樣本隨機排列
batchSize = 1000;%因為總共樣本數為10000個,所以分成了10個批次
maxIter = 20;
for i=1:maxIter    
    startIndex = mod((i-1) * batchSize, size(traindata,2)) + 1;
    fprintf('startIndex = %d, endIndex = %d\n', startIndex, startIndex + batchSize-1);
    data = traindata(:, startIndex:startIndex + batchSize-1); 
    [theta, obj] = minFunc( @deepAutoencoder, theta, options, layersizes, ...
                            data);
    if obj <= finalObjective % use the minibatch obj as a heuristic for stopping
                             % because checking the entire dataset is very
                             % expensive
        % yes, we should check the objective for the entire training set        
        trainError = deepAutoencoder(theta, layersizes, traindata);
        if trainError <= finalObjective
            % now your submission is qualified
            break
        end
    end
end

%% write to text files so that we can test your program
writeToTextFiles;

 

deepAutoencoder.m:(深度網絡代價函數及其導數的求解函數):

function [cost,grad] = deepAutoencoder(theta, layersizes, data)
% cost and gradient of a deep autoencoder 
% layersizes is a vector of sizes of hidden layers, e.g., 
% layersizes[2] is the size of layer 2
% this does not count the visible layer
% data is the input data, each column is an example
% the activation function of the last layer is linear, the activation
% function of intermediate layers is the hyperbolic tangent function

% WARNING: the code is optimized for ease of implemtation and
% understanding, not speed nor space

%% FORCING THETA TO BE IN MATRIX FORMAT FOR EASE OF UNDERSTANDING
% Note that this is not optimized for space, one can just retrieve W and b
% on the fly during forward prop and backprop. But i do it here so that the
% readers can understand what's going on
layersizes = [size(data,1) layersizes];
l = length(layersizes);
lnew = 0;
for i=1:l-1
    lold = lnew + 1;
    lnew = lnew + layersizes(i) * layersizes(i+1);
    W{i} = reshape(theta(lold:lnew), layersizes(i+1), layersizes(i));
    lold = lnew + 1;
    lnew = lnew + layersizes(i+1);
    b{i} = theta(lold:lnew);
end
% handle tied-weight stuff
j = 1;
for i=l:2*(l-1)
    lold = lnew + 1;
    lnew = lnew + layersizes(l-j);
    W{i} = W{l - j}'; %直接用encoder中對應的轉置即可
    b{i} = theta(lold:lnew);
    j = j + 1;
end
assert(lnew == length(theta), 'Error: dimensions of theta and layersizes do not match\n')


%% FORWARD PROP
for i=1:2*(l-1)-1
    if i==1
        [h{i} dh{i}] = tanhAct(bsxfun(@plus, W{i}*data, b{i}));
    else
        [h{i} dh{i}] = tanhAct(bsxfun(@plus, W{i}*h{i-1}, b{i}));
    end
end
h{i+1} = linearAct(bsxfun(@plus, W{i+1}*h{i}, b{i+1}));

%% COMPUTE COST
diff = h{i+1} - data; 
M = size(data,2); 
cost = 1/M * 0.5 * sum(diff(:).^2);% 純粹標准的autoencoder,不加其它比如sparse限制

%% BACKPROP
if nargout > 1
    outderv = 1/M * diff;    
    for i=2*(l-1):-1:2
        Wgrad{i} = outderv * h{i-1}';
        bgrad{i} = sum(outderv,2);        
        outderv = (W{i}' * outderv) .* dh{i-1};        
    end
    Wgrad{1} = outderv * data';
    bgrad{1} = sum(outderv,2);
        
    % handle tied-weight stuff        
    j = 1;
    for i=l:2*(l-1)
        Wgrad{l-j} = Wgrad{l-j} + Wgrad{i}';
        j = j + 1;
    end
    % dump the results to the grad vector
    grad = zeros(size(theta));
    lnew = 0;
    for i=1:l-1
        lold = lnew + 1;
        lnew = lnew + layersizes(i) * layersizes(i+1);
        grad(lold:lnew) = Wgrad{i}(:);
        lold = lnew + 1;
        lnew = lnew + layersizes(i+1);
        grad(lold:lnew) = bgrad{i}(:);
    end
    j = 1;
    for i=l:2*(l-1)
        lold = lnew + 1;
        lnew = lnew + layersizes(l-j);
        grad(lold:lnew) = bgrad{i}(:);
        j = j + 1;
    end
end 
end

%% USEFUL ACTIVATION FUNCTIONS
function [a da] = sigmoidAct(x)

a = 1 ./ (1 + exp(-x));
if nargout > 1
    da = a .* (1-a);
end
end

function [a da] = tanhAct(x)
a = tanh(x);
if nargout > 1
    da = (1-a) .* (1+a);
end
end

function [a da] = linearAct(x)
a = x;
if nargout > 1
    da = ones(size(a));
end
end

 

initializeWeights.m(參數初始化賦值,雖然是隨機,但是有一定要求):

%% Random initialization
% X. Glorot, Y. Bengio. 
% Understanding the dif鏗乧ulty of training deep feedforward neural networks.
% AISTATS 2010.
% QVL: this initialization method appears to perform better than 
% theta = randn(d,1);
s0 = size(traindata,1);% s0涓烘牱鏈殑緇存暟
layersizes = [s0 layersizes];%輸入層-hidden1-hidden2,這里是3072-6144-100
l = length(layersizes);%緗戠粶涓殑灞傛暟錛屼笉鍖呭惈瑙g爜閮ㄥ垎錛屽鏋滄槸2涓殣鍚眰鐨勮瘽錛岃繖閲宭=3
lnew = 0;
for i=1:l-1%1到3之間
    lold = lnew + 1;
    lnew = lnew + layersizes(i) * layersizes(i+1);
    r  = sqrt(6) / sqrt(layersizes(i+1)+layersizes(i));   
    A = rand(layersizes(i+1), layersizes(i))*2*r - r; %reshape(theta(lold:lnew), layersizes(i+1), layersizes(i));
    theta(lold:lnew) = A(:); %相當於權值W的賦值
    lold = lnew + 1;
    lnew = lnew + layersizes(i+1);
    A = zeros(layersizes(i+1),1);
    theta(lold:lnew) = A(:);%相當於偏置值b的賦值
end %以上是encoder部分
j = 1;
for i=l:2*(l-1) %1到4之間,下面開始decoder部分
    lold = lnew + 1;
    lnew = lnew + layersizes(l-j);
    theta(lold:lnew)= zeros(layersizes(l-j),1);
    j = j + 1;
end
theta = theta';
layersizes = layersizes(2:end); %去除輸入層

 

 

  參考資料:

  Le, Q. V., et al. (2011). On optimization methods for deep learning. Proc. of ICML.

     deep autoencoder with L-BFGS

      Deep learning:三十六(關於構建深度卷積SAE網絡的一點困惑)

 

 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM