Deep learning：三十七(Deep learning中的優化方法)

本文轉載自查看原文 2013-05-02 00:04 27541 機器學習/ Deep Learning

　　內容：

　　本文主要是參考論文：On optimization methods for deep learning，文章內容主要是筆記SGD（隨機梯度下降），LBFGS（受限的BFGS），CG（共軛梯度法）三種常見優化算法的在deep learning體系中的性能。下面是一些讀完的筆記。

　　SGD優點：實現簡單，當訓練樣本足夠多時優化速度非常快。

　　SGD缺點：需要人為調整很多參數，比如學習率，收斂准則等。另外，它是序列的方法，不利於GPU並行或分布式處理。

　　各種deep learning中常見方法（比如說Autoencoder，RBM，DBN，ICA，Sparse coding）的區別是：目標函數形式不同。這其實才是最本質的區別，由於目標函數的不同導致了對其優化的方法也可能會不同，比如說RBM中目標函數跟網絡能量有關，采用CD優化的，而Autoencoder目標函數為理論輸出和實際輸出的MSE，由於此時的目標函數的偏導可以直接被計算，所以可以用LBFGS，CG等方法優化，其它的類似。所以不能單從網絡的結構來判斷其屬於Deep learning中的哪種方法，比如說我單獨給定64-100的2層網絡，你就無法知道它屬於deep learning中的哪一種方法，因為這個網絡既可以用RBM也可以用Autoencoder來訓練。

　　作者通過實驗得出的結論是：不同的優化算法有不同的優缺點，適合不同的場合，比如LBFGS算法在參數的維度比較低（一般指小於10000維）時的效果要比SGD（隨機梯度下降）和CG（共軛梯度下降）效果好，特別是帶有convolution的模型。而針對高維的參數問題，CG的效果要比另2種好。也就是說一般情況下，SGD的效果要差一些，這種情況在使用GPU加速時情況一樣，即在GPU上使用LBFGS和CG時，優化速度明顯加快，而SGD算法優化速度提高很小。在單核處理器上，LBFGS的優勢主要是利用參數之間的2階近視特性來加速優化，而CG則得得益於參數之間的共軛信息，需要計算器Hessian矩陣。

　　不過當使用一個大的minibatch且采用線搜索的話，SGD的優化性能也會提高。

　　在單核上比較SGD，LBFGS，CG三種算法的優化性能，當針對Autoencoder模型。結果如下：

　　可以看出，SGD效果最差。

　　同樣的情況下，訓練的是Sparse autoencoder模型的比較情況如下：

　　這時SGD的效果更差。這主要原因是LBFGS和CG能夠使用大的minibatch數據來估算每個節點的期望激發值，這個值是可以用來約束該節點的稀疏特性的，而SGD需要去估計噪聲信息。

　　當然了作者還有在GUP，convolution上也做了不少實驗。

　　最后，作者訓練了一個2隱含層（這2層不算pooling層）的Sparse autocoder網絡，並應用於MNIST上，其識別率結果如下：

　　作者網站上給出了一些code，見deep autoencoder with L-BFGS。看着標題本以為code會實現deep convolution autoencoder pre-training和fine-tuning的，因為作者paper里面用的是convolution，閱讀完code后發現其實現就是一個普通二層的autoencoder。看來還是得到前面博文第二個問題的答案：Deep learning：三十六(關於構建深度卷積SAE網絡的一點困惑)。

　　下面是作者code主要部分的一些注釋：

optimizeAutoencoderLBFGS.m(實現deep autoencoder網絡的參數優化過程):

function [] = optimizeAutoencoderLBFGS(layersizes, datasetpath, ...
                                       finalObjective)
% train a deep autoencoder with variable hidden sizes
% layersizes : the sizes of the hidden layers. For istance, specifying layersizes =
%     [200 100] will create a network looks like input -> 200 -> 100 -> 200
%     -> output (same size as input). Notice the mirroring structure of the
%     autoencoders. Default layersizes = [2*3072 100]
% datasetpath: the path to the CIFAR dataset (where we find the *.mat
%     files). see loadData.m
% finalObjective: the final objective that you use to compare to
%                 terminate your optimization. To qualify, the objective
%                 function on the entire training set must be below this
%                 value.
%
% Author: Quoc V. Le (quocle@stanford.edu)
% 
%% Handle default parameters
if nargin < 3 || isempty(finalObjective)
    finalObjective = 70; % i am just making this up, the evaluation objective 
                         % will be much lower
end
if nargin < 2 || isempty(datasetpath)
  datasetpath = '.';
end
if nargin < 1 || isempty(layersizes)
  layersizes = [2*3072 100];
  layersizes = [200 100];
end

%% Load data
loadData %traindata 3072*10000的，每一列表示一個向量

%% Random initialization
initializeWeights;%看作者對應該部分的code，也沒有感覺出convolution和pooling的影響啊，怎么它就連接起來了呢

%% Optimization: minibatch L-BFGS
% Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, A.Y. Ng. 
% On optimization methods for deep learning. ICML, 2011

addpath minFunc/
options.Method = 'lbfgs'; 
options.maxIter = 20;      
options.display = 'on';
options.TolX = 1e-3;

perm = randperm(size(traindata,2));
traindata = traindata(:,perm);% 將訓練樣本隨機排列
batchSize = 1000;%因為總共樣本數為10000個，所以分成了10個批次
maxIter = 20;
for i=1:maxIter    
    startIndex = mod((i-1) * batchSize, size(traindata,2)) + 1;
    fprintf('startIndex = %d, endIndex = %d\n', startIndex, startIndex + batchSize-1);
    data = traindata(:, startIndex:startIndex + batchSize-1); 
    [theta, obj] = minFunc( @deepAutoencoder, theta, options, layersizes, ...
                            data);
    if obj <= finalObjective % use the minibatch obj as a heuristic for stopping
                             % because checking the entire dataset is very
                             % expensive
        % yes, we should check the objective for the entire training set        
        trainError = deepAutoencoder(theta, layersizes, traindata);
        if trainError <= finalObjective
            % now your submission is qualified
            break
        end
    end
end

%% write to text files so that we can test your program
writeToTextFiles;

deepAutoencoder.m:（深度網絡代價函數及其導數的求解函數）:

function [cost,grad] = deepAutoencoder(theta, layersizes, data)
% cost and gradient of a deep autoencoder 
% layersizes is a vector of sizes of hidden layers, e.g., 
% layersizes[2] is the size of layer 2
% this does not count the visible layer
% data is the input data, each column is an example
% the activation function of the last layer is linear, the activation
% function of intermediate layers is the hyperbolic tangent function

% WARNING: the code is optimized for ease of implemtation and
% understanding, not speed nor space

%% FORCING THETA TO BE IN MATRIX FORMAT FOR EASE OF UNDERSTANDING
% Note that this is not optimized for space, one can just retrieve W and b
% on the fly during forward prop and backprop. But i do it here so that the
% readers can understand what's going on
layersizes = [size(data,1) layersizes];
l = length(layersizes);
lnew = 0;
for i=1:l-1
    lold = lnew + 1;
    lnew = lnew + layersizes(i) * layersizes(i+1);
    W{i} = reshape(theta(lold:lnew), layersizes(i+1), layersizes(i));
    lold = lnew + 1;
    lnew = lnew + layersizes(i+1);
    b{i} = theta(lold:lnew);
end
% handle tied-weight stuff
j = 1;
for i=l:2*(l-1)
    lold = lnew + 1;
    lnew = lnew + layersizes(l-j);
    W{i} = W{l - j}'; %直接用encoder中對應的轉置即可
    b{i} = theta(lold:lnew);
    j = j + 1;
end
assert(lnew == length(theta), 'Error: dimensions of theta and layersizes do not match\n')


%% FORWARD PROP
for i=1:2*(l-1)-1
    if i==1
        [h{i} dh{i}] = tanhAct(bsxfun(@plus, W{i}*data, b{i}));
    else
        [h{i} dh{i}] = tanhAct(bsxfun(@plus, W{i}*h{i-1}, b{i}));
    end
end
h{i+1} = linearAct(bsxfun(@plus, W{i+1}*h{i}, b{i+1}));

%% COMPUTE COST
diff = h{i+1} - data; 
M = size(data,2); 
cost = 1/M * 0.5 * sum(diff(:).^2);% 純粹標准的autoencoder，不加其它比如sparse限制

%% BACKPROP
if nargout > 1
    outderv = 1/M * diff;    
    for i=2*(l-1):-1:2
        Wgrad{i} = outderv * h{i-1}';
        bgrad{i} = sum(outderv,2);        
        outderv = (W{i}' * outderv) .* dh{i-1};        
    end
    Wgrad{1} = outderv * data';
    bgrad{1} = sum(outderv,2);
        
    % handle tied-weight stuff        
    j = 1;
    for i=l:2*(l-1)
        Wgrad{l-j} = Wgrad{l-j} + Wgrad{i}';
        j = j + 1;
    end
    % dump the results to the grad vector
    grad = zeros(size(theta));
    lnew = 0;
    for i=1:l-1
        lold = lnew + 1;
        lnew = lnew + layersizes(i) * layersizes(i+1);
        grad(lold:lnew) = Wgrad{i}(:);
        lold = lnew + 1;
        lnew = lnew + layersizes(i+1);
        grad(lold:lnew) = bgrad{i}(:);
    end
    j = 1;
    for i=l:2*(l-1)
        lold = lnew + 1;
        lnew = lnew + layersizes(l-j);
        grad(lold:lnew) = bgrad{i}(:);
        j = j + 1;
    end
end 
end

%% USEFUL ACTIVATION FUNCTIONS
function [a da] = sigmoidAct(x)

a = 1 ./ (1 + exp(-x));
if nargout > 1
    da = a .* (1-a);
end
end

function [a da] = tanhAct(x)
a = tanh(x);
if nargout > 1
    da = (1-a) .* (1+a);
end
end

function [a da] = linearAct(x)
a = x;
if nargout > 1
    da = ones(size(a));
end
end

initializeWeights.m（參數初始化賦值，雖然是隨機，但是有一定要求）:

%% Random initialization
% X. Glorot, Y. Bengio. 
% Understanding the dif鏗乧ulty of training deep feedforward neural networks.
% AISTATS 2010.
% QVL: this initialization method appears to perform better than 
% theta = randn(d,1);
s0 = size(traindata,1);% s0涓烘牱鏈殑緇存暟
layersizes = [s0 layersizes];%輸入層-hidden1-hidden2，這里是3072-6144-100
l = length(layersizes);%緗戠粶涓殑灞傛暟錛屼笉鍖呭惈瑙ｇ爜閮ㄥ垎錛屽鏋滄槸2涓殣鍚眰鐨勮瘽錛岃繖閲宭=3
lnew = 0;
for i=1:l-1%1到3之間
    lold = lnew + 1;
    lnew = lnew + layersizes(i) * layersizes(i+1);
    r  = sqrt(6) / sqrt(layersizes(i+1)+layersizes(i));   
    A = rand(layersizes(i+1), layersizes(i))*2*r - r; %reshape(theta(lold:lnew), layersizes(i+1), layersizes(i));
    theta(lold:lnew) = A(:); %相當於權值W的賦值
    lold = lnew + 1;
    lnew = lnew + layersizes(i+1);
    A = zeros(layersizes(i+1),1);
    theta(lold:lnew) = A(:);%相當於偏置值b的賦值
end %以上是encoder部分
j = 1;
for i=l:2*(l-1) %1到4之間，下面開始decoder部分
    lold = lnew + 1;
    lnew = lnew + layersizes(l-j);
    theta(lold:lnew)= zeros(layersizes(l-j),1);
    j = j + 1;
end
theta = theta';
layersizes = layersizes(2:end); %去除輸入層

　　參考資料：

　　Le, Q. V., et al. (2011). On optimization methods for deep learning. Proc. of ICML.

deep autoencoder with L-BFGS

Deep learning：三十六(關於構建深度卷積SAE網絡的一點困惑)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Deep Learning 優化方法總結 deep learning 以及deep learning 常用模型和方法《DEEP LEARNING》 Deep learning：十七(Linear Decoders，Convolution和Pooling) Deep learning：十六(deep networks) Deep learning：八(Sparse Autoencoder) 基於Deep Learning 的視頻識別方法概覽 Deep Learning for Information Retrieval Wide and Deep Learning Model Deep Mutual Learning