Deep Learning 16：用自編碼器對數據進行降維_讀論文“Reducing the Dimensionality of Data with Neural Networks”的筆記

本文轉載自查看原文 2015-12-28 19:37 12795 deep learning/ DBN/ hinton/ 深度學習

前言

論文“Reducing the Dimensionality of Data with Neural Networks”是深度學習鼻祖hinton於2006年發表於《SCIENCE 》的論文，也是這篇論文揭開了深度學習的序幕。

筆記

摘要：高維數據可以通過一個多層神經網絡把它編碼成一個低維數據，從而重建這個高維數據，其中這個神經網絡的中間層神經元數是較少的，可把這個神經網絡叫做自動編碼網絡或自編碼器（autoencoder）。梯度下降法可用來微調這個自動編碼器的權值，但是只有在初始化權值較好時才能得到最優解，不然就容易陷入局部最優解。本文提供了一種有效的初始化權值算法，就是利用深度自動編碼網絡來學習得到初始權值。這一算法比用主成份分析（PCA）來對數據進行降維更好更有效。

內容：

降維在分類、可視化、通信、高維數據的存儲等方面都非常有促進作用。一個簡單且廣泛應用的方法就是PCA降維，它通過尋找數據中的最大變化方向，然后把每個數據都投影到這些方向構成的坐標系中，並表示出來。本文提出了一種PCA的非線性泛化算法，該算法用一個自適應的多層自動編碼網絡來把高維數據編碼為一個低維數據，同時用一個類似的解碼網絡來把這個低維數據重構為原高維數據。

首先，對這兩個網絡的權值進行隨機初始化，然后通過最小化重構項和原始數據之間的誤差對權值進行訓練。誤差的偏導數通過后向傳播得到梯度，也就是把誤差偏導數先通過解碼網絡，再通過編碼網絡進行傳播。整個系統叫做自編碼器，具體見圖1。

圖1.預訓練，就是訓練一系列的RBM，每個RBM只有一層特征檢測器。前一個RBM學習的特征作為下一個RBM的輸入。預訓練完成后把RBM展開得到一個深層自動編碼網絡，然后把誤差的偏導數后向傳播，用來對這個網絡進行微調。

最優化有多層隱藏層（2－4層）的非線性自編碼器的權值比較困難。因為如果權值初始值較大時，自編碼器非常容易陷入局部最優解；如果權值初始值較小時，前幾層的梯度下降是非常小的，權值更新就非常慢，這樣就必須增加自編碼器的隱藏層數，不然就訓練不出最優值。如果初始權值比較接近最優解，那么就能能過梯度下降法很快訓練得到最優解，但是通過一次學習一層特征的算法來找出這樣的初始權值非常困難。“預訓練”可以很好地解決這一問題，通過“預訓練”可以得到比較接近最優解的初始權值。雖然本文中的“預訓練”過程是用的二值數據，但是推廣到其他真實的數據也是可以的，並且證明是有效的。

一個二值向量（如：圖像）可以通過一個2層網絡（即：RBM）來重構，在RBM（文獻[5][6]）中，通過對稱加權連接把隨機二值像素點和隨機二值特征檢測器聯系起來。那些像素點相當於RBM的可視化單元，因為它們的狀態是可見的；那些特征檢測器相當於隱藏單元。可視單元和隱藏單元的聯合系統（v,h）之間的能量（文獻[7]）表示為：

其中，v_i和h_j分別是第i個可視層單元和第j個隱藏層單元的狀態，bi和bj是偏置項，wji是權值。這個網絡通過這個能量函數得到每個可能圖像的概率，具體解釋見文獻[8]。神經元的輸入輸出關系是sigmoid函數。給定一張輸入圖像（暫時是以二值圖像為例），我們可以通過調整網絡的權值和偏置值使得網絡對該輸入圖像的能量最低。權值更新公式如下：

單層的二值網絡不足以模擬大量的數據集，因此一般采用多層網絡，即把第一層網絡的輸出作為第二層網絡的輸入。並且每增加一個網絡層，就會提高網絡對輸入數據重構的log下界概率值，且上層的網絡能夠提取出其下層網絡更高階的特征。

當網絡的預訓練過程完成后，我們需要把解碼和編碼部分重新拿回來展開構成整個網絡，然后用真實的數據作為樣本標簽來微調網絡的參數。

對於連續的數據，第一個RBM的隱藏層仍然是二值的，但是其可視化層單元是帶高斯白噪聲的線性單元。如果該噪聲是單位方差，隱藏單元的更新規則仍然是一樣的，第i個可視化層單元的更新規則是從一個高斯噪聲中抽樣，這個噪聲的方差是單位方差，均值是的平均值。

在實驗中，每個RBM的可視層單元都有真實的[0，1]內激活值，對於高層RBM，其可視化層單元就是前一個RBM的隱藏層單元的激活概率，但是除了最上面一個RBM之外，其他的RBM的隱藏層單元都是隨機的二值。最上面一個RBM的隱藏單元是一個隨機實值狀態，它是從單位方差噪聲中抽樣得到的，這個單位方差噪聲的均值由RBM的可視單元決定。比起PCA，本算法較好地利用了連續變量。預訓練和微調的細節見文獻[8]。

交叉熵誤差公式如下：

其中，pi是輸入數據的重構值。

接下來，做了一系列實驗。

實驗

實驗基礎說明

1.實驗代碼：http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html

2.在CG_MNIST.m中會用到：后向傳導算法求各層偏導數df，見“http://ufldl.stanford.edu/wiki/index.php/用反向傳導思想求導”

3.一些matlab函數

rem和mod:

　　參考資料取模（mod）與取余（rem）的區別——Matlab學習筆記

　　通常取模運算也叫取余運算，它們返回結果都是余數.rem和mod唯一的區別在於:
　　當x和y的正負號一樣的時候，兩個函數結果是等同的；當x和y的符號不同時，rem函數結果的符號和x的一樣，而mod和y一樣。這是由於這兩個函數的生成機制不同，rem函數采用fix函數，而mod函數采用了floor函數（這兩個函數是用來取整的，fix函數向0方向舍入，floor函數向無窮小方向舍入）。rem（x，y）命令返回的是x-n.*y，如果y不等於0，其中的n = fix(x./y)，而mod(x,y)返回的是x-n.*y，當y不等於0時，n=floor(x./y)

4.函數說明

converter.m:

　　實現的功能是將樣本集從.ubyte格式轉換成.ascii格式，然后繼續轉換成.mat格式。

　　makebatches.m:

　　實現的是將原本的2維數據集變成3維的，因為分了多個批次，另外1維表示的是批次。

function [f, df] = CG_MNIST(VV,Dim,XX);

　該函數實現的功能是計算網絡代價函數值f，以及f對網絡中各個參數值的偏導數df，權值和偏置值是同時處理。其中參數VV為網絡中所有參數構成的列向量，參數Dim為每層網絡的節點數構成的向量，XX為訓練樣本集合。f和df分別表示網絡的代價函數和偏導函數值。

　　共軛梯度下降的優化函數形式為：

　　[X, fX, i] = minimize(X, f, length, P1, P2, P3, ... )

　　該函數時使用共軛梯度的方法來對參數X進行優化，所以X是網絡的參數值，為一個列向量。f是一個函數的名稱，它主要是用來計算網絡中的代價函數以及代價函數對各個參數X的偏導函數，f的參數值分別為X，以及minimize函數后面的P1,P2,P3,…使用共軛梯度法進行優化的最大線性搜索長度為length。返回值X為找到的最優參數，fX為在此最優參數X下的代價函數，i為線性搜索的長度（即迭代的次數）。

疑問

1.rbm.m的代碼中，直接有v1=p(v1|h0)，而實際上應該是把p(v1|h0)與均勻分布的隨機數比較得出v1，即：01化，但是在該代碼中並沒有把p(v1|h0)進行01化？為什么？

2.在第4個RBM的預訓練代碼rbmhidlinear.m中，有這句話：

poshidprobs = (data*vishid) + repmat(hidbiases,numcases,1);

即：p(hj=1|v0)＝Wji*v0+bj，為什么？

答：因為輸出層神經元（即：第4個rbm的隱含層神經元）的激活函數是f(x)=x，而不是原來的logistic函數。

3.在把4個RBM展開連接起來，再用訓練數據進行微調整個模型的代碼backprop.m中這句話：

w4probs = w3probs*w4; w4probs = [w4probs ones(N,1)];

為什么？

答：因為沒有把4個RBM展開前輸出層神經元（即：第4個rbm的隱含層神經元）的激活函數是f(x)=x，而不是原來的logistic函數。所以把4個RBM展開並連接起來變為9層神經網絡后，它的第5層神經元的激活函數仍然是f(x)=x。

即：下圖中節點數為30的網絡層神經元激活函數為f(x)=x。

4.backprop.m中這句話：

dataout = 1./(1 + exp(-w7probs*w8));

這里dataout是輸出層的輸出概率密度，但是它是下面代碼中的作用是輸出數據或重構數據，為什么？

答：原因不知道。但從這里可推導出：輸出層的輸出概率密度＝重構數據的概率密度＝重構數據

實驗步驟

1.加載數據集，並轉換為.mat格式，即代碼中的converter.m；

2.依次預訓練4個rbm，並把前一個rbm的輸入作為后一個rbm的輸入，見rbm.m；

3.把4個rbm展開成圖1中的“Unrolling”部分，計算該網絡的代價函數及其對各權值的偏導數，見CG_MNIST.m；

4.利用共軛梯度下降法對代價函數進行優化，見minimize.m。

實驗結果

Train squared error: 4.318

Test squared error: 4.520

代碼

mnistdeepauto.m

% Version 1.000
%
% Code provided by Ruslan Salakhutdinov and Geoff Hinton  
%
% Permission is granted for anyone to copy, use, modify, or distribute this
% program and accompanying programs and documents for any purpose, provided
% this copyright notice is retained and prominently displayed, along with
% a note saying that the original programs are available from our 
% web page. 
% The programs and documents are distributed without any warranty, express or
% implied.  As the programs were written for research purposes only, they have
% not been tested to the degree that would be advisable in any important
% application.  All use of these programs is entirely at the user's own risk.


% This program pretrains a deep autoencoder for MNIST dataset
% You can set the maximum number of epochs for pretraining each layer
% and you can set the architecture of the multilayer net.

clear all
close all

maxepoch=10; %最大迭代次數  In the Science paper we use maxepoch=50, but it works just fine. 
numhid=1000; numpen=500; numpen2=250; numopen=30;

fprintf(1,'Converting Raw files into Matlab format \n');
converter; % 把測試數據集和訓練數據集轉換為.mat格式

fprintf(1,'Pretraining a deep autoencoder. \n');
fprintf(1,'The Science paper used 50 epochs. This uses %3i \n', maxepoch);

makebatches;% 把數據集及其標簽進行打包或分批，方便以后分批進行處理，因為數據太大了，這樣可加快學習速率
[numcases numdims numbatches]=size(batchdata);%返回訓練數據集的大小

fprintf(1,'Pretraining Layer 1 with RBM: %d-%d \n',numdims,numhid);
restart=1;
rbm;             %預訓練第1個rbm
hidrecbiases=hidbiases; % 第一個rbm的隱含層偏置項
save mnistvh vishid hidrecbiases visbiases;% 保存第1個rbm的權值、隱含層偏置項、可視化層偏置項，為mnistvh.mat

fprintf(1,'\nPretraining Layer 2 with RBM: %d-%d \n',numhid,numpen);
batchdata=batchposhidprobs;% 第1個rbm中整個數據第一次正向傳播時隱含層的輸出概率（注意：不是把概率01化后的輸出狀態），作為第2個rbm的輸入數據
numhid=numpen;% 第2個rbm的隱含層神經元數
restart=1;
rbm;       %預訓練第2個rbm
hidpen=vishid; penrecbiases=hidbiases; hidgenbiases=visbiases;
save mnisthp hidpen penrecbiases hidgenbiases;% 保存第2個rbm的權值、隱含層偏置項、可視化層偏置項，為mnisthp.mat

fprintf(1,'\nPretraining Layer 3 with RBM: %d-%d \n',numpen,numpen2);
batchdata=batchposhidprobs;% 第2個rbm中整個數據第一次正向傳播時隱含層的輸出概率，作為第3個rbm的輸入數據（注意：不是把概率01化后的輸出狀態作為輸入數據）
numhid=numpen2;
restart=1;
rbm;       %預訓練第3個rbm
hidpen2=vishid; penrecbiases2=hidbiases; hidgenbiases2=visbiases;
save mnisthp2 hidpen2 penrecbiases2 hidgenbiases2;% 保存第3個rbm的權值、隱含層偏置項、可視化層偏置項，為mnisthp2.mat

fprintf(1,'\nPretraining Layer 4 with RBM: %d-%d \n',numpen2,numopen);
batchdata=batchposhidprobs;% 第3個rbm中整個數據第一次正向傳播時隱含層的輸出概率，作為第4個rbm的輸入數據
numhid=numopen; 
restart=1;
rbmhidlinear;      % 預訓練第4個rbm，但是注意輸出層單元激活函數是1，而不再是logistic函數
hidtop=vishid; toprecbiases=hidbiases; topgenbiases=visbiases;
save mnistpo hidtop toprecbiases topgenbiases;% 保存第4個rbm的權值、隱含層偏置項、可視化層偏置項，為mnistpo.mat

backprop; % 把4個RBM展開連接起來，再用訓練數據進行微調整個模型

converter.m

% Version 1.000
% % 作用：把測試數據集和訓練數據集轉換為.mat格式
% 最終得到的測試數據集：test(0~9).mat
% 最終得到的訓練數據集：digit(0~9).mat
% Code provided by Ruslan Salakhutdinov and Geoff Hinton
%
% Permission is granted for anyone to copy, use, modify, or distribute this
% program and accompanying programs and documents for any purpose, provided
% this copyright notice is retained and prominently displayed, along with
% a note saying that the original programs are available from our
% web page.
% The programs and documents are distributed without any warranty, express or
% implied.  As the programs were written for research purposes only, they have
% not been tested to the degree that would be advisable in any important
% application.  All use of these programs is entirely at the user's own risk.

% This program reads raw MNIST files available at 
% http://yann.lecun.com/exdb/mnist/ 
% and converts them to files in matlab format 
% Before using this program you first need to download files:
% train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz 
% t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz
% and gunzip them. You need to allocate some space for this.  

% This program was originally written by Yee Whye Teh 

%% 首先轉換測試數據集的格式 Work with test files first 
fprintf(1,'You first need to download files:\n train-images-idx3-ubyte.gz\n train-labels-idx1-ubyte.gz\n t10k-images-idx3-ubyte.gz\n t10k-labels-idx1-ubyte.gz\n from http://yann.lecun.com/exdb/mnist/\n and gunzip them \n'); 

f = fopen('t10k-images.idx3-ubyte','r');
[a,count] = fread(f,4,'int32');
  
g = fopen('t10k-labels.idx1-ubyte','r');
[l,count] = fread(g,2,'int32');

fprintf(1,'Starting to convert Test MNIST images (prints 10 dots) \n'); 
n = 1000;

Df = cell(1,10);
for d=0:9,
  Df{d+1} = fopen(['test' num2str(d) '.ascii'],'w');
end;
  
for i=1:10,
  fprintf('.');
  rawimages = fread(f,28*28*n,'uchar');
  rawlabels = fread(g,n,'uchar');
  rawimages = reshape(rawimages,28*28,n);

  for j=1:n,
    fprintf(Df{rawlabels(j)+1},'%3d ',rawimages(:,j));
    fprintf(Df{rawlabels(j)+1},'\n');
  end;
end;

fprintf(1,'\n');
for d=0:9,
  fclose(Df{d+1});
  D = load(['test' num2str(d) '.ascii'],'-ascii');%這個test.ascii文件從哪來的？
  fprintf('%5d Digits of class %d\n',size(D,1),d);
  save(['test' num2str(d) '.mat'],'D','-mat');
end;


%% 然后轉換訓練數據集的格式Work with trainig files second  
f = fopen('train-images.idx3-ubyte','r');
[a,count] = fread(f,4,'int32');

g = fopen('train-labels.idx1-ubyte','r');
[l,count] = fread(g,2,'int32');

fprintf(1,'Starting to convert Training MNIST images (prints 60 dots)\n'); 
n = 1000;

Df = cell(1,10);
for d=0:9,
  Df{d+1} = fopen(['digit' num2str(d) '.ascii'],'w');
end;

for i=1:60,
  fprintf('.');
  rawimages = fread(f,28*28*n,'uchar');
  rawlabels = fread(g,n,'uchar');
  rawimages = reshape(rawimages,28*28,n);

  for j=1:n,
    fprintf(Df{rawlabels(j)+1},'%3d ',rawimages(:,j));
    fprintf(Df{rawlabels(j)+1},'\n');
  end;
end;

fprintf(1,'\n');
for d=0:9,
  fclose(Df{d+1});
  D = load(['digit' num2str(d) '.ascii'],'-ascii');
  fprintf('%5d Digits of class %d\n',size(D,1),d);
  save(['digit' num2str(d) '.mat'],'D','-mat');
end;

dos('rm *.ascii');

makebatches.m

% Version 1.000
% 作用：把數據集及其標簽進行分批，方便以后分批進行處理，因為數據太大了，分批處理可加快學習速率
% 訓練數據集及標簽的打包結果：batchdata、batchtargets
% 測試數據集及標簽的打包結果：testbatchdata、testbatchtargets

% Code provided by Ruslan Salakhutdinov and Geoff Hinton
%
% Permission is granted for anyone to copy, use, modify, or distribute this
% program and accompanying programs and documents for any purpose, provided
% this copyright notice is retained and prominently displayed, along with
% a note saying that the original programs are available from our
% web page.
% The programs and documents are distributed without any warranty, express or
% implied.  As the programs were written for research purposes only, they have
% not been tested to the degree that would be advisable in any important
% application.  All use of these programs is entirely at the user's own risk.

%% 訓練數據集分批
digitdata=[]; % 訓練數據
targets=[];   % 訓練數據的標簽
load digit0; digitdata = [digitdata; D]; targets = [targets; repmat([1 0 0 0 0 0 0 0 0 0], size(D,1), 1)];  
load digit1; digitdata = [digitdata; D]; targets = [targets; repmat([0 1 0 0 0 0 0 0 0 0], size(D,1), 1)];
load digit2; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 1 0 0 0 0 0 0 0], size(D,1), 1)]; 
load digit3; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 1 0 0 0 0 0 0], size(D,1), 1)];
load digit4; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 0 1 0 0 0 0 0], size(D,1), 1)]; 
load digit5; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 0 0 1 0 0 0 0], size(D,1), 1)];
load digit6; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 0 0 0 1 0 0 0], size(D,1), 1)];
load digit7; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 0 0 0 0 1 0 0], size(D,1), 1)];
load digit8; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 0 0 0 0 0 1 0], size(D,1), 1)];
load digit9; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 0 0 0 0 0 0 1], size(D,1), 1)];
digitdata = digitdata/255;% 簡單縮放歸一化

totnum=size(digitdata,1);%訓練樣本數：60000
fprintf(1, 'Size of the training dataset= %5d \n', totnum);

rand('state',0); %so we know the permutation of the training data
randomorder=randperm(totnum);% 產生totnum個小於等於totnum的正整數

numbatches=totnum/100;          % 批數：600
numdims  =  size(digitdata,2);  % 每個訓練樣本的維數：784
batchsize = 100;                % 每個batch中的訓練樣本數：100
batchdata = zeros(batchsize, numdims, numbatches);
batchtargets = zeros(batchsize, 10, numbatches);

for b=1:numbatches
  batchdata(:,:,b) = digitdata(randomorder(1+(b-1)*batchsize:b*batchsize), :);
  batchtargets(:,:,b) = targets(randomorder(1+(b-1)*batchsize:b*batchsize), :);
end;
clear digitdata targets;

%% 測試數據集分批
digitdata=[];
targets=[];
load test0; digitdata = [digitdata; D]; targets = [targets; repmat([1 0 0 0 0 0 0 0 0 0], size(D,1), 1)]; 
load test1; digitdata = [digitdata; D]; targets = [targets; repmat([0 1 0 0 0 0 0 0 0 0], size(D,1), 1)]; 
load test2; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 1 0 0 0 0 0 0 0], size(D,1), 1)];
load test3; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 1 0 0 0 0 0 0], size(D,1), 1)];
load test4; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 0 1 0 0 0 0 0], size(D,1), 1)];
load test5; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 0 0 1 0 0 0 0], size(D,1), 1)];
load test6; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 0 0 0 1 0 0 0], size(D,1), 1)];
load test7; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 0 0 0 0 1 0 0], size(D,1), 1)];
load test8; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 0 0 0 0 0 1 0], size(D,1), 1)];
load test9; digitdata = [digitdata; D]; targets = [targets; repmat([0 0 0 0 0 0 0 0 0 1], size(D,1), 1)];
digitdata = digitdata/255;

totnum=size(digitdata,1);
fprintf(1, 'Size of the test dataset= %5d \n', totnum);

rand('state',0); %so we know the permutation of the training data
randomorder=randperm(totnum);

numbatches=totnum/100;
numdims  =  size(digitdata,2);
batchsize = 100;
testbatchdata = zeros(batchsize, numdims, numbatches);
testbatchtargets = zeros(batchsize, 10, numbatches);

for b=1:numbatches
  testbatchdata(:,:,b) = digitdata(randomorder(1+(b-1)*batchsize:b*batchsize), :);
  testbatchtargets(:,:,b) = targets(randomorder(1+(b-1)*batchsize:b*batchsize), :);
end;
clear digitdata targets;


%%% Reset random seeds 
rand('state',sum(100*clock)); 
randn('state',sum(100*clock));

rbm.m

% Version 1.000 
% 作用：訓練RBM，利用1步CD算法
% Code provided by Geoff Hinton and Ruslan Salakhutdinov 
%
% Permission is granted for anyone to copy, use, modify, or distribute this
% program and accompanying programs and documents for any purpose, provided
% this copyright notice is retained and prominently displayed, along with
% a note saying that the original programs are available from our
% web page.
% The programs and documents are distributed without any warranty, express or
% implied.  As the programs were written for research purposes only, they have
% not been tested to the degree that would be advisable in any important
% application.  All use of these programs is entirely at the user's own risk.

% This program trains Restricted Boltzmann Machine in which
% visible, binary, stochastic pixels are connected to
% hidden, binary, stochastic feature detectors using symmetrically
% weighted connections. Learning is done with 1-step Contrastive Divergence.   
% The program assumes that the following variables are set externally:
% maxepoch  -- 最大迭代次數maximum number of epochs
% numhid    -- 隱含層神經元數number of hidden units 
% batchdata -- 分批后的訓練數據集the data that is divided into batches (numcases numdims numbatches)
% restart   -- 如果從第1層開始學習，就置restart為1.set to 1 if learning starts from beginning 

epsilonw      = 0.1;   % 權值的學習速率Learning rate for weights 
epsilonvb     = 0.1;   % 可視化層偏置項的學習速率Learning rate for biases of visible units 
epsilonhb     = 0.1;   % 隱含層偏置項的學習速率Learning rate for biases of hidden units 
weightcost  = 0.0002;  % 權衰減，用於防止出現過擬合，見論文“受限波爾茲曼機RBM”
initialmomentum  = 0.5;% 動量項學習率，用於克服收斂速度和算法的不穩定性之間的矛盾
finalmomentum    = 0.9;

[numcases numdims numbatches]=size(batchdata);%[numcases numdims numbatches]＝[每批中的樣本數 每個樣本的維數 訓練樣本批數]

if restart ==1,
  restart=0;
  epoch=1;

% Initializing symmetric weights and biases. 
  vishid     = 0.1*randn(numdims, numhid);% 連接權值Wij
  hidbiases  = zeros(1,numhid);           % 隱含層偏置項ci
  visbiases  = zeros(1,numdims);          % 可視化層偏置項bj

  poshidprobs = zeros(numcases,numhid);%100*1000，單個batch第一次正向傳播時隱含層的輸出概率p(h|v0)
  neghidprobs = zeros(numcases,numhid);%第二次正向傳播時隱含層的輸出概率p(h|v1)
  posprods    = zeros(numdims,numhid);% posprods表示p(hi=1|v0)*v0，以后更新detaWij時會用到這一項
  negprods    = zeros(numdims,numhid);% negprods表示p(hi=1|v1)*v1，以后更新detaWij時會用到這一項
  vishidinc  = zeros(numdims,numhid);% 權值更新的增量deta Wji
  hidbiasinc = zeros(1,numhid);      % 隱含層偏置項更新的增量deta bj
  visbiasinc = zeros(1,numdims);     % 可視化層偏置項更新的增量deta ci
  batchposhidprobs=zeros(numcases,numhid,numbatches);% 整個數據第一次正向傳播時隱含層的輸出概率
end

for epoch = epoch:maxepoch,
 fprintf(1,'epoch %d\r',epoch); 
 errsum=0;
 for batch = 1:numbatches,
 fprintf(1,'epoch %d batch %d\r',epoch,batch); 

%%%%%%%%% 求正項部分 START POSITIVE PHASE %%%%%%%%%%%%%%%%%以下的代碼請對照“深度學習筆記_-_RBM”看%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  data = batchdata(:,:,batch);% data表示可視化層初始數據v0,每次迭代都需要取出一個batch的數據，每一行代表一個樣本值（這里的數據是double的，不是01的，嚴格的說后面應將其01化）
  poshidprobs = 1./(1 + exp(-data*vishid - repmat(hidbiases,numcases,1)));% 樣本第一次正向傳播時隱含層節點的輸出概率，即：p(hj=1|v0)       
  batchposhidprobs(:,:,batch)=poshidprobs;
  posprods    = data' * poshidprobs;% posprods表示p(hi=1|v0)*v0，以后更新detaWij時會用到這一項
  poshidact   = sum(poshidprobs);% 所有p(hi=1|v0)的累加，以后更新deta ci時會用到這一項
  posvisact = sum(data);% 所有v0的累加，以后更新deta bj時會用到這一項

%%%%%%%%% END OF POSITIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  poshidstates = poshidprobs > rand(numcases,numhid); %poshidstates表示隱含層的狀態h0，將隱含層數據01化（此步驟在posprods之后進行），按照概率值大小來判定.
                                                      %rand(m,n)為產生m*n大小的矩陣，矩陣中元素為(0,1)之間的均勻分布。
  
%%%%%%%%%求負項部分 START NEGATIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  negdata = 1./(1 + exp(-poshidstates*vishid' - repmat(visbiases,numcases,1)));% 從下面來推斷，negdata表示第一次反向進行時的可視層數據v1，但從其表達式上推斷negdata實際上是p(v1|h0)，這里為什么沒有將p(v1|h0)數據01，從而變為v1？而是直接v1=p(v1|h0)？感覺不對
  neghidprobs = 1./(1 + exp(-negdata*vishid - repmat(hidbiases,numcases,1))); % 第一次反向進行后又馬上正向傳播的隱含層概率值，即：p(hj=1|v1)   
  negprods  = negdata'*neghidprobs;% negprods表示p(hi=1|v1)*v1，以后更新detaWij時會用到這一項
  neghidact = sum(neghidprobs);    % 所有p(hi=1|v1)的累加，以后更新deta ci時會用到這一項
  negvisact = sum(negdata);        % 所有v1的累加，以后更新deta bj時會用到這一項

%%%%%%%%% END OF NEGATIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  err= sum(sum( (data-negdata).^2 ));
  errsum = err + errsum;

   if epoch>5,
     momentum=finalmomentum;%0.5，momentum表示保持上一次更新增量的比例，如果迭代次數越少，則這個比例值可以稍微大一點
   else
     momentum=initialmomentum;%0.9
   end;

%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
    vishidinc = momentum*vishidinc + ...  %vishidinc表示權值更新時的增量deta Wij；
                epsilonw*( (posprods-negprods)/numcases - weightcost*vishid);% posprods-negprods表示deta W，weightcost*vishid表示權衰減項，防止出現過擬合
    visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*(posvisact-negvisact);% (posvisact-negvisact)表示 deta bj
    hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*(poshidact-neghidact);% (poshidact-neghidact)表示 deta ci

    vishid = vishid + vishidinc;
    visbiases = visbiases + visbiasinc;
    hidbiases = hidbiases + hidbiasinc;

%%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

  end
  fprintf(1, 'epoch %4i error %6.1f  \n', epoch, errsum); 
end;

rbmhidlinear.m

% Version 1.000
% 作用：訓練最頂層的一個RBM
% 輸出層神經元的激活函數為1，是線性的，不再是sigmoid函數，所以該函數名字叫：rbmhidlinear.m
% Code provided by Ruslan Salakhutdinov and Geoff Hinton
%
% Permission is granted for anyone to copy, use, modify, or distribute this
% program and accompanying programs and documents for any purpose, provided
% this copyright notice is retained and prominently displayed, along with
% a note saying that the original programs are available from our
% web page.
% The programs and documents are distributed without any warranty, express or
% implied.  As the programs were written for research purposes only, they have
% not been tested to the degree that would be advisable in any important
% application.  All use of these programs is entirely at the user's own risk.

% This program trains Restricted Boltzmann Machine in which
% visible, binary, stochastic pixels are connected to
% hidden, tochastic real-valued feature detectors drawn from a unit
% variance Gaussian whose mean is determined by the input from 
% the logistic visible units. Learning is done with 1-step Contrastive Divergence.
% The program assumes that the following variables are set externally:
% maxepoch  -- maximum number of epochs
% numhid    -- number of hidden units
% batchdata -- the data that is divided into batches (numcases numdims numbatches)
% restart   -- set to 1 if learning starts from beginning

epsilonw      = 0.001; % Learning rate for weights 
epsilonvb     = 0.001; % Learning rate for biases of visible units
epsilonhb     = 0.001; % Learning rate for biases of hidden units 
weightcost  = 0.0002;  
initialmomentum  = 0.5;
finalmomentum    = 0.9;


[numcases numdims numbatches]=size(batchdata);

if restart ==1,
  restart=0;
  epoch=1;

% Initializing symmetric weights and biases.
  vishid     = 0.1*randn(numdims, numhid);
  hidbiases  = zeros(1,numhid);
  visbiases  = zeros(1,numdims);


  poshidprobs = zeros(numcases,numhid);
  neghidprobs = zeros(numcases,numhid);
  posprods    = zeros(numdims,numhid);
  negprods    = zeros(numdims,numhid);
  vishidinc  = zeros(numdims,numhid);
  hidbiasinc = zeros(1,numhid);
  visbiasinc = zeros(1,numdims);
  sigmainc = zeros(1,numhid);
  batchposhidprobs=zeros(numcases,numhid,numbatches);
end

for epoch = epoch:maxepoch,
 fprintf(1,'epoch %d\r',epoch); 
 errsum=0;

 for batch = 1:numbatches,
 fprintf(1,'epoch %d batch %d\r',epoch,batch);

%%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  data = batchdata(:,:,batch);
  poshidprobs =  (data*vishid) + repmat(hidbiases,numcases,1);% 樣本第一次正向傳播時隱含層節點的輸出值，即：p(hj=1|v0)
                                                              % 為什么是這個表達式：p(hj=1|v0)＝Wji*v0+bj ？因為輸出層激活函數為1
  batchposhidprobs(:,:,batch)=poshidprobs;
  posprods    = data' * poshidprobs;
  poshidact   = sum(poshidprobs);
  posvisact = sum(data);
  
%%%%%%%%% END OF POSITIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
poshidstates = poshidprobs+randn(numcases,numhid);% h0:非概率密度，而是01后的實值

%%%%%%%%% START NEGATIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  negdata = 1./(1 + exp(-poshidstates*vishid' - repmat(visbiases,numcases,1)));% v1=p(v1|h0)?
  neghidprobs = (negdata*vishid) + repmat(hidbiases,numcases,1);%為什么是這個表達式p(hj=1|v1)＝Wji*v1+bj ？ neghidprobs表示樣本第二次正向傳播時隱含層節點的輸出值，即：p(hj=1|v1)
  negprods  = negdata'*neghidprobs;
  neghidact = sum(neghidprobs);
  negvisact = sum(negdata); 

%%%%%%%%% END OF NEGATIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


  err= sum(sum( (data-negdata).^2 )); 
  errsum = err + errsum;
   if epoch>5,
     momentum=finalmomentum;
   else
     momentum=initialmomentum;
   end;

%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    vishidinc = momentum*vishidinc + ...
                epsilonw*( (posprods-negprods)/numcases - weightcost*vishid);
    visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*(posvisact-negvisact);
    hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*(poshidact-neghidact);
    vishid = vishid + vishidinc;
    visbiases = visbiases + visbiasinc;
    hidbiases = hidbiases + hidbiasinc;

%%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 end
fprintf(1, 'epoch %4i error %f \n', epoch, errsum);

end

backprop.m

% Version 1.000
% 作用：把4個RBM展開連接起來，再用訓練數據進行微調整個模型。相當於論文圖1中的“Unrolling”部分，
% Code provided by Ruslan Salakhutdinov and Geoff Hinton
%
% Permission is granted for anyone to copy, use, modify, or distribute this
% program and accompanying programs and documents for any purpose, provided
% this copyright notice is retained and prominently displayed, along with
% a note saying that the original programs are available from our
% web page.
% The programs and documents are distributed without any warranty, express or
% implied.  As the programs were written for research purposes only, they have
% not been tested to the degree that would be advisable in any important
% application.  All use of these programs is entirely at the user's own risk.

% This program fine-tunes an autoencoder with backpropagation.
% Weights of the autoencoder are going to be saved in mnist_weights.mat
% and trainig and test reconstruction errors in mnist_error.mat
% You can also set maxepoch, default value is 200 as in our paper.  

maxepoch=200;
fprintf(1,'\nFine-tuning deep autoencoder by minimizing cross entropy error. \n');
fprintf(1,'60 batches of 1000 cases each. \n');% 60個batch，每個batch1000個樣本

load mnistvh
load mnisthp
load mnisthp2
load mnistpo 

makebatches;
[numcases numdims numbatches]=size(batchdata);
N=numcases; 

%%%% PREINITIALIZE WEIGHTS OF THE AUTOENCODER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
w1=[vishid; hidrecbiases];    % [W1;b1] 分別裝載每層的權值和偏置值，將它們作為一個整體
w2=[hidpen; penrecbiases];    % [W2;b2]
w3=[hidpen2; penrecbiases2];  % [W3;b3]
w4=[hidtop; toprecbiases];    % [W4;b4]
w5=[hidtop'; topgenbiases];   % [W4';v4]
w6=[hidpen2'; hidgenbiases2]; % [W3';v3]
w7=[hidpen'; hidgenbiases];   % [W2';v2]
w8=[vishid'; visbiases];      % [W1';v1]

%%%%%%%%%% END OF PREINITIALIZATIO OF WEIGHTS  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

l1=size(w1,1)-1;% 每個網絡層中節點的個數
l2=size(w2,1)-1;
l3=size(w3,1)-1;
l4=size(w4,1)-1;
l5=size(w5,1)-1;
l6=size(w6,1)-1;
l7=size(w7,1)-1;
l8=size(w8,1)-1;
l9=l1;           % 輸出層節點和輸入層的一樣
test_err=[];
train_err=[];


for epoch = 1:maxepoch

%%  %%%%%%%%%%%%%%%%%% 計算訓練誤差 COMPUTE TRAINING RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
err=0; 
[numcases numdims numbatches]=size(batchdata);
N=numcases;
 for batch = 1:numbatches
  data = [batchdata(:,:,batch)];
  data = [data ones(N,1)];
  w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs  ones(N,1)];%正向傳播，計算每一層的輸出概率密度p(h|v)，且同時在輸出上增加一維（值為常量1）
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];
  w4probs = w3probs*w4; w4probs = [w4probs  ones(N,1)];
  w5probs = 1./(1 + exp(-w4probs*w5)); w5probs = [w5probs  ones(N,1)];
  w6probs = 1./(1 + exp(-w5probs*w6)); w6probs = [w6probs  ones(N,1)];
  w7probs = 1./(1 + exp(-w6probs*w7)); w7probs = [w7probs  ones(N,1)];
  dataout = 1./(1 + exp(-w7probs*w8));% 輸出層的輸出概率密度，即：重構數據的概率密度，也即：重構數據
  err= err +  1/N*sum(sum( (data(:,1:end-1)-dataout).^2 )); % 每個batch內的均方誤差
  end
 train_err(epoch)=err/numbatches;% 迭代第epoch次的所有樣本內的均方誤差

%%%%%%%%%%%%%% END OF COMPUTING TRAINING RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%  %% DISPLAY FIGURE TOP ROW REAL DATA BOTTOM ROW RECONSTRUCTIONS 顯示真實數據和重構數據 %%%%%%%%%%%%%%%%%%%%%%%%%
fprintf(1,'Displaying in figure 1: Top row - real data, Bottom row -- reconstructions \n');
output=[];
 for ii=1:15
  output = [output data(ii,1:end-1)' dataout(ii,:)'];%output為15（因為是顯示15個數字）組，每組2列，分別為理論值和重構值
 end
   if epoch==1 
   close all 
   figure('Position',[100,600,1000,200]);
   else 
   figure(1)
   end 
   mnistdisp(output);%顯示圖片
   drawnow;%刷新屏幕

%% %%%%%%%%%%%%%%%%%% 計算測試誤差 COMPUTE TEST RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[testnumcases testnumdims testnumbatches]=size(testbatchdata);% [100 784 100] 測試數據為100個batch，每個batch含100個測試樣本，每個樣本維數為784
N=testnumcases;
err=0;
for batch = 1:testnumbatches
  data = [testbatchdata(:,:,batch)];
  data = [data ones(N,1)];
  w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs  ones(N,1)];
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];
  w4probs = w3probs*w4; w4probs = [w4probs  ones(N,1)];
  w5probs = 1./(1 + exp(-w4probs*w5)); w5probs = [w5probs  ones(N,1)];
  w6probs = 1./(1 + exp(-w5probs*w6)); w6probs = [w6probs  ones(N,1)];
  w7probs = 1./(1 + exp(-w6probs*w7)); w7probs = [w7probs  ones(N,1)];
  dataout = 1./(1 + exp(-w7probs*w8));
  err = err +  1/N*sum(sum( (data(:,1:end-1)-dataout).^2 ));
  end
 test_err(epoch)=err/testnumbatches;
 fprintf(1,'Before epoch %d Train squared error: %6.3f Test squared error: %6.3f \t \t \n',epoch,train_err(epoch),test_err(epoch));

%%%%%%%%%%%%%% END OF COMPUTING TEST RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% 
 tt=0;
 for batch = 1:numbatches/10          % 訓練樣本：批數numbatches是600，每個batch內100個樣本，組合后變為批數60，每個batch1000個樣本
 fprintf(1,'epoch %d batch %d\r',epoch,batch);

%%%%%%%%%%% 在訓練數據內組合10個mini-batch為一個larger-batch ，COMBINE 10 MINIBATCHES INTO 1 LARGER MINIBATCH %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 tt=tt+1; 
 data=[];
 for kk=1:10
  data=[data 
        batchdata(:,:,(tt-1)*10+kk)]; %使訓練數據變為60個batch，每個batch內含1000個樣本
 end 

%%%%%%%%%%%%%%% PERFORM CONJUGATE GRADIENT WITH 3 LINESEARCHES 進行共軛梯度3次線性搜索%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  max_iter=3;
  VV = [w1(:)' w2(:)' w3(:)' w4(:)' w5(:)' w6(:)' w7(:)' w8(:)']';% 把所有權值（已經包括了偏置值）變成一個大的列向量
  Dim = [l1; l2; l3; l4; l5; l6; l7; l8; l9];% 每層網絡對應節點的個數（不包括偏置值）

  [X, fX] = minimize(VV,'CG_MNIST',max_iter,Dim,data);% X為3次線性搜索最優化后得到的權值參數，是一個列向量

  w1 = reshape(X(1:(l1+1)*l2),l1+1,l2);
  xxx = (l1+1)*l2;
  w2 = reshape(X(xxx+1:xxx+(l2+1)*l3),l2+1,l3);
  xxx = xxx+(l2+1)*l3;
  w3 = reshape(X(xxx+1:xxx+(l3+1)*l4),l3+1,l4);
  xxx = xxx+(l3+1)*l4;
  w4 = reshape(X(xxx+1:xxx+(l4+1)*l5),l4+1,l5);
  xxx = xxx+(l4+1)*l5;
  w5 = reshape(X(xxx+1:xxx+(l5+1)*l6),l5+1,l6);
  xxx = xxx+(l5+1)*l6;
  w6 = reshape(X(xxx+1:xxx+(l6+1)*l7),l6+1,l7);
  xxx = xxx+(l6+1)*l7;
  w7 = reshape(X(xxx+1:xxx+(l7+1)*l8),l7+1,l8);
  xxx = xxx+(l7+1)*l8;
  w8 = reshape(X(xxx+1:xxx+(l8+1)*l9),l8+1,l9);%依次重新賦值為優化后的參數

%%%%%%%%%%%%%%% END OF CONJUGATE GRADIENT WITH 3 LINESEARCHES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 end

 save mnist_weights w1 w2 w3 w4 w5 w6 w7 w8 
 save mnist_error test_err train_err;

end

CG_MNIST.m

% Version 1.000
% 得到代價函數及其對各權值的偏導數
% Code provided by Ruslan Salakhutdinov and Geoff Hinton
%
% Permission is granted for anyone to copy, use, modify, or distribute this
% program and accompanying programs and documents for any purpose, provided
% this copyright notice is retained and prominently displayed, along with
% a note saying that the original programs are available from our
% web page.
% The programs and documents are distributed without any warranty, express or
% implied.  As the programs were written for research purposes only, they have
% not been tested to the degree that would be advisable in any important
% application.  All use of these programs is entirely at the user's own risk.

function [f, df] = CG_MNIST(VV,Dim,XX)
% VV：權值（已經包括了偏置值），為一個大的列向量
% Dim：每層網絡對應節點的個數
% XX：訓練樣本
% f ：代價函數，即交叉熵誤差
% df ：代價函數對各權值的偏導數


l1 = Dim(1);%每層網絡對應節點的個數（不包括偏置值）
l2 = Dim(2);
l3 = Dim(3);
l4= Dim(4);
l5= Dim(5);
l6= Dim(6);
l7= Dim(7);
l8= Dim(8);
l9= Dim(9);
N = size(XX,1);% 樣本的個數

% Do decomversion.下面一系列步驟完成權值的矩陣化
 w1 = reshape(VV(1:(l1+1)*l2),l1+1,l2);% VV是一個長的列向量，它包括偏置值和權值，這里取出的向量已經包括了偏置值
 xxx = (l1+1)*l2; %xxx 表示已經使用了的長度
 w2 = reshape(VV(xxx+1:xxx+(l2+1)*l3),l2+1,l3);
 xxx = xxx+(l2+1)*l3;
 w3 = reshape(VV(xxx+1:xxx+(l3+1)*l4),l3+1,l4);
 xxx = xxx+(l3+1)*l4;
 w4 = reshape(VV(xxx+1:xxx+(l4+1)*l5),l4+1,l5);
 xxx = xxx+(l4+1)*l5;
 w5 = reshape(VV(xxx+1:xxx+(l5+1)*l6),l5+1,l6);
 xxx = xxx+(l5+1)*l6;
 w6 = reshape(VV(xxx+1:xxx+(l6+1)*l7),l6+1,l7);
 xxx = xxx+(l6+1)*l7;
 w7 = reshape(VV(xxx+1:xxx+(l7+1)*l8),l7+1,l8);
 xxx = xxx+(l7+1)*l8;
 w8 = reshape(VV(xxx+1:xxx+(l8+1)*l9),l8+1,l9);


  XX = [XX ones(N,1)];% 訓練樣本，加1維使其下可乘w1
  w1probs = 1./(1 + exp(-XX*w1)); w1probs = [w1probs  ones(N,1)];
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];
  w4probs = w3probs*w4; w4probs = [w4probs  ones(N,1)];% 第5層神經元激活函數為1，而不是logistic函數
  w5probs = 1./(1 + exp(-w4probs*w5)); w5probs = [w5probs  ones(N,1)];
  w6probs = 1./(1 + exp(-w5probs*w6)); w6probs = [w6probs  ones(N,1)];
  w7probs = 1./(1 + exp(-w6probs*w7)); w7probs = [w7probs  ones(N,1)];
  XXout = 1./(1 + exp(-w7probs*w8));% 輸出層的概率密度，也就是重構數據

f = -1/N*sum(sum( XX(:,1:end-1).*log(XXout) + (1-XX(:,1:end-1)).*log(1-XXout)));%代價函數，即交叉熵誤差，怎么推導的？可見論文最后一頁
IO = 1/N*(XXout-XX(:,1:end-1));% 重構誤差
%% % 用后向傳導算法求各層偏導數df，見“http://ufldl.stanford.edu/wiki/index.php/用反向傳導思想求導”
Ix8=IO; % 相當於輸出層“殘差”
dw8 =  w7probs'*Ix8;% 用后向傳導算法求輸出層偏導數

Ix7 = (Ix8*w8').*w7probs.*(1-w7probs); % 第7層“殘差”
Ix7 = Ix7(:,1:end-1);
dw7 =  w6probs'*Ix7;  % 第7層偏導數

Ix6 = (Ix7*w7').*w6probs.*(1-w6probs); 
Ix6 = Ix6(:,1:end-1);
dw6 =  w5probs'*Ix6;

Ix5 = (Ix6*w6').*w5probs.*(1-w5probs); 
Ix5 = Ix5(:,1:end-1);
dw5 =  w4probs'*Ix5;

Ix4 = (Ix5*w5');
Ix4 = Ix4(:,1:end-1);
dw4 =  w3probs'*Ix4;

Ix3 = (Ix4*w4').*w3probs.*(1-w3probs); 
Ix3 = Ix3(:,1:end-1);
dw3 =  w2probs'*Ix3;

Ix2 = (Ix3*w3').*w2probs.*(1-w2probs); 
Ix2 = Ix2(:,1:end-1);
dw2 =  w1probs'*Ix2;

Ix1 = (Ix2*w2').*w1probs.*(1-w1probs); 
Ix1 = Ix1(:,1:end-1);
dw1 =  XX'*Ix1;

df = [dw1(:)' dw2(:)' dw3(:)' dw4(:)' dw5(:)' dw6(:)'  dw7(:)'  dw8(:)'  ]'; %網絡代價函數的偏導數

minimize.m

function [X, fX, i] = minimize(X, f, length, varargin)
%作用：利用共軛梯度下降法對目標函數進行優化
% Minimize a differentiable multivariate function. 

% [X, fX, i]中的X : 3次線性搜索最優化后得到的權值參數，是一個列向量

% minimize(X, f, length, varargin)中的X : 優化目標，即權值
% minimize(X, f, length, varargin)中的f : 代價函數的名稱
% minimize(X, f, length, varargin)中的length : 線性搜索次數
% minimize(X, f, length, varargin)中的varargin : 每層網絡對應的節點數Dim和訓練數據data

% Usage: [X, fX, i] = minimize(X, f, length, P1, P2, P3, ... )
%
% where the starting point is given by "X" (D by 1), and the function named in
% the string "f", must return a function value and a vector of partial
% derivatives of f wrt X, the "length" gives the length of the run: if it is
% positive, it gives the maximum number of line searches, if negative its
% absolute gives the maximum allowed number of function evaluations. You can
% (optionally) give "length" a second component, which will indicate the
% reduction in function value to be expected in the first line-search (defaults
% to 1.0). The parameters P1, P2, P3, ... are passed on to the function f.
%
% The function returns when either its length is up, or if no further progress
% can be made (ie, we are at a (local) minimum, or so close that due to
% numerical problems, we cannot get any closer). NOTE: If the function
% terminates within a few iterations, it could be an indication that the
% function values and derivatives are not consistent (ie, there may be a bug in
% the implementation of your "f" function). The function returns the found
% solution "X", a vector of function values "fX" indicating the progress made
% and "i" the number of iterations (line searches or function evaluations,
% depending on the sign of "length") used.
%
% The Polack-Ribiere flavour of conjugate gradients is used to compute search
% directions, and a line search using quadratic and cubic polynomial
% approximations and the Wolfe-Powell stopping criteria is used together with
% the slope ratio method for guessing initial step sizes. Additionally a bunch
% of checks are made to make sure that exploration is taking place and that
% extrapolation will not be unboundedly large.
%
% See also: checkgrad 
%
% Copyright (C) 2001 - 2006 by Carl Edward Rasmussen (2006-09-08).

INT = 0.1;    % don't reevaluate within 0.1 of the limit of the current bracket
EXT = 3.0;                  % extrapolate maximum 3 times the current step-size
MAX = 20;                         % max 20 function evaluations per line search
RATIO = 10;                                       % maximum allowed slope ratio
SIG = 0.1; RHO = SIG/2; % SIG and RHO are the constants controlling the Wolfe-
% Powell conditions. SIG is the maximum allowed absolute ratio between
% previous and new slopes (derivatives in the search direction), thus setting
% SIG to low (positive) values forces higher precision in the line-searches.
% RHO is the minimum allowed fraction of the expected (from the slope at the
% initial point in the linesearch). Constants must satisfy 0 < RHO < SIG < 1.
% Tuning of SIG (depending on the nature of the function to be optimized) may
% speed up the minimization; it is probably not worth playing much with RHO.

% The code falls naturally into 3 parts, after the initial line search is
% started in the direction of steepest descent. 1) we first enter a while loop
% which uses point 1 (p1) and (p2) to compute an extrapolation (p3), until we
% have extrapolated far enough (Wolfe-Powell conditions). 2) if necessary, we
% enter the second loop which takes p2, p3 and p4 chooses the subinterval
% containing a (local) minimum, and interpolates it, unil an acceptable point
% is found (Wolfe-Powell conditions). Note, that points are always maintained
% in order p0 <= p1 <= p2 < p3 < p4. 3) compute a new search direction using
% conjugate gradients (Polack-Ribiere flavour), or revert to steepest if there
% was a problem in the previous line-search. Return the best value so far, if
% two consecutive line-searches fail, or whenever we run out of function
% evaluations or line-searches. During extrapolation, the "f" function may fail
% either with an error or returning Nan or Inf, and minimize should handle this
% gracefully.

if max(size(length)) == 2, red=length(2); length=length(1); else red=1; end
if length>0, S='Linesearch'; else S='Function evaluation'; end 

i = 0;                                            % zero the run length counter
ls_failed = 0;                             % no previous line search has failed
[f0 df0] = feval(f, X, varargin{:});          % get function value and gradient
fX = f0;
i = i + (length<0);                                            % count epochs?!
s = -df0; d0 = -s'*s;           % initial search direction (steepest) and slope
x3 = red/(1-d0);                                  % initial step is red/(|s|+1)

while i < abs(length)                                      % while not finished
  i = i + (length>0);                                      % count iterations?!

  X0 = X; F0 = f0; dF0 = df0;                   % make a copy of current values
  if length>0, M = MAX; else M = min(MAX, -length-i); end

  while 1                             % keep extrapolating as long as necessary
    x2 = 0; f2 = f0; d2 = d0; f3 = f0; df3 = df0;
    success = 0;
    while ~success && M > 0
      try
        M = M - 1; i = i + (length<0);                         % count epochs?!
        [f3 df3] = feval(f, X+x3*s, varargin{:});
        if isnan(f3) || isinf(f3) || any(isnan(df3)+isinf(df3)), error(''), end
        success = 1;
      catch                                % catch any error which occured in f
        x3 = (x2+x3)/2;                                  % bisect and try again
      end
    end
    if f3 < F0, X0 = X+x3*s; F0 = f3; dF0 = df3; end         % keep best values
    d3 = df3'*s;                                                    % new slope
    if d3 > SIG*d0 || f3 > f0+x3*RHO*d0 || M == 0  % are we done extrapolating?
      break
    end
    x1 = x2; f1 = f2; d1 = d2;                        % move point 2 to point 1
    x2 = x3; f2 = f3; d2 = d3;                        % move point 3 to point 2
    A = 6*(f1-f2)+3*(d2+d1)*(x2-x1);                 % make cubic extrapolation
    B = 3*(f2-f1)-(2*d1+d2)*(x2-x1);
    x3 = x1-d1*(x2-x1)^2/(B+sqrt(B*B-A*d1*(x2-x1))); % num. error possible, ok!
    if ~isreal(x3) || isnan(x3) || isinf(x3) || x3 < 0 % num prob | wrong sign?
      x3 = x2*EXT;                                 % extrapolate maximum amount
    elseif x3 > x2*EXT                  % new point beyond extrapolation limit?
      x3 = x2*EXT;                                 % extrapolate maximum amount
    elseif x3 < x2+INT*(x2-x1)         % new point too close to previous point?
      x3 = x2+INT*(x2-x1);
    end
  end                                                       % end extrapolation

  while (abs(d3) > -SIG*d0 || f3 > f0+x3*RHO*d0) && M > 0  % keep interpolating
    if d3 > 0 || f3 > f0+x3*RHO*d0                         % choose subinterval
      x4 = x3; f4 = f3; d4 = d3;                      % move point 3 to point 4
    else
      x2 = x3; f2 = f3; d2 = d3;                      % move point 3 to point 2
    end
    if f4 > f0           
      x3 = x2-(0.5*d2*(x4-x2)^2)/(f4-f2-d2*(x4-x2));  % quadratic interpolation
    else
      A = 6*(f2-f4)/(x4-x2)+3*(d4+d2);                    % cubic interpolation
      B = 3*(f4-f2)-(2*d2+d4)*(x4-x2);
      x3 = x2+(sqrt(B*B-A*d2*(x4-x2)^2)-B)/A;        % num. error possible, ok!
    end
    if isnan(x3) || isinf(x3)
      x3 = (x2+x4)/2;               % if we had a numerical problem then bisect
    end
    x3 = max(min(x3, x4-INT*(x4-x2)),x2+INT*(x4-x2));  % don't accept too close
    [f3 df3] = feval(f, X+x3*s, varargin{:});
    if f3 < F0, X0 = X+x3*s; F0 = f3; dF0 = df3; end         % keep best values
    M = M - 1; i = i + (length<0);                             % count epochs?!
    d3 = df3'*s;                                                    % new slope
  end                                                       % end interpolation

  if abs(d3) < -SIG*d0 && f3 < f0+x3*RHO*d0          % if line search succeeded
    X = X+x3*s; f0 = f3; fX = [fX' f0]';                     % update variables
    fprintf('%s %6i;  Value %4.6e\r', S, i, f0);
    s = (df3'*df3-df0'*df3)/(df0'*df0)*s - df3;   % Polack-Ribiere CG direction
    df0 = df3;                                               % swap derivatives
    d3 = d0; d0 = df0'*s;
    if d0 > 0                                      % new slope must be negative
      s = -df0; d0 = -s'*s;                  % otherwise use steepest direction
    end
    x3 = x3 * min(RATIO, d3/(d0-realmin));          % slope ratio but max RATIO
    ls_failed = 0;                              % this line search did not fail
  else
    X = X0; f0 = F0; df0 = dF0;                     % restore best point so far
    if ls_failed || i > abs(length)         % line search failed twice in a row
      break;                             % or we ran out of time, so we give up
    end
    s = -df0; d0 = -s'*s;                                        % try steepest
    x3 = 1/(1-d0);                     
    ls_failed = 1;                                    % this line search failed
  end
end
fprintf('\n');

參考文獻

Deep learning：三十五(用NN實現數據降維練習)；

Deep learning：三十四(用NN實現數據的降維)；

Reducing the dimensionality of data with neural networks；

supporting online material (pdf) ；

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Deep Learning 學習筆記（8）：自編碼器( Autoencoders ) Deep Learning學習隨記（一）稀疏自編碼器一天一經典Reducing the Dimensionality of Data with Neural Networks [Science2006] 《Neural Networks and Deep Learning》課程筆記論文筆記系列-Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves 自編碼器深度自編碼器(Deep Autoencoder)MATLAB解讀自編碼器自編碼器