Deep learning：三十五(用NN實現數據降維練習)

本文轉載自查看原文 2013-04-30 20:03 24151 機器學習/ Deep Learning

　　前言：

　　本文是針對上篇博文Deep learning：三十四(用NN實現數據的降維)的練習部分，也就是Hition大牛science文章reducing the dimensionality of data with neural networks的code部分，其code下載見：http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html。花了點時間閱讀並運行了下它的code，其實code主要是2個單獨的工程。一個只是用MNIST數據庫來進行深度的autoencoder壓縮，用的是無監督學習，評價標准是重構誤差值MSE。另一個工程是MNIST的手寫字體識別，網絡的預訓練部分用的是無監督的，網絡的微調部分用的是有監督的。評價標准准是識別率或者錯誤率。

　　MINST降維實驗：

　　本次是訓練4個隱含層的autoencoder深度網絡結構，輸入層維度為784維，4個隱含層維度分別為1000,500,250,30。整個網絡權值的獲得流程梳理如下：

首先訓練第一個rbm網絡，即輸入層784維和第一個隱含層1000維構成的網絡。采用的方法是rbm優化，這個過程用的是訓練樣本，優化完畢后，計算訓練樣本在隱含層的輸出值。
利用1中的結果作為第2個rbm網絡訓練的輸入值，同樣用rbm網絡來優化第2個rbm網絡，並計算出網絡的輸出值。並且用同樣的方法訓練第3個rbm網絡和第4個rbm網絡。
將上面4個rbm網絡展開連接成新的網絡，且分成encoder和decoder部分。並用步驟1和2得到的網絡值給這個新網絡賦初值。
由於新網絡中最后的輸出和最初的輸入節點數是相同的，所以可以將最初的輸入值作為網絡理論的輸出標簽值，然后采用BP算法計算網絡的代價函數和代價函數的偏導數。
利用步驟3的初始值和步驟4的代價值和偏導值，采用共軛梯度下降法優化整個新網絡，得到最終的網絡權值。以上整個過程都是無監督的。

　　一些matlab函數：

　　rem和mod:

　　參考資料取模（mod）與取余（rem）的區別——Matlab學習筆記

　　通常取模運算也叫取余運算，它們返回結果都是余數.rem和mod唯一的區別在於:
　　當x和y的正負號一樣的時候，兩個函數結果是等同的；當x和y的符號不同時，rem函數結果的符號和x的一樣，而mod和y一樣。這是由於這兩個函數的生成機制不同，rem函數采用fix函數，而mod函數采用了floor函數（這兩個函數是用來取整的，fix函數向0方向舍入，floor函數向無窮小方向舍入）。rem（x，y）命令返回的是x-n.*y，如果y不等於0，其中的n = fix(x./y)，而mod(x,y)返回的是x-n.*y，當y不等於0時，n=floor(x./y)

　　工程中的m文件：

　　converter.m:

　　實現的功能是將樣本集從.ubyte格式轉換成.ascii格式，然后繼續轉換成.mat格式。

　　makebatches.m:

　　實現的是將原本的2維數據集變成3維的，因為分了多個批次，另外1維表示的是批次。

　　下面來看下在程序中大致實現RBM權值的優化步驟（假設是一個2層的RBM網絡，即只有輸入層和輸出層，且這兩層上的變量是二值變量）：

隨機給網絡初始化一個權值矩陣w和偏置向量b。
對可視層輸入矩陣v正向傳播，計算出隱含層的輸出矩陣h，並計算出輸入v和h對應節點乘積的均值矩陣
此時2中的輸出h為概率值，將它隨機01化為二值變量。
利用3中01化了的h方向傳播計算出可視層的矩陣v’.(按照道理，這個v'應該是要01化的)
對v’進行正向傳播計算出隱含層的矩陣h’，並計算出v’和h’對應節點乘積的均值矩陣。
用2中得到的均值矩陣減掉5中得到的均值矩陣，其結果作為對應權值增量的矩陣。
結合其對應的學習率，利用權值迭代公式對權值進行迭代。
重復計算2到7，直至收斂。

　　偏置值的優化步驟：

隨機給網絡初始化一個權值矩陣w和偏置向量b。
對可視層輸入矩陣v正向傳播，計算出隱含層的輸出矩陣h，並計算v層樣本的均值向量以及h層的均值向量。
此時2中的輸出h為概率值，將它隨機01化為二值變量。
利用3中01化了的h方向傳播計算出可視層的矩陣v’.
對v’進行正向傳播計算出隱含層的矩陣h’，並計算v‘層樣本的均值向量以及h’層的均值向量。
用2中得到的v方均值向量減掉5中得到的v’方的均值向量，其結果作為輸入層v對應偏置的增值向量。用2中得到的h方均值向量減掉5中得到的h’方的均值向量，其結果作為輸入層h對應偏置的增值向量。
結合其對應的學習率，利用權值迭代公式對偏置值進行迭代。
重復計算2到7，直至收斂。

　　當然了，權值更新和偏置值更新每次迭代都是同時進行的，所以應該是同時收斂的。並且在權值更新公式也可以稍微作下變形，比如加入momentum變量，即本次權值更新的增量會保留一部分上次更新權值的增量值。

　　函數CG_MNIST形式如下：

　　function [f, df] = CG_MNIST(VV,Dim,XX);

　　該函數實現的功能是計算網絡代價函數值f，以及f對網絡中各個參數值的偏導數df，權值和偏置值是同時處理。其中參數VV為網絡中所有參數構成的列向量，參數Dim為每層網絡的節點數構成的向量，XX為訓練樣本集合。f和df分別表示網絡的代價函數和偏導函數值。

　　共軛梯度下降的優化函數形式為：

　　[X, fX, i] = minimize(X, f, length, P1, P2, P3, ... )

　　該函數時使用共軛梯度的方法來對參數X進行優化，所以X是網絡的參數值，為一個列向量。f是一個函數的名稱，它主要是用來計算網絡中的代價函數以及代價函數對各個參數X的偏導函數，f的參數值分別為X，以及minimize函數后面的P1,P2,P3,…使用共軛梯度法進行優化的最大線性搜索長度為length。返回值X為找到的最優參數，fX為在此最優參數X下的代價函數，i為線性搜索的長度（即迭代的次數）。

　　實驗結果：

　　由於在實驗過程中，作者將迭代次數設置為200，本人在實驗時發現迭代到35次時已經花了6個多小時，所以懶得等那么久了（需長達30多個小時），此時的原始數字和重構數字顯示如下：

　　均方誤差結果為：

　　Train squared error: 4.318

　　Test squared error: 4.520

　　實驗主要部分代碼及注釋：

mnistdeepauto.m:

clear all
close all

maxepoch=10; %In the Science paper we use maxepoch=50, but it works just fine. 
numhid=1000; numpen=500; numpen2=250; numopen=30;

fprintf(1,'Converting Raw files into Matlab format \n');
converter; % 轉換數據為matlab的格式

fprintf(1,'Pretraining a deep autoencoder. \n');
fprintf(1,'The Science paper used 50 epochs. This uses %3i \n', maxepoch);

makebatches;
[numcases numdims numbatches]=size(batchdata);

fprintf(1,'Pretraining Layer 1 with RBM: %d-%d \n',numdims,numhid);
restart=1;
rbm;
hidrecbiases=hidbiases; %hidbiases為隱含層的偏置值
save mnistvh vishid hidrecbiases visbiases;%保持每層的變量，分別為權值，隱含層偏置值，可視層偏置值

fprintf(1,'\nPretraining Layer 2 with RBM: %d-%d \n',numhid,numpen);
batchdata=batchposhidprobs;%batchposhidprobs為第一個rbm的輸出概率值
numhid=numpen;
restart=1;
rbm;% 第2個rbm的訓練
hidpen=vishid; penrecbiases=hidbiases; hidgenbiases=visbiases;
save mnisthp hidpen penrecbiases hidgenbiases;%mnisthp為所保存的文件名

fprintf(1,'\nPretraining Layer 3 with RBM: %d-%d \n',numpen,numpen2);
batchdata=batchposhidprobs;
numhid=numpen2;
restart=1;
rbm;
hidpen2=vishid; penrecbiases2=hidbiases; hidgenbiases2=visbiases;%第3個rbm
save mnisthp2 hidpen2 penrecbiases2 hidgenbiases2;

fprintf(1,'\nPretraining Layer 4 with RBM: %d-%d \n',numpen2,numopen);
batchdata=batchposhidprobs;
numhid=numopen; 
restart=1;
rbmhidlinear;
hidtop=vishid; toprecbiases=hidbiases; topgenbiases=visbiases;%第4個rbm
save mnistpo hidtop toprecbiases topgenbiases;

backprop;

rbm.m:

epsilonw      = 0.1;   % Learning rate for weights 
epsilonvb     = 0.1;   % Learning rate for biases of visible units 
epsilonhb     = 0.1;   % Learning rate for biases of hidden units %由此可見這里隱含層和可視層的偏置值不是共用的，當然了，其權值是共用的
weightcost  = 0.0002;   
initialmomentum  = 0.5;
finalmomentum    = 0.9;

[numcases numdims numbatches]=size(batchdata);%[100,784,600]

if restart ==1,
  restart=0;
  epoch=1;

% Initializing symmetric weights and biases. 
  vishid     = 0.1*randn(numdims, numhid); %權值初始值隨便給,784*1000
  hidbiases  = zeros(1,numhid); %偏置值初始化為0
  visbiases  = zeros(1,numdims);

  poshidprobs = zeros(numcases,numhid);%100*1000，單個batch正向傳播時隱含層的輸出概率
  neghidprobs = zeros(numcases,numhid);
  posprods    = zeros(numdims,numhid);%784*1000
  negprods    = zeros(numdims,numhid);
  vishidinc  = zeros(numdims,numhid);
  hidbiasinc = zeros(1,numhid);
  visbiasinc = zeros(1,numdims);
  batchposhidprobs=zeros(numcases,numhid,numbatches);% 整個數據正向傳播時隱含層的輸出概率
end

for epoch = epoch:maxepoch, %總共迭代10次
 fprintf(1,'epoch %d\r',epoch); 
 errsum=0;
 for batch = 1:numbatches, %每次迭代都有遍歷所有的batch
 fprintf(1,'epoch %d batch %d\r',epoch,batch);

%%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  data = batchdata(:,:,batch);% 每次迭代都需要取出一個batch的數據，每一行代表一個樣本值（這里的數據是double的，不是01的，嚴格的說后面應將其01化）
  poshidprobs = 1./(1 + exp(-data*vishid - repmat(hidbiases,numcases,1)));% 樣本正向傳播時隱含層節點的輸出概率    
  batchposhidprobs(:,:,batch)=poshidprobs;
  posprods    = data' * poshidprobs;%784*1000，這個是求系統的能量值用的，矩陣中每個元素表示對應的可視層節點和隱含層節點的乘積（包含此次樣本的數據對應值的累加）
  poshidact   = sum(poshidprobs);%針對樣本值進行求和
  posvisact = sum(data);

%%%%%%%%% END OF POSITIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  poshidstates = poshidprobs > rand(numcases,numhid); %將隱含層數據01化（此步驟在posprods之后進行），按照概率值大小來判定.
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　%rand(m,n)為產生m*n大小的矩陣，矩陣中元素為(0,1)之間的均勻分布。

%%%%%%%%% START NEGATIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  negdata = 1./(1 + exp(-poshidstates*vishid' - repmat(visbiases,numcases,1)));% 反向進行時的可視層數據
  neghidprobs = 1./(1 + exp(-negdata*vishid - repmat(hidbiases,numcases,1)));% 反向進行后又馬上正向傳播的隱含層概率值    
  negprods  = negdata'*neghidprobs;% 同理也是計算能量值用的，784*1000
  neghidact = sum(neghidprobs);
  negvisact = sum(negdata); 

%%%%%%%%% END OF NEGATIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  err= sum(sum( (data-negdata).^2 ));% 重構后的差值
  errsum = err + errsum; % 變量errsum只是用來輸出每次迭代時的誤差而已

   if epoch>5,
     momentum=finalmomentum;%0.5，momentum為保持上一次權值更新增量的比例，如果迭代次數越少，則這個比例值可以稍微大一點
   else
     momentum=initialmomentum;%0.9
   end;

%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
    vishidinc = momentum*vishidinc + ... %vishidinc 784*1000，權值更新時的增量；
                epsilonw*( (posprods-negprods)/numcases - weightcost*vishid); %posprods/numcases求的是正向傳播時vihj的期望，同理negprods/numcases是逆向重構時它們的期望
    visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*(posvisact-negvisact); %這3個都是按照權值更新公式來的
    hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*(poshidact-neghidact);

    vishid = vishid + vishidinc;
    visbiases = visbiases + visbiasinc;
    hidbiases = hidbiases + hidbiasinc;

%%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

  end
  fprintf(1, 'epoch %4i error %6.1f  \n', epoch, errsum); 
end;

CG_MNIST.m:

function [f, df] = CG_MNIST(VV,Dim,XX);

l1 = Dim(1);
l2 = Dim(2);
l3 = Dim(3);
l4= Dim(4);
l5= Dim(5);
l6= Dim(6);
l7= Dim(7);
l8= Dim(8);
l9= Dim(9);
N = size(XX,1);% 樣本的個數

% Do decomversion.
 w1 = reshape(VV(1:(l1+1)*l2),l1+1,l2);% VV是一個長的列向量，這里取出的向量已經包括了偏置值
 xxx = (l1+1)*l2; %xxx 表示已經使用了的長度
 w2 = reshape(VV(xxx+1:xxx+(l2+1)*l3),l2+1,l3);
 xxx = xxx+(l2+1)*l3;
 w3 = reshape(VV(xxx+1:xxx+(l3+1)*l4),l3+1,l4);
 xxx = xxx+(l3+1)*l4;
 w4 = reshape(VV(xxx+1:xxx+(l4+1)*l5),l4+1,l5);
 xxx = xxx+(l4+1)*l5;
 w5 = reshape(VV(xxx+1:xxx+(l5+1)*l6),l5+1,l6);
 xxx = xxx+(l5+1)*l6;
 w6 = reshape(VV(xxx+1:xxx+(l6+1)*l7),l6+1,l7);
 xxx = xxx+(l6+1)*l7;
 w7 = reshape(VV(xxx+1:xxx+(l7+1)*l8),l7+1,l8);
 xxx = xxx+(l7+1)*l8;
 w8 = reshape(VV(xxx+1:xxx+(l8+1)*l9),l8+1,l9);% 上面一系列步驟完成權值的矩陣化


  XX = [XX ones(N,1)];
  w1probs = 1./(1 + exp(-XX*w1)); w1probs = [w1probs  ones(N,1)];
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];
  w4probs = w3probs*w4; w4probs = [w4probs  ones(N,1)];
  w5probs = 1./(1 + exp(-w4probs*w5)); w5probs = [w5probs  ones(N,1)];
  w6probs = 1./(1 + exp(-w5probs*w6)); w6probs = [w6probs  ones(N,1)];
  w7probs = 1./(1 + exp(-w6probs*w7)); w7probs = [w7probs  ones(N,1)];
  XXout = 1./(1 + exp(-w7probs*w8));

f = -1/N*sum(sum( XX(:,1:end-1).*log(XXout) + (1-XX(:,1:end-1)).*log(1-XXout)));%原始數據和重構數據的交叉熵
IO = 1/N*(XXout-XX(:,1:end-1));
Ix8=IO; 
dw8 =  w7probs'*Ix8;%輸出層的誤差項，但是這個公式怎么和以前介紹的不同，因為它的誤差評價標准是交叉熵，不是MSE

Ix7 = (Ix8*w8').*w7probs.*(1-w7probs); 
Ix7 = Ix7(:,1:end-1);
dw7 =  w6probs'*Ix7;

Ix6 = (Ix7*w7').*w6probs.*(1-w6probs); 
Ix6 = Ix6(:,1:end-1);
dw6 =  w5probs'*Ix6;

Ix5 = (Ix6*w6').*w5probs.*(1-w5probs); 
Ix5 = Ix5(:,1:end-1);
dw5 =  w4probs'*Ix5;

Ix4 = (Ix5*w5');
Ix4 = Ix4(:,1:end-1);
dw4 =  w3probs'*Ix4;

Ix3 = (Ix4*w4').*w3probs.*(1-w3probs); 
Ix3 = Ix3(:,1:end-1);
dw3 =  w2probs'*Ix3;

Ix2 = (Ix3*w3').*w2probs.*(1-w2probs); 
Ix2 = Ix2(:,1:end-1);
dw2 =  w1probs'*Ix2;

Ix1 = (Ix2*w2').*w1probs.*(1-w1probs); 
Ix1 = Ix1(:,1:end-1);
dw1 =  XX'*Ix1;

df = [dw1(:)' dw2(:)' dw3(:)' dw4(:)' dw5(:)' dw6(:)'  dw7(:)'  dw8(:)'  ]'; %網絡代價函數的偏導數

backprop.m:

maxepoch=200;%迭代35次就用了6個多小時，200次要30多個小時，太長時間了，就沒讓它繼續運行了
fprintf(1,'\nFine-tuning deep autoencoder by minimizing cross entropy error. \n');%其微調通過最小化交叉熵來實現
fprintf(1,'60 batches of 1000 cases each. \n');

load mnistvh% 分別download4個rbm的參數
load mnisthp
load mnisthp2
load mnistpo 

makebatches;
[numcases numdims numbatches]=size(batchdata);
N=numcases; 

%%%% PREINITIALIZE WEIGHTS OF THE AUTOENCODER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
w1=[vishid; hidrecbiases];%分別裝載每層的權值和偏置值，將它們作為一個整體
w2=[hidpen; penrecbiases];
w3=[hidpen2; penrecbiases2];
w4=[hidtop; toprecbiases];
w5=[hidtop'; topgenbiases]; 
w6=[hidpen2'; hidgenbiases2]; 
w7=[hidpen'; hidgenbiases]; 
w8=[vishid'; visbiases];

%%%%%%%%%% END OF PREINITIALIZATIO OF WEIGHTS  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

l1=size(w1,1)-1;%每個網絡層中節點的個數
l2=size(w2,1)-1;
l3=size(w3,1)-1;
l4=size(w4,1)-1;
l5=size(w5,1)-1;
l6=size(w6,1)-1;
l7=size(w7,1)-1;
l8=size(w8,1)-1;
l9=l1; %輸出層節點和輸入層的一樣
test_err=[];
train_err=[];


for epoch = 1:maxepoch

%%%%%%%%%%%%%%%%%%%% COMPUTE TRAINING RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
err=0; 
[numcases numdims numbatches]=size(batchdata);
N=numcases;
 for batch = 1:numbatches
  data = [batchdata(:,:,batch)];
  data = [data ones(N,1)];% b補上一維，因為有偏置項
  w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs  ones(N,1)];%正向傳播，計算每一層的輸出，且同時在輸出上增加一維（值為常量1）
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];
  w4probs = w3probs*w4; w4probs = [w4probs  ones(N,1)];
  w5probs = 1./(1 + exp(-w4probs*w5)); w5probs = [w5probs  ones(N,1)];
  w6probs = 1./(1 + exp(-w5probs*w6)); w6probs = [w6probs  ones(N,1)];
  w7probs = 1./(1 + exp(-w6probs*w7)); w7probs = [w7probs  ones(N,1)];
  dataout = 1./(1 + exp(-w7probs*w8));
  err= err +  1/N*sum(sum( (data(:,1:end-1)-dataout).^2 )); %重構的誤差值
  end
 train_err(epoch)=err/numbatches;%總的誤差值（訓練樣本上）

%%%%%%%%%%%%%% END OF COMPUTING TRAINING RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%% DISPLAY FIGURE TOP ROW REAL DATA BOTTOM ROW RECONSTRUCTIONS %%%%%%%%%%%%%%%%%%%%%%%%%
fprintf(1,'Displaying in figure 1: Top row - real data, Bottom row -- reconstructions \n');
output=[];
 for ii=1:15
  output = [output data(ii,1:end-1)' dataout(ii,:)'];%output為15（因為是顯示15個數字）組，每組2列，分別為理論值和重構值
 end
   if epoch==1 
   close all 
   figure('Position',[100,600,1000,200]);
   else 
   figure(1)
   end 
   mnistdisp(output);
   drawnow;

%%%%%%%%%%%%%%%%%%%% COMPUTE TEST RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[testnumcases testnumdims testnumbatches]=size(testbatchdata);
N=testnumcases;
err=0;
for batch = 1:testnumbatches
  data = [testbatchdata(:,:,batch)];
  data = [data ones(N,1)];
  w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs  ones(N,1)];
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];
  w4probs = w3probs*w4; w4probs = [w4probs  ones(N,1)];
  w5probs = 1./(1 + exp(-w4probs*w5)); w5probs = [w5probs  ones(N,1)];
  w6probs = 1./(1 + exp(-w5probs*w6)); w6probs = [w6probs  ones(N,1)];
  w7probs = 1./(1 + exp(-w6probs*w7)); w7probs = [w7probs  ones(N,1)];
  dataout = 1./(1 + exp(-w7probs*w8));
  err = err +  1/N*sum(sum( (data(:,1:end-1)-dataout).^2 ));
  end
 test_err(epoch)=err/testnumbatches;
 fprintf(1,'Before epoch %d Train squared error: %6.3f Test squared error: %6.3f \t \t \n',epoch,train_err(epoch),test_err(epoch));

%%%%%%%%%%%%%% END OF COMPUTING TEST RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 tt=0;
 for batch = 1:numbatches/10 %測試樣本numbatches是100
 fprintf(1,'epoch %d batch %d\r',epoch,batch);

%%%%%%%%%%% COMBINE 10 MINIBATCHES INTO 1 LARGER MINIBATCH %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 tt=tt+1; 
 data=[];
 for kk=1:10
  data=[data 
        batchdata(:,:,(tt-1)*10+kk)]; 
 end 

%%%%%%%%%%%%%%% PERFORM CONJUGATE GRADIENT WITH 3 LINESEARCHES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%共軛梯度線性搜索
  max_iter=3;
  VV = [w1(:)' w2(:)' w3(:)' w4(:)' w5(:)' w6(:)' w7(:)' w8(:)']';% 把所有權值（已經包括了偏置值）變成一個大的列向量
  Dim = [l1; l2; l3; l4; l5; l6; l7; l8; l9];%每層網絡對應節點的個數（不包括偏置值）

  [X, fX] = minimize(VV,'CG_MNIST',max_iter,Dim,data);

  w1 = reshape(X(1:(l1+1)*l2),l1+1,l2);
  xxx = (l1+1)*l2;
  w2 = reshape(X(xxx+1:xxx+(l2+1)*l3),l2+1,l3);
  xxx = xxx+(l2+1)*l3;
  w3 = reshape(X(xxx+1:xxx+(l3+1)*l4),l3+1,l4);
  xxx = xxx+(l3+1)*l4;
  w4 = reshape(X(xxx+1:xxx+(l4+1)*l5),l4+1,l5);
  xxx = xxx+(l4+1)*l5;
  w5 = reshape(X(xxx+1:xxx+(l5+1)*l6),l5+1,l6);
  xxx = xxx+(l5+1)*l6;
  w6 = reshape(X(xxx+1:xxx+(l6+1)*l7),l6+1,l7);
  xxx = xxx+(l6+1)*l7;
  w7 = reshape(X(xxx+1:xxx+(l7+1)*l8),l7+1,l8);
  xxx = xxx+(l7+1)*l8;
  w8 = reshape(X(xxx+1:xxx+(l8+1)*l9),l8+1,l9); %依次重新賦值為優化后的參數

%%%%%%%%%%%%%%% END OF CONJUGATE GRADIENT WITH 3 LINESEARCHES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 end

 save mnist_weights w1 w2 w3 w4 w5 w6 w7 w8 
 save mnist_error test_err train_err;

end

　　MINST識別實驗：

　　MINST手寫數字庫的識別部分和前面的降維部分其實很相似。首先它也是預訓練整個網絡，只不過在MINST識別時，預訓練的網絡部分需要包括輸出softmax部分，且這部分預訓練時是用的有監督方法的。在微調部分的不同體現在：MINST降維部分是用的無監督方法，即數據的標簽為原始的輸入數據。而MINST識別部分數據的標簽為訓練樣本的實際標簽

　　在進行MINST手寫數字體識別的時候，需要計算加入了softmax部分的網絡的代價函數，作者的程序中給出了2個函數。其中第一個函數用於預訓練softmax分類器：

　　function [f, df] = CG_CLASSIFY_INIT(VV,Dim,w3probs,target);

　　該函數是專門針對softmax分類器那部分預訓練用的，因為一開始的rbm預訓練部分沒有包括輸出層softmax網絡。輸入參數VV表示整個網絡的權值向量（也包括了softmax那一部分），Dim為sofmmax對應部分的2層網絡節點個數的向量，w3probs為訓練softmax所用的樣本集，target為對應樣本集的標簽。f和df分別為softmax網絡的代價函數和代價函數的偏導數。

　　另一個才是真正的計算網絡微調的代價函數：

　　function [f, df] = CG_CLASSIFY(VV,Dim,XX,target);

　　函數輸入值VV代表網絡的參數向量，Dim為每層網絡的節點數向量，XX為訓練樣本集，target為訓練樣本集的標簽，f和df分別為整個網絡的代價函數以及代價函數的偏導數。

　　實驗結果：

　　作者采用的1個輸入層，3個隱含層和一個softmax分類層的輸出層，網絡的節點數依次為：784-500-500-2000-10。

　　其最終識別的錯誤率為：1.2%.

　　實驗主要部分代碼及注釋：

mnistclassify.m:

clear all
close all

maxepoch=50; 
numhid=500; numpen=500; numpen2=2000; 

fprintf(1,'Converting Raw files into Matlab format \n');
converter; 

fprintf(1,'Pretraining a deep autoencoder. \n');
fprintf(1,'The Science paper used 50 epochs. This uses %3i \n', maxepoch);

makebatches;
[numcases numdims numbatches]=size(batchdata);

fprintf(1,'Pretraining Layer 1 with RBM: %d-%d \n',numdims,numhid);
restart=1;
rbm;
hidrecbiases=hidbiases; 
save mnistvhclassify vishid hidrecbiases visbiases;%mnistvhclassify為第一層網絡的權值保存的文件名

fprintf(1,'\nPretraining Layer 2 with RBM: %d-%d \n',numhid,numpen);
batchdata=batchposhidprobs;
numhid=numpen;
restart=1;
rbm;
hidpen=vishid; penrecbiases=hidbiases; hidgenbiases=visbiases;
save mnisthpclassify hidpen penrecbiases hidgenbiases;%mnisthpclassify和前面類似，第2層網絡的

fprintf(1,'\nPretraining Layer 3 with RBM: %d-%d \n',numpen,numpen2);
batchdata=batchposhidprobs;
numhid=numpen2;
restart=1;
rbm;
hidpen2=vishid; penrecbiases2=hidbiases; hidgenbiases2=visbiases;
save mnisthp2classify hidpen2 penrecbiases2 hidgenbiases2;

backpropclassify;

backpropclassify.m:

maxepoch=200;
fprintf(1,'\nTraining discriminative model on MNIST by minimizing cross entropy error. \n');
fprintf(1,'60 batches of 1000 cases each. \n');

load mnistvhclassify %載入3個rbm網絡的預訓練好了的權值
load mnisthpclassify
load mnisthp2classify

makebatches;
[numcases numdims numbatches]=size(batchdata);
N=numcases; 

%%%% PREINITIALIZE WEIGHTS OF THE DISCRIMINATIVE MODEL%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

w1=[vishid; hidrecbiases];
w2=[hidpen; penrecbiases];
w3=[hidpen2; penrecbiases2];
w_class = 0.1*randn(size(w3,2)+1,10); %因為要分類，所以最后一層直接輸出10個節點，類似softmax分類器
 

%%%%%%%%%% END OF PREINITIALIZATIO OF WEIGHTS  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

l1=size(w1,1)-1;
l2=size(w2,1)-1;
l3=size(w3,1)-1;
l4=size(w_class,1)-1;
l5=10; 
test_err=[];
train_err=[];


for epoch = 1:maxepoch %200

%%%%%%%%%%%%%%%%%%%% COMPUTE TRAINING MISCLASSIFICATION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
err=0; 
err_cr=0;
counter=0;
[numcases numdims numbatches]=size(batchdata);
N=numcases;
 for batch = 1:numbatches
  data = [batchdata(:,:,batch)];
  target = [batchtargets(:,:,batch)];
  data = [data ones(N,1)];
  w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs  ones(N,1)];
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];
  targetout = exp(w3probs*w_class);
  targetout = targetout./repmat(sum(targetout,2),1,10); %softmax分類器

  [I J]=max(targetout,[],2);%J是索引值
  [I1 J1]=max(target,[],2);
  counter=counter+length(find(J==J1));% length(find(J==J1))表示為預測值和網絡輸出值相等的個數
  err_cr = err_cr- sum(sum( target(:,1:end).*log(targetout))) ;
 end
 train_err(epoch)=(numcases*numbatches-counter);%每次迭代的訓練誤差
 train_crerr(epoch)=err_cr/numbatches;

%%%%%%%%%%%%%% END OF COMPUTING TRAINING MISCLASSIFICATION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%% COMPUTE TEST MISCLASSIFICATION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
err=0;
err_cr=0;
counter=0;
[testnumcases testnumdims testnumbatches]=size(testbatchdata);
N=testnumcases;
for batch = 1:testnumbatches
  data = [testbatchdata(:,:,batch)];
  target = [testbatchtargets(:,:,batch)];
  data = [data ones(N,1)];
  w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs  ones(N,1)];
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];
  targetout = exp(w3probs*w_class);
  targetout = targetout./repmat(sum(targetout,2),1,10);

  [I J]=max(targetout,[],2);
  [I1 J1]=max(target,[],2);
  counter=counter+length(find(J==J1));
  err_cr = err_cr- sum(sum( target(:,1:end).*log(targetout))) ;
end
 test_err(epoch)=(testnumcases*testnumbatches-counter); %測試樣本的誤差，這都是在預訓練基礎上得到的結果
 test_crerr(epoch)=err_cr/testnumbatches;
 fprintf(1,'Before epoch %d Train # misclassified: %d (from %d). Test # misclassified: %d (from %d) \t \t \n',...
            epoch,train_err(epoch),numcases*numbatches,test_err(epoch),testnumcases*testnumbatches);

%%%%%%%%%%%%%% END OF COMPUTING TEST MISCLASSIFICATION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 tt=0;
 for batch = 1:numbatches/10
 fprintf(1,'epoch %d batch %d\r',epoch,batch);

%%%%%%%%%%% COMBINE 10 MINIBATCHES INTO 1 LARGER MINIBATCH %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 tt=tt+1; 
 data=[];
 targets=[]; 
 for kk=1:10
  data=[data 
        batchdata(:,:,(tt-1)*10+kk)]; 
  targets=[targets
        batchtargets(:,:,(tt-1)*10+kk)];
 end 

%%%%%%%%%%%%%%% PERFORM CONJUGATE GRADIENT WITH 3 LINESEARCHES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  max_iter=3;

  if epoch<6  % First update top-level weights holding other weights fixed. 前6次迭代都是針對softmax部分的預訓練
    N = size(data,1);
    XX = [data ones(N,1)];
    w1probs = 1./(1 + exp(-XX*w1)); w1probs = [w1probs  ones(N,1)];
    w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
    w3probs = 1./(1 + exp(-w2probs*w3)); %w3probs = [w3probs  ones(N,1)];

    VV = [w_class(:)']';
    Dim = [l4; l5];
    [X, fX] = minimize(VV,'CG_CLASSIFY_INIT',max_iter,Dim,w3probs,targets);
    w_class = reshape(X,l4+1,l5);

  else
    VV = [w1(:)' w2(:)' w3(:)' w_class(:)']';
    Dim = [l1; l2; l3; l4; l5];
    [X, fX] = minimize(VV,'CG_CLASSIFY',max_iter,Dim,data,targets);

    w1 = reshape(X(1:(l1+1)*l2),l1+1,l2);
    xxx = (l1+1)*l2;
    w2 = reshape(X(xxx+1:xxx+(l2+1)*l3),l2+1,l3);
    xxx = xxx+(l2+1)*l3;
    w3 = reshape(X(xxx+1:xxx+(l3+1)*l4),l3+1,l4);
    xxx = xxx+(l3+1)*l4;
    w_class = reshape(X(xxx+1:xxx+(l4+1)*l5),l4+1,l5);

  end
%%%%%%%%%%%%%%% END OF CONJUGATE GRADIENT WITH 3 LINESEARCHES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 end

 save mnistclassify_weights w1 w2 w3 w_class
 save mnistclassify_error test_err test_crerr train_err train_crerr;

end

CG_CLASSIFY_INIT.m:

function [f, df] = CG_CLASSIFY_INIT(VV,Dim,w3probs,target);%只有2層網絡
l1 = Dim(1);
l2 = Dim(2);
N = size(w3probs,1);%N為訓練樣本的個數
% Do decomversion.
  w_class = reshape(VV,l1+1,l2);
  w3probs = [w3probs  ones(N,1)];  

  targetout = exp(w3probs*w_class);
  targetout = targetout./repmat(sum(targetout,2),1,10);
  f = -sum(sum( target(:,1:end).*log(targetout))) ;%f位softmax分類器的誤差函數
IO = (targetout-target(:,1:end));
Ix_class=IO; 
dw_class =  w3probs'*Ix_class; %偏導值

df = [dw_class(:)']';

CG_CLASSIFY.m:

function [f, df] = CG_CLASSIFY(VV,Dim,XX,target);

l1 = Dim(1);
l2 = Dim(2);
l3= Dim(3);
l4= Dim(4);
l5= Dim(5);
N = size(XX,1);

% Do decomversion.
 w1 = reshape(VV(1:(l1+1)*l2),l1+1,l2);
 xxx = (l1+1)*l2;
 w2 = reshape(VV(xxx+1:xxx+(l2+1)*l3),l2+1,l3);
 xxx = xxx+(l2+1)*l3;
 w3 = reshape(VV(xxx+1:xxx+(l3+1)*l4),l3+1,l4);
 xxx = xxx+(l3+1)*l4;
 w_class = reshape(VV(xxx+1:xxx+(l4+1)*l5),l4+1,l5);


  XX = [XX ones(N,1)];
  w1probs = 1./(1 + exp(-XX*w1)); w1probs = [w1probs  ones(N,1)];
  w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)];
  w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs  ones(N,1)];

  targetout = exp(w3probs*w_class);
  targetout = targetout./repmat(sum(targetout,2),1,10);
  f = -sum(sum( target(:,1:end).*log(targetout))) ;

IO = (targetout-target(:,1:end));
Ix_class=IO; 
dw_class =  w3probs'*Ix_class; 

Ix3 = (Ix_class*w_class').*w3probs.*(1-w3probs);
Ix3 = Ix3(:,1:end-1);
dw3 =  w2probs'*Ix3;

Ix2 = (Ix3*w3').*w2probs.*(1-w2probs); 
Ix2 = Ix2(:,1:end-1);
dw2 =  w1probs'*Ix2;

Ix1 = (Ix2*w2').*w1probs.*(1-w1probs); 
Ix1 = Ix1(:,1:end-1);
dw1 =  XX'*Ix1;

df = [dw1(:)' dw2(:)' dw3(:)' dw_class(:)']';

　　實驗總結：

　　 1. 終於閱讀了一個RBM的源碼了，以前看那些各種公式的理論，現在有了對應的code，讀對應的code起來就是爽！

　　 2. 這里由於用的是整個圖片進行訓練（不是用的它們的patch部分），所以沒有對應的convolution和pooling，因此預訓練網絡結構時下一個rbm網絡的輸入就是上一個rbm網絡的輸出，且當沒有加入softmax時的微調階段用的依舊是無監督的學習（此時的標簽依舊為原始的輸入數據）；而當加入了softmax后的微調部分用的就是訓練樣本的真實標簽了，因為此時需要進行分類。

　　 3. 深度越深，則網絡的微調時間越長，需要很多時間收斂，即使是進行了預訓練。

　　 4. 暫時還沒弄懂要是針對大圖片采用covolution訓練時，第二層網絡的數據來源是什么，有可能和上面的一樣，是上層網絡的輸出（但是此時微調怎么辦呢，不用標簽數據？）也有可能是大圖片經過第一層網絡covolution，pooling后的輸出值（如果是這樣的話，網絡的代價函數就不好弄了，因為里面有convolution和pooling操作）。

　　參考資料：

Deep learning：三十四(用NN實現數據的降維)

　　reducing the dimensionality of data with neural networks

http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html

取模（mod）與取余（rem）的區別——Matlab學習筆記

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Deep learning：三十四(用NN實現數據的降維) Deep learning：十五(Self-Taught Learning練習) Deep learning：十一(PCA和whitening在二維數據中的練習) Deep learning：三十一(數據預處理練習) Deep learning：四(logistic regression練習) Deep learning：二(linear regression練習) Deep learning：九(Sparse Autoencoder練習) Deep learning：四十五(maxout簡單理解) Salesforce LWC學習(三十五) 使用 REST API實現不寫Apex的批量創建/更新數據 Deep Learning 16：用自編碼器對數據進行降維_讀論文“Reducing the Dimensionality of Data with Neural Networks”的筆記