前言:
本文是針對上篇博文Deep learning:三十四(用NN實現數據的降維)的練習部分,也就是Hition大牛science文章reducing the dimensionality of data with neural networks的code部分,其code下載見:http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html。花了點時間閱讀並運行了下它的code,其實code主要是2個單獨的工程。一個只是用MNIST數據庫來進行深度的autoencoder壓縮,用的是無監督學習,評價標准是重構誤差值MSE。另一個工程是MNIST的手寫字體識別,網絡的預訓練部分用的是無監督的,網絡的微調部分用的是有監督的。評價標准准是識別率或者錯誤率。
MINST降維實驗:
本次是訓練4個隱含層的autoencoder深度網絡結構,輸入層維度為784維,4個隱含層維度分別為1000,500,250,30。整個網絡權值的獲得流程梳理如下:
- 首先訓練第一個rbm網絡,即輸入層784維和第一個隱含層1000維構成的網絡。采用的方法是rbm優化,這個過程用的是訓練樣本,優化完畢后,計算訓練樣本在隱含層的輸出值。
- 利用1中的結果作為第2個rbm網絡訓練的輸入值,同樣用rbm網絡來優化第2個rbm網絡,並計算出網絡的輸出值。並且用同樣的方法訓練第3個rbm網絡和第4個rbm網絡。
- 將上面4個rbm網絡展開連接成新的網絡,且分成encoder和decoder部分。並用步驟1和2得到的網絡值給這個新網絡賦初值。
- 由於新網絡中最后的輸出和最初的輸入節點數是相同的,所以可以將最初的輸入值作為網絡理論的輸出標簽值,然后采用BP算法計算網絡的代價函數和代價函數的偏導數。
- 利用步驟3的初始值和步驟4的代價值和偏導值,采用共軛梯度下降法優化整個新網絡,得到最終的網絡權值。以上整個過程都是無監督的。
一些matlab函數:
rem和mod:
參考資料取模(mod)與取余(rem)的區別——Matlab學習筆記
通常取模運算也叫取余運算,它們返回結果都是余數.rem和mod唯一的區別在於:
當x和y的正負號一樣的時候,兩個函數結果是等同的;當x和y的符號不同時,rem函數結果的符號和x的一樣,而mod和y一樣。這是由於這兩個函數的生成機制不同,rem函數采用fix函數,而mod函數采用了floor函數(這兩個函數是用來取整的,fix函數向0方向舍入,floor函數向無窮小方向舍入)。rem(x,y)命令返回的是x-n.*y,如果y不等於0,其中的n = fix(x./y),而mod(x,y)返回的是x-n.*y,當y不等於0時,n=floor(x./y)
工程中的m文件:
converter.m:
實現的功能是將樣本集從.ubyte格式轉換成.ascii格式,然后繼續轉換成.mat格式。
makebatches.m:
實現的是將原本的2維數據集變成3維的,因為分了多個批次,另外1維表示的是批次。
下面來看下在程序中大致實現RBM權值的優化步驟(假設是一個2層的RBM網絡,即只有輸入層和輸出層,且這兩層上的變量是二值變量):
- 隨機給網絡初始化一個權值矩陣w和偏置向量b。
- 對可視層輸入矩陣v正向傳播,計算出隱含層的輸出矩陣h,並計算出輸入v和h對應節點乘積的均值矩陣
- 此時2中的輸出h為概率值,將它隨機01化為二值變量。
- 利用3中01化了的h方向傳播計算出可視層的矩陣v’.(按照道理,這個v'應該是要01化的)
- 對v’進行正向傳播計算出隱含層的矩陣h’,並計算出v’和h’對應節點乘積的均值矩陣。
- 用2中得到的均值矩陣減掉5中得到的均值矩陣,其結果作為對應權值增量的矩陣。
- 結合其對應的學習率,利用權值迭代公式對權值進行迭代。
- 重復計算2到7,直至收斂。
偏置值的優化步驟:
- 隨機給網絡初始化一個權值矩陣w和偏置向量b。
- 對可視層輸入矩陣v正向傳播,計算出隱含層的輸出矩陣h,並計算v層樣本的均值向量以及h層的均值向量。
- 此時2中的輸出h為概率值,將它隨機01化為二值變量。
- 利用3中01化了的h方向傳播計算出可視層的矩陣v’.
- 對v’進行正向傳播計算出隱含層的矩陣h’, 並計算v‘層樣本的均值向量以及h’層的均值向量。
- 用2中得到的v方均值向量減掉5中得到的v’方的均值向量,其結果作為輸入層v對應偏置的增值向量。用2中得到的h方均值向量減掉5中得到的h’方的均值向量,其結果作為輸入層h對應偏置的增值向量。
- 結合其對應的學習率,利用權值迭代公式對偏置值進行迭代。
- 重復計算2到7,直至收斂。
當然了,權值更新和偏置值更新每次迭代都是同時進行的,所以應該是同時收斂的。並且在權值更新公式也可以稍微作下變形,比如加入momentum變量,即本次權值更新的增量會保留一部分上次更新權值的增量值。
函數CG_MNIST形式如下:
function [f, df] = CG_MNIST(VV,Dim,XX);
該函數實現的功能是計算網絡代價函數值f,以及f對網絡中各個參數值的偏導數df,權值和偏置值是同時處理。其中參數VV為網絡中所有參數構成的列向量,參數Dim為每層網絡的節點數構成的向量,XX為訓練樣本集合。f和df分別表示網絡的代價函數和偏導函數值。
共軛梯度下降的優化函數形式為:
[X, fX, i] = minimize(X, f, length, P1, P2, P3, ... )
該函數時使用共軛梯度的方法來對參數X進行優化,所以X是網絡的參數值,為一個列向量。f是一個函數的名稱,它主要是用來計算網絡中的代價函數以及代價函數對各個參數X的偏導函數,f的參數值分別為X,以及minimize函數后面的P1,P2,P3,…使用共軛梯度法進行優化的最大線性搜索長度為length。返回值X為找到的最優參數,fX為在此最優參數X下的代價函數,i為線性搜索的長度(即迭代的次數)。
實驗結果:
由於在實驗過程中,作者將迭代次數設置為200,本人在實驗時發現迭代到35次時已經花了6個多小時,所以懶得等那么久了(需長達30多個小時),此時的原始數字和重構數字顯示如下:
均方誤差結果為:
Train squared error: 4.318
Test squared error: 4.520
實驗主要部分代碼及注釋:
mnistdeepauto.m:
clear all close all maxepoch=10; %In the Science paper we use maxepoch=50, but it works just fine. numhid=1000; numpen=500; numpen2=250; numopen=30; fprintf(1,'Converting Raw files into Matlab format \n'); converter; % 轉換數據為matlab的格式 fprintf(1,'Pretraining a deep autoencoder. \n'); fprintf(1,'The Science paper used 50 epochs. This uses %3i \n', maxepoch); makebatches; [numcases numdims numbatches]=size(batchdata); fprintf(1,'Pretraining Layer 1 with RBM: %d-%d \n',numdims,numhid); restart=1; rbm; hidrecbiases=hidbiases; %hidbiases為隱含層的偏置值 save mnistvh vishid hidrecbiases visbiases;%保持每層的變量,分別為權值,隱含層偏置值,可視層偏置值 fprintf(1,'\nPretraining Layer 2 with RBM: %d-%d \n',numhid,numpen); batchdata=batchposhidprobs;%batchposhidprobs為第一個rbm的輸出概率值 numhid=numpen; restart=1; rbm;% 第2個rbm的訓練 hidpen=vishid; penrecbiases=hidbiases; hidgenbiases=visbiases; save mnisthp hidpen penrecbiases hidgenbiases;%mnisthp為所保存的文件名 fprintf(1,'\nPretraining Layer 3 with RBM: %d-%d \n',numpen,numpen2); batchdata=batchposhidprobs; numhid=numpen2; restart=1; rbm; hidpen2=vishid; penrecbiases2=hidbiases; hidgenbiases2=visbiases;%第3個rbm save mnisthp2 hidpen2 penrecbiases2 hidgenbiases2; fprintf(1,'\nPretraining Layer 4 with RBM: %d-%d \n',numpen2,numopen); batchdata=batchposhidprobs; numhid=numopen; restart=1; rbmhidlinear; hidtop=vishid; toprecbiases=hidbiases; topgenbiases=visbiases;%第4個rbm save mnistpo hidtop toprecbiases topgenbiases; backprop;
rbm.m:
epsilonw = 0.1; % Learning rate for weights epsilonvb = 0.1; % Learning rate for biases of visible units epsilonhb = 0.1; % Learning rate for biases of hidden units %由此可見這里隱含層和可視層的偏置值不是共用的,當然了,其權值是共用的 weightcost = 0.0002; initialmomentum = 0.5; finalmomentum = 0.9; [numcases numdims numbatches]=size(batchdata);%[100,784,600] if restart ==1, restart=0; epoch=1; % Initializing symmetric weights and biases. vishid = 0.1*randn(numdims, numhid); %權值初始值隨便給,784*1000 hidbiases = zeros(1,numhid); %偏置值初始化為0 visbiases = zeros(1,numdims); poshidprobs = zeros(numcases,numhid);%100*1000,單個batch正向傳播時隱含層的輸出概率 neghidprobs = zeros(numcases,numhid); posprods = zeros(numdims,numhid);%784*1000 negprods = zeros(numdims,numhid); vishidinc = zeros(numdims,numhid); hidbiasinc = zeros(1,numhid); visbiasinc = zeros(1,numdims); batchposhidprobs=zeros(numcases,numhid,numbatches);% 整個數據正向傳播時隱含層的輸出概率 end for epoch = epoch:maxepoch, %總共迭代10次 fprintf(1,'epoch %d\r',epoch); errsum=0; for batch = 1:numbatches, %每次迭代都有遍歷所有的batch fprintf(1,'epoch %d batch %d\r',epoch,batch); %%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% data = batchdata(:,:,batch);% 每次迭代都需要取出一個batch的數據,每一行代表一個樣本值(這里的數據是double的,不是01的,嚴格的說后面應將其01化) poshidprobs = 1./(1 + exp(-data*vishid - repmat(hidbiases,numcases,1)));% 樣本正向傳播時隱含層節點的輸出概率 batchposhidprobs(:,:,batch)=poshidprobs; posprods = data' * poshidprobs;%784*1000,這個是求系統的能量值用的,矩陣中每個元素表示對應的可視層節點和隱含層節點的乘積(包含此次樣本的數據對應值的累加) poshidact = sum(poshidprobs);%針對樣本值進行求和 posvisact = sum(data); %%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% poshidstates = poshidprobs > rand(numcases,numhid); %將隱含層數據01化(此步驟在posprods之后進行),按照概率值大小來判定.
%rand(m,n)為產生m*n大小的矩陣,矩陣中元素為(0,1)之間的均勻分布。 %%%%%%%%% START NEGATIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% negdata = 1./(1 + exp(-poshidstates*vishid' - repmat(visbiases,numcases,1)));% 反向進行時的可視層數據 neghidprobs = 1./(1 + exp(-negdata*vishid - repmat(hidbiases,numcases,1)));% 反向進行后又馬上正向傳播的隱含層概率值 negprods = negdata'*neghidprobs;% 同理也是計算能量值用的,784*1000 neghidact = sum(neghidprobs); negvisact = sum(negdata); %%%%%%%%% END OF NEGATIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% err= sum(sum( (data-negdata).^2 ));% 重構后的差值 errsum = err + errsum; % 變量errsum只是用來輸出每次迭代時的誤差而已 if epoch>5, momentum=finalmomentum;%0.5,momentum為保持上一次權值更新增量的比例,如果迭代次數越少,則這個比例值可以稍微大一點 else momentum=initialmomentum;%0.9 end; %%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% vishidinc = momentum*vishidinc + ... %vishidinc 784*1000,權值更新時的增量; epsilonw*( (posprods-negprods)/numcases - weightcost*vishid); %posprods/numcases求的是正向傳播時vihj的期望,同理negprods/numcases是逆向重構時它們的期望 visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*(posvisact-negvisact); %這3個都是按照權值更新公式來的 hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*(poshidact-neghidact); vishid = vishid + vishidinc; visbiases = visbiases + visbiasinc; hidbiases = hidbiases + hidbiasinc; %%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% end fprintf(1, 'epoch %4i error %6.1f \n', epoch, errsum); end;
CG_MNIST.m:
function [f, df] = CG_MNIST(VV,Dim,XX); l1 = Dim(1); l2 = Dim(2); l3 = Dim(3); l4= Dim(4); l5= Dim(5); l6= Dim(6); l7= Dim(7); l8= Dim(8); l9= Dim(9); N = size(XX,1);% 樣本的個數 % Do decomversion. w1 = reshape(VV(1:(l1+1)*l2),l1+1,l2);% VV是一個長的列向量,這里取出的向量已經包括了偏置值 xxx = (l1+1)*l2; %xxx 表示已經使用了的長度 w2 = reshape(VV(xxx+1:xxx+(l2+1)*l3),l2+1,l3); xxx = xxx+(l2+1)*l3; w3 = reshape(VV(xxx+1:xxx+(l3+1)*l4),l3+1,l4); xxx = xxx+(l3+1)*l4; w4 = reshape(VV(xxx+1:xxx+(l4+1)*l5),l4+1,l5); xxx = xxx+(l4+1)*l5; w5 = reshape(VV(xxx+1:xxx+(l5+1)*l6),l5+1,l6); xxx = xxx+(l5+1)*l6; w6 = reshape(VV(xxx+1:xxx+(l6+1)*l7),l6+1,l7); xxx = xxx+(l6+1)*l7; w7 = reshape(VV(xxx+1:xxx+(l7+1)*l8),l7+1,l8); xxx = xxx+(l7+1)*l8; w8 = reshape(VV(xxx+1:xxx+(l8+1)*l9),l8+1,l9);% 上面一系列步驟完成權值的矩陣化 XX = [XX ones(N,1)]; w1probs = 1./(1 + exp(-XX*w1)); w1probs = [w1probs ones(N,1)]; w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)]; w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs ones(N,1)]; w4probs = w3probs*w4; w4probs = [w4probs ones(N,1)]; w5probs = 1./(1 + exp(-w4probs*w5)); w5probs = [w5probs ones(N,1)]; w6probs = 1./(1 + exp(-w5probs*w6)); w6probs = [w6probs ones(N,1)]; w7probs = 1./(1 + exp(-w6probs*w7)); w7probs = [w7probs ones(N,1)]; XXout = 1./(1 + exp(-w7probs*w8)); f = -1/N*sum(sum( XX(:,1:end-1).*log(XXout) + (1-XX(:,1:end-1)).*log(1-XXout)));%原始數據和重構數據的交叉熵 IO = 1/N*(XXout-XX(:,1:end-1)); Ix8=IO; dw8 = w7probs'*Ix8;%輸出層的誤差項,但是這個公式怎么和以前介紹的不同,因為它的誤差評價標准是交叉熵,不是MSE Ix7 = (Ix8*w8').*w7probs.*(1-w7probs); Ix7 = Ix7(:,1:end-1); dw7 = w6probs'*Ix7; Ix6 = (Ix7*w7').*w6probs.*(1-w6probs); Ix6 = Ix6(:,1:end-1); dw6 = w5probs'*Ix6; Ix5 = (Ix6*w6').*w5probs.*(1-w5probs); Ix5 = Ix5(:,1:end-1); dw5 = w4probs'*Ix5; Ix4 = (Ix5*w5'); Ix4 = Ix4(:,1:end-1); dw4 = w3probs'*Ix4; Ix3 = (Ix4*w4').*w3probs.*(1-w3probs); Ix3 = Ix3(:,1:end-1); dw3 = w2probs'*Ix3; Ix2 = (Ix3*w3').*w2probs.*(1-w2probs); Ix2 = Ix2(:,1:end-1); dw2 = w1probs'*Ix2; Ix1 = (Ix2*w2').*w1probs.*(1-w1probs); Ix1 = Ix1(:,1:end-1); dw1 = XX'*Ix1; df = [dw1(:)' dw2(:)' dw3(:)' dw4(:)' dw5(:)' dw6(:)' dw7(:)' dw8(:)' ]'; %網絡代價函數的偏導數
backprop.m:
maxepoch=200;%迭代35次就用了6個多小時,200次要30多個小時,太長時間了,就沒讓它繼續運行了 fprintf(1,'\nFine-tuning deep autoencoder by minimizing cross entropy error. \n');%其微調通過最小化交叉熵來實現 fprintf(1,'60 batches of 1000 cases each. \n'); load mnistvh% 分別download4個rbm的參數 load mnisthp load mnisthp2 load mnistpo makebatches; [numcases numdims numbatches]=size(batchdata); N=numcases; %%%% PREINITIALIZE WEIGHTS OF THE AUTOENCODER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% w1=[vishid; hidrecbiases];%分別裝載每層的權值和偏置值,將它們作為一個整體 w2=[hidpen; penrecbiases]; w3=[hidpen2; penrecbiases2]; w4=[hidtop; toprecbiases]; w5=[hidtop'; topgenbiases]; w6=[hidpen2'; hidgenbiases2]; w7=[hidpen'; hidgenbiases]; w8=[vishid'; visbiases]; %%%%%%%%%% END OF PREINITIALIZATIO OF WEIGHTS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% l1=size(w1,1)-1;%每個網絡層中節點的個數 l2=size(w2,1)-1; l3=size(w3,1)-1; l4=size(w4,1)-1; l5=size(w5,1)-1; l6=size(w6,1)-1; l7=size(w7,1)-1; l8=size(w8,1)-1; l9=l1; %輸出層節點和輸入層的一樣 test_err=[]; train_err=[]; for epoch = 1:maxepoch %%%%%%%%%%%%%%%%%%%% COMPUTE TRAINING RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% err=0; [numcases numdims numbatches]=size(batchdata); N=numcases; for batch = 1:numbatches data = [batchdata(:,:,batch)]; data = [data ones(N,1)];% b補上一維,因為有偏置項 w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs ones(N,1)];%正向傳播,計算每一層的輸出,且同時在輸出上增加一維(值為常量1) w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)]; w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs ones(N,1)]; w4probs = w3probs*w4; w4probs = [w4probs ones(N,1)]; w5probs = 1./(1 + exp(-w4probs*w5)); w5probs = [w5probs ones(N,1)]; w6probs = 1./(1 + exp(-w5probs*w6)); w6probs = [w6probs ones(N,1)]; w7probs = 1./(1 + exp(-w6probs*w7)); w7probs = [w7probs ones(N,1)]; dataout = 1./(1 + exp(-w7probs*w8)); err= err + 1/N*sum(sum( (data(:,1:end-1)-dataout).^2 )); %重構的誤差值 end train_err(epoch)=err/numbatches;%總的誤差值(訓練樣本上) %%%%%%%%%%%%%% END OF COMPUTING TRAINING RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%% DISPLAY FIGURE TOP ROW REAL DATA BOTTOM ROW RECONSTRUCTIONS %%%%%%%%%%%%%%%%%%%%%%%%% fprintf(1,'Displaying in figure 1: Top row - real data, Bottom row -- reconstructions \n'); output=[]; for ii=1:15 output = [output data(ii,1:end-1)' dataout(ii,:)'];%output為15(因為是顯示15個數字)組,每組2列,分別為理論值和重構值 end if epoch==1 close all figure('Position',[100,600,1000,200]); else figure(1) end mnistdisp(output); drawnow; %%%%%%%%%%%%%%%%%%%% COMPUTE TEST RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% [testnumcases testnumdims testnumbatches]=size(testbatchdata); N=testnumcases; err=0; for batch = 1:testnumbatches data = [testbatchdata(:,:,batch)]; data = [data ones(N,1)]; w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs ones(N,1)]; w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)]; w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs ones(N,1)]; w4probs = w3probs*w4; w4probs = [w4probs ones(N,1)]; w5probs = 1./(1 + exp(-w4probs*w5)); w5probs = [w5probs ones(N,1)]; w6probs = 1./(1 + exp(-w5probs*w6)); w6probs = [w6probs ones(N,1)]; w7probs = 1./(1 + exp(-w6probs*w7)); w7probs = [w7probs ones(N,1)]; dataout = 1./(1 + exp(-w7probs*w8)); err = err + 1/N*sum(sum( (data(:,1:end-1)-dataout).^2 )); end test_err(epoch)=err/testnumbatches; fprintf(1,'Before epoch %d Train squared error: %6.3f Test squared error: %6.3f \t \t \n',epoch,train_err(epoch),test_err(epoch)); %%%%%%%%%%%%%% END OF COMPUTING TEST RECONSTRUCTION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% tt=0; for batch = 1:numbatches/10 %測試樣本numbatches是100 fprintf(1,'epoch %d batch %d\r',epoch,batch); %%%%%%%%%%% COMBINE 10 MINIBATCHES INTO 1 LARGER MINIBATCH %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% tt=tt+1; data=[]; for kk=1:10 data=[data batchdata(:,:,(tt-1)*10+kk)]; end %%%%%%%%%%%%%%% PERFORM CONJUGATE GRADIENT WITH 3 LINESEARCHES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%共軛梯度線性搜索 max_iter=3; VV = [w1(:)' w2(:)' w3(:)' w4(:)' w5(:)' w6(:)' w7(:)' w8(:)']';% 把所有權值(已經包括了偏置值)變成一個大的列向量 Dim = [l1; l2; l3; l4; l5; l6; l7; l8; l9];%每層網絡對應節點的個數(不包括偏置值) [X, fX] = minimize(VV,'CG_MNIST',max_iter,Dim,data); w1 = reshape(X(1:(l1+1)*l2),l1+1,l2); xxx = (l1+1)*l2; w2 = reshape(X(xxx+1:xxx+(l2+1)*l3),l2+1,l3); xxx = xxx+(l2+1)*l3; w3 = reshape(X(xxx+1:xxx+(l3+1)*l4),l3+1,l4); xxx = xxx+(l3+1)*l4; w4 = reshape(X(xxx+1:xxx+(l4+1)*l5),l4+1,l5); xxx = xxx+(l4+1)*l5; w5 = reshape(X(xxx+1:xxx+(l5+1)*l6),l5+1,l6); xxx = xxx+(l5+1)*l6; w6 = reshape(X(xxx+1:xxx+(l6+1)*l7),l6+1,l7); xxx = xxx+(l6+1)*l7; w7 = reshape(X(xxx+1:xxx+(l7+1)*l8),l7+1,l8); xxx = xxx+(l7+1)*l8; w8 = reshape(X(xxx+1:xxx+(l8+1)*l9),l8+1,l9); %依次重新賦值為優化后的參數 %%%%%%%%%%%%%%% END OF CONJUGATE GRADIENT WITH 3 LINESEARCHES %%%%%%%%%%%%%%%%%%%%%%%%%%%%% end save mnist_weights w1 w2 w3 w4 w5 w6 w7 w8 save mnist_error test_err train_err; end
MINST識別實驗:
MINST手寫數字庫的識別部分和前面的降維部分其實很相似。首先它也是預訓練整個網絡,只不過在MINST識別時,預訓練的網絡部分需要包括輸出softmax部分,且這部分預訓練時是用的有監督方法的。在微調部分的不同體現在:MINST降維部分是用的無監督方法,即數據的標簽為原始的輸入數據。而MINST識別部分數據的標簽為訓練樣本的實際標簽
在進行MINST手寫數字體識別的時候,需要計算加入了softmax部分的網絡的代價函數,作者的程序中給出了2個函數。其中第一個函數用於預訓練softmax分類器:
function [f, df] = CG_CLASSIFY_INIT(VV,Dim,w3probs,target);
該函數是專門針對softmax分類器那部分預訓練用的,因為一開始的rbm預訓練部分沒有包括輸出層softmax網絡。輸入參數VV表示整個網絡的權值向量(也包括了softmax那一部分),Dim為sofmmax對應部分的2層網絡節點個數的向量,w3probs為訓練softmax所用的樣本集,target為對應樣本集的標簽。f和df分別為softmax網絡的代價函數和代價函數的偏導數。
另一個才是真正的計算網絡微調的代價函數:
function [f, df] = CG_CLASSIFY(VV,Dim,XX,target);
函數輸入值VV代表網絡的參數向量,Dim為每層網絡的節點數向量,XX為訓練樣本集,target為訓練樣本集的標簽,f和df分別為整個網絡的代價函數以及代價函數的偏導數。
實驗結果:
作者采用的1個輸入層,3個隱含層和一個softmax分類層的輸出層,網絡的節點數依次為:784-500-500-2000-10。
其最終識別的錯誤率為:1.2%.
實驗主要部分代碼及注釋:
mnistclassify.m:
clear all close all maxepoch=50; numhid=500; numpen=500; numpen2=2000; fprintf(1,'Converting Raw files into Matlab format \n'); converter; fprintf(1,'Pretraining a deep autoencoder. \n'); fprintf(1,'The Science paper used 50 epochs. This uses %3i \n', maxepoch); makebatches; [numcases numdims numbatches]=size(batchdata); fprintf(1,'Pretraining Layer 1 with RBM: %d-%d \n',numdims,numhid); restart=1; rbm; hidrecbiases=hidbiases; save mnistvhclassify vishid hidrecbiases visbiases;%mnistvhclassify為第一層網絡的權值保存的文件名 fprintf(1,'\nPretraining Layer 2 with RBM: %d-%d \n',numhid,numpen); batchdata=batchposhidprobs; numhid=numpen; restart=1; rbm; hidpen=vishid; penrecbiases=hidbiases; hidgenbiases=visbiases; save mnisthpclassify hidpen penrecbiases hidgenbiases;%mnisthpclassify和前面類似,第2層網絡的 fprintf(1,'\nPretraining Layer 3 with RBM: %d-%d \n',numpen,numpen2); batchdata=batchposhidprobs; numhid=numpen2; restart=1; rbm; hidpen2=vishid; penrecbiases2=hidbiases; hidgenbiases2=visbiases; save mnisthp2classify hidpen2 penrecbiases2 hidgenbiases2; backpropclassify;
backpropclassify.m:
maxepoch=200; fprintf(1,'\nTraining discriminative model on MNIST by minimizing cross entropy error. \n'); fprintf(1,'60 batches of 1000 cases each. \n'); load mnistvhclassify %載入3個rbm網絡的預訓練好了的權值 load mnisthpclassify load mnisthp2classify makebatches; [numcases numdims numbatches]=size(batchdata); N=numcases; %%%% PREINITIALIZE WEIGHTS OF THE DISCRIMINATIVE MODEL%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% w1=[vishid; hidrecbiases]; w2=[hidpen; penrecbiases]; w3=[hidpen2; penrecbiases2]; w_class = 0.1*randn(size(w3,2)+1,10); %因為要分類,所以最后一層直接輸出10個節點,類似softmax分類器 %%%%%%%%%% END OF PREINITIALIZATIO OF WEIGHTS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% l1=size(w1,1)-1; l2=size(w2,1)-1; l3=size(w3,1)-1; l4=size(w_class,1)-1; l5=10; test_err=[]; train_err=[]; for epoch = 1:maxepoch %200 %%%%%%%%%%%%%%%%%%%% COMPUTE TRAINING MISCLASSIFICATION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% err=0; err_cr=0; counter=0; [numcases numdims numbatches]=size(batchdata); N=numcases; for batch = 1:numbatches data = [batchdata(:,:,batch)]; target = [batchtargets(:,:,batch)]; data = [data ones(N,1)]; w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs ones(N,1)]; w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)]; w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs ones(N,1)]; targetout = exp(w3probs*w_class); targetout = targetout./repmat(sum(targetout,2),1,10); %softmax分類器 [I J]=max(targetout,[],2);%J是索引值 [I1 J1]=max(target,[],2); counter=counter+length(find(J==J1));% length(find(J==J1))表示為預測值和網絡輸出值相等的個數 err_cr = err_cr- sum(sum( target(:,1:end).*log(targetout))) ; end train_err(epoch)=(numcases*numbatches-counter);%每次迭代的訓練誤差 train_crerr(epoch)=err_cr/numbatches; %%%%%%%%%%%%%% END OF COMPUTING TRAINING MISCLASSIFICATION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%% COMPUTE TEST MISCLASSIFICATION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% err=0; err_cr=0; counter=0; [testnumcases testnumdims testnumbatches]=size(testbatchdata); N=testnumcases; for batch = 1:testnumbatches data = [testbatchdata(:,:,batch)]; target = [testbatchtargets(:,:,batch)]; data = [data ones(N,1)]; w1probs = 1./(1 + exp(-data*w1)); w1probs = [w1probs ones(N,1)]; w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)]; w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs ones(N,1)]; targetout = exp(w3probs*w_class); targetout = targetout./repmat(sum(targetout,2),1,10); [I J]=max(targetout,[],2); [I1 J1]=max(target,[],2); counter=counter+length(find(J==J1)); err_cr = err_cr- sum(sum( target(:,1:end).*log(targetout))) ; end test_err(epoch)=(testnumcases*testnumbatches-counter); %測試樣本的誤差,這都是在預訓練基礎上得到的結果 test_crerr(epoch)=err_cr/testnumbatches; fprintf(1,'Before epoch %d Train # misclassified: %d (from %d). Test # misclassified: %d (from %d) \t \t \n',... epoch,train_err(epoch),numcases*numbatches,test_err(epoch),testnumcases*testnumbatches); %%%%%%%%%%%%%% END OF COMPUTING TEST MISCLASSIFICATION ERROR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% tt=0; for batch = 1:numbatches/10 fprintf(1,'epoch %d batch %d\r',epoch,batch); %%%%%%%%%%% COMBINE 10 MINIBATCHES INTO 1 LARGER MINIBATCH %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% tt=tt+1; data=[]; targets=[]; for kk=1:10 data=[data batchdata(:,:,(tt-1)*10+kk)]; targets=[targets batchtargets(:,:,(tt-1)*10+kk)]; end %%%%%%%%%%%%%%% PERFORM CONJUGATE GRADIENT WITH 3 LINESEARCHES %%%%%%%%%%%%%%%%%%%%%%%%%%%%% max_iter=3; if epoch<6 % First update top-level weights holding other weights fixed. 前6次迭代都是針對softmax部分的預訓練 N = size(data,1); XX = [data ones(N,1)]; w1probs = 1./(1 + exp(-XX*w1)); w1probs = [w1probs ones(N,1)]; w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)]; w3probs = 1./(1 + exp(-w2probs*w3)); %w3probs = [w3probs ones(N,1)]; VV = [w_class(:)']'; Dim = [l4; l5]; [X, fX] = minimize(VV,'CG_CLASSIFY_INIT',max_iter,Dim,w3probs,targets); w_class = reshape(X,l4+1,l5); else VV = [w1(:)' w2(:)' w3(:)' w_class(:)']'; Dim = [l1; l2; l3; l4; l5]; [X, fX] = minimize(VV,'CG_CLASSIFY',max_iter,Dim,data,targets); w1 = reshape(X(1:(l1+1)*l2),l1+1,l2); xxx = (l1+1)*l2; w2 = reshape(X(xxx+1:xxx+(l2+1)*l3),l2+1,l3); xxx = xxx+(l2+1)*l3; w3 = reshape(X(xxx+1:xxx+(l3+1)*l4),l3+1,l4); xxx = xxx+(l3+1)*l4; w_class = reshape(X(xxx+1:xxx+(l4+1)*l5),l4+1,l5); end %%%%%%%%%%%%%%% END OF CONJUGATE GRADIENT WITH 3 LINESEARCHES %%%%%%%%%%%%%%%%%%%%%%%%%%%%% end save mnistclassify_weights w1 w2 w3 w_class save mnistclassify_error test_err test_crerr train_err train_crerr; end
CG_CLASSIFY_INIT.m:
function [f, df] = CG_CLASSIFY_INIT(VV,Dim,w3probs,target);%只有2層網絡 l1 = Dim(1); l2 = Dim(2); N = size(w3probs,1);%N為訓練樣本的個數 % Do decomversion. w_class = reshape(VV,l1+1,l2); w3probs = [w3probs ones(N,1)]; targetout = exp(w3probs*w_class); targetout = targetout./repmat(sum(targetout,2),1,10); f = -sum(sum( target(:,1:end).*log(targetout))) ;%f位softmax分類器的誤差函數 IO = (targetout-target(:,1:end)); Ix_class=IO; dw_class = w3probs'*Ix_class; %偏導值 df = [dw_class(:)']';
CG_CLASSIFY.m:
function [f, df] = CG_CLASSIFY(VV,Dim,XX,target); l1 = Dim(1); l2 = Dim(2); l3= Dim(3); l4= Dim(4); l5= Dim(5); N = size(XX,1); % Do decomversion. w1 = reshape(VV(1:(l1+1)*l2),l1+1,l2); xxx = (l1+1)*l2; w2 = reshape(VV(xxx+1:xxx+(l2+1)*l3),l2+1,l3); xxx = xxx+(l2+1)*l3; w3 = reshape(VV(xxx+1:xxx+(l3+1)*l4),l3+1,l4); xxx = xxx+(l3+1)*l4; w_class = reshape(VV(xxx+1:xxx+(l4+1)*l5),l4+1,l5); XX = [XX ones(N,1)]; w1probs = 1./(1 + exp(-XX*w1)); w1probs = [w1probs ones(N,1)]; w2probs = 1./(1 + exp(-w1probs*w2)); w2probs = [w2probs ones(N,1)]; w3probs = 1./(1 + exp(-w2probs*w3)); w3probs = [w3probs ones(N,1)]; targetout = exp(w3probs*w_class); targetout = targetout./repmat(sum(targetout,2),1,10); f = -sum(sum( target(:,1:end).*log(targetout))) ; IO = (targetout-target(:,1:end)); Ix_class=IO; dw_class = w3probs'*Ix_class; Ix3 = (Ix_class*w_class').*w3probs.*(1-w3probs); Ix3 = Ix3(:,1:end-1); dw3 = w2probs'*Ix3; Ix2 = (Ix3*w3').*w2probs.*(1-w2probs); Ix2 = Ix2(:,1:end-1); dw2 = w1probs'*Ix2; Ix1 = (Ix2*w2').*w1probs.*(1-w1probs); Ix1 = Ix1(:,1:end-1); dw1 = XX'*Ix1; df = [dw1(:)' dw2(:)' dw3(:)' dw_class(:)']';
實驗總結:
1. 終於閱讀了一個RBM的源碼了,以前看那些各種公式的理論,現在有了對應的code,讀對應的code起來就是爽!
2. 這里由於用的是整個圖片進行訓練(不是用的它們的patch部分),所以沒有對應的convolution和pooling,因此預訓練網絡結構時下一個rbm網絡的輸入就是上一個rbm網絡的輸出,且當沒有加入softmax時的微調階段用的依舊是無監督的學習(此時的標簽依舊為原始的輸入數據);而當加入了softmax后的微調部分用的就是訓練樣本的真實標簽了,因為此時需要進行分類。
3. 深度越深,則網絡的微調時間越長,需要很多時間收斂,即使是進行了預訓練。
4. 暫時還沒弄懂要是針對大圖片采用covolution訓練時,第二層網絡的數據來源是什么,有可能和上面的一樣,是上層網絡的輸出(但是此時微調怎么辦呢,不用標簽數據?)也有可能是大圖片經過第一層網絡covolution,pooling后的輸出值(如果是這樣的話,網絡的代價函數就不好弄了,因為里面有convolution和pooling操作)。
參考資料:
reducing the dimensionality of data with neural networks
http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html
取模(mod)與取余(rem)的區別——Matlab學習筆記