前言
本節主要是練習下斯坦福DL網絡教程UFLDL關於Sparse coding那一部分,具體的網頁教程參考:Exercise:Sparse Coding。該實驗的主要內容是從2w個自然圖像的patches中分別采用sparse coding和拓撲的sparse coding方法進行學習,並觀察學習到的這些圖像基圖像的特征。訓練數據時自然圖片IMAGE,在給出的教程網站上有。
實驗基礎
Sparse coding的主要是思想學習輸入數據集”基數據”,一旦獲得這些”基數據”,輸入數據集中的每個數據都可以用這些”基數據”的線性組合表示,而稀疏性則體現在這些線性組合系數是系數的,即大部分的值都為0。很顯然,這些”基數據”的尺寸和原始輸入數據的尺寸是相同的,另外”基數據”的個數通常要比每個樣本的維數大。最簡單的理解可以看前面博文提到過的公式:
其中的輸入數據x可以分解成基Ф的線性組合,ai為組合系數。不過那只是針對一個數據而已,而在ML領域中都是大數據,因此下面來考慮樣本是矩陣的形式。在前面博文Deep learning:二十六(Sparse coding簡單理解)中我們已經介紹過sparse coding系統非拓撲時的代價函數為:
拓撲結構時的代價函數如下:
在訓練階段我們的目的是要通過優化算法求出最佳的參數A,因為A就是我們的”基數據”集。但是以上2個代價函數表達式中都有兩個未知的參數矩陣,即A和s,所以不能采用簡單的優化方法。此時一般的優化思想為交叉優化,即先固定一個A來優化s,然后固定該s來優化A,以此類推,等迭代步驟到達預設值時就停止。而在優化過程中首先要解決的就是代價函數對參數矩陣A和s的求導問題。
此時的求導涉及到了矩陣范數的求導,一般有2種方法,第一種是將求導問題轉換到矩陣的跡的求導,可以參考前面博文Deep learning:二十七(Sparse coding中關於矩陣的范數求導)。第二種就是利用BP的思想來求,可以參考:Deep learning:二十八(使用BP算法思想求解Sparse coding中矩陣范數導數)一文。
代價函數關於權值矩陣A的導數如下(拓撲和非拓撲時結果是一樣的,因為此時這2種情況下代價函數關於A是沒區別的):
非拓撲結構下代價函數關於s的導數如下:
拓撲Sparse coding下代價函數關於s的導數為:
關於本程序的一點注釋:
- 如果按照上面公式的和我們的理解,A是由學習到的基向量構成,S為每個樣本在該基分解下的系數。在這里表示前提下,可以這樣定義:A為n*k維,其中的每一列表示的是訓練出來的基向量,S是k*m,其中的每一列是對應輸入樣本的sparse coding分解系數,當然了,此時的X是n*m了。即每一列表示的是一個樣本數據。如果我們反過來表示(雖然這樣理解不對,這里只是用不同的表示方法矩陣而已),即A表示輸入數據X的分解系數(即編碼值),而S是原始數據集訓練出來的基的構成的,那么此時關於A,S,X三者的維度可以這樣定義和解釋:現假設有m個樣本X,每個樣本是個n維的向量,即X為m*n維的矩陣,需要用sparse coding學習k個特征,使得代價函數值最小,則其中的A是m*k維的,A中的第i行表示第i個樣本分解后的系數值,S是k*n維的,S的第i行表示所學習到的第i個基。當然了,在本次實驗和以后類似情況下我們還是用正確的版本,即第一種表示。
- 在matlab中,右除矩陣A和右乘inv(A)雖然在定義上式一樣的,但是兩者運行的結果有可能不同,右除的精度要高些。
- 注意拓撲結構下代價函數對s導數公式中的最后一項是點乘符號,也就是矩陣中對應元素的相乘,如果弄成了普通的矩陣乘法則就很難通過gradient checking了。
- 本程序訓練樣本IMAGE原圖片尺寸512*512,共10張,從這10張大圖片中提取2w張8*8的小patch圖片,這些圖片部分顯示如下:
一些Matlab函數:
circshift:
該函數是將矩陣循環平移的函數,比如說B = circshift(A,shiftsize)是將矩陣A按照shiftsize的方式左右平移,一般hiftsize為一個多維的向量,第一個元素表示上下方向移動(更准確的說是在第一個維度上移動,這里只是考慮是2維矩陣的情況,后面的類似),如果為正表示向下移,第二個元素表示左右方向移動,如果向右表示向右移動。
rndperm:
該函數是隨機產生一個行向量,比如randperm(n)產生一個n維的行向量,向量元素值為1~n,隨機選取且不重復。而randperm(n,k)表示產生一個長為k的行向量,其元素也是在1到n之間,不能有重復。
questdlg:
button = questdlg('qstring','title','str1','str2','str3',default),這是一個對話框,對話框中的內容用qstring表示,標題為title,然后后面3個分別為對應yes,no,cancel按鈕,最后的參數default為默認的對應按鈕。
實驗結果:
交叉優化參數中,給定s優化A時,由於A有直接的解析解,所以不需要通過lbfgs等優化算法求得,通過令代價函數對A的導函數為0,可以得到解析解為:
注意單位矩陣前一定要有個系數(即樣本個數),不然在程序中直接用該方法求得的A是通過不了驗證。
此時學習到的非拓撲結果為:
上面的結果有點難看,采用的是16*16大小的patch,而非8*8的。
采用cg優化,256個16*16大小的patch,其結果如下:
如果將patch改為8*8,121個特征點,結果如下(這個比較像樣):
如果用lbfgs,256個16*16的,其結果如下(效果很差,說明優化方法對結果有影響):
實驗部分代碼及注釋:
sparseCodeingExercise.m:
%% CS294A/CS294W Sparse Coding Exercise % Instructions % ------------ % % This file contains code that helps you get started on the % sparse coding exercise. In this exercise, you will need to modify % sparseCodingFeatureCost.m and sparseCodingWeightCost.m. You will also % need to modify this file, sparseCodingExercise.m slightly. % Add the paths to your earlier exercises if necessary % addpath /path/to/solution %%====================================================================== %% STEP 0: Initialization % Here we initialize some parameters used for the exercise. addpath minFunc; numPatches = 20000; % number of patches numFeatures = 256; % number of features to learn patchDim = 16; % patch dimension visibleSize = patchDim * patchDim; %單通道灰度圖,64維,學習121個特征 % dimension of the grouping region (poolDim x poolDim) for topographic sparse coding poolDim = 3; % number of patches per batch batchNumPatches = 2000; %分成10個batch lambda = 5e-5; % L1-regularisation parameter (on features) epsilon = 1e-5; % L1-regularisation epsilon |x| ~ sqrt(x^2 + epsilon) gamma = 1e-2; % L2-regularisation parameter (on basis) %%====================================================================== %% STEP 1: Sample patches images = load('IMAGES.mat'); images = images.IMAGES; patches = sampleIMAGES(images, patchDim, numPatches); display_network(patches(:, 1:64)); %%====================================================================== %% STEP 3: Iterative optimization % Once you have implemented the cost functions, you can now optimize for % the objective iteratively. The code to do the iterative optimization % using mini-batching and good initialization of the features has already % been included for you. % % However, you will still need to derive and fill in the analytic solution % for optimizing the weight matrix given the features. % Derive the solution and implement it in the code below, verify the % gradient as described in the instructions below, and then run the % iterative optimization. % Initialize options for minFunc options.Method = 'cg'; options.display = 'off'; options.verbose = 0; % Initialize matrices weightMatrix = rand(visibleSize, numFeatures);%64*121 featureMatrix = rand(numFeatures, batchNumPatches);%121*2000 % Initialize grouping matrix assert(floor(sqrt(numFeatures)) ^2 == numFeatures, 'numFeatures should be a perfect square'); donutDim = floor(sqrt(numFeatures)); assert(donutDim * donutDim == numFeatures,'donutDim^2 must be equal to numFeatures'); groupMatrix = zeros(numFeatures, donutDim, donutDim);%121*11*11 groupNum = 1; for row = 1:donutDim for col = 1:donutDim groupMatrix(groupNum, 1:poolDim, 1:poolDim) = 1;%poolDim=3 groupNum = groupNum + 1; groupMatrix = circshift(groupMatrix, [0 0 -1]); end groupMatrix = circshift(groupMatrix, [0 -1, 0]); end groupMatrix = reshape(groupMatrix, numFeatures, numFeatures);%121*121 if isequal(questdlg('Initialize grouping matrix for topographic or non-topographic sparse coding?', 'Topographic/non-topographic?', 'Non-topographic', 'Topographic', 'Non-topographic'), 'Non-topographic') groupMatrix = eye(numFeatures);%非拓撲結構時的groupMatrix矩陣 end % Initial batch indices = randperm(numPatches);%1*20000 indices = indices(1:batchNumPatches);%1*2000 batchPatches = patches(:, indices); fprintf('%6s%12s%12s%12s%12s\n','Iter', 'fObj','fResidue','fSparsity','fWeight'); warning off; for iteration = 1:200 % iteration = 1; error = weightMatrix * featureMatrix - batchPatches; error = sum(error(:) .^ 2) / batchNumPatches; %說明重構誤差需要考慮樣本數 fResidue = error; num_batches = size(batchPatches,2); R = groupMatrix * (featureMatrix .^ 2); R = sqrt(R + epsilon); fSparsity = lambda * sum(R(:)); %稀疏項和權值懲罰項不需要考慮樣本數 fWeight = gamma * sum(weightMatrix(:) .^ 2); %上面的那些權值都是隨機初始化的 fprintf(' %4d %10.4f %10.4f %10.4f %10.4f\n', iteration, fResidue+fSparsity+fWeight, fResidue, fSparsity, fWeight) % Select a new batch indices = randperm(numPatches); indices = indices(1:batchNumPatches); batchPatches = patches(:, indices); % Reinitialize featureMatrix with respect to the new % 對featureMatrix重新初始化,按照網頁教程上介紹的方法進行 featureMatrix = weightMatrix' * batchPatches; normWM = sum(weightMatrix .^ 2)'; featureMatrix = bsxfun(@rdivide, featureMatrix, normWM); % Optimize for feature matrix options.maxIter = 20; %給定權值初始值,優化特征值矩陣 [featureMatrix, cost] = minFunc( @(x) sparseCodingFeatureCost(weightMatrix, x, visibleSize, numFeatures, batchPatches, gamma, lambda, epsilon, groupMatrix), ... featureMatrix(:), options); featureMatrix = reshape(featureMatrix, numFeatures, batchNumPatches); weightMatrix = (batchPatches*featureMatrix')/(gamma*num_batches*eye(size(featureMatrix,1))+featureMatrix*featureMatrix'); [cost, grad] = sparseCodingWeightCost(weightMatrix, featureMatrix, visibleSize, numFeatures, batchPatches, gamma, lambda, epsilon, groupMatrix); end figure; display_network(weightMatrix);
sparseCodingWeightCost.m:
function [cost, grad] = sparseCodingWeightCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix) %sparseCodingWeightCost - given the features in featureMatrix, % computes the cost and gradient with respect to % the weights, given in weightMatrix % parameters % weightMatrix - the weight matrix. weightMatrix(:, c) is the cth basis % vector. % featureMatrix - the feature matrix. featureMatrix(:, c) is the features % for the cth example % visibleSize - number of pixels in the patches % numFeatures - number of features % patches - patches % gamma - weight decay parameter (on weightMatrix) % lambda - L1 sparsity weight (on featureMatrix) % epsilon - L1 sparsity epsilon % groupMatrix - the grouping matrix. groupMatrix(r, :) indicates the % features included in the rth group. groupMatrix(r, c) % is 1 if the cth feature is in the rth group and 0 % otherwise. if exist('groupMatrix', 'var') assert(size(groupMatrix, 2) == numFeatures, 'groupMatrix has bad dimension'); else groupMatrix = eye(numFeatures);%非拓撲的sparse coding中,相當於groupMatrix為單位對角矩陣 end numExamples = size(patches, 2);%測試代碼時為5 weightMatrix = reshape(weightMatrix, visibleSize, numFeatures);%其實傳入進來的就是這些東西 featureMatrix = reshape(featureMatrix, numFeatures, numExamples); % -------------------- YOUR CODE HERE -------------------- % Instructions: % Write code to compute the cost and gradient with respect to the % weights given in weightMatrix. % -------------------- YOUR CODE HERE -------------------- %% 求目標的代價函數 delta = weightMatrix*featureMatrix-patches; fResidue = sum(sum(delta.^2))./numExamples;%重構誤差 fWeight = gamma*sum(sum(weightMatrix.^2));%防止基內元素值過大 % sparsityMatrix = sqrt(groupMatrix*(featureMatrix.^2)+epsilon); % fSparsity = lambda*sum(sparsityMatrix(:)); %對特征系數性的懲罰值 % cost = fResidue+fWeight+fSparsity; %目標的代價函數 cost = fResidue+fWeight; %% 求目標代價函數的偏導函數 grad = (2*weightMatrix*featureMatrix*featureMatrix'-2*patches*featureMatrix')./numExamples+2*gamma*weightMatrix; grad = grad(:); end
sparseCodingFeatureCost .m:
function [cost, grad] = sparseCodingFeatureCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix) %sparseCodingFeatureCost - given the weights in weightMatrix, % computes the cost and gradient with respect to % the features, given in featureMatrix % parameters % weightMatrix - the weight matrix. weightMatrix(:, c) is the cth basis % vector. % featureMatrix - the feature matrix. featureMatrix(:, c) is the features % for the cth example % visibleSize - number of pixels in the patches % numFeatures - number of features % patches - patches % gamma - weight decay parameter (on weightMatrix) % lambda - L1 sparsity weight (on featureMatrix) % epsilon - L1 sparsity epsilon % groupMatrix - the grouping matrix. groupMatrix(r, :) indicates the % features included in the rth group. groupMatrix(r, c) % is 1 if the cth feature is in the rth group and 0 % otherwise. isTopo = 1; % L = size(groupMatrix,1); % [K M] = size(featureMatrix); if exist('groupMatrix', 'var') assert(size(groupMatrix, 2) == numFeatures, 'groupMatrix has bad dimension'); if(isequal(groupMatrix,eye(numFeatures))); isTopo = 0; end else groupMatrix = eye(numFeatures); isTopo = 0; end numExamples = size(patches, 2); weightMatrix = reshape(weightMatrix, visibleSize, numFeatures); featureMatrix = reshape(featureMatrix, numFeatures, numExamples); % -------------------- YOUR CODE HERE -------------------- % Instructions: % Write code to compute the cost and gradient with respect to the % features given in featureMatrix. % You may wish to write the non-topographic version, ignoring % the grouping matrix groupMatrix first, and extend the % non-topographic version to the topographic version later. % -------------------- YOUR CODE HERE -------------------- %% 求目標的代價函數 delta = weightMatrix*featureMatrix-patches; fResidue = sum(sum(delta.^2))./numExamples;%重構誤差 % fWeight = gamma*sum(sum(weightMatrix.^2));%防止基內元素值過大 sparsityMatrix = sqrt(groupMatrix*(featureMatrix.^2)+epsilon); fSparsity = lambda*sum(sparsityMatrix(:)); %對特征系數性的懲罰值 % cost = fResidue++fSparsity+fWeight;%此時A為常量,可以不用 cost = fResidue++fSparsity; %% 求目標代價函數的偏導函數 gradResidue = (-2*weightMatrix'*patches+2*weightMatrix'*weightMatrix*featureMatrix)./numExamples; % 非拓撲結構時: if ~isTopo gradSparsity = lambda*(featureMatrix./sparsityMatrix); end % 拓撲結構時 if isTopo % gradSparsity = lambda*groupMatrix'*(groupMatrix*(featureMatrix.^2)+epsilon).^(-0.5).*featureMatrix;%一定要小心最后一項是內積乘法 gradSparsity = lambda*groupMatrix'*(groupMatrix*(featureMatrix.^2)+epsilon).^(-0.5).*featureMatrix;%一定要小心最后一項是內積乘法 end grad = gradResidue+gradSparsity; grad = grad(:); end
sampleIMAGES.m:
function patches = sampleIMAGES(images, patchsize,numpatches) % sampleIMAGES % Returns 10000 patches for training % load IMAGES; % load images from disk %patchsize = 8; % we'll use 8x8 patches %numpatches = 10000; % Initialize patches with zeros. Your code will fill in this matrix--one % column per patch, 10000 columns. patches = zeros(patchsize*patchsize, numpatches); %% ---------- YOUR CODE HERE -------------------------------------- % Instructions: Fill in the variable called "patches" using data % from IMAGES. % % IMAGES is a 3D array containing 10 images % For instance, IMAGES(:,:,6) is a 512x512 array containing the 6th image, % and you can type "imagesc(IMAGES(:,:,6)), colormap gray;" to visualize % it. (The contrast on these images look a bit off because they have % been preprocessed using using "whitening." See the lecture notes for % more details.) As a second example, IMAGES(21:30,21:30,1) is an image % patch corresponding to the pixels in the block (21,21) to (30,30) of % Image 1 for imageNum = 1:10%在每張圖片中隨機選取1000個patch,共10000個patch [rowNum colNum] = size(images(:,:,imageNum)); for patchNum = 1:2000%實現每張圖片選取1000個patch xPos = randi([1,rowNum-patchsize+1]); yPos = randi([1, colNum-patchsize+1]); patches(:,(imageNum-1)*2000+patchNum) = reshape(images(xPos:xPos+patchsize-1,yPos:yPos+patchsize-1,... imageNum),patchsize*patchsize,1); end end %% --------------------------------------------------------------- % For the autoencoder to work well we need to normalize the data % Specifically, since the output of the network is bounded between [0,1] % (due to the sigmoid activation function), we have to make sure % the range of pixel values is also bounded between [0,1] % patches = normalizeData(patches); end %% --------------------------------------------------------------- function patches = normalizeData(patches) % Squash data to [0.1, 0.9] since we use sigmoid as the activation % function in the output layer % Remove DC (mean of images). patches = bsxfun(@minus, patches, mean(patches)); % Truncate to +/-3 standard deviations and scale to -1 to 1 pstd = 3 * std(patches(:)); patches = max(min(patches, pstd), -pstd) / pstd;%因為根據3sigma法則,95%以上的數據都在該區域內 % 這里轉換后將數據變到了-1到1之間 % Rescale from [-1,1] to [0.1,0.9] patches = (patches + 1) * 0.4 + 0.1; end
實驗總結:
拓撲結構的Sparse coding未完成,跑出來沒有效果,還望有人指導下。
2013.5.6:
已解決非拓撲下的Sparse coding,那時候出現的問題是因為在代價函數中,重構誤差那一項沒有除樣本數(下面博文回復中網友給的提示),導致代價函數,導數,以及A的解析解都相應錯了。
但是拓撲Sparse Coding依舊沒有訓練出來,因為訓練過程中代價函數的值不是遞減的,而是基本無規律。
2013.5.14:
基本解決了拓撲下的Sparse coding。以前訓練不出特征來主要原因是在sampleIMAGES.m里沒有將最后的patches歸一化注釋掉(個人猜測:采樣前的大圖片是經過白化了的,所以如果后面繼續用那個帶誤差的歸一化,可能引入更大的誤差,導致給定的樣本不適合Sparse coding),另外就是根據群里網友@地皮菜的提示,將優化算法由lbfgs改為cg就可以得出像樣的結果。由此可見,不同的優化算法對最終的結果也是有影響的。
參考資料:
Deep learning:二十六(Sparse coding簡單理解)
Deep learning:二十七(Sparse coding中關於矩陣的范數求導)
Deep learning:二十八(使用BP算法思想求解Sparse coding中矩陣范數導數)