1前言
本人寫技術博客的目的,其實是感覺好多東西,很長一段時間不動就會忘記了,為了加深學習記憶以及方便以后可能忘記后能很快回憶起自己曾經學過的東西。
首先,在網上找了一些資料,看見介紹說UFLDL很不錯,很適合從基礎開始學習,Adrew Ng大牛寫得一點都不裝B,感覺非常好,另外對我們英語不好的人來說非常感謝,此教程的那些翻譯者們!如余凱等。因為我先看了一些深度學習的文章,但是感覺理解得不夠,一般要自己編程或者至少要看懂別人的程序才能理解深刻,所以我根據該教程的練習,一步一步做起,當然我也參考了一些別人已編好的程序,這是為了節省時間,提高效率,還可以學習其他人好的編程習慣。
另外感覺tornadomeet比較牛,寫的博客非常細,適合我這種初學者學習,在此非常感謝!我也關注了他的微博,現在他在網易還在從事深度學習。
此練習是按UFLDL的Sparse Autoencoder練習來的做的。
2練習基礎
看本文練習請先看其教程第一節,里面有它的原理及相關符號約定。
注意:本文的程序已經是矢量化編程的結果,所以即使是檢查梯度也比較快。
以下基礎是引用自tornadomeet的博客,感覺寫的很好,一看就知道他非常詳細地讀了Ng提供的每個程序。但是我也加了自己的一些東西。
一些malab函數:
bsxfun:
C=bsxfun(fun,A,B)表達的是兩個數組A和B間元素的二值操作,fun是函數句柄或者m文件,或者是內嵌的函數。在實際使用過程中fun有很多選擇比如說加,減等,前面需要使用符號’@’.一般情況下A和B需要尺寸大小相同,如果不相同的話,則只能有一個維度不同,同時A和B中在該維度處必須有一個的維度為1。比如說bsxfun(@minus, A, mean(A)),其中A和mean(A)的大小是不同的,這里的意思需要先將mean(A)擴充到和A大小相同,然后用A的每個元素減去擴充后的mean(A)對應元素的值。
rand:
生成均勻分布的偽隨機數。分布在(0~1)之間
主要語法:rand(m,n)生成m行n列的均勻分布的偽隨機數
rand(m,n,'double')生成指定精度的均勻分布的偽隨機數,參數還可以是'single'
rand(RandStream,m,n)利用指定的RandStream(我理解為隨機種子)生成偽隨機數
randn:
生成標准正態分布的偽隨機數(均值為0,方差為1)。主要語法:和上面一樣
randi:
生成均勻分布的偽隨機整數
主要語法:randi(iMax)在閉區間(0,iMax)生成均勻分布的偽隨機整數
randi(iMax,m,n)在閉區間(0,iMax)生成mXn型隨機矩陣
r = randi([iMin,iMax],m,n)在閉區間(iMin,iMax)生成mXn型隨機矩陣
exist:
測試參數是否存在,比如說exist('opt_normalize', 'var')表示檢測變量opt_normalize是否存在,其中的’var’表示變量的意思。
colormap:
設置當前常見的顏色值表。
floor:
floor(A):取不大於A的最大整數。
ceil:
ceil(A):取不小於A的最小整數。
imagesc:
imagesc和image類似,可以用於顯示圖像。比如imagesc(array,'EraseMode','none',[-1 1]),這里的意思是將array中的數據線性映射到[-1,1]之間,然后使用當前設置的顏色表進行顯示。此時的[-1,1]充滿了整個顏色表。背景擦除模式設置為node,表示不擦除背景。
repmat:
該函數是擴展一個矩陣並把原來矩陣中的數據復制進去。比如說B = repmat(A,m,n),就是創建一個矩陣B,B中復制了共m*n個A矩陣,因此B矩陣的大小為[size(A,1)*m size(A,2)*m]。
matlab中的@用法:
問:f=@(x)acos(x)表示什么意思?其中@代表什么?
答:表示f為函數句柄,@是定義句柄的運算符。f=@(x)acos(x) 相當於建立了一個函數文件:
% f.m
function y=f(x)
y=acos(x);
若有下列語句:xsqual=@(x)1/2.*(x==-1/2)+1.*(x>-1/28&x<1/2)+1.2.*(x==-1/2);
則相當於建立了一個函數文件:
% xsqual.m
function y=xsqual(x)
y=1/2.*(x==-1/2)+1.*(x>-1/28&x<1/2)+1.2.*(x==-1/2);
函數句柄的好處
①提高運行速度。因為matlab對函數的調用每次都是要搜索所有的路徑,從set path中我們可以看到,路徑是非常的多的,所以如果一個函數在你的程序中需要經常用到的話,使用函數句柄,對你的速度會有提高的。
②使用可以與變量一樣方便。比如說,我再這個目錄運行后,創建了本目錄的一個函數句柄,當我轉到其他的目錄下的時候,創建的函數句柄還是可以直接調用的,而不需要把那個函數文件拷貝過來。因為你創建的function handles中,已經包含了路徑,
使用函數句柄的作用:
不使用函數句柄的情況下,對函數多次調用,每次都要為該函數進行全面的路徑搜索,直接影響計算速度,借助句柄可以完全避免這種時間損耗。也就是直接指定了函數的指針。函數句柄就像一個函數的名字,有點類似於C++程序中的引用。
代價函數 為:
,其中
計算梯度需要用到的公式:
對輸出層(第 層),計算:
-
其中,y是希望輸出得到數據,即:y=x
其中,
令 ,
3練習步驟
1.產生訓練集。從10張512*512的圖片中,隨機選擇10000張8*8的小圖塊,然后再把它歸一化,得到訓練集patches。具體見程序 sampleIMAGES.m
2.計算出代價函數 Jsparse(W,b) 及其梯度。具體見程序sparseAutoencoderCost.m。
3.通過函數 computeNumericalGradient.m計算出大概梯度(EPSILON = 10-4),然后通過函數checkNumericalGradient.m檢查上一步寫的計算梯度的代碼是否正確。首先,通過計算函數 在點[4,10]處的梯度對比用computeNumericalGradient.m中的方法計算該函數的梯度,這兩者梯度的差值小於10-9就代表computeNumericalGradient.m中方法是正確的。然后,用computeNumericalGradient.m中方法計算代價函數 Jsparse(W,b) 的梯度對比用sparseAutoencoderCost.m中的方法計算代價函數 Jsparse(W,b) 的梯度,如果這兩者梯度的差值小於10-9就證明sparseAutoencoderCost.m中方法是正確的。
4.訓練稀疏自動編碼器。用的 L-BFGS算法(注意:這個算法不能將它用於商業用途,若用與商業用途的話,可以使用fminlbfgs函數,他比L-BFGS慢但可用於商業用途),具體見文件夾 minFunc。另外,初始化參數矩陣θ(包含W(1),W(2),b(1),b(2))時,W(1),W(2)的初始值是從 中隨機均勻分布產生,其中 nin是隱藏層神經元個數, nout 是輸出層神經元個數。b(1),b(2)初始化為0.
5.可視化結果。點擊train.m運行總程序,訓練稀疏自動編碼器,得到可視化結果。把產生的權重結果可視化,通過它我們能夠知道,該算法究竟從圖片中學習了哪些特征。
運行整個程序,從開始產生訓練集到最后得到可視化結果,所用的時間為144.498443 seconds。
具體顯示第一層特征的可視化函數display_network.m的詳細注釋可見:Deep Learning八:Stacked Autocoders and Implement deep networks for digit classification_Exercise(斯坦福大學深度學習教程UFLDL)
運行結果為:
訓練集為:
特征可視化結果為:
可以看出,稀疏自動編碼器學習到的特征實際上是圖像的邊緣
>> train
Elapsed time is 0.034801 seconds.
38.0000 38.0000
12.0000 12.0000
The above two columns you get should be very similar.
(Left-Your Numerical Gradient, Right-Analytical Gradient)
2.1452e-12
上面是通過計算函數 在點[4,10]處的梯度對比用computeNumericalGradient.m中的方法計算該函數的梯度,這兩者梯度的差值,可見他小於10-9
7.8813e-11
Iteration FunEvals Step Length Function Val Opt Cond
1 4 7.66801e-02 8.26267e+00 2.99521e+02
2 5 1.00000e+00 4.00448e+00 1.59279e+02
上面是用computeNumericalGradient.m中方法計算代價函數 Jsparse(W,b) 的梯度對比用sparseAutoencoderCost.m中的方法計算代價函數 Jsparse(W,b) 的梯度,可見這兩者梯度的差值小於10-9
4代碼及注釋
train.m
%% CS294A/CS294W Programming Assignment Starter Code % Instructions % ------------ % % This file contains code that helps you get started on the % programming assignment. You will need to complete the code in sampleIMAGES.m, % sparseAutoencoderCost.m and computeNumericalGradient.m. % For the purpose of completing the assignment, you do not need to % change the code in this file. % %%====================================================================== %% STEP 0: Here we provide the relevant parameters values that will % allow your sparse autoencoder to get good filters; you do not need to % change the parameters below.第0步:提供可得到較好濾波器的相關參數值,不得改變以下參數 visibleSize = 8*8; % number of input units 輸入層單元數 hiddenSize = 25; % number of hidden units隱藏層單元數 sparsityParam = 0.01; % desired average activation of the hidden units.稀疏值 % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p", % in the lecture notes). lambda = 0.0001; % weight decay parameter 權重衰減系數 beta = 3; % weight of sparsity penalty term稀疏值懲罰項的權重 %%====================================================================== %% STEP 1: Implement sampleIMAGES 第1步:實現圖片采樣 % % 實現圖片采樣后,函數display_network從訓練集中隨機顯示200張 % After implementing sampleIMAGES, the display_network command should % display a random sample of 200 patches from the dataset patches = sampleIMAGES; display_network(patches(:,randi(size(patches,2),200,1)),8);%從10000張中隨機選擇200張顯示 % Obtain random parameters theta初始化參數向量theta theta = initializeParameters(hiddenSize, visibleSize); %%====================================================================== %% STEP 2: Implement sparseAutoencoderCost % % 在計算代價函數時,可以一次計算其所有的元素項值(均方差項、權重衰減項、懲罰項),但是一步一步地計算各元素項值, % 然后每步完成后運行梯度檢驗的方法可能會更容易實現,建議按照下面的步驟來實現函數sparseAutoencoderCost: % You can implement all of the components (squared error cost, weight decay term, % sparsity penalty) in the cost function at once, but it may be easier to do % it step-by-step and run gradient checking (see STEP 3) after each step. We % suggest implementing the sparseAutoencoderCost function using the following steps: % % (a) Implement forward propagation in your neural network, and implement the % squared error term of the cost function. Implement backpropagation to % compute the derivatives. Then (using lambda=beta=0), run Gradient Checking % to verify that the calculations corresponding to the squared error cost % term are correct.實現神經網絡中的前向傳播和代價函數中的均方差項。通過反向傳導計算偏導數。 % 然后運行梯度檢驗法來檢查均方差項是否計算錯誤。 % % (b) Add in the weight decay term (in both the cost function and the derivative % calculations), then re-run Gradient Checking to verify correctness.在代價函數和偏導數計算中 % 加入權重衰減項,然后運行梯度檢驗法來檢查其正確性。 % % (c) Add in the sparsity penalty term, then re-run Gradient Checking to % verify correctness.加入懲罰項,然后運行梯度檢驗法來檢查其正確性。 % % Feel free to change the training settings when debugging your % code. (For example, reducing the training set size or % number of hidden units may make your code run faster; and setting beta % and/or lambda to zero may be helpful for debugging.) However, in your % final submission of the visualized weights, please use parameters we % gave in Step 0 above. [cost, grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, lambda, ... sparsityParam, beta, patches); %%====================================================================== %% STEP 3: Gradient Checking % % Hint: If you are debugging your code, performing gradient checking on smaller models % and smaller training sets (e.g., using only 10 training examples and 1-2 hidden % units) may speed things up. % First, lets make sure your numerical gradient computation is correct for a % simple function. After you have implemented computeNumericalGradient.m, % run the following: checkNumericalGradient(); % Now we can use it to check your cost function and derivative calculations % for the sparse autoencoder. numgrad = computeNumericalGradient( @(x) sparseAutoencoderCost(x, visibleSize, ... hiddenSize, lambda, ... sparsityParam, beta, ... patches), theta); % Use this to visually compare the gradients side by side disp([numgrad grad]); % Compare numerically computed gradients with the ones obtained from backpropagation diff = norm(numgrad-grad)/norm(numgrad+grad); disp(diff); % Should be small. In our implementation, these values are % usually less than 1e-9. % When you got this working, Congratulations!!! %%====================================================================== %% STEP 4: After verifying that your implementation of % sparseAutoencoderCost is correct, You can start training your sparse % autoencoder with minFunc (L-BFGS). % Randomly initialize the parameters theta = initializeParameters(hiddenSize, visibleSize); % Use minFunc to minimize the function addpath minFunc/ options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost % function. Generally, for minFunc to work, you % need a function pointer with two outputs: the % function value and the gradient. In our problem, % sparseAutoencoderCost.m satisfies this. options.maxIter = 400; % Maximum number of iterations of L-BFGS to run options.display = 'on'; [opttheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ... visibleSize, hiddenSize, ... lambda, sparsityParam, ... beta, patches), ... theta, options); %%====================================================================== %% STEP 5: Visualization W1 = reshape(opttheta(1:hiddenSize*visibleSize), hiddenSize, visibleSize); display_network(W1', 12); print -djpeg weights.jpg % save the visualization to a file
sampleIMAGES.m
function patches = sampleIMAGES() % sampleIMAGES % Returns 10000 patches for training load IMAGES; % load images from disk patchsize = 8; % we'll use 8x8 patches numpatches = 10000; % Initialize patches with zeros. Your code will fill in this matrix--one % column per patch, 10000 columns. patches = zeros(patchsize*patchsize, numpatches); %% ---------- YOUR CODE HERE -------------------------------------- % Instructions: Fill in the variable called "patches" using data % from IMAGES. % % IMAGES is a 3D array containing 10 images % For instance, IMAGES(:,:,6) is a 512x512 array containing the 6th image, % and you can type "imagesc(IMAGES(:,:,6)), colormap gray;" to visualize % it. (The contrast on these images look a bit off because they have % been preprocessed using using "whitening." See the lecture notes for % more details.) As a second example, IMAGES(21:30,21:30,1) is an image % patch corresponding to the pixels in the block (21,21) to (30,30) of % Image 1 tic image_size=size(IMAGES); i=randi(image_size(1)-patchsize+1,1,numpatches);%生成元素值隨機為大於0且小於image_size(1)-patchsize+1的1行numpatches矩陣 j=randi(image_size(2)-patchsize+1,1,numpatches); k=randi(image_size(3),1,numpatches); for num=1:numpatches patches(:,num)=reshape(IMAGES(i(num):i(num)+patchsize-1,j(num):j(num)+patchsize-1,k(num)),1,patchsize*patchsize); end toc %% --------------------------------------------------------------- % For the autoencoder to work well we need to normalize the data % Specifically, since the output of the network is bounded between [0,1] % (due to the sigmoid activation function), we have to make sure % the range of pixel values is also bounded between [0,1] patches = normalizeData(patches); end %% --------------------------------------------------------------- function patches = normalizeData(patches) % Squash data to [0.1, 0.9] since we use sigmoid as the activation % function in the output layer % Remove DC (mean of images). 把patches數組中的每個元素值都減去mean(patches) patches = bsxfun(@minus, patches, mean(patches)); % Truncate to +/-3 standard deviations and scale to -1 to 1 pstd = 3 * std(patches(:));%把patches的標准差變為其原來的3倍 patches = max(min(patches, pstd), -pstd) / pstd;%因為根據3sigma法則,95%以上的數據都在該區域內 % 這里轉換后將數據變到了-1到1之間 % Rescale from [-1,1] to [0.1,0.9] patches = (patches + 1) * 0.4 + 0.1; end
sparseAutoencoderCost.m
function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ... lambda, sparsityParam, beta, data) % visibleSize:輸入層神經單元節點數 the number of input units (probably 64) % hiddenSize:隱藏層神經單元節點數 the number of hidden units (probably 25) % lambda: 權重衰減系數weight decay parameter % sparsityParam: 稀疏性參數The desired average activation for the hidden units (denoted in the lecture % notes by the greek alphabet rho, which looks like a lower-case "p"). % beta: 稀疏懲罰項的權重weight of sparsity penalty term % data: 訓練集Our 64x10000 matrix containing the training data. So, data(:,i) is the i-th training example. % theta:參數向量,包含W1、W2、b1、b2 % The input theta is a vector (because minFunc expects the parameters to be a vector). % We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this % follows the notation convention of the lecture notes. W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize); W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize); b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize); b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end); % Cost and gradient variables (your code needs to compute these values). % Here, we initialize them to zeros. cost = 0; W1grad = zeros(size(W1)); W2grad = zeros(size(W2)); b1grad = zeros(size(b1)); b2grad = zeros(size(b2)); %% ---------- YOUR CODE HERE -------------------------------------- % Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder, % and the corresponding gradients W1grad, W2grad, b1grad, b2grad. % % W1grad, W2grad, b1grad and b2grad should be computed using backpropagation. % Note that W1grad has the same dimensions as W1, b1grad has the same dimensions % as b1, etc. Your code should set W1grad to be the partial derivative of J_sparse(W,b) with % respect to W1. I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b) % with respect to the input parameter W1(i,j). Thus, W1grad should be equal to the term % [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2 % of the lecture notes (and similarly for W2grad, b1grad, b2grad). % % Stated differently, if we were using batch gradient descent to optimize the parameters, % the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2. % %1.前向傳播forward propagation data_size=size(data); active_value2=repmat(b1,1,data_size(2)); active_value3=repmat(b2,1,data_size(2)); active_value2=sigmoid(W1*data+active_value2); %第2層激活值 active_value3=sigmoid(W2*active_value2+active_value3); %第3層激活值 %2.計算代價函數computing error term and cost ave_square=sum(sum((active_value3-data).^2)./2)/data_size(2);%均方差項 weight_decay=lambda/2*(sum(sum(W1.^2))+sum(sum(W2.^2))); %權重衰減項 p_real=sum(active_value2,2)./data_size(2); %對active_value2每行求和,再平均,得到每個隱藏單元的平均活躍度(25行1列) p_para=repmat(sparsityParam,hiddenSize,1); %稀疏性參數(25行1列) sparsity=beta.*sum(p_para.*log(p_para./p_real)+(1-p_para).*log((1-p_para)./(1-p_real)));%懲罰項 cost=ave_square+weight_decay+sparsity; %代價函數 delta3=(active_value3-data).*(active_value3).*(1-active_value3); %第3層殘差 average_sparsity=repmat(sum(active_value2,2)./data_size(2),1,data_size(2));%每個隱藏單元的平均活躍度(25行10000列) default_sparsity=repmat(sparsityParam,hiddenSize,data_size(2)); %稀疏性參數(25行10000列) sparsity_penalty=beta.*(-(default_sparsity./average_sparsity)+((1-default_sparsity)./(1-average_sparsity))); delta2=(W2'*delta3+sparsity_penalty).*((active_value2).*(1-active_value2));%第2層殘差 %3.反向傳導backword propagation W2grad=delta3*active_value2'./data_size(2)+lambda.*W2; %W2梯度 W1grad=delta2*data'./data_size(2)+lambda.*W1; %W1梯度 b2grad=sum(delta3,2)./data_size(2); %b2梯度 b1grad=sum(delta2,2)./data_size(2); %b1梯度 %------------------------------------------------------------------- % After computing the cost and gradient, we will convert the gradients back % to a vector format (suitable for minFunc). Specifically, we will unroll % your gradient matrices into a vector. grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)]; end %------------------------------------------------------------------- % Here's an implementation of the sigmoid function, which you may find useful % in your computation of the costs and the gradients. This inputs a (row or % column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end
computeNumericalGradient.m
function numgrad = computeNumericalGradient(J, theta) % numgrad = computeNumericalGradient(J, theta) % theta: 參數向量,包含W1、W2、b1、b2;a vector of parameters % J: a function that outputs a real-number. Calling y = J(theta) will return the % function value at theta. % Initialize numgrad with zeros numgrad = zeros(size(theta)); %% ---------- YOUR CODE HERE -------------------------------------- % Instructions: % Implement numerical gradient checking, and return the result in numgrad. % (See Section 2.3 of the lecture notes.) % You should write code so that numgrad(i) is (the numerical approximation to) the % partial derivative of J with respect to the i-th input argument, evaluated at theta. % I.e., numgrad(i) should be the (approximately) the partial derivative of J with % respect to theta(i). % % Hint: You will probably want to compute the elements of numgrad one at a time. EPSILON=0.0001; for i=1:size(theta) theta_plus=theta; theta_minu=theta; theta_plus(i)=theta_plus(i)+EPSILON; theta_minu(i)=theta_minu(i)-EPSILON; numgrad(i)=(J(theta_plus)-J(theta_minu))/(2*EPSILON); end %% --------------------------------------------------------------- end
另外,checkNumericalGradient.m、display_network.m、minFunc、initializeParameters.m函數見UFLDL的 sparseae_exercise.zip。
參考資料:
http://www.cnblogs.com/hrlnw/archive/2013/06/08/3127162.html
http://www.cnblogs.com/tornadomeet/archive/2013/03/20/2970724.html
http://www.cnblogs.com/cj695/p/4498699.html