stanford coursera 機器學習編程作業 exercise4--使用BP算法訓練神經網絡以識別阿拉伯數字(0-9)

本文轉載自查看原文 2016-11-29 21:44 8483 編程作業/ machine learning

在這篇文章中，會實現一個BP(backpropagation)算法，並將之應用到手寫的阿拉伯數字(0-9)的自動識別上。

訓練數據集(training set)如下：一共有5000個訓練實例(training instance)，每個訓練實例是一個400維特征的列向量(20*20 pixel image)。用 X 矩陣表示整個訓練集，則 X 是一個 5000*400 （5000行 400列）的矩陣

另外，還有一個5000*1的列向量 y ，用來標記訓練數據集的結果。比如，第一個訓練實例對應的輸出結果為數字：5

①模型表示

我們使用三層的神經網絡模型：輸入層、一個隱藏層、和輸出層。將訓練數據集矩陣 X 中的每一個訓練實例用load指令加載到Matlab中：

由於我們使用三層的神經網絡，故一共有二個參數矩陣：Θ⁽¹⁾ (Theta1)和 Θ⁽²⁾ (Theta2)，它是預先存儲在 ex4weights.mat文件中，使用load('ex4weights')加載到Matlab中，如下：

參數矩陣Θ⁽¹⁾ (Theta1) 和 Θ⁽²⁾ (Theta2) 的維數是如何確定的呢？

一般，對於一個特定的訓練數據集而言，它的輸入維數和輸出的結果數目是確定的。比如，本文中的數字圖片是用一個400維的特征向量代表，輸出結果則是0-9的阿拉伯數字，即一共有10種輸出。而中間隱藏層的個數則是變化的，根據實際情況選擇隱藏層的數目。本文中隱藏層單元個數為25個。故參數矩陣Θ⁽¹⁾ (Theta1)是一個25*401矩陣，行的數目為25，由隱藏層的單元個數決定(不包括bias unit)，列的數目由輸入特征數目決定(加上bias unit之后變成401)。同理，參數矩陣Θ⁽²⁾ (Theta2)是一個10*26矩陣，行的數目由輸出結果數目決定（0-9 共10種數字），列的數目由隱藏層數目決定（25個隱藏層的unit，再加上一個bias unit）。神經網絡的結構圖如下：

②代價函數

未考慮正則化的神經網絡的代價函數如下：

其中，m等於5000，表示一共有5000個訓練實例；K=10，總共有10種可能的訓練結果（數字0-9）

假設函數 h_θ(x⁽ⁱ⁾) 和 h_θ(x⁽ⁱ⁾)_k 的解釋

我們是通過如下公式來求解h_θ(x⁽ⁱ⁾)的：

a⁽¹⁾ = x 再加上bias unit a₀⁽¹⁾ ，其中，x 是訓練集矩陣 X 的第 i 行（第 i 個訓練實例）。它是一個400行乘1列的列向量，上標(1)表示神經網絡的第幾層。

z⁽²⁾ = Θ⁽¹⁾_* a⁽¹⁾，再使用 sigmoid函數作用於z⁽²⁾，就得到了a⁽²⁾，它代表隱藏層的每個神經元的值。a⁽²⁾是一個25行1列的列向量。

a⁽²⁾ = sigmoid(z⁽²⁾) ，再將隱藏層的25個神經元，添加一個bias unit ，就a₀⁽²⁾可以計算第三層（輸出層）的神經單元向量a⁽³⁾了。a⁽³⁾是一個10行1列的列向量。

同理，z⁽³⁾ = Θ⁽²⁾_* a⁽²⁾ ， a⁽³⁾ = sigmoid(z⁽³⁾) 此時得到的 a⁽³⁾就是假設函數 h_θ(x⁽ⁱ⁾)

由此可以看出：假設函數 h_θ(x⁽ⁱ⁾)就是一個10行1列的列向量，而 h_θ(x⁽ⁱ⁾)_k 就表示列向量中的第 k 個元素。【也即該訓練實例以 h_θ(x⁽ⁱ⁾)_k的概率取數字k？】

舉個例子： h_θ(x⁽⁶⁾) = (0, 0, 0.03, 0, 0.97, 0, 0, 0, 0, 0)^T 【h_θ(x⁽ⁱ⁾)的輸出結果是什么，0-1之間的各個元素的取值概率？matlab debug???】

它是含義是：使用神經網絡訓練 training set 中的第6個訓練實例，得到的訓練結果是：以0.03的概率是數字3，以0.97的概率是數字5

（注意：向量的下標10 表示數字0）

訓練樣本集的結果向量 y (label of result)的解釋

由於神經網絡的訓練是監督學習，也就是說：樣本訓練數據集是這樣的格式：(x⁽ⁱ⁾, y⁽ⁱ⁾)，對於一個訓練實例x⁽ⁱ⁾，我們是已經確定知道了它的正確結果是y⁽ⁱ⁾，而我們的目標是構造一個神經網絡模型，訓練出來的這個模型的假設函數h_θ(x)，對於未知的輸入數據x^(k)，能夠准確地識別出正確的結果。

因此，訓練數據集(traing set)中的結果數據 y 是正確的已知的結果，比如y⁽⁶⁰⁰⁾ = (0, 0, 0, 0, 1, 0, 0, 0, 0, 0)^T表示：訓練數據集中的第600條訓練實例它所對應的正確結果是：數字5 （因為，向量y⁽⁶⁰⁰⁾中的第5個元素為1，其它所有元素為0），另外需要注意的是：當向量y⁽ⁱ⁾ 第10個元素為1時代表數字0。

③BP(back propagation)算法

BP算法是用來計算神經網絡的代價函數的梯度。

計算梯度，本質上是求偏導數。來求解偏導數我們可以用傳統的數學方法：求偏導數的數學計算公式來求解。這也是Ng課程中講到的“Gradient Checking”所用的方法。但當我們的輸入特征非常的多(上百萬...)，參數矩陣Θ非常大時，就需要大量地進行計算了（Ng在課程中也專門提到，當實際訓練神經網絡時，要記得關閉 "Gradient Checking"）。而這也是神經網絡大拿Minsky 曾經的一個觀點：(可參考這篇博文：神經網絡淺講)

Minsky認為，如果將計算層增加到兩層，計算量則過大，而且沒有有效的學習算法。所以，他認為研究更深層的網絡是沒有價值的。

而BP算法，則解決了這個計算量過大的問題。BP算法又稱反向傳播算法，它從輸出層開始，往輸入層方向以“某種形式”的計算，得到一組“數據“，而這組數據剛好就是我們所需要的梯度。

Once you have computed the gradient, you will be able to train the neural network 
by minimizing the cost function J(Θ) using an advanced optimizer such as fmincg.

Sigmoid 函數的導數

至於為什么要引入Sigmoid函數，等我有了更深刻的理解再來解釋（Ng課程中也有提到）。Sigmoid函數的導數有一個特點，即Sigmoid的導數可以用Sigmoid函數自己本身來表示，如下：

證明過程如下：（將證明過程中的 f(x) 視為 g(z) 即可）

Sigmoid 函數的Matlab實現如下(sigmoidGradient.m)：

g = zeros(size(z));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the gradient of the sigmoid function evaluated at
%               each value of z (z can be a matrix, vector or scalar).

g = sigmoid(z) .* (1 - sigmoid(z));
% =============================================================

訓練神經網絡時的“symmetry”現象---隨機初始化神經網絡的參數矩陣(權值矩陣Θ)

隨機初始化參數矩陣，就是對參數矩陣Θ^(L)中的每個元素，隨機地賦值，取值范圍一般為[ξ ,-ξ]，ξ 的確定規則如下：

假設將參數矩陣Θ^(L) 中所有的元素初始化0，則根據計算公式：a₁⁽²⁾ = Θ⁽¹⁾ * （參看視頻，完善）會導致 a⁽²⁾ 中的每個元素都會取相同的值。

因此，隨機初始化的好處就是：讓學習更有效率

This range of values ensures that the parameters are kept small and makes the learning more efficient

隨機初始化的Matlab實現如下：可以看出，它是先調用 randInitializeWeights.m 中定義的公式進行初始化的。然后，再將 initial_Theta1 和 initial_Theta2 unroll 成列向量

initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);
initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels);

% Unroll parameters
initial_nn_params = [initial_Theta1(:) ; initial_Theta2(:)];

randInitializeWeights.m 的實現如下：

 1 function W = randInitializeWeights(L_in, L_out)
 2 %RANDINITIALIZEWEIGHTS Randomly initialize the weights of a layer with L_in
 3 %incoming connections and L_out outgoing connections
 4 %   W = RANDINITIALIZEWEIGHTS(L_in, L_out) randomly initializes the weights 
 5 %   of a layer with L_in incoming connections and L_out outgoing 
 6 %   connections. 
 7 %
 8 %   Note that W should be set to a matrix of size(L_out, 1 + L_in) as
 9 %   the column row of W handles the "bias" terms
10 %
11 % You need to return the following variables correctly 
12 W = zeros(L_out, 1 + L_in);
13 
14 % ====================== YOUR CODE HERE ======================
15 % Instructions: Initialize W randomly so that we break the symmetry while
16 %               training the neural network.
17 %
18 % Note: The first row of W corresponds to the parameters for the bias units
19 %
20 epsilon_init = 0.12;
21 W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init;
22 
23 % =========================================================================
24 
25 end

View Code

BP算法的具體執行步驟如下：

對於每一個訓練實例(x, y)，先用“前向傳播算法”計算出 activations（a⁽²⁾， a⁽³⁾），然后再對每一層計算一個殘差δ_j^(L) (error term)。注意：輸入層(input layer)不需要計算殘差。

具體的每一層的殘差計算公式如下：（本文中的神經網絡只有3層，隱藏層的數目為1）

對於輸出層的殘差計算公式如下，這里的輸入層是第1層，隱藏層是第2層，輸出層是第3層

殘差δ⁽³⁾是一個向量，下標 k 表示，該向量中的第 k 個元素。y_k 就是前面提到的表示樣本結果的向量y 中的第 k 個元素。(這里的向量 y 是由訓練樣本集的結果向量y分割得到的)

上面的減法公式用向量表示為：δ⁽³⁾= a⁽³⁾- y，因此δ⁽³⁾維數與a⁽³⁾一樣，它與分類輸出的數目有關。在本例中，是10，故δ⁽³⁾是一個10*1向量

對於隱藏層的殘差計算公式如下：

當每一層的殘差計算好之后，就可以更新 Δ(delta) 矩陣了，Δ(delta) 矩陣與參數矩陣有相同的維數，初始時Δ(delta) 矩陣中的元素全為0.

% nnCostFunction.m
Theta1_grad = zeros(size(Theta1));% Theta1_grad is a 25*401 matrix--矩陣Θ⁽¹⁾ ，由Δ⁽¹⁾的值來更新
Theta2_grad = zeros(size(Theta2));% Theta2_grad is a 10*26 matrix--矩陣Θ⁽²⁾ ，由Δ⁽²⁾ 的值來更新

它的定義(計算公式)如下：

在這里，δ^(L+1)是一個列向量，(a⁽¹⁾)^T是一個行向量，相乘后，得到的是一個矩陣。

計算出 Δ(delta) 矩陣后，就可以用它來更新代價函數的導數了，公式如下：

對於一個訓練實例(training instance)，一次完整的BP算法運行Matlab代碼如下：

for i = 1:m
    a1 = X(i, :)'; %the i th input variables, 400*1
    z2 = Theta1 *  a1;
    a2 =  sigmoid( z2 ); % Theta1 * x superscript i
    a2 = [ 1; a2 ];% add bias unit, a2's size is 26 * 1
    z3 = Theta2 * a2;
    a3 = sigmoid( z3 ); % h_theta(x)
    
    error_3 = a3 - Y( :, i ); % last layer's error, 10*1 第三層的殘差計算公式
    %error_2 = ( Theta2' * error_3 ) .*  ( a2 .* (1 - a2) );% g'(z2)=g(z2)*(1-g(z2)), 26*1
    
    err_2 =  Theta2' * error_3; % 26*1
    error_2 = ( err_2(2:end) ) .* sigmoidGradient(z2);% 去掉 bias unit 對應的 error units
    
    Theta2_grad = Theta2_grad + error_3 * a2'; % Δ^(L) = Δ^(L) + δ^(L+1)* (a^(L))^T
    Theta1_grad = Theta1_grad + error_2 * a1';
end

Theta2_grad = Theta2_grad / m; % video 9-2 backpropagation algorithm the 11 th minute
Theta1_grad = Theta1_grad / m; %這里的結果就是 D_i,j^(L)

④梯度檢查(gradient checking)

梯度檢查的原理如下：由於我們通過BP算法這種巧妙的方式求得了代價函數的導數，那它到底正不正確呢？這里就可以用高等數學里面的導數的定義(極限的定義)來計算導數，然后再比較：用BP算法求得的導數和用導數的定義求得的導數這二者之間的差距。

導數定義(極限定義)---非正式定義，如下：

可能正是這種通過定義直接計算的方式運算量很大，所以課程視頻中才提到：在正式訓練時，要記得關閉 gradient checking

從下面的 gradient checking 結果可以看出（二者計算出來的結果幾乎相等），故 BP算法的運行是正常的。

⑤神經網絡的正則化

對於神經網絡而言，它的表達能力很強，容易出現 overfitting problem，故一般需要正則化。正則化就是加上一個正則化項，就可以了。注意 bias unit不需要正則化

在lambda==1，訓練的迭代次數MaxIter==50的情況下，訓練的結果如下：代價從一開始的3.29...到最后的 0.52....

訓練集上的精度：Training Set Accuracy: 94.820000

Training Neural Network... 
Iteration     1 | Cost: 3.295180e+00
Iteration     2 | Cost: 3.250966e+00
Iteration     3 | Cost: 3.216955e+00
Iteration     4 | Cost: 2.884544e+00
Iteration     5 | Cost: 2.746602e+00
Iteration     6 | Cost: 2.429900e+00
.....
.....
.....
Iteration    46 | Cost: 5.428769e-01
Iteration    47 | Cost: 5.363841e-01
Iteration    48 | Cost: 5.332370e-01
Iteration    49 | Cost: 5.302586e-01
Iteration    50 | Cost: 5.202410e-01

對於神經網絡而言，很容易產生過擬合的現象，比如當把參數 lambda 設置成 0.1，並且訓練次數MaxIter 設置成 200時，訓練結果如下：訓練精度已經達到了99.94%，很可能是 overfitting 了

Training Neural Network... 
Iteration     1 | Cost: 3.303119e+00
Iteration     2 | Cost: 3.241696e+00
Iteration     3 | Cost: 3.220572e+00
Iteration     4 | Cost: 2.637648e+00
Iteration     5 | Cost: 2.182911e+00
....
......
.........
Iteration   197 | Cost: 8.177972e-02
Iteration   198 | Cost: 8.171843e-02
Iteration   199 | Cost: 8.169971e-02
Iteration   200 | Cost: 8.165209e-02

Training Set Accuracy: 99.940000

⑥使用Matlab的 fmincg 函數最終得到參數矩陣Θ

% Create "short hand" for the cost function to be minimized
costFunction = @(p) nnCostFunction(p, ...
                                   input_layer_size, ...
                                   hidden_layer_size, ...
                                   num_labels, X, y, lambda);

% Now, costFunction is a function that takes in only one argument (the
% neural network parameters)
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);

代碼中最后一行 fmincg(costFunction, initial_nn_params, options) 將求得的神經網絡的參數nn_params返回。initial_nn_params 就上前面提到的使用隨機初始化后初始化的參數矩陣。

整個求解代價函數、梯度、正則化的 Matlab代碼如下：nnCostFunction.m文件

function [J grad] = nnCostFunction(nn_params, ...
                                   input_layer_size, ...
                                   hidden_layer_size, ...
                                   num_labels, ...
                                   X, y, lambda)
%NNCOSTFUNCTION Implements the neural network cost function for a two layer
%neural network which performs classification
%   [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
%   X, y, lambda) computes the cost and gradient of the neural network. The
%   parameters for the neural network are "unrolled" into the vector
%   nn_params and need to be converted back into the weight matrices. 
% 
%   The returned parameter grad should be a "unrolled" vector of the
%   partial derivatives of the neural network.
%

% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
% for our 2 layer neural network
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
                 hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
                 num_labels, (hidden_layer_size + 1));

% Setup some useful variables
m = size(X, 1);
         
% You need to return the following variables correctly 
J = 0;
Theta1_grad = zeros(size(Theta1));% Theta1_grad is a 25*401 matrix
Theta2_grad = zeros(size(Theta2));% Theta2_grad is a 10*26 matrix

% ====================== YOUR CODE HERE ======================
% Instructions: You should complete the code by working through the
%               following parts.
%
% Part 1: Feedforward the neural network and return the cost in the
%         variable J. After implementing Part 1, you can verify that your
%         cost function computation is correct by verifying the cost
%         computed in ex4.m
X = [ones(m,1) X]; %5000*401
a_super2 = sigmoid(Theta1 * X'); % attention a_super2 is a 25*5000 matrix
a_super2 = [ones(1,m);a_super2]; %add each bias unit for a_superscript2, 26 * 5000

% attention a_super3 is a 10 * 5000 matrix, each column is a predict value
a_super3 = sigmoid(Theta2 * a_super2);%10*5000

a3 = 1 - a_super3;%10*5000

%將5000條的結果label 向量y 轉化成元素只為0或1 的矩陣Y
Y = zeros(num_labels, m); %10*5000, each column is a label result
for i = 1:num_labels
    Y(i, y==i)=1;
end

Y1 = 1 - Y;
res1 = 0;
res2 = 0;
for j = 1:m
    %兩個矩陣的每一列相乘,再把結果求和。預測值和結果label對應的元素相乘,就是某個輸入x 的代價
    tmp1 = sum( log(a_super3(:,j)) .* Y(:,j) ); 
    res1 = res1 + tmp1; % m 列之和
    tmp2 = sum( log(a3(:,j)) .* Y1(:,j) );
    res2 = res2 + tmp2;
end
J = (-res1 - res2) / m;

%
% Part 2: Implement the backpropagation algorithm to compute the gradients
%         Theta1_grad and Theta2_grad. You should return the partial derivatives of
%         the cost function with respect to Theta1 and Theta2 in Theta1_grad and
%         Theta2_grad, respectively. After implementing Part 2, you can check
%         that your implementation is correct by running checkNNGradients
%
%         Note: The vector y passed into the function is a vector of labels
%               containing values from 1..K. You need to map this vector into a 
%               binary vector of 1's and 0's to be used with the neural network
%               cost function.
%
%         Hint: We recommend implementing backpropagation using a for-loop
%               over the training examples if you are implementing it for the 
%               first time.
%

for i = 1:m
    a1 = X(i, :)'; %the i th input variables, 400*1
    z2 = Theta1 *  a1;
    a2 =  sigmoid( z2 ); % Theta1 * x superscript i
    a2 = [ 1; a2 ];% add bias unit, a2's size is 26 * 1
    z3 = Theta2 * a2;
    a3 = sigmoid( z3 ); % h_theta(x)
    
    error_3 = a3 - Y( :, i ); % last layer's error, 10*1
    %error_2 = ( Theta2' * error_3 ) .*  ( a2 .* (1 - a2) );% g'(z2)=g(z2)*(1-g(z2)), 26*1
    
    err_2 =  Theta2' * error_3; % 26*1
    error_2 = ( err_2(2:end) ) .*  sigmoidGradient(z2);% 去掉 bias unit 對應的 error units
    
    Theta2_grad = Theta2_grad + error_3 * a2';
    Theta1_grad = Theta1_grad + error_2 * a1';
end

Theta2_grad = Theta2_grad / m; % video 9-2 backpropagation algorithm the 11 th minute
Theta1_grad = Theta1_grad / m;

% Part 3: Implement regularization with the cost function and gradients.
%
%         Hint: You can implement this around the code for
%               backpropagation. That is, you can compute the gradients for
%               the regularization separately and then add them to Theta1_grad
%               and Theta2_grad from Part 2.
%

% reg for cost function J,  ex4.pdf page 6 
Theta1_tmp = Theta1(:, 2:end).^2;
Theta2_tmp = Theta2(:, 2:end).^2;
reg = lambda / (2*m) * ( sum( Theta1_tmp(:) ) + sum( Theta2_tmp(:) ) );

J = (-res1 - res2) / m + reg;
% -------------------------------------------------------------

% reg for bp, ex4.pdf materials page 11
Theta1(:,1) = 0;
Theta2(:,1) = 0;

Theta1_grad = Theta1_grad + lambda / m * Theta1;
Theta2_grad = Theta2_grad + lambda / m * Theta2;
% =========================================================================

% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];
end

View Code

⑦可視化神經網絡

多層神經網絡有很多層，層數越多，有着更深的表示特征及更強的函數模擬能力。后一層網絡是前一層的更“抽象”(更深入)的表示。

比如說本文中的識別0-9數字，這里的隱藏層只有一層(一般而言第一層稱為輸入層，最后一層稱為輸出層，其他中間的所有層稱為隱藏層)，隱藏層它學到的只是數字的一些邊緣特征，輸出層在隱藏層的基礎上，就基本上能識別數字了。

隱藏層的可視化如下：

輸出層的可視化 “可能”如下：

原文：http://www.cnblogs.com/hapjin/p/6106182.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 stanford coursera 機器學習編程作業 exercise 3（使用神經網絡識別手寫的阿拉伯數字(0-9)）【原】Coursera—Andrew Ng機器學習—編程作業 Programming Exercise 4—反向傳播神經網絡【算法】將阿拉伯數字轉為中文大寫【原】Coursera—Andrew Ng機器學習—編程作業 Programming Exercise 3—多分類邏輯回歸和神經網絡羅馬數字轉阿拉伯數字羅馬數字與阿拉伯數字轉換羅馬數字轉阿拉伯數字阿拉伯數字轉中文（漢字）數字 java實現阿拉伯數字轉換為漢字數字算法 iOS 阿拉伯數字轉漢字(1轉一)