神經網絡基礎-梯度下降和BP算法

本文轉載自查看原文 2018-12-28 16:50 1751

https://blog.csdn.net/weixin_38206214/article/details/81143894

在深度學習的路上，從頭開始了解一下各項技術。本人是DL小白，連續記錄我自己看的一些東西，大家可以互相交流。本文參考：本文參考吳恩達老師的Coursera深度學習課程，很棒的課，推薦
本文默認你已經大致了解深度學習的簡單概念，如果需要更簡單的例子，可以參考吳恩達老師的入門課程：
http://study.163.com/courses-search?keyword=%E5%90%B4%E6%81%A9%E8%BE%BE#/?ot=5
轉載請注明出處，其他的隨你便咯
一、前言在上篇文章中，我們介紹了神經網絡的一些基礎知識，但是並不能讓你真正的做點什么。我們如何訓練神經網絡？具體該怎么計算？隱層可以添加嗎，多少層合適？這些問題，會在本篇文章中給出。

二、神經網絡前向計算

首先，我們在上文中已經初步了解到神經網絡的結構，由於我們有很多的全連接，如果用單一的乘法計算，會導致訓練一個深層的神經網絡，需要上百萬次的計算。這時候，我們可以用向量化的方式，將所有的參數疊加成矩陣，通過矩陣來計算。我們將上文中的神經網絡復制到上圖。

在上圖中，我們可以發現，每個隱層的神經元結點的計算分為兩個部分，計算z和計算a。
要注意的是層與層之間參數矩陣的形狀：
輸入層和隱層之間
w[1].shape = (4, 3)：4為隱層神經元的個數，3為輸入層神經元的個數；
b[1].shape = (4, 1)：4為隱層神經元的個數，1不用擔心，python的廣播機制，會讓b復制成適合的形狀去進行矩陣加法；
隱層和輸出層之間
w[2].shape = (1, 4)：1為輸出層神經元的個數，4個隱層神經元的個數；
b[2].shape = (1, 1)：1為輸出層神經元的個數，1可以被廣播機制所擴展。
通過上述描述，我們可以看出w矩陣的規則，我們以相鄰兩層來說，前面一層作為輸入層，后層為輸出。兩層之間的w參數矩陣大小為(n_out，n_in)，b參數矩陣為(n_out，1)。其中n為該層的神經元個數。
那么我們現在用向量化的方式來計算我們的輸出值：

在對應的圖中，使用矩陣的方法，實際上只用實現右邊的四個公式，即可得到a[2]，也就是我們的輸出值yhat。
三、向量化神經網絡通過向量化參數，我們可以簡化我們的單次訓練計算。同樣在m個訓練樣本的計算過程中，我們發現，每個樣本的計算過程實際上是相同的，如果按照之前的思路，我們可以用for循環來計算m個樣本。
for i in m:
單次訓練
但是這種for循環在python中實際上會占用大量的資源，同樣我們也可以用向量化的方式，一次性計算所有m個樣本，提高我們的計算速度。
下面是實現向量化的解釋：

在上面，我們用 [ l ] 表示第幾層，用 ( i ) 表示第幾個樣本，我們先假設b = 0。
在m個訓練樣本中，其實都是在重復相同的過程，那么我們可以將m個樣本，疊加在一個X矩陣中，其形狀為(xn，m)。其中xn表示單個樣本的特征數，m為訓練樣本的個數。
四、反向傳播算法在實現了前向計算之后，我們可以通過計算損失函數和代價函數來得到我們這個神經網絡的效果。同時我們也可以開始我們的反向傳播(Backward Prop)，以此來更新參數，讓我們的模型更能得到我們想要的預測值。梯度下降法即使一種優化w和b的方法。
簡單理解梯度下降首先我們使用一個簡單的例子來講解什么是梯度下降：

我們先給出一個簡單的神經網絡(可能叫神經元更合適)，損失函數的計算公式為：

我們將上述公式化為一個計算圖如下：

現在我們要優化w1、w2和b，來使得L(a，y)的值最小化，那么我們需要對求偏導數，用偏導數來更新我們的w1、w2和b。因為L(a，y)是一個凸函數，我們在逐步更新的過程中，一點點的達到全局最優解。
計算過程如下：
首先我們對da、dz求導：

在對w1、w2和b進行求導：

接下來用梯度下降法更新參數：

其中 α表示學習率(learning-rate)，也可以理解為學習的步長，就是我們每次朝着最優解前進的速度。如果學習率過大，我們可能會在最優解附近來回震盪，沒辦法到達最優解。如果學習率過小，我們可能需要很多次數的迭代，才能到達最優解，所以選擇合適的學習率，也是很重要的。
接下來，我們給出m個樣本的損失函數：

損失函數關於w和b的偏導數，在m個樣本的情況下，可以寫成所有樣本點偏導數的平均形式：

接下來，和單個樣本一樣，我們可以更新w1、w2和b來進行下一次的訓練：

在吳恩達老師的課程中，給出了兩幅動圖來講解更新率對梯度下降的影響：
當梯度下降很小或合適時候，我們會得到如下的過程，模型最終會走向最優解。

當我們的更新率設置過高時，我們的步長會讓我們不得不在最終結果周圍震盪，這會讓我們浪費更多時間，甚至達不到最終的最優解，如下：

淺層神經網絡的梯度下降好了，讓我們回到本文的第一個例子：

我們繼續通過這個式子來講解梯度下降，首先我們給出單個梯度下降的求導公式：

在上圖中，我們直接給出了求導的結果，我給出一個dz[2]的手算過程，大家可以以此推導以下其他的結果：

(字比較丑大家忍住看，或者自己手算一遍吧...)整體計算不難，核心思想是鏈式求導，相信大家都能理解。
接下來，我們給出向量化的求導結果：

其中與單個樣本求導不同的在於，w和b是偏導數的平均數。這樣我們就可以更新參數，完成一次迭代。
總結而言反向傳播是相對與正向傳播而言的，在神經網絡的訓練中，我們通過正向傳播來計算當前模型的預測值，根據最終得到的代價函數，通過梯度下降算法，求取每個參數的偏導數，更新參數實現反向傳播以此來讓我們的模型更能准確的預測問題。
五、神經網絡代碼及查漏補缺這算是第一篇原創文章，由於分了兩篇文章來講解，我覺得有必要通過代碼來將所有的點都串聯一下了。
通過一個簡單的二分類問題來梳理一下神經網絡的構建流程：
0、數據集

如上圖所示，在這個例子中，我們需要用一個簡單的神經網絡來划分圖片上的區域，橫軸和數軸為特征x1和x2。每個點的顏色為最終的值y，藍色為1，紅色為0。我們的目標是通過得知該點的坐標(x1,x2)來預測該點的顏色(y)。
神經網絡模型

我們選擇如上圖所示的神經網絡模型，在隱層中選擇tanh函數來做激活函數，在輸出層中，用sigmoid函數來做激活函數。
對於每個訓練樣本，計算公式如下：

最終代價函數公式如下：

現在我們給出構建神經網絡的方法：
定義神經網絡的結構(輸入的神經元數，隱層的神經元數等)初始化模型的參數循環(給定的次數)實現前向傳播計算損失函數實現反向傳播、獲得梯度更新參數(梯度下降)最終將1-3步驟合並為一個模型。構建的模型學習了正確的參數后(訓練完成)，就可以對新數據進行預測了。
1、定義神經網絡結構我們定義X為輸入值(特征數，樣本數)；
Y為輸出值(結果，樣本數)，每單個x都有對應的y，所以樣本數是一致的；
n_x為輸入層的大小；
n_h為隱層的大小，4；
n_y為輸出層的大小
def layer_sizes(X, Y): """ Arguments: X -- input dataset of shape (input size, number of examples) Y -- labels of shape (output size, number of examples) Returns: n_x -- the size of the input layer n_h -- the size of the hidden layer n_y -- the size of the output layer """ n_x = X.shape[0] # size of input layer n_h = 4 n_y = Y.shape[0]# size of output layer return (n_x, n_h, n_y)2、初始化模型參數我們使用np.random.randn(a, b) * 0.01來初始化權重w；
使用np.zeros((a, b))來初始化偏置b。
其中w不能用0來初始化。如果用0來初始化w，那么所以的特征值在通過同樣的運算，換言之，所有特征值對最后結果的影響是一樣的，那么就損失了所有的特征值，我們用randn()隨機數來生成w，在將其變的很小，就避免了上述問題。
def initialize_parameters(n_x, n_h, n_y): """ Argument: n_x -- size of the input layer n_h -- size of the hidden layer n_y -- size of the output layer Returns: params -- python dictionary containing your parameters: W1 -- weight matrix of shape (n_h, n_x) b1 -- bias vector of shape (n_h, 1) W2 -- weight matrix of shape (n_y, n_h) b2 -- bias vector of shape (n_y, 1) """   np.random.seed(2) # we set up a seed so that your output matches ours although the initialization is random.   W1 = np.random.randn(n_h, n_x) b1 = np.zeros((n_h, 1)) W2 = np.random.randn(n_y, n_h) b2 = np.zeros((n_y, 1))   assert (W1.shape == (n_h, n_x)) assert (b1.shape == (n_h, 1)) assert (W2.shape == (n_y, n_h)) assert (b2.shape == (n_y, 1))   parameters = {"W1": W1, "b1": b1, "W2": W2, "b2": b2}   return parameters3、實現循環首先需要實現前向傳播(forward prop)。
我們可以從dict parameters中得到我們初始化的參數w，b。在計算前向傳播中，我們將z、a存儲在緩存(cache)中，方便我們在反向傳播中調用。
def forward_propagation(X, parameters): """ Argument: X -- input data of size (n_x, m) parameters -- python dictionary containing your parameters (output of initialization function) Returns: A2 -- The sigmoid output of the second activation cache -- a dictionary containing "Z1", "A1", "Z2" and "A2" """ # Retrieve each parameter from the dictionary "parameters" ### START CODE HERE ### (≈ 4 lines of code) W1 = parameters["W1"] b1 = parameters["b1"] W2 = parameters["W2"] b2 = parameters["b2"] ### END CODE HERE ###   # Implement Forward Propagation to calculate A2 (probabilities) Z1 = np.dot(W1, X) + b1 A1 = np.tanh(Z1) Z2 = np.dot(W2, A1) + b2   A2 = sigmoid(Z2)   assert(A2.shape == (1, X.shape[1]))   cache = {"Z1": Z1, "A1": A1, "Z2": Z2, "A2": A2}   return A2, cache接下來我們計算代價函數
我們通過A2，Y即可計算損失函數。
def compute_cost(A2, Y, parameters): """ Computes the cross-entropy cost given in equation (13) Arguments: A2 -- The sigmoid output of the second activation, of shape (1, number of examples) Y -- "true" labels vector of shape (1, number of examples) parameters -- python dictionary containing your parameters W1, b1, W2 and b2 Returns: cost -- cross-entropy cost given equation (13) """   m = Y.shape[1] # number of example   # Compute the cross-entropy cost logprobs = np.multiply(np.log(A2), Y) + np.multiply(np.log(1-A2), (1-Y)) cost = -(1.0/m)*np.sum(logprobs)   cost = np.squeeze(cost) # makes sure cost is the dimension we expect.   # E.g., turns [[17]] into 17   assert(isinstance(cost, float))   return cost接下來我們計算反向傳播(backward prop)
我們將求導值存儲在緩存(grads)中。
其中計算公式如下：

代碼如下：
def backward_propagation(parameters, cache, X, Y): """ Implement the backward propagation using the instructions above. Arguments: parameters -- python dictionary containing our parameters   cache -- a dictionary containing "Z1", "A1", "Z2" and "A2". X -- input data of shape (2, number of examples) Y -- "true" labels vector of shape (1, number of examples) Returns: grads -- python dictionary containing your gradients with respect to different parameters """ m = X.shape[1]   # First, retrieve W1 and W2 from the dictionary "parameters". ### START CODE HERE ### (≈ 2 lines of code) W1 = parameters["W1"] W2 = parameters["W2"] ### END CODE HERE ###   # Retrieve also A1 and A2 from dictionary "cache". ### START CODE HERE ### (≈ 2 lines of code) A1 = cache["A1"] A2 = cache["A2"] ### END CODE HERE ###   # Backward propagation: calculate dW1, db1, dW2, db2.   dZ2 = A2 - Y dW2 = 1.0/m*np.dot(dZ2, A1.T) db2 = 1.0/m*np.sum(dZ2, axis=1, keepdims=True) dZ1 = np.dot(W2.T, dZ2)*(1-np.power(A1, 2)) dW1 = 1.0/m*np.dot(dZ1, X.T) db1 = 1.0/m*np.sum(dZ1, axis=1, keepdims=True)   grads = {"dW1": dW1, "db1": db1, "dW2": dW2, "db2": db2}   return grads接下來我們更新參數，結束本次循環：

設置更新率為1.2，從dict parameters和grads中取出參數和導數，將更新后的參數，重新存入parameters中。
def update_parameters(parameters, grads, learning_rate = 1.2): """ Updates parameters using the gradient descent update rule given above Arguments: parameters -- python dictionary containing your parameters   grads -- python dictionary containing your gradients   Returns: parameters -- python dictionary containing your updated parameters   """ # Retrieve each parameter from the dictionary "parameters" W1 = parameters["W1"] b1 = parameters["b1"] W2 = parameters["W2"] b2 = parameters["b2"]   # Retrieve each gradient from the dictionary "grads" dW1 = grads["dW1"] db1 = grads["db1"] dW2 = grads["dW2"] db2 = grads["db2"]   # Update rule for each parameter W1 = W1 - learning_rate*dW1 b1 = b1 - learning_rate*db1 W2 = W2 - learning_rate*dW2 b2 = b2 - learning_rate*db2   parameters = {"W1": W1, "b1": b1, "W2": W2, "b2": b2}   return parameters4、整合模型接下來我們將上述的步驟，整合為一個模型，即為我們的神經網絡模型。
我們設定訓練次數(num_iterations)為10000次，每1000次打印出我們的誤差。
def nn_model(X, Y, n_h, num_iterations = 10000, print_cost=False): """ Arguments: X -- dataset of shape (2, number of examples) Y -- labels of shape (1, number of examples) n_h -- size of the hidden layer num_iterations -- Number of iterations in gradient descent loop print_cost -- if True, print the cost every 1000 iterations Returns: parameters -- parameters learnt by the model. They can then be used to predict. """   np.random.seed(3) n_x = layer_sizes(X, Y)[0] n_y = layer_sizes(X, Y)[2]   # Initialize parameters, then retrieve W1, b1, W2, b2. Inputs: "n_x, n_h, n_y". Outputs = "W1, b1, W2, b2, parameters". parameters = initialize_parameters(n_x, n_h, n_y) W1 = parameters["W1"] b1 = parameters["b1"] W2 = parameters["W2"] b2 = parameters["b2"]   # Loop (gradient descent)   for i in range(0, num_iterations):   # Forward propagation. Inputs: "X, parameters". Outputs: "A2, cache". A2, cache = forward_propagation(X, parameters)   # Cost function. Inputs: "A2, Y, parameters". Outputs: "cost". cost = compute_cost(A2, Y, parameters)   # Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads". grads = backward_propagation(parameters, cache, X, Y)   # Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters". parameters = update_parameters(parameters, grads, learning_rate = 1.2)    # Print the cost every 1000 iterations if print_cost and i % 1000 == 0: print ("Cost after iteration %i: %f" %(i, cost))   return parameters

在上圖中，我們可以看到每1000次循環，我們的代價函數都會變小，這說明我們的梯度下降是成功的！
5、預測函數最終，我們用一個預測函數，來結束我們這個文章。
我們將測試數據輸入模型，得到預測結果A2，如果A2 > 0.5，就意味着有超過50%的概率，是藍色的點，反之則是紅色的點。
def predict(parameters, X): """ Using the learned parameters, predicts a class for each example in X Arguments: parameters -- python dictionary containing your parameters   X -- input data of size (n_x, m) Returns predictions -- vector of predictions of our model (red: 0 / blue: 1) """   # Computes probabilities using forward propagation, and classifies to 0/1 using 0.5 as the threshold. A2, cache = forward_propagation(X, parameters) predictions = (A2 > 0.5)   return predictions最終，我們將原來的數據集划分為如下圖片：

總結而言：通過這篇文章能了解一個MLP或神經網絡是如何組成的。前向傳播是通過計算得到一個預測值，而反向傳播是通過反向求導，通過梯度下降算法來優化模型參數，讓模型能更准確的預測樣本數值。--------------------- 作者：Dominic221 來源：CSDN 原文：https://blog.csdn.net/weixin_38206214/article/details/81143894 版權聲明：本文為博主原創文章，轉載請附上博文鏈接！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 [轉]BP神經網絡梯度下降算法《神經網絡與機器學習》第5講隨機梯度下降算法-BP起源機器學習（一）：梯度下降、神經網絡、BP神經網絡【零基礎】神經網絡優化之動量梯度下降梯度下降算法原理神經網絡（Gradient Descent）神經網絡與深度學習（2）：梯度下降算法和隨機梯度下降算法神經網絡——BP算法神經網絡梯度下降的推導深度學習基礎--神經網絡--BP反向傳播算法 AI-Tensorflow-神經網絡優化算法-梯度下降算法-學習率