最近在跟着Andrew Ng老師學習深度神經網絡.在學習淺層神經網絡(兩層)的時候,推導反向傳播公式遇到了一些困惑,網上沒有找到系統推導的過程.后來通過學習矩陣求導相關技巧,終於搞清楚了.首先從最簡單的logistics回歸(單層神經網絡)開始.
logistics regression中的梯度下降法
單訓練樣本的logistics regression
輸入訓練樣本為\(x\),網絡權重為\(w\)和\(b\),其中\(x\)為列向量,向量維度為\((n_0,1)\),\(w\)為行向量,向量維度為\((1,n_0)\),\(b\)為標量.則神經網絡的輸出為$$a = \sigma(z),z = wx + b$$其中,\(\sigma()\)函數為sigmoid函數,其定義為$$\sigma(x) = \frac{1}{1+e^{-x}}$$
網絡的loss函數定義為:$$ l(a) = -(yloga+(1-y)log(1-a))$$
其中,\(y\)為訓練樣本標簽,對於logistics regression\(y = 0/1\).
- 下面首先求解\(\frac{\partial l}{\partial z}\):
- 下面求解\(\frac{\partial l}{\partial w}\):
由於w為行向量,上面的求導為標量對向量的求導.可以按照標量對向量求導的定義來計算,即$$\frac{\partial l}{\partial w} = [\frac{\partial l}{\partial w_1},\frac{\partial l}{\partial w_2},...,\frac{\partial l}{\partial w_{n_0}}]$$當然此處可以利用標量求導的鏈式法則,將\(\frac{\partial l}{\partial w_i} = \frac{\partial l}{\partial z}\frac{\partial z}{\partial w_i}\)帶入進行計算.
但是,為了與后續向量化實現和兩層神經網絡的求導相一致,此處利用矩陣求導的法則進行計算,雖然有殺雞用牛刀的嫌疑.首先明確一點,標量的鏈式求導法則並不適用於向量,不能相當然的套用,我就是犯了這個錯誤,在自己推導公式時百思不得其解.但是矩陣求導也有類似與標量的鏈式法則,下面直接給出公式:
其中,dl指的是標量l的微分,W為矩陣,tr為跡運算.若dl,dW能滿足上面這種形式,則dW前面部分就是標量l對矩陣W的導數.此處簡單的舉個例子:
\(f = a^{T}Xb\),\(f\)為標量,\(a,b\)為列向量,\(X\)為矩陣,求\(\frac{\partial f}{\partial X}\),解答過程如下:
對照上面的公式,可得\(\frac{\partial f}{\partial X} = ab^{T}\).上面的推導過程用了部分矩陣微分公式如\(d(XY) = dXY+XdY\),還包括跡運算的技巧,如\(tr(ABC) = tr(CAB) = tr(BAC)\),更詳細的關於矩陣求導的內容請參考博主疊加態的貓
3. 下面求解\(\frac{\partial l}{\partial b}\):
由於b為標量,可以簡單求得\(\frac{\partial l}{\partial b} = \frac{\partial l}{\partial z}\)
m個訓練樣本的logistics regression向量化實現
單次輸入的訓練樣本是\(X\),\(X\)為矩陣,維度為\((n_{0},m)\).網絡權重為\(\boldsymbol{w}\)和\(b\),\(\boldsymbol{w}\)為行向量,向量維度為\((1,n_0)\),\(\boldsymbol{b}=\overrightarrow{1}^{T}b\).則神經網絡的輸出為$$\boldsymbol{a} = \sigma(\boldsymbol{z}),\boldsymbol{z} = \boldsymbol{w}X + \boldsymbol{b}$$|
\(\boldsymbol{z},\boldsymbol{a}\)均為行向量,維度為\((1,m)\).cost函數定義為:
也可以定義為矩陣的形式:
\(\overrightarrow{1}\)為全為1的列向量
- 下面首先求解\(\frac{\partial J}{\partial \boldsymbol{z}}\):
不管通過標量對向量求導的定義,或者利用矩陣"鏈式法則"都能求得:
注意此處J對z的導數與Andrew Ng老師的結果有點區別,多了一個\(\frac{1}{m}\),私以為嚴格按照求導公式,\(\frac{1}{m}\)是該有的,雖然Andrew Ng老師在dw,db前加上了\(\frac{1}{m}\),所以對最終的迭代並無影響.
2. 下面求解\(\frac{\partial J}{\partial \boldsymbol{w}}\):
已知\(dJ=tr(\frac{\partial J^{T}}{\partial \boldsymbol{z}}d\boldsymbol{z})\),將上式帶入可得:
因此,\(\frac{\partial J}{\partial \boldsymbol{w}}=\frac{\partial J}{\partial \boldsymbol{z}}X^{T}\)
3. 下面求解\(\frac{\partial J}{\partial b}\):
因此,\(\frac{\partial J}{\partial b}=\frac{\partial J}{\partial \boldsymbol{z}}\overrightarrow{1}\)
雙層神經網絡中的梯度下降法
神經網絡的輸入,隱含層,輸出層神經元個數分別為\(n_0,n_1,n_2=1\),其中隱含層激活函數為\(g()\),參數為\(W_1,\boldsymbol{b_1}\),\(W_1\)為矩陣,維度\((n_1,n_0)\),\(\boldsymbol{b_1}\)為列向量,維度\((n_1,1)\).輸出層激活函數選擇sigmoid函數,參數為\(\boldsymbol{w_2},b_2\),\(\boldsymbol{w_2}\)為行向量,維度為\((n_1,1)\),\(b_2\)為標量.
單個訓練樣本推導
輸入\(\boldsymbol{x}\),則網絡的正向傳遞過程如下:
loss函數定義與logistics regression相同
- 首先求解\(\frac{\partial l}{\partial z_2}\):
與logistics regression方式相同,可得\(\frac{\partial l}{\partial z_2}=a_2-y\) - 下面求解\(\frac{\partial l}{\partial \boldsymbol{w_2}}\):
與logistics regression方式相同,可得\(\frac{\partial l}{\partial \boldsymbol{w_2}}=\frac{\partial l}{\partial z_2}\boldsymbol{a_1}^{T}\) - 相同方式可求解\(\frac{\partial l}{\partial b_2}=\frac{\partial l}{\partial z_2}\)
- 求解\(\frac{\partial l}{\partial \boldsymbol{z_1}}\):
因此,\(\frac{\partial l}{\partial \boldsymbol{z_1}}=\boldsymbol{w_2}^{T}\frac{\partial l}{\partial z_2}*g^{'}(\boldsymbol{z_1})\),其中*為逐元素相乘,上面公式推導過程中運用了跡的性質,\(tr(A^{T}(B*C))=tr((A*B)^{T}C)\)
5. 求解\(\frac{\partial l}{\partial W_1}\):
因此,\(\frac{\partial l}{\partial W_1}=\frac{\partial l}{\partial \boldsymbol{z_1}}\boldsymbol{x}^{T}\)
6. 求解\(\frac{\partial l}{\partial \boldsymbol{b_1}}\):
因此,\(\frac{\partial l}{\partial \boldsymbol{b_1}}=\frac{\partial J}{\partial \boldsymbol{z_1}}\)
m個訓練樣本向量化實現的推導
輸入\(X\),\(X\)為矩陣,維度為\((n_1,m)\)則網絡的正向傳遞過程如下:
- 下面首先求解\(\frac{\partial J}{\partial \boldsymbol{z_2}}\):
與logistics regression中方法相同,可得\(\frac{\partial J}{\partial \boldsymbol{z_2}}=\frac{1}{m}(\boldsymbol{a_2}-\boldsymbol{Y})\) - 下面求解\(\frac{\partial J}{\partial \boldsymbol{w_2}}\):與logistics regression中方法相同,可得\(\frac{\partial J}{\partial \boldsymbol{w_2}}=\frac{\partial J}{\partial \boldsymbol{z_2}}\boldsymbol{a_1}^{T}\)
- 下面求解\(\frac{\partial J}{\partial b_2}\):同logistics regression可得\(\frac{\partial J}{\partial b_2}=\frac{\partial J}{\partial \boldsymbol{z_2}}\overrightarrow{1}\)
- 下面求解\(\frac{\partial J}{\partial Z_1}\):
與單個訓練樣本方法相同,可得\(\frac{\partial J}{\partial Z_1}=\boldsymbol{w_2}^{T}\frac{\partial J}{\partial \boldsymbol{z_2}}*g^{'}(\boldsymbol{z_1})\) - 求解\(\frac{\partial J}{\partial W_1}\):
與單個訓練樣本方法相同,可得\(\frac{\partial J}{\partial W_1}=\frac{\partial J}{\partial Z_1}\boldsymbol{X}^{T}\) - 求解\(\frac{\partial J}{\partial \boldsymbol{b_1}}\):
因此,\(\frac{\partial J}{\partial \boldsymbol{b_1}}=\frac{\partial J}{\partial Z_1}\overrightarrow{1}\)