概
考慮
\[\hat{x} = f(x) \in \mathbb{R}^D, \quad x \in \mathbb{R}^D. \]
怎么樣結構的\(f\)使得
\[\hat{x} = [\hat{x}_1, f_2(x_1), f_3(x_1, x_2), \ldots, f_d(x_1,x_2,\ldots,x_D)]. \]
即, \(\hat{x}_d\)只與\(x_{< d}\)有關.
主要內容
假設第\(l\)層的關系式為:
\[x^l = \sigma^l(W^lx^{l-1} + b^l). \]
作者給出的思路是, 給一個隱層的第k個神經元分配一個數字\(m^1(k) \in \{1, \ldots, D-1\}\), 則構建一個掩碼矩陣\(M^1\):
\[M^1_{k,d} = \left \{ \begin{array}{ll} 1, & m^1(k) \ge d \\ 0, & \mathrm{else}. \end{array} \right . \]
於是實際上的過程為:
\[x^1 = \sigma^1(W^1 \odot M^1 \: x + b^1). \]
進一步, 給第\(l\)個隱層的第\(i\)個神經元分配數字\(m^l(i) \in \{\min m^{l-1}, \ldots, D-1\}\) (否則會出現\(M^l\)的某些行全為0):
\[M^l_{i,j} = \left \{ \begin{array}{ll} 1, & m^l(i) \ge m^{l-1}(j) \\ 0, & \mathrm{else}. \end{array} \right . \]
\[x^l = \sigma^l(W^l \odot M^l \: x^{l-1} + b^l). \]
以及最后的輸出層:
\[M^L_{d,k} = \left \{ \begin{array}{ll} 1, & d > m^{L-1}(k) \\ 0, & \mathrm{else}. \end{array} \right . \]
個人感覺, 會有很明顯的稀疏化... 而且越深狀況越嚴重.