參考:Keras API reference / Layers API / Core layers / Dense layer
語法如下:
tf.keras.layers.Dense(
units,
activation=None,
use_bias=True,
kernel_initializer="glorot_uniform",
bias_initializer="zeros",
kernel_regularizer=None,
bias_regularizer=None,
activity_regularizer=None,
kernel_constraint=None,
bias_constraint=None,
**kwargs
)
Just your regular densely-connected NN layer.
Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).
Note: If the input to the layer has a rank greater than 2, then Dense computes the dot product between the inputs and the kernel along the last axis of the inputs and axis 1 of the kernel (using tf.tensordot). For example, if input has dimensions (batch_size, d0, d1), then we create a kernel with shape (d1, units), and the kernel operates along axis 2 of the input, on every sub-tensor of shape (1, 1, d1) (there are batch_size * d0 such sub-tensors). The output in this case will have shape (batch_size, d0, units).
Besides, layer attributes cannot be modified after the layer has been called once (except the trainable attribute).
主要是針對高亮的部分進行解讀。
當 inputs 的數據的秩超過2(這里粗淺的認為是維度)時,Dense 沿着 inputs 的最后一個維度與 kernel 做叉乘。
舉例:
inputs 的維度為 $X=(batch\_size, d_0, d_1)$, kernel 的維度為 $W=(d_1, units)$,因此輸出層可以按照如下計算:
$$Y=X \times W$$
由此可得,輸出維度為 $Y=(batch\_size, d_0, units)$。這個實際上是不難理解的,但是應用到神經網絡上就不一樣了。
相當於最后一個維度 $d_1$ 對 $units$ 做了 $d_0$ 個全連接,同時它們公用一個 kernel,這也就是 Attention 實現的方法,只要對三維的輸入做了一個 Dense,就相當於都變成了一個數,也就是 $\alpha$。

