參考:Keras API reference / Layers API / Core layers / Dense layer
語法如下:
tf.keras.layers.Dense( units, activation=None, use_bias=True, kernel_initializer="glorot_uniform", bias_initializer="zeros", kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None, **kwargs )
Just your regular densely-connected NN layer.
Dense
implements the operation: output = activation(dot(input, kernel) + bias)
where activation
is the element-wise activation function passed as the activation
argument, kernel
is a weights matrix created by the layer, and bias
is a bias vector created by the layer (only applicable if use_bias
is True
).
Note: If the input to the layer has a rank greater than 2, then Dense
computes the dot product between the inputs
and the kernel
along the last axis of the inputs
and axis 1 of the kernel
(using tf.tensordot
). For example, if input has dimensions (batch_size, d0, d1)
, then we create a kernel
with shape (d1, units)
, and the kernel
operates along axis 2 of the input
, on every sub-tensor of shape (1, 1, d1)
(there are batch_size * d0
such sub-tensors). The output in this case will have shape (batch_size, d0, units)
.
Besides, layer attributes cannot be modified after the layer has been called once (except the trainable
attribute).
主要是針對高亮的部分進行解讀。
當 inputs 的數據的秩超過2(這里粗淺的認為是維度)時,Dense 沿着 inputs 的最后一個維度與 kernel 做叉乘。
舉例:
inputs 的維度為 $X=(batch\_size, d_0, d_1)$, kernel 的維度為 $W=(d_1, units)$,因此輸出層可以按照如下計算:
$$Y=X \times W$$
由此可得,輸出維度為 $Y=(batch\_size, d_0, units)$。這個實際上是不難理解的,但是應用到神經網絡上就不一樣了。
相當於最后一個維度 $d_1$ 對 $units$ 做了 $d_0$ 個全連接,同時它們公用一個 kernel,這也就是 Attention 實現的方法,只要對三維的輸入做了一個 Dense,就相當於都變成了一個數,也就是 $\alpha$。