基本結構
basic參數
or
: total number of transformer blocks
or
: number of units in each bottleneck layer, and number of units of each Q/K/V input
or
: number of heads of each transformer block
or
: input sequence length
derived參數
: dimension of each attention head,
: intermediate layer units of feed forward layer,
各參數在transformer block中的詳細示意圖如下(可雙擊放大):
Zoom in Feed Forward子模塊
典型模型基本參數
應用 | 模型 | ![]() |
![]() |
![]() |
![]() |
NLP | GPT-3 | 96 | 12288 | 96 | 2048 |
NLP | BERT_Base | 12 | 768 | 12 | 128/512 |
NLP | BERT_Large | 24 | 1024 | 16 | 128/512 |
RecSys | BST | 1 | 128(max) | 8 | 20 |
-
BST: Behavior Sequence Transformer