Transformer block拆解


基本結構

Alt text

basic參數

  • or : total number of transformer blocks

  • or : number of units in each bottleneck layer, and number of units of each Q/K/V input

  • or : number of heads of each transformer block

  • or : input sequence length

derived參數

  • : dimension of each attention head,

  • : intermediate layer units of feed forward layer,

各參數在transformer block中的詳細示意圖如下(可雙擊放大):

Alt text

Zoom in Feed Forward子模塊

Alt text

典型模型基本參數

應用 模型
NLP GPT-3 96 12288 96 2048
NLP BERT_Base 12 768 12 128/512
NLP BERT_Large 24 1024 16 128/512
RecSys BST 1 128(max) 8 20
  • BST: Behavior Sequence Transformer

References

  1. The GPT-3 Architecture, on a Napkin

  2. GPT-3 An Overview

  3. Language Models are Few-Shot Learners

  4. Improving Language Understanding by Generative Pre-Training

  5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  6. Attention Is All You Need

  7. BERT transformer block code

  8. Deep Learning Recommendation Model for Personalization and Recommendation Systems

  9. Behavior Sequence Transformer for E-commerce Recommendation in Alibaba


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM