Transformer block拆解


基本结构

Alt text

basic参数

  • or : total number of transformer blocks

  • or : number of units in each bottleneck layer, and number of units of each Q/K/V input

  • or : number of heads of each transformer block

  • or : input sequence length

derived参数

  • : dimension of each attention head,

  • : intermediate layer units of feed forward layer,

各参数在transformer block中的详细示意图如下(可双击放大):

Alt text

Zoom in Feed Forward子模块

Alt text

典型模型基本参数

应用 模型
NLP GPT-3 96 12288 96 2048
NLP BERT_Base 12 768 12 128/512
NLP BERT_Large 24 1024 16 128/512
RecSys BST 1 128(max) 8 20
  • BST: Behavior Sequence Transformer

References

  1. The GPT-3 Architecture, on a Napkin

  2. GPT-3 An Overview

  3. Language Models are Few-Shot Learners

  4. Improving Language Understanding by Generative Pre-Training

  5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  6. Attention Is All You Need

  7. BERT transformer block code

  8. Deep Learning Recommendation Model for Personalization and Recommendation Systems

  9. Behavior Sequence Transformer for E-commerce Recommendation in Alibaba


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM