Generative Pre-trained Transformer (GPT)
總的來說,GPT1,2,3都是 單向transformer decoder結構,訓練語言模型,最主要的是訓練數據量和模型大小的區別,越來越多,越來越大
|
GPT1
|
GPT2
|
GPT3
|
|
|
paper
|
Improving Language Understanding by Generative Pre-Training
link
|
Language Models are Unsupervised Multitask Learners
link
|
Language Models are Few-Shot Learners
link
|
|
學習目標
|
無監督語言模型(Pre-training),有監督fine-tune
|
多任務,P(output|input, task)
Zero Short Task Transfer
|
few shot
|
|
主要區別
|
增加語料、層數、維度
LN前移,最后加LN,初始化scale
|
增加語料、層數、維度
|
|
|
Dataset
|
7000 unpublished books,長文較多
|
WebText, 40GB, 8 million documents
|
Common Crawl, WebText2, Books1, Books2 and Wikipedia,共45TB
|
|
模型結構
|
12-layer decoder,12 heads,dim 768,ff 3072
|
48 layers,dim 1600
|
96 layers, 96 heads, dim 12888,
|
|
訓練參數
|
100 epochs,batch_size 64,sequence length of 512,lr 2.5e-4,BPE vocab 40,000,
|
vocab 50,257, batch_size 512, context window 1024
|
context 2048, β_1=0.9, β_2=0.95, ε= 10^(-8)
|
|
模型參數量
|
117M parameters(1.17億)
|
117M (same as GPT-1), 345M, 762M and 1.5B (GPT-2) parameters(15億)
|
175 billion parameters(1750億)
|
參考
https://medium.com/walmartglobaltech/the-journey-of-open-ai-gpt-models-32d95b7b7fb2
