深度學習中是否考慮過樣本量和參數的關系?


在深度學習中,樣本量和參數有什么關系呢?

是不是樣本量越大?參數越多?模型表現會越好?
參數越多自然想到可能會出現過擬合,樣本量與參數量應該保持怎樣的關系?

參考論文Scaling Laws for Neural Language Model

summary

文章主要討論了如下幾個問題

**Performance depends strongly on scale, weakly on model shape: **Model performance depends most strongly on scale, which consists of three factors: the number of model parameters \(N\) (excluding embeddings), the size of the dataset \(D\), and the amount of compute \(C\) used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width. (Section 3)

模型效果更多的依賴於參數的規模,較少的依賴於模型的形狀
模型的性能依賴於模型的規模,模型的規模主要由三部分組成:模型參數N(包括emb的數量),數據集的大小D,還有算力C,模型性能主要受限於這三個因素,和模型的深度和寬度關系不大。
這部分在文章第三部分討論

Smooth power laws Performance has a power-law relationship with each of the three scale factors \(N, D, C\) when not bottlenecked by the other two, with trends spanning more than six orders of magnitude (see Figure 1). We observe no signs of deviation from these trends on the upper end, though performance must flatten out eventually before reaching zero loss. (Section 3)

平滑冪定律

影響模型性能的三個要素之間存在冪指數的關系,每個參數並受另外兩個參數影響。在fig1中有超過6個數量的對比。本文實驗中沒有很大的偏離現象,在達到零損失之前,性能最終會趨於平緩。

The test loss of a Transformer trained to autoregressively model language can be predicted using a power-law when performance is limited by only either the number of non-embedding parameters \(N\), the dataset size \(D\), or the optimally allocated compute budget \(C_{\min }\) (see Figure 1):

一個transformer模型的訓練損失可以使用下面的公式衡量

  1. For models with a limited number of parameters, trained to convergence on sufficiently large datasets:
    如果受模型參數限制,那模型效果主要依賴樣本量

\[L(N)=\left(N_{\mathrm{c}} / N\right)^{\alpha_{N}} ; \alpha_{N} \sim 0.076, \quad N_{\mathrm{c}} \sim 8.8 \times 10^{13} \text { (non-embedding parameters) } \]

  1. For large models trained with a limited dataset with early stopping:
    樣本量有限的大模型,訓練會早停

\[L(D)=\left(D_{\mathrm{c}} / D\right)^{\alpha_{D}} ; \alpha_{D} \sim 0.095, \quad D_{\mathrm{c}} \sim 5.4 \times 10^{13} \text { (tokens) } \]

  1. When training with a limited amount of compute, a sufficiently large dataset, an optimally-sized model, and a sufficiently small batch size (making optimal \(^{3}\) use of compute):

對於算力受限,充足的樣本量+最優的模型+較小的bz

\[L\left(C_{\min }\right)=\left(C_{\mathrm{c}}^{\min } / C_{\min }\right)^{\alpha_{C}^{\min }} ; \alpha_{C}^{\min } \sim 0.050, \quad C_{\mathrm{c}}^{\min } \sim 3.1 \times 10^{8} \text { (PF-days) } \]

Universality of overfittingPerformance improves predictably as long as we scale up \(N\) and \(D\) in tandem, but enters a regime of diminishing returns if either \(N\) or \(D\) is held fixed while the other increases. The performance penalty depends predictably on the ratio \(N^{0.74} / D\), meaning that every time we increase the model size \(8 \mathrm{x}\), we only need to increase the data by roughly \(5 \mathrm{x}\) to avoid a penalty. (Section 4\()\)

過擬合的普遍性

同時增加N和D模型表現就會提升,但是N和D保持不變模型表現保持不變。模型表現主要取決於一個比例系數 \(N^{0.74} / D\),這個系數的啥意思?就是模型參數增加8倍,訓練集的量要增加5倍。

這個是本文給出的參數量和訓練樣本的一個關系。

Universality of trainingTraining curves follow predictable power-laws whose parameters are roughly independent of the model size. By extrapolating the early part of a training curve, we can roughly predict the loss that would be achieved if we trained for much longer. (Section 5)

訓練的普遍性
訓練曲線也是呈現冪指函數,模型的參數和模型的大小無關。從初期收斂曲線可知,如果訓練時間長,造成的損失將會更多。

Transfer improves with test performanceWhen we evaluate models on text with a different distribution than they were trained on, the results are strongly correlated to those on the training validation set with a roughly constant offset in the loss - in other words, transfer to a different distribution incurs a constant penalty but otherwise improves roughly in line with performance on the training set. (Section 3.2.2)

遷移學習提升測試集的表現
當評估與訓練時不同分布的文本模型時,結果與訓練驗證集上的模型有很強的相關性,損失的偏移量大致不變——換句話說,遷移到不同的分布會帶來不變的懲罰,但在其他方面會根據訓練集上的性能得到大致的改善。

Sample efficiencyLarge models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4).

抽樣的有效性大型模型比小型模型更具有樣本效率,可以用更少的優化步驟達和較少的樣本點到相同的性能水平結果參考如下fig2,fig4

這個感覺更多的是遷移學習的作用,大模型訓練好了遷移到下游任務做微調。


Convergence is inefficientWhen working within a fixed compute budget \(C\) but without any other restrictions on the model size \(N\) or available data \(D\), we attain optimal performance by training very large models and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would therefore be far more sample efficient than one might expect based on training small models to convergence, with data requirements growing very slowly as \(D \sim C^{0.27}\) with training compute. (Section 6)

收斂效率比較低
當算力是固定的時但對模型大小\(N\)或可用數據\(D\)沒有任何其他限制時,通過訓練非常大的模型並停止明顯低於收斂速度來獲得最優性能(見圖3)。因此,計算效率最高的訓練將遠比基於訓練小模型以收斂的預期的樣本效率更高,數據需求增長非常緩慢,如使用訓練計算的\(D \sim C^{0.27}\)
簡單的來說就是算力足夠,模型參數和訓練集越大越好。

Optimal batch size The ideal batch size for training these models is roughly a power of the loss only, and continues to be determinable by measuring the gradient noise scale [MKAT18]; it is roughly \(1-2\) million tokens at convergence for the largest models we can train. (Section 5.1)
Taken together, these results show that language modeling performance improves smoothly and predictably as we appropriately scale up model size, data, and compute. We expect that larger language models will perform better and be more sample efficient than current models.

優化batchsize
訓練這些模型的理想批量大小大致僅是損失的一個冪次。並繼續通過測量梯度噪聲來確定bz[MKAT18];它大約是1-2百萬的token在訓練至最后收斂。
總的來說這些結果表明,當適當地擴大模型規模、數據和計算時,語言建模性能會平穩和可預測地提高,期望更大的語言模型會比當前的模型表現得更好,樣本效率更高.這就是本文的初步結論。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM