BERT 模型壓縮方法

本文轉載自查看原文 2020-03-12 12:10 875 NLP/ Notes

模型壓縮可減少受訓神經網絡的冗余，由於幾乎沒有 BERT 或者 BERT-Large 模型可直接在 GPU 及智能手機上應用，因此模型壓縮方法對於 BERT 的未來的應用前景而言，非常有價值。

一、壓縮方法

1、剪枝——即訓練后從網絡中去掉不必要的部分。

這包括權重大小剪枝、注意力頭剪枝、網絡層以及其他部分的剪枝等。還有一些方法也通過在訓練期間采用正則化的方式來提升剪枝能力（layer dropout）。

2、權重因子分解——通過將參數矩陣分解成兩個較小矩陣的乘積來逼近原始參數矩陣。

這給矩陣施加了低秩約束。權重因子分解既可以應用於輸入嵌入層（這節省了大量磁盤內存），也可以應用於前饋/自注意力層的參數（為了提高速度）。

===> 分解成兩個小矩陣的話參數會變少，例如 5*5 ==> 3*3 3*3

3、知識蒸餾——又名「Student Teacher」。

在預訓練/下游數據上從頭開始訓練一個小得多的 Transformer，正常情況下，這可能會失敗，但是由於未知的原因，利用完整大小的模型中的軟標簽可以改進優化。一些方法還將BERT 蒸餾成如LSTMS 等其他各種推理速度更快的架構。另外還有一些其他方法不僅在輸出上，還在權重矩陣和隱藏的激活層上對 Teacher 知識進行更深入的挖掘。

4、權重共享——模型中的一些權重與模型中的其他參數共享相同的值。

例如，ALBERT 對 BERT 中的每個自注意力層使用相同的權重矩陣。

5、量化——截斷浮點數，使其僅使用幾個比特（這會導致舍入誤差）。

模型可以在訓練期間，也可以在訓練之后學習量化值。

6、預訓練和下游任務——一些方法僅僅在涉及到特定的下游任務時才壓縮 BERT，也有一些方法以任務無關的方式來壓縮 BERT。

Bibtex

Paper	Prune	Factor	Distill	W. Sharing	Quant.	Pre-train	Downstream
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning	☑					☑	☑
Are Sixteen Heads Really Better than One?	☑						☑
Pruning a BERT-based Question Answering Model	☑						☑
Reducing Transformer Depth on Demand with Structured Dropout	☑					☑
Reweighted Proximal Pruning for Large-Scale Language Representation	☑					☑
Structured Pruning of Large Language Models		☑					☑
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations		☑		☑		☑
Extreme Language Model Compression with Optimal Subwords and Shared Projections			☑			☑
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter			☑			☑
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks			☑				☑
Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data			☑				☑
Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models			☑			Multi-task	☑
Patient Knowledge Distillation for BERT Model Compression			☑				☑
TinyBERT: Distilling BERT for Natural Language Understanding			☑			☑	☑
MobileBERT: Task-Agnostic Compression of BERT by Progressive Knowledge Transfer			☑			☑
Q8BERT: Quantized 8Bit BERT					☑		☑
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT					☑		☑

結果比較，同時主要關注以下指標：參數縮減，推理加速和准確性。

若需要選一個贏家，我認為是 ALBERT，DistilBERT，MobileBERT，Q-BERT，LayerDrop和RPP。你也可以將其中一些方法疊加使用 4，但是有些剪枝相關的論文，它們的科學性要高於實用性，所以我們不妨也來驗證一番：

Paper	Reduction	Of	Speed-up	Accuracy?	Comments
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning	30%	params	?	Same	Some interesting ablation experiments and fine-tuning analysis
Are Sixteen Heads Really Better than One?	50-60%	attn heads	1.2x	Same
Pruning a BERT-based Question Answering Model	50%	attn Heads + FF	2x	-1.5 F1
Reducing Transformer Depth on Demand with Structured Dropout	50-75%	layers	?	Same
Reweighted Proximal Pruning for Large-Scale Language Representation	40-80%	params	?	Same
Structured Pruning of Large Language Models	35%	params	?	Same
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations	90-95%	params	6-20x	Worse	Allows training larger models (BERT-xxlarge), so effectively 30% param reduction and 1.5x speedup with better acc.
Extreme Language Model Compression with Optimal Subwords and Shared Projections	80-98%	params	?	worse to much worse
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter	40%	params	2.5x	97%	🤗 Huggingface
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks	99%	params	15x	ELMO equiv.	Distills into Bi-LSTMs
Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data	96%	params	?	?	Low-resource only
Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models	90%	params	14x	better than Tang^	Distills into BiLSTMs.
Patient Knowledge Distillation for BERT Model Compression	50-75%	layers	2-4x	Worse	But better than vanilla KD
TinyBERT: Distilling BERT for Natural Language Understanding	87%	params	9.4x	96%
MobileBERT: Task-Agnostic Compression of BERT by Progressive Knowledge Transfer	77%	params	4x	competitive
Q8BERT: Quantized 8Bit BERT	75%	bits	?	negligible	"Need hardware to show speed-ups" but I don't think anyone has it
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT	93%	bits	?	"at most 2.3% worse"	^ probably same

相關論文和博文推薦

《稀疏 Transformer：通過顯式選擇集中注意力》（Sparse Transformer: Concentrated Attention Through Explicit Selection），論文鏈接：https://openreview.net/forum?id=Hye87grYDH）
《使用四元數網絡進行輕量級和高效的神經自然語言處理》（Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks，論文鏈接：http://arxiv.org/abs/1906.04393）
《自適應稀疏 Transformer》（Adaptively Sparse Transformers，論文鏈接：https://www.semanticscholar.org/paper/f6390beca54411b06f3bde424fb983a451789733）
《壓縮 BERT 以獲得更快的預測結果》（Compressing BERT for Faster Prediction，博文鏈接：https://blog.rasa.com/compressing-bert-for-faster-prediction-2/amp/）

最后的話：
1、請注意，並非所有壓縮方法都能使模型更快。眾所周知，非結構化剪枝很難通過 GPU 並行來加速。其中一篇論文認為，在 Transformers 中，計算時間主要由 Softmax 計算決定，而不是矩陣乘法。

2、期待有更好的模型壓縮評價標准。就像 F1之類的。

3、其中一些百分比是根據 BERT-Large 而不是 BERT-Base 衡量的，僅供參考。

4、不同的壓縮方法如何交互，是一個開放的研究問題。

參考

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Huggingface中的BERT模型的使用方法 BERT模型 BERT模型 BERT模型最強NLP模型-BERT BERT模型總結 NLP學習（3）---Bert模型 BERT模型介紹 BERT模型詳解預訓練模型（三）-----Bert