pytorch bert 源碼解讀

本文轉載自查看原文 2019-07-29 10:14 758

https://daiwk.github.io/posts/nlp-bert.html

參考https://www.zhihu.com/question/298203515/answer/509703208

概述

本文介紹了一種新的語言表征模型BERT——來自Transformer的雙向編碼器表征。與最近的語言表征模型不同，BERT旨在基於所有層的左、右語境來預訓練深度雙向表征。BERT是首個在大批句子層面和token層面任務中取得當前最優性能的基於微調的表征模型，其性能超越許多使用任務特定架構的系統，刷新了11項NLP任務的當前最優性能記錄。

目前將預訓練語言表征應用於下游任務存在兩種策略：feature-based的策略和fine-tuning策略。

feature-based策略（如 ELMo）使用將預訓練表征作為額外特征的任務專用架構。
fine-tuning策略（如生成預訓練 Transformer (OpenAI GPT)）引入了任務特定最小參數，通過簡單地微調預訓練參數在下游任務中進行訓練。

在之前的研究中，兩種策略在預訓練期間使用相同的目標函數，利用單向語言模型來學習通用語言表征。

作者認為現有的技術嚴重制約了預訓練表征的能力，微調策略尤其如此。其主要局限在於標准語言模型是單向的，這限制了可以在預訓練期間使用的架構類型。例如，OpenAI GPT使用的是從左到右的架構，其中每個token只能注意Transformer自注意力層中的先前token。這些局限對於句子層面的任務而言不是最佳選擇，對於token級任務（如 SQuAD 問答）則可能是毀滅性的，因為在這種任務中，結合兩個方向的語境至關重要。

BERT（Bidirectional Encoder Representations from Transformers）改進了基於微調的策略。

BERT提出一種新的預訓練目標——遮蔽語言模型（masked language model，MLM），來克服上文提到的單向局限。MLM 的靈感來自 Cloze 任務（Taylor, 1953）。MLM隨機遮蔽輸入中的一些token，目標在於僅基於遮蔽詞的語境來預測其原始詞匯id。與從左到右的語言模型預訓練不同，MLM目標允許表征融合左右兩側的語境，從而預訓練一個深度雙向Transformer。除了 MLM，我們還引入了一個“下一句預測”（next sentence prediction）任務，該任務聯合預訓練文本對表征。

貢獻：

展示了雙向預訓練語言表征的重要性。不同於 Radford 等人（2018）使用單向語言模型進行預訓練，BERT使用MLM預訓練深度雙向表征。本研究與 Peters 等人（2018）的研究也不同，后者使用的是獨立訓練的從左到右和從右到左LM的淺層級聯。
證明了預訓練表征可以消除對許多精心設計的任務特定架構的需求。BERT是首個在大批句子層面和token層面任務中取得當前最優性能的基於微調的表征模型，其性能超越許多使用任務特定架構的系統。
BERT 刷新了11項NLP任務的當前最優性能記錄。本論文還報告了BERT的模型簡化測試（ablation study），證明該模型的雙向特性是最重要的一項新貢獻。代碼和預訓練模型將發布在goo.gl/language/bert。

BERT

模型架構

BERT 旨在基於所有層的左、右語境來預訓練深度雙向表征。因此，預訓練的 BERT 表征可以僅用一個額外的輸出層進行微調，進而為很多任務（如問答和語言推斷任務）創建當前最優模型，無需對任務特定架構做出大量修改。

BERT 的模型架構是一個多層雙向Transformer編碼器，基於Vaswani 等人 (2017)描述的原始實現，在tensor2tensor庫中發布(當然，可以抽空看看https://daiwk.github.io/posts/platform-tensor-to-tensor.html和https://daiwk.github.io/posts/platform-tensor-to-tensor-coding.html)。

本文中，我們將層數（即Transformer塊）表示為\(L\)，將隱層的size表示為\(H\)、自注意力頭數表示為\(A\)。在所有實驗中，我們將feed-forward/filter的size設置為\(4H\)，即H=768時為3072，H=1024時為4096。我們主要看下在兩種模型尺寸上的結果：

\(BERT_{BASE}\): L=12, H=768, A=12, Total Parameters=110M
\(BERT_{LARGE}\): L=24, H=1024, A=16, Total Parameters=340M

其中，\(BERT_{BASE}\)和OpenAI GPT的大小是一樣的。BERT Transformer使用雙向自注意力機制，而GPT Transformer使用受限的自注意力機制，導致每個token只能關注其左側的語境。雙向Transformer在文獻中通常稱為“Transformer 編碼器”，而只關注左側語境的版本則因能用於文本生成而被稱為“Transformer 解碼器”。

下圖顯示了BERT/GPT Transformer/ELMo的結構區別：

BERT 使用雙向Transformer
OpenAI GPT 使用從左到右的Transformer
ELMo 使用獨立訓練的從左到右和從右到左LSTM的級聯來生成下游任務的特征。

三種模型中，只有BERT表征會基於所有層中的左右兩側語境。

Input Representation

論文的輸入表示（input representation）能夠在一個token序列中明確地表示單個文本句子或一對文本句子（例如， [Question, Answer]）。對於給定token，其輸入表示通過對相應的token、segment和position embeddings進行求和來構造：

使用WordPiece嵌入【GNMT，Google’s neural machine translation system: Bridging the gap between human and machine translation】和30,000個token的詞匯表。用##表示分詞。
使用learned positional embeddings，支持的序列長度最多為512個token。
每個序列的第一個token始終是特殊分類嵌入（[CLS]）。對應於該token的最終隱藏狀態（即，Transformer的輸出）被用作分類任務的聚合序列表示。對於非分類任務，將忽略此向量。
句子對被打包成一個序列。以兩種方式區分句子。
- 首先，用特殊標記（[SEP]）將它們分開。
- 其次，添加一個learned sentence A嵌入到第一個句子的每個token中，一個sentence B嵌入到第二個句子的每個token中。
對於單個句子輸入，只使用 sentence A嵌入。

Pre-training Tasks

它在訓練雙向語言模型時以減小的概率把少量的詞替成了Mask或者另一個隨機的詞。感覺其目的在於使模型被迫增加對上下文的記憶。（知乎的回答）
增加了一個預測下一句的loss。

Task #1: Masked LM

標准條件語言模型只能從左到右或從右到左進行訓練，因為雙向條件作用將允許每個單詞在多層上下文中間接地“see itself”。

為了訓練一個深度雙向表示（deep bidirectional representation），研究團隊采用了一種簡單的方法，即隨機屏蔽（masking）部分輸入token，然后只預測那些被屏蔽的token。論文將這個過程稱為“masked LM”(MLM)，盡管在文獻中它經常被稱為Cloze任務(Taylor, 1953)。

在這個例子中，與masked token對應的最終隱藏向量被輸入到詞匯表上的輸出softmax中，就像在標准LM中一樣。在團隊所有實驗中，隨機地屏蔽了每個序列中15%的WordPiece token。與去噪的自動編碼器（Vincent et al.， 2008）相反，只預測masked words而不是重建整個輸入。

雖然這確實能讓團隊獲得雙向預訓練模型，但這種方法有兩個缺點。

缺點1：預訓練和finetuning之間不匹配，因為在finetuning期間從未看到[MASK]token。

為了解決這個問題，團隊並不總是用實際的[MASK]token替換被“masked”的詞匯。相反，訓練數據生成器隨機選擇15％的token。

例如在這個句子“my dog is hairy”中，它選擇的token是“hairy”。然后，執行以下過程：

數據生成器將執行以下操作，而不是始終用[MASK]替換所選單詞：

80％的時間：用[MASK]標記替換單詞，例如，my dog is hairy → my dog is [MASK]
10％的時間：用一個隨機的單詞替換該單詞，例如，my dog is hairy → my dog is apple
10％的時間：保持單詞不變，例如，my dog is hairy → my dog is hairy. 這樣做的目的是將表示偏向於實際觀察到的單詞。

Transformer encoder不知道它將被要求預測哪些單詞或哪些單詞已被隨機單詞替換，因此它被迫保持每個輸入token的分布式上下文表示。此外，因為隨機替換只發生在所有token的1.5％（即15％的10％），這似乎不會損害模型的語言理解能力。

缺點2：每個batch只預測了15％的token，這表明模型可能需要更多的預訓練步驟才能收斂。

團隊證明MLM的收斂速度略慢於 left-to-right的模型（預測每個token），但MLM模型在實驗上獲得的提升遠遠超過增加的訓練成本。

Task #2: Next Sentence Prediction

在為了訓練一個理解句子的模型關系，預先訓練一個二分類的下一句測任務，這一任務可以從任何單語語料庫中生成。具體地說，當選擇句子A和B作為預訓練樣本時，B有50％的可能是A的下一個句子，也有50％的可能是來自語料庫的隨機句子。例如：

 
           Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = IsNext Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP] Label = NotNext  
          

完全隨機地選擇了NotNext語句，最終的預訓練模型在此任務上實現了97％-98％的准確率。

Pre-training Procedure

使用gelu激活函數（Bridging nonlinearities and stochastic regularizers with gaus- sian error linear units），在pytorch里實現如下：

 
           class GELU(nn.Module): """ Paper Section 3.4, last paragraph notice that BERT used the GELU instead of RELU """ def forward(self, x): return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))  
          

Fine-tuning Procedure

Comparison of BERT and OpenAI GPT

實驗

網絡結構如下：

GLUE Datasets

GLUE Results

SQuAD v1.1

Named Entity Recognition

SWAG

Ablation Studies

Effect of Pre-training Tasks

Effect of Model Size

Effect of Number of Training Steps

Feature-based Approach with BERT

代碼實現

pytorch版本

https://github.com/codertimo/BERT-pytorch

fork了一份：https://github.com/daiwk/BERT-pytorch

輸入data/corpus.small：

 
           Welcome to the \t the jungle \n I can stay \t here all night \n  
          

可視化，需要：

 
           brew install graphviz # mac pip3 install git+https://github.com/szagoruyko/pytorchviz  
          

畫出bert的架構圖的方法(先生成vocab，如果機器的dot不支持pdf，只支持png/jpg等，需要在lib/python3.6/site-packages/torchviz/dot.py中把dot = Digraph(node_attr=node_attr, graph_attr=dict(size="12,12"))改成dot = Digraph(node_attr=node_attr, graph_attr=dict(size="12,12"), format="png"))：

 
           import torch from torch import nn from torchviz import make_dot, make_dot_from_trace import sys sys.path.append("./bert_pytorch-0.0.1a4.src/") #from trainer import BERTTrainer from model import BERTLM, BERT from dataset import BERTDataset, WordVocab from torch.utils.data import DataLoader def demo(): lstm_cell = nn.LSTMCell(128, 128) x = torch.randn(1, 128) dot = make_dot(lstm_cell(x), params=dict(list(lstm_cell.named_parameters()))) file_out = "xx" dot.render(file_out) def bert_dot(): """ """ vocab_size = 128 train_dataset_path = "data/bert_train_data.xxx" vocab_path = "data/vocab.all.xxx" vocab = WordVocab.load_vocab(vocab_path) train_dataset = BERTDataset(train_dataset_path, vocab, seq_len=20, corpus_lines=2000, on_memory=True) train_data_loader = DataLoader(train_dataset, batch_size=8, num_workers=8) bert = BERT(len(vocab), hidden=256, n_layers=8, attn_heads=8) device = torch.device("cpu") mymodel = BERTLM(bert, vocab_size).to(device) data_iter = train_data_loader out_idx = 0 for data in data_iter: data = {key: value.to(device) for key, value in data.items()} if out_idx == 0: g = make_dot(mymodel(data["bert_input"], data["segment_label"]), params=dict(mymodel.named_parameters())) g.render("./bert-arch") break bert_dot()  
          

可以畫出這么個圖。。圖太大，自己下載看看

https://daiwk.github.io/assets/bert-arch.jpeg

對應的pdf如

https://daiwk.github.io/assets/bert-arch.pdf

對應的dot文件

https://daiwk.github.io/assets/bert-arch

把dot文件轉換成其他格式的方式：

 
           input=./bert-arch output=./bert-arch dot $input -Tjpeg -o $output.jpeg dot $input -Tpdf -o $output.pdf  
          

設置一個layer的簡單版pdf如下：

https://daiwk.github.io/assets/bert-arch-1layer.pdf

代碼解讀

transformer部分參考http://nlp.seas.harvard.edu/2018/04/03/attention.htm

可以學習下https://blog.csdn.net/stupid_3/article/details/83184691，講得很細致呢！

基礎知識

參考https://daiwk.github.io/posts/knowledge-pytorch-usage.html

position encoding

代碼

 
           class PositionalEncoding(nn.Module): "Implement the PE function." def __init__(self, d_model, dropout, max_len=5000): super(PositionalEncoding, self).__init__() self.dropout = nn.Dropout(p=dropout) # Compute the positional encodings once in log space. pe = torch.zeros(max_len, d_model) position = torch.arange(0, max_len).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) pe = pe.unsqueeze(0) self.register_buffer('pe', pe) def forward(self, x): x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False) return self.dropout(x)  
          

輸入是shape為(max_len, d_model)的矩陣，d_model是emb的size。如下圖，輸入是一個max_len=100，d_model=20的矩陣，圖中畫的是這20維里的4、5、6、7每一維在100個position的取值。

bert里改名了一下：

 
           class PositionalEmbedding(nn.Module): def __init__(self, d_model, max_len=512): super().__init__() # Compute the positional encodings once in log space. pe = torch.zeros(max_len, d_model).float() pe.require_grad = False position = torch.arange(0, max_len).float().unsqueeze(1) div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp() pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) pe = pe.unsqueeze(0) self.register_buffer('pe', pe) def forward(self, x): return self.pe[:, :x.size(1)]  
          

而bert還有另外兩個embedding，就是segment和token，這里用很簡單的實現：

 
           class SegmentEmbedding(nn.Embedding): def __init__(self, embed_size=512): ### 輸入是segment_label，表示是第1句話，第2句話，還是padding，所以num_embeddings是3 super().__init__(3, embed_size, padding_idx=0) class TokenEmbedding(nn.Embedding): def __init__(self, vocab_size, embed_size=512): super().__init__(vocab_size, embed_size, padding_idx=0)  
          

用的時候是把三者加起來：

 
           class BERTEmbedding(nn.Module): """ BERT Embedding which is consisted with under features 1. TokenEmbedding : normal embedding matrix 2. PositionalEmbedding : adding positional information using sin, cos 2. SegmentEmbedding : adding sentence segment info, (sent_A:1, sent_B:2) sum of all these features are output of BERTEmbedding """ def __init__(self, vocab_size, embed_size, dropout=0.1): """ :param vocab_size: total vocab size :param embed_size: embedding size of token embedding :param dropout: dropout rate """ super().__init__() self.token = TokenEmbedding(vocab_size=vocab_size, embed_size=embed_size) self.position = PositionalEmbedding(d_model=self.token.embedding_dim) self.segment = SegmentEmbedding(embed_size=self.token.embedding_dim) self.dropout = nn.Dropout(p=dropout) self.embed_size = embed_size def forward(self, sequence, segment_label): x = self.token(sequence) + self.position(sequence) + self.segment(segment_label) return self.dropout(x)  
          

這部分畫出來的圖就應該是下面這個了：

position-wise feed forward

 
           class PositionwiseFeedForward(nn.Module): "Implements FFN equation." def __init__(self, d_model, d_ff, dropout=0.1): super(PositionwiseFeedForward, self).__init__() self.w_1 = nn.Linear(d_model, d_ff) self.w_2 = nn.Linear(d_ff, d_model) self.dropout = nn.Dropout(dropout) def forward(self, x): return self.w_2(self.dropout(F.relu(self.w_1(x))))  
          

在bert中，把relu改為gelu，所以：

 
           class GELU(nn.Module): """ Paper Section 3.4, last paragraph notice that BERT used the GELU instead of RELU """ def forward(self, x): return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) class PositionwiseFeedForward(nn.Module): "Implements FFN equation." def __init__(self, d_model, d_ff, dropout=0.1): super(PositionwiseFeedForward, self).__init__() self.w_1 = nn.Linear(d_model, d_ff) self.w_2 = nn.Linear(d_ff, d_model) self.dropout = nn.Dropout(dropout) self.activation = GELU() def forward(self, x): return self.w_2(self.dropout(self.activation(self.w_1(x))))  
          

attention和Multi-head attention

代碼如下：

 
           def attention(query, key, value, mask=None, dropout=None): "Compute 'Scaled Dot Product Attention'" d_k = query.size(-1) scores = torch.matmul(query, key.transpose(-2, -1)) \ / math.sqrt(d_k) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) p_attn = F.softmax(scores, dim = -1) if dropout is not None: p_attn = dropout(p_attn) return torch.matmul(p_attn, value), p_attn class MultiHeadedAttention(nn.Module): def __init__(self, h, d_model, dropout=0.1): "Take in model size and number of heads." super(MultiHeadedAttention, self).__init__() assert d_model % h == 0 # We assume d_v always equals d_k self.d_k = d_model // h self.h = h self.linears = clones(nn.Linear(d_model, d_model), 4) self.attn = None self.dropout = nn.Dropout(p=dropout) def forward(self, query, key, value, mask=None): "Implements Figure 2" if mask is not None: # Same mask applied to all h heads. mask = mask.unsqueeze(1) nbatches = query.size(0) # 1) Do all the linear projections in batch from d_model => h x d_k query, key, value = \ [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2) for l, x in zip(self.linears, (query, key, value))] # 2) Apply attention on all the projected vectors in batch. x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout) # 3) "Concat" using a view and apply a final linear. x = x.transpose(1, 2).contiguous() \ .view(nbatches, -1, self.h * self.d_k) return self.linears[-1](x)  
          

注：

畫出來的圖可以參考[https://daiwk.github.io/assets/bert-arch-1layer.pdf]

有4個Linear，其中三個分別和q,k,v相乘，最后一個和concat后的相乘。大小都是d_model,d_model。因為d_k=d_v=d_model/h，對於q來講，有h個(d_k, d_model)，所以一個(d_model, d_model)就行了。k,v同理。當然，后面還搞了下batches，所以畫出來的圖是q和k先bmm一下，再和v去bmm一下，最后的concat是就是view一下，然后再和最后那個linear去mm一下。

封裝一下：

 
           class Attention(nn.Module): """ Compute 'Scaled Dot Product Attention """ def forward(self, query, key, value, mask=None, dropout=None): scores = torch.matmul(query, key.transpose(-2, -1)) \ / math.sqrt(query.size(-1)) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) p_attn = F.softmax(scores, dim=-1) if dropout is not None: p_attn = dropout(p_attn) return torch.matmul(p_attn, value), p_attn class MultiHeadedAttention(nn.Module): """ Take in model size and number of heads. """ def __init__(self, h, d_model, dropout=0.1): super().__init__() assert d_model % h == 0 # We assume d_v always equals d_k self.d_k = d_model // h self.h = h self.linear_layers = nn.ModuleList([nn.Linear(d_model, d_model) for _ in range(3)]) self.output_linear = nn.Linear(d_model, d_model) self.attention = Attention() self.dropout = nn.Dropout(p=dropout) def forward(self, query, key, value, mask=None): batch_size = query.size(0) # 1) Do all the linear projections in batch from d_model => h x d_k query, key, value = [l(x).view(batch_size, -1, self.h, self.d_k).transpose(1, 2) for l, x in zip(self.linear_layers, (query, key, value))] # 2) Apply attention on all the projected vectors in batch. x, attn = self.attention(query, key, value, mask=mask, dropout=self.dropout) # 3) "Concat" using a view and apply a final linear. x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k) return self.output_linear(x)  
          

layernorm和sublayer

 
           class LayerNorm(nn.Module): "Construct a layernorm module (See citation for details)." def __init__(self, features, eps=1e-6): super(LayerNorm, self).__init__() self.a_2 = nn.Parameter(torch.ones(features)) self.b_2 = nn.Parameter(torch.zeros(features)) self.eps = eps def forward(self, x): mean = x.mean(-1, keepdim=True) std = x.std(-1, keepdim=True) return self.a_2 * (x - mean) / (std + self.eps) + self.b_2 class SublayerConnection(nn.Module): """ A residual connection followed by a layer norm. Note for code simplicity the norm is first as opposed to last. """ def __init__(self, size, dropout): super(SublayerConnection, self).__init__() self.norm = LayerNorm(size) self.dropout = nn.Dropout(dropout) def forward(self, x, sublayer): "Apply residual connection to any sublayer with the same size." return x + self.dropout(sublayer(self.norm(x)))  
          

transformer里的encoder：

 
           class EncoderLayer(nn.Module): "Encoder is made up of self-attn and feed forward (defined below)" def __init__(self, size, self_attn, feed_forward, dropout): super(EncoderLayer, self).__init__() self.self_attn = self_attn self.feed_forward = feed_forward self.sublayer = clones(SublayerConnection(size, dropout), 2) self.size = size def forward(self, x, mask): "Follow Figure 1 (left) for connections." x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask)) return self.sublayer[1](x, self.feed_forward)  
          

decoder部分：

 
           class Decoder(nn.Module): "Generic N layer decoder with masking." def __init__(self, layer, N): super(Decoder, self).__init__() self.layers = clones(layer, N) self.norm = LayerNorm(layer.size) def forward(self, x, memory, src_mask, tgt_mask): for layer in self.layers: x = layer(x, memory, src_mask, tgt_mask) return self.norm(x) class DecoderLayer(nn.Module): "Decoder is made of self-attn, src-attn, and feed forward (defined below)" def __init__(self, size, self_attn, src_attn, feed_forward, dropout): super(DecoderLayer, self).__init__() self.size = size self.self_attn = self_attn self.src_attn = src_attn self.feed_forward = feed_forward self.sublayer = clones(SublayerConnection(size, dropout), 3) def forward(self, x, memory, src_mask, tgt_mask): "Follow Figure 1 (right) for connections." m = memory x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask)) x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask)) return self.sublayer[2](x, self.feed_forward)  
          

其中的mask部分：

 
           def subsequent_mask(size): "Mask out subsequent positions." attn_shape = (1, size, size) ## np.triu：一個上三角矩陣（注意：這里是一個方陣）右上角都是1，左下角都是0 subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8') return torch.from_numpy(subsequent_mask) == 0 class Batch: "Object for holding a batch of data with mask during training." def __init__(self, src, trg=None, pad=0): self.src = src self.src_mask = (src != pad).unsqueeze(-2) if trg is not None: self.trg = trg[:, :-1] self.trg_y = trg[:, 1:] self.trg_mask = \ self.make_std_mask(self.trg, pad) self.ntokens = (self.trg_y != pad).data.sum() @staticmethod def make_std_mask(tgt, pad): "Create a mask to hide padding and future words." tgt_mask = (tgt != pad).unsqueeze(-2) tgt_mask = tgt_mask & Variable( subsequent_mask(tgt.size(-1)).type_as(tgt_mask.data)) return tgt_mask  
          

在產出數據時把mask加上：

 
           def data_gen(V, batch, nbatches): "Generate random data for a src-tgt copy task." for i in range(nbatches): data = torch.from_numpy(np.random.randint(1, V, size=(batch, 10))) data[:, 0] = 1 src = Variable(data, requires_grad=False) tgt = Variable(data, requires_grad=False) yield Batch(src, tgt, 0)  
          

整個模型：

 
           class EncoderDecoder(nn.Module): """ A standard Encoder-Decoder architecture. Base for this and many other models. """ def __init__(self, encoder, decoder, src_embed, tgt_embed, generator): super(EncoderDecoder, self).__init__() self.encoder = encoder self.decoder = decoder self.src_embed = src_embed self.tgt_embed = tgt_embed self.generator = generator def forward(self, src, tgt, src_mask, tgt_mask): "Take in and process masked src and target sequences." return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask) def encode(self, src, src_mask): return self.encoder(self.src_embed(src), src_mask) def decode(self, memory, src_mask, tgt, tgt_mask): return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask) class Generator(nn.Module): "Define standard linear + softmax generation step." def __init__(self, d_model, vocab): super(Generator, self).__init__() self.proj = nn.Linear(d_model, vocab) def forward(self, x): return F.log_softmax(self.proj(x), dim=-1) def make_model(src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1): "Helper: Construct a model from hyperparameters." c = copy.deepcopy attn = MultiHeadedAttention(h, d_model) ff = PositionwiseFeedForward(d_model, d_ff, dropout) position = PositionalEncoding(d_model, dropout) model = EncoderDecoder( Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N), Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N), nn.Sequential(Embeddings(d_model, src_vocab), c(position)), nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)), Generator(d_model, tgt_vocab)) # This was important from their code. # Initialize parameters with Glorot / fan_avg. for p in model.parameters(): if p.dim() > 1: nn.init.xavier_uniform(p) return model  
          

bert中的transformerblock(相當於只有encoder，但是加入了自己的mask)：

 
           class TransformerBlock(nn.Module): """ Bidirectional Encoder = Transformer (self-attention) Transformer = MultiHead_Attention + Feed_Forward with sublayer connection """ def __init__(self, hidden, attn_heads, feed_forward_hidden, dropout): """ :param hidden: hidden size of transformer :param attn_heads: head sizes of multi-head attention :param feed_forward_hidden: feed_forward_hidden, usually 4*hidden_size :param dropout: dropout rate """ super().__init__() self.attention = MultiHeadedAttention(h=attn_heads, d_model=hidden) self.feed_forward = PositionwiseFeedForward(d_model=hidden, d_ff=feed_forward_hidden, dropout=dropout) self.input_sublayer = SublayerConnection(size=hidden, dropout=dropout) self.output_sublayer = SublayerConnection(size=hidden, dropout=dropout) self.dropout = nn.Dropout(p=dropout) def forward(self, x, mask): x = self.input_sublayer(x, lambda _x: self.attention.forward(_x, _x, _x, mask=mask)) x = self.output_sublayer(x, self.feed_forward) return self.dropout(x)  
          

完整的bert

 
           class BERT(nn.Module): """ BERT model : Bidirectional Encoder Representations from Transformers. """ def __init__(self, vocab_size, hidden=768, n_layers=12, attn_heads=12, dropout=0.1): """ :param vocab_size: vocab_size of total words :param hidden: BERT model hidden size :param n_layers: numbers of Transformer blocks(layers) :param attn_heads: number of attention heads :param dropout: dropout rate """ super().__init__() self.hidden = hidden self.n_layers = n_layers self.attn_heads = attn_heads # paper noted they used 4*hidden_size for ff_network_hidden_size self.feed_forward_hidden = hidden * 4 # embedding for BERT, sum of positional, segment, token embeddings self.embedding = BERTEmbedding(vocab_size=vocab_size, embed_size=hidden) # multi-layers transformer blocks, deep network self.transformer_blocks = nn.ModuleList( [TransformerBlock(hidden, attn_heads, hidden * 4, dropout) for _ in range(n_layers)]) def forward(self, x, segment_info): # attention masking for padded token # torch.ByteTensor([batch_size, 1, seq_len, seq_len) mask = (x > 0).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1) # embedding the indexed sequence to sequence of vectors x = self.embedding(x, segment_info) # running over multiple transformer blocks for transformer in self.transformer_blocks: x = transformer.forward(x, mask) return x  
          

對於pretrain來講：

 
           class BERTLM(nn.Module): """ BERT Language Model Next Sentence Prediction Model + Masked Language Model """ def __init__(self, bert: BERT, vocab_size): """ :param bert: BERT model which should be trained :param vocab_size: total vocab size for masked_lm """ super().__init__() self.bert = bert self.next_sentence = NextSentencePrediction(self.bert.hidden) self.mask_lm = MaskedLanguageModel(self.bert.hidden, vocab_size) def forward(self, x, segment_label): x = self.bert(x, segment_label) return self.next_sentence(x), self.mask_lm(x) class NextSentencePrediction(nn.Module): """ 2-class classification model : is_next, is_not_next """ def __init__(self, hidden): """ :param hidden: BERT model output size """ super().__init__() self.linear = nn.Linear(hidden, 2) self.softmax = nn.LogSoftmax(dim=-1) def forward(self, x): return self.softmax(self.linear(x[:, 0])) class MaskedLanguageModel(nn.Module): """ predicting origin token from masked input sequence n-class classification problem, n-class = vocab_size """ def __init__(self, hidden, vocab_size): """ :param hidden: output size of BERT model :param vocab_size: total vocab size """ super().__init__() self.linear = nn.Linear(hidden, vocab_size) self.softmax = nn.LogSoftmax(dim=-1) def forward(self, x): return self.softmax(self.linear(x))  
          

整個訓練過程：

 
           class BERTTrainer: """ BERTTrainer make the pretrained BERT model with two LM training method. 1. Masked Language Model : 3.3.1 Task #1: Masked LM 2. Next Sentence prediction : 3.3.2 Task #2: Next Sentence Prediction please check the details on README.md with simple example. """ def __init__(self, bert: BERT, vocab_size: int, train_dataloader: DataLoader, test_dataloader: DataLoader = None, lr: float = 1e-4, betas=(0.9, 0.999), weight_decay: float = 0.01, warmup_steps=10000, with_cuda: bool = True, cuda_devices=None, log_freq: int = 10): """ :param bert: BERT model which you want to train :param vocab_size: total word vocab size :param train_dataloader: train dataset data loader :param test_dataloader: test dataset data loader [can be None] :param lr: learning rate of optimizer :param betas: Adam optimizer betas :param weight_decay: Adam optimizer weight decay param :param with_cuda: traning with cuda :param log_freq: logging frequency of the batch iteration """ # Setup cuda device for BERT training, argument -c, --cuda should be true cuda_condition = torch.cuda.is_available() and with_cuda self.device = torch.device("cuda:0" if cuda_condition else "cpu") # This BERT model will be saved every epoch self.bert = bert # Initialize the BERT Language Model, with BERT model self.model = BERTLM(bert, vocab_size).to(self.device) # Distributed GPU training if CUDA can detect more than 1 GPU if with_cuda and torch.cuda.device_count() > 1: print("Using %d GPUS for BERT" % torch.cuda.device_count()) self.model = nn.DataParallel(self.model, device_ids=cuda_devices) # Setting the train and test data loader self.train_data = train_dataloader self.test_data = test_dataloader # Setting the Adam optimizer with hyper-param self.optim = Adam(self.model.parameters(), lr=lr, betas=betas, weight_decay=weight_decay) self.optim_schedule = ScheduledOptim(self.optim, self.bert.hidden, n_warmup_steps=warmup_steps) # Using Negative Log Likelihood Loss function for predicting the masked_token self.criterion = nn.NLLLoss(ignore_index=0) self.log_freq = log_freq print("Total Parameters:", sum([p.nelement() for p in self.model.parameters()])) def train(self, epoch): self.iteration(epoch, self.train_data) def test(self, epoch): self.iteration(epoch, self.test_data, train=False) def iteration(self, epoch, data_loader, train=True): """ loop over the data_loader for training or testing if on train status, backward operation is activated and also auto save the model every peoch :param epoch: current epoch index :param data_loader: torch.utils.data.DataLoader for iteration :param train: boolean value of is train or test :return: None """ str_code = "train" if train else "test" # Setting the tqdm progress bar data_iter = tqdm.tqdm(enumerate(data_loader), desc="EP_%s:%d" % (str_code, epoch), total=len(data_loader), bar_format="{l_bar}{r_bar}") avg_loss = 0.0 total_correct = 0 total_element = 0 for i, data in data_iter: # 0. batch_data will be sent into the device(GPU or cpu) data = {key: value.to(self.device) for key, value in data.items()} # 1. forward the next_sentence_prediction and masked_lm model next_sent_output, mask_lm_output = self.model.forward(data["bert_input"], data["segment_label"]) # 2-1. NLL(negative log likelihood) loss of is_next classification result next_loss = self.criterion(next_sent_output, data["is_next"]) # 2-2. NLLLoss of predicting masked token word mask_loss = self.criterion(mask_lm_output.transpose(1, 2), data["bert_label"]) # 2-3. Adding next_loss and mask_loss : 3.4 Pre-training Procedure loss = next_loss + mask_loss # 3. backward and optimization only in train if train: self.optim_schedule.zero_grad() loss.backward() self.optim_schedule.step_and_update_lr() # next sentence prediction accuracy correct = next_sent_output.argmax(dim=-1).eq(data["is_next"]).sum().item() avg_loss += loss.item() total_correct += correct total_element += data["is_next"].nelement() post_fix = { "epoch": epoch, "iter": i, "avg_loss": avg_loss / (i + 1), "avg_acc": total_correct / total_element * 100, "loss": loss.item() } if i % self.log_freq == 0: data_iter.write(str(post_fix)) print("EP%d_%s, avg_loss=" % (epoch, str_code), avg_loss / len(data_iter), "total_acc=", total_correct * 100.0 / total_element) def save(self, epoch, file_path="output/bert_trained.model"): """ Saving the current BERT model on file_path :param epoch: current epoch number :param file_path: model output path which gonna be file_path+"ep%d" % epoch :return: final_output_path """ output_path = file_path + ".ep%d" % epoch torch.save(self.bert.cpu(), output_path) self.bert.to(self.device) print("EP:%d Model Saved on:" % epoch, output_path) return output_path  
          

vocab和dataset

vocab部分：

 
           from collections import Counter class TorchVocab(object): """Defines a vocabulary object that will be used to numericalize a field. Attributes: freqs: A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab. stoi: A collections.defaultdict instance mapping token strings to numerical identifiers. itos: A list of token strings indexed by their numerical identifiers. """ def __init__(self, counter, max_size=None, min_freq=1, specials=['<pad>', '<oov>'], vectors=None, unk_init=None, vectors_cache=None): """Create a Vocab object from a collections.Counter. Arguments: counter: collections.Counter object holding the frequencies of each value found in the data. max_size: The maximum size of the vocabulary, or None for no maximum. Default: None. min_freq: The minimum frequency needed to include a token in the vocabulary. Values less than 1 will be set to 1. Default: 1. specials: The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary in addition to an <unk> token. Default: ['<pad>'] vectors: One of either the available pretrained vectors or custom pretrained vectors (see Vocab.load_vectors); or a list of aforementioned vectors unk_init (callback): by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: torch.Tensor.zero_ vectors_cache: directory for cached vectors. Default: '.vector_cache' """ self.freqs = counter counter = counter.copy() min_freq = max(min_freq, 1) self.itos = list(specials) # frequencies of special tokens are not counted when building vocabulary # in frequency order for tok in specials: del counter[tok] max_size = None if max_size is None else max_size + len(self.itos) # sort by frequency, then alphabetically words_and_frequencies = sorted(counter.items(), key=lambda tup: tup[0]) words_and_frequencies.sort(key=lambda tup: tup[1], reverse=True) for word, freq in words_and_frequencies: if freq < min_freq or len(self.itos) == max_size: break self.itos.append(word) # stoi is simply a reverse dict for itos self.stoi = {tok: i for i, tok in enumerate(self.itos)} self.vectors = None if vectors is not None: self.load_vectors(vectors, unk_init=unk_init, cache=vectors_cache) else: assert unk_init is None and vectors_cache is None def __eq__(self, other): if self.freqs != other.freqs: return False if self.stoi != other.stoi: return False if self.itos != other.itos: return False if self.vectors != other.vectors: return False return True def __len__(self): return len(self.itos) def vocab_rerank(self): self.stoi = {word: i for i, word in enumerate(self.itos)} def extend(self, v, sort=False): words = sorted(v.itos) if sort else v.itos for w in words: if w not in self.stoi: self.itos.append(w) self.stoi[w] = len(self.itos) - 1 class Vocab(TorchVocab): def __init__(self, counter, max_size=None, min_freq=1): self.pad_index = 0 self.unk_index = 1 self.eos_index = 2 self.sos_index = 3 self.mask_index = 4 super().__init__(counter, specials=["<pad>", "<unk>", "<eos>", "<sos>", "<mask>"], max_size=max_size, min_freq=min_freq) def to_seq(self, sentece, seq_len, with_eos=False, with_sos=False) -> list: pass def from_seq(self, seq, join=False, with_pad=False): pass @staticmethod def load_vocab(vocab_path: str) -> 'Vocab': with open(vocab_path, "rb") as f: return pickle.load(f) def save_vocab(self, vocab_path): with open(vocab_path, "wb") as f: pickle.dump(self, f) # Building Vocab with text files class WordVocab(Vocab): def __init__(self, texts, max_size=None, min_freq=1): print("Building Vocab") counter = Counter() for line in tqdm.tqdm(texts): if isinstance(line, list): words = line else: words = line.replace("\n", "").replace("\t", "").split() for word in words: counter[word] += 1 super().__init__(counter, max_size=max_size, min_freq=min_freq) def to_seq(self, sentence, seq_len=None, with_eos=False, with_sos=False, with_len=False): if isinstance(sentence, str): sentence = sentence.split() seq = [self.stoi.get(word, self.unk_index) for word in sentence] if with_eos: seq += [self.eos_index] # this would be index 1 if with_sos: seq = [self.sos_index] + seq origin_seq_len = len(seq) if seq_len is None: pass elif len(seq) <= seq_len: seq += [self.pad_index for _ in range(seq_len - len(seq))] else: seq = seq[:seq_len] return (seq, origin_seq_len) if with_len else seq def from_seq(self, seq, join=False, with_pad=False): words = [self.itos[idx] if idx < len(self.itos) else "<%d>" % idx for idx in seq if not with_pad or idx != self.pad_index] return " ".join(words) if join else words @staticmethod def load_vocab(vocab_path: str) -> 'WordVocab': with open(vocab_path, "rb") as f: return pickle.load(f) def build(): import argparse parser = argparse.ArgumentParser() parser.add_argument("-c", "--corpus_path", required=True, type=str) parser.add_argument("-o", "--output_path", required=True, type=str) parser.add_argument("-s", "--vocab_size", type=int, default=None) parser.add_argument("-e", "--encoding", type=str, default="utf-8") parser.add_argument("-m", "--min_freq", type=int, default=1) args = parser.parse_args() with open(args.corpus_path, "r", encoding=args.encoding) as f: vocab = WordVocab(f, max_size=args.vocab_size, min_freq=args.min_freq) print("VOCAB SIZE:", len(vocab)) vocab.save_vocab(args.output_path)  
          

main函數

 
               print("Loading Vocab", args.vocab_path) vocab = WordVocab.load_vocab(args.vocab_path) print("Vocab Size: ", len(vocab)) print("Loading Train Dataset", args.train_dataset) train_dataset = BERTDataset(args.train_dataset, vocab, seq_len=args.seq_len, corpus_lines=args.corpus_lines, on_memory=args.on_memory) print("Loading Test Dataset", args.test_dataset) test_dataset = BERTDataset(args.test_dataset, vocab, seq_len=args.seq_len, on_memory=args.on_memory) \ if args.test_dataset is not None else None print("Creating Dataloader") train_data_loader = DataLoader(train_dataset, batch_size=args.batch_size, num_workers=args.num_workers) test_data_loader = DataLoader(test_dataset, batch_size=args.batch_size, num_workers=args.num_workers) \ if test_dataset is not None else None print("Building BERT model") bert = BERT(len(vocab), hidden=args.hidden, n_layers=args.layers, attn_heads=args.attn_heads) print("Creating BERT Trainer") trainer = BERTTrainer(bert, len(vocab), train_dataloader=train_data_loader, test_dataloader=test_data_loader, lr=args.lr, betas=(args.adam_beta1, args.adam_beta2), weight_decay=args.adam_weight_decay, with_cuda=args.with_cuda, cuda_devices=args.cuda_devices, log_freq=args.log_freq) print("Training Start") for epoch in range(args.epochs): trainer.train(epoch) trainer.save(epoch, args.output_path) if test_data_loader is not None: trainer.test(epoch)  
          

dataset部分：

 
           from torch.utils.data import Dataset import tqdm import torch import random class BERTDataset(Dataset): def __init__(self, corpus_path, vocab, seq_len, encoding="utf-8", corpus_lines=None, on_memory=True): self.vocab = vocab self.seq_len = seq_len self.on_memory = on_memory self.corpus_lines = corpus_lines self.corpus_path = corpus_path self.encoding = encoding with open(corpus_path, "r", encoding=encoding) as f: if self.corpus_lines is None and not on_memory: for _ in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines): self.corpus_lines += 1 if on_memory: self.lines = [line[:-1].split("\t") for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)] self.corpus_lines = len(self.lines) if not on_memory: self.file = open(corpus_path, "r", encoding=encoding) self.random_file = open(corpus_path, "r", encoding=encoding) for _ in range(random.randint(self.corpus_lines if self.corpus_lines < 1000 else 1000)): self.random_file.__next__() def __len__(self): return self.corpus_lines def __getitem__(self, item): t1, t2, is_next_label = self.random_sent(item) t1_random, t1_label = self.random_word(t1) t2_random, t2_label = self.random_word(t2) # [CLS] tag = SOS tag, [SEP] tag = EOS tag t1 = [self.vocab.sos_index] + t1_random + [self.vocab.eos_index] t2 = t2_random + [self.vocab.eos_index] t1_label = [self.vocab.pad_index] + t1_label + [self.vocab.pad_index] t2_label = t2_label + [self.vocab.pad_index] segment_label = ([1 for _ in range(len(t1))] + [2 for _ in range(len(t2))])[:self.seq_len] bert_input = (t1 + t2)[:self.seq_len] bert_label = (t1_label + t2_label)[:self.seq_len] padding = [self.vocab.pad_index for _ in range(self.seq_len - len(bert_input))] bert_input.extend(padding), bert_label.extend(padding), segment_label.extend(padding) output = {"bert_input": bert_input, "bert_label": bert_label, "segment_label": segment_label, "is_next": is_next_label} return {key: torch.tensor(value) for key, value in output.items()} def random_word(self, sentence): tokens = sentence.split() output_label = [] for i, token in enumerate(tokens): prob = random.random() if prob < 0.15: prob /= 0.15 # 80% randomly change token to mask token if prob < 0.8: tokens[i] = self.vocab.mask_index # 10% randomly change token to random token elif prob < 0.9: tokens[i] = random.randrange(len(self.vocab)) # 10% randomly change token to current token else: tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index)) else: tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) output_label.append(0) return tokens, output_label def random_sent(self, index): t1, t2 = self.get_corpus_line(index) # output_text, label(isNotNext:0, isNext:1) if random.random() > 0.5: return t1, t2, 1 else: return t1, self.get_random_line(), 0 def get_corpus_line(self, item): if self.on_memory: return self.lines[item][0], self.lines[item][1] else: line = self.file.__next__() if line is None: self.file.close() self.file = open(self.corpus_path, "r", encoding=self.encoding) line = self.file.__next__() t1, t2 = line[:-1].split("\t") return t1, t2 def get_random_line(self): if self.on_memory: return self.lines[random.randrange(len(self.lines))][1] line = self.file.__next__() if line is None: self.file.close() self.file = open(self.corpus_path, "r", encoding=self.encoding) for _ in range(random.randint(self.corpus_lines if self.corpus_lines < 1000 else 1000)): self.random_file.__next__() line = self.random_file.__next__() return line[:-1].split("\t")[1]  
          

官方版

https://github.com/google-research/bert

詳見https://daiwk.github.io/posts/nlp-bert-code.html

原創文章，轉載請注明出處！
本文鏈接：http://daiwk.github.io/posts/nlp-bert.html

上篇： TensorFlow Serving的深度學習在線預估優化

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Bert系列源碼解讀四篇章 Bert源碼解讀(一)之主框架 Bert源碼解讀(二)之Transformer 代碼實現 [源碼解讀] ResNet源碼解讀（pytorch） Bert系列（三）——源碼解讀之Pre-train Bert源碼解讀(四)之繪制流程圖 Bert源碼解讀(三)之預訓練部分 ELMo解讀（論文 + PyTorch源碼） PyTorch 源碼解讀之 BN & SyncBN PyTorch源碼解讀之torchvision.models(轉)