transformers 中，bert模型的輸出

本文轉載自查看原文 2021-06-01 22:01 198 NLP/ pytorch

通常我們在利用Bert模型進行NLP任務時，需要針對特定的NLP任務，在Bert模型的下游，接上針對特定任務的模型，因此，我們就十分需要知道Bert模型的輸出是什么，以方便我們靈活地定制Bert下游的模型層，本文針對Bert的一個pytorch實現transformers庫，來探討一下Bert的具體輸出。
一般使用transformers做bert finetune時，經常會編寫如下類似的代碼：

outputs = self.bert(input_ids,
		   attention_mask=attention_mask,
		   token_type_ids=token_type_ids,
		   position_ids=position_ids,
		   head_mask=head_mask)

我們查看BertModel(BertPreTrainedModel)的官方文檔，里面對返回值outputs的解釋如下：

Outputs: Tuple comprising various elements depending on the configuration (config) and inputs:

last_hidden_state: torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the model.

pooler_output: torch.FloatTensor of shape (batch_size, hidden_size)
Last layer hidden-state of the first token of the sequence (classification token)further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification)
objective during Bert pretraining. This output is usually not a good summary of the semantic content of the input, you're often better with averaging or pooling the sequence of hidden-states for the whole input sequence.

hidden_states: (optional, returned when config.output_hidden_states=True),list of torch.FloatTensor (one for the output of each layer + the output of the embeddings)of shape (batch_size, sequence_length, hidden_size):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.

attentions: (optional, returned when config.output_attentions=True),list of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length):Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

>>> from transformers import BertTokenizer, BertModel
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = BertModel.from_pretrained('bert-base-uncased')

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state

在最新的transformers接口中，我們獲取bert的各個輸出，需要這樣：

last_hidden_state = outputs.last_hidden_state
pooler_output = outputs.pooler_output
hidden_states = outputs.hidden_states
attentions = outputs.attentions

可以看出，bert的輸出是由四部分組成：
last_hidden_state：shape是(batch_size, sequence_length, hidden_size)，hidden_size=768,它是模型最后一層輸出的隱藏狀態。（通常用於命名實體識別）
pooler_output：shape是(batch_size, hidden_size)，這是序列的第一個token(classification token)的最后一層的隱藏狀態，它是由線性層和Tanh激活函數進一步處理的。（通常用於句子分類，至於是使用這個表示，還是使用整個輸入序列的隱藏狀態序列的平均化或池化，視情況而定）
hidden_states：這是輸出的一個可選項，如果輸出，需要指定config.output_hidden_states=True,它也是一個元組，它的第一個元素是embedding，其余元素是各層的輸出，每個元素的形狀是(batch_size, sequence_length, hidden_size)
attentions：這也是輸出的一個可選項，如果輸出，需要指定config.output_attentions=True,它也是一個元組，它的元素是每一層的注意力權重，用於計算self-attention heads的加權平均值。

另外一點需要注意的是，pooler_output是序列的最后一層的隱藏狀態的第一個token（classification token),經過一個線性層和Tanh激活函數進一步處理后得到的，關於這一點，我們可以通過查看官方的源碼看出來：

class BertPooler(nn.Module):
    def __init__(self, config):
        super(BertPooler, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 關於bert的輸出是什么 BERT模型介紹 BERT模型詳解預訓練模型（三）-----Bert BERT模型圖解 BERT模型圖解 NLP（三十）：BertForSequenceClassification：Kaggle的bert文本分類，基於transformers的BERT分類 pytorch版本的bert模型代碼 BERT模型使用及一個問題