transformers的bert預訓練模型的返回值簡要描述

本文轉載自查看原文 2020-01-08 18:04 3408 NLP/ deeplearning/ transformer/ bert

一般使用transformers做bert finetune時，經常會編寫如下類似的代碼：

outputs = self.bert(input_ids,
                               attention_mask=attention_mask,
                               token_type_ids=token_type_ids,
                               position_ids=position_ids,
                               head_mask=head_mask)

在BertModel(BertPreTrainedModel)中，對返回值outputs的解釋如下：

r"""
    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
            Sequence of hidden-states at the output of the last layer of the model.
        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
            Last layer hidden-state of the first token of the sequence (classification token)
            further processed by a Linear layer and a Tanh activation function. The Linear
            layer weights are trained from the next sentence prediction (classification)
            objective during Bert pretraining. This output is usually *not* a good summary
            of the semantic content of the input, you're often better with averaging or pooling
            the sequence of hidden-states for the whole input sequence.
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
"""

這里的pooler_output指的是輸出序列最后一個隱層，即CLS標簽。查看forward函數的源碼，最后返回的部分代碼如下：

        sequence_output = encoder_outputs[0]
        pooled_output = self.pooler(sequence_output)

        outputs = (sequence_output, pooled_output,) + encoder_outputs[
            1:
        ]  # add hidden_states and attentions if they are here
        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)

可以看到sequence_output進入了一個pooler層，這個pooler層結構如下：

class BertPooler(nn.Module):
    def __init__(self, config):
        super(BertPooler, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

所以bert的model並不是簡單的組合返回。一般說來，如果需要用bert做句子級的任務，可以使用pooled_output結果做baseline；進一步的微調可以使用last_hidden_state的結果。

last_hidden_state的結構如下所示：

第0列為CLS，對應句向量，其他列對應詞向量。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Pytorch-Bert預訓練模型的使用（調用transformers）如何使用BERT預訓練模型提取文本特征？—基於transformers Pytorch-Bert預訓練模型的使用（調用transformers）預訓練模型（三）-----Bert bert 預訓練模型路徑 Notes | Bert系列的預訓練模型關於bert預訓練模型的輸出是什么 NLP與深度學習（五）BERT預訓練模型 BERT的通俗理解預訓練模型微調 NLP（三十四）：huggingface transformers預訓練模型如何下載至本地，並使用？