通常我们在利用Bert模型进行NLP任务时,需要针对特定的NLP任务,在Bert模型的下游,接上针对特定任务的模型,因此,我们就十分需要知道Bert模型的输出是什么,以方便我们灵活地定制Bert下游的模型层,本文针对Bert的一个pytorch实现transformers库,来探讨一下Bert的具体输出。
一般使用transformers做bert finetune时,经常会编写如下类似的代码:
outputs = self.bert(input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask)
我们查看BertModel(BertPreTrainedModel)
的官方文档,里面对返回值outputs
的解释如下:
Outputs: Tuple
comprising various elements depending on the configuration (config) and inputs:
last_hidden_state: torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the model.
pooler_output: torch.FloatTensor
of shape (batch_size, hidden_size)
Last layer hidden-state of the first token of the sequence (classification token)further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification)
objective during Bert pretraining. This output is usually not a good summary of the semantic content of the input, you're often better with averaging or pooling the sequence of hidden-states for the whole input sequence.
hidden_states: (optional
, returned when config.output_hidden_states=True
),list of torch.FloatTensor
(one for the output of each layer + the output of the embeddings)of shape (batch_size, sequence_length, hidden_size)
:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions: (optional
, returned when config.output_attentions=True
),list of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
:Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
>>> from transformers import BertTokenizer, BertModel
>>> import torch
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = BertModel.from_pretrained('bert-base-uncased')
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
在最新的transformers接口中,我们获取bert的各个输出,需要这样:
last_hidden_state = outputs.last_hidden_state
pooler_output = outputs.pooler_output
hidden_states = outputs.hidden_states
attentions = outputs.attentions
可以看出,bert
的输出是由四部分组成:
last_hidden_state:shape是(batch_size, sequence_length, hidden_size),hidden_size=768,它是模型最后一层输出的隐藏状态。(通常用于命名实体识别)
pooler_output:shape是(batch_size, hidden_size),这是序列的第一个token(classification token)的最后一层的隐藏状态,它是由线性层和Tanh激活函数进一步处理的。(通常用于句子分类,至于是使用这个表示,还是使用整个输入序列的隐藏状态序列的平均化或池化,视情况而定)
hidden_states:这是输出的一个可选项,如果输出,需要指定config.output_hidden_states=True,它也是一个元组,它的第一个元素是embedding,其余元素是各层的输出,每个元素的形状是(batch_size, sequence_length, hidden_size)
attentions:这也是输出的一个可选项,如果输出,需要指定config.output_attentions=True,它也是一个元组,它的元素是每一层的注意力权重,用于计算self-attention heads的加权平均值。
另外一点需要注意的是,pooler_output是序列的最后一层的隐藏状态的第一个token(classification token),经过一个线性层和Tanh激活函数进一步处理后得到的,关于这一点,我们可以通过查看官方的源码看出来:
class BertPooler(nn.Module):
def __init__(self, config):
super(BertPooler, self).__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output