1.encode和encode_plus的區別
區別
1. encode僅返回input_ids
2. encode_plus返回所有的編碼信息,具體如下:
’input_ids:是單詞在詞典中的編碼
‘token_type_ids’:區分兩個句子的編碼(上句全為0,下句全為1)
‘attention_mask’:指定對哪些詞進行self-Attention操作
代碼演示:
import torch from transformers import BertTokenizer model_name = 'bert-base-uncased'
# a.通過詞典導入分詞器
tokenizer = BertTokenizer.from_pretrained(model_name) sentence = "Hello, my son is laughing."
print(tokenizer.encode(sentence)) print(tokenizer.encode_plus(sentence))
運行結果:
[101, 7592, 1010, 2026, 2365, 2003, 5870, 1012, 102] {'input_ids': [101, 7592, 1010, 2026, 2365, 2003, 5870, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
下面展示一下pytorch中的源代碼:
@add_end_docstrings(ENCODE_KWARGS_DOCSTRING, ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING) def encode_plus( self, text: Union[TextInput, PreTokenizedInput, EncodedInput], text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None, add_special_tokens: bool = True, padding: Union[bool, str, PaddingStrategy] = False, truncation: Union[bool, str, TruncationStrategy] = False, max_length: Optional[int] = None, stride: int = 0, is_split_into_words: bool = False, pad_to_multiple_of: Optional[int] = None, return_tensors: Optional[Union[str, TensorType]] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, **kwargs ) -> BatchEncoding: """ Tokenize and prepare for the model a sequence or a pair of sequences. .. warning:: This method is deprecated, ``__call__`` should be used instead. Args: text (:obj:`str`, :obj:`List[str]` or :obj:`List[int]` (the latter only for not-fast tokenizers)): The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the ``tokenize`` method) or a list of integers (tokenized string ids using the ``convert_tokens_to_ids`` method). text_pair (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`, `optional`): Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using the ``tokenize`` method) or a list of integers (tokenized string ids using the ``convert_tokens_to_ids`` method). """
# Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies( padding=padding, truncation=truncation, max_length=max_length, pad_to_multiple_of=pad_to_multiple_of, verbose=verbose, **kwargs, ) return self._encode_plus( text=text, text_pair=text_pair, add_special_tokens=add_special_tokens, padding_strategy=padding_strategy, truncation_strategy=truncation_strategy, max_length=max_length, stride=stride, is_split_into_words=is_split_into_words, pad_to_multiple_of=pad_to_multiple_of, return_tensors=return_tensors, return_token_type_ids=return_token_type_ids, return_attention_mask=return_attention_mask, return_overflowing_tokens=return_overflowing_tokens, return_special_tokens_mask=return_special_tokens_mask, return_offsets_mapping=return_offsets_mapping, return_length=return_length, verbose=verbose, **kwargs, )
add_special_tokens的默認參數為True。
text_pair:Optional second sequence to be encoded。
import torch from transformers import BertTokenizer model_name = 'bert-base-uncased'
# a.通過詞典導入分詞器
tokenizer = BertTokenizer.from_pretrained(model_name) sentence = "Hello, my son is laughing." sentence2 = "Hello, my son is cuting."
print(tokenizer.encode_plus(sentence,sentence2))
結果:
{'input_ids': [101, 7592, 1010, 2026, 2365, 2003, 5870, 1012, 102, 7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
其中101代表[cls],102代表[sep]。由於add_special_tokens的默認參數為True,所以中間拼接會有連接詞[sep],‘token_type_ids’:區分兩個句子的編碼(上句全為0,下句全為1)。
print(tokenizer.encode_plus(sentence,sentence2,truncation="only_second",padding="max_length"))
padding為補零操作,默認加到max_length=512;
print(tokenizer.encode_plus(sentence,sentence2,truncation="only_second",padding="max_length",max_length=12,stride=2,return_token_type_ids=True,))
{'input_ids': [101, 7592, 1010, 2026, 2365, 2003, 5870, 1012, 102, 7592, 1010, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
print(tokenizer.encode_plus(sentence,sentence2,truncation="only_second",padding="max_length",max_length=12,stride=2,return_token_type_ids=True,return_overflowing_tokens=True,))
結果:
{'overflowing_tokens': [7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012], 'num_truncated_tokens': 6, 'input_ids': [101, 7592, 1010, 2026, 2365, 2003, 5870, 1012, 102, 7592, 1010, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
2.encode和tokeninze方法的區別
- 直接上代碼比較直觀(其中處理的句子是隨便起的,為了凸顯WordPiece功能):
sentence = "Hello, my son is cuting." input_ids_method1 = torch.tensor( tokenizer.encode(sentence, add_special_tokens=True)) # Batch size 1
# tensor([ 101, 7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012, 102])
input_token2 = tokenizer.tokenize(sentence) # ['hello', ',', 'my', 'son', 'is', 'cut', '##ing', '.']
input_ids_method2 = tokenizer.convert_tokens_to_ids(input_token2) # tensor([7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012]) # 並沒有開頭和結尾的標記:[cls]、[sep]
(當tokenizer.encode函數中的add_special_tokens設置為False時,同樣不會出現開頭和結尾標記:[cls], [sep]。)
從例子中可以看出,encode方法可以一步到位地生成對應模型的輸入。
相比之下,tokenize只是用於分詞,可以分成WordPiece的類型,並且在分詞之后還要手動使用convert_tokens_to_ids方法,比較麻煩。
通過源碼閱讀,發現encode方法中調用了tokenize方法,所以在使用的過程中,我們可以通過設置encode方法中的參數,達到轉化數據到可訓練格式一步到位的目的,下面開始介紹encode的相關參數與具體操作。