encode和encode_plus和tokenizer的區別

本文轉載自查看原文 2021-10-14 21:59 4674 語音

1.encode和encode_plus的區別

區別
1. encode僅返回input_ids
2. encode_plus返回所有的編碼信息，具體如下：
’input_ids:是單詞在詞典中的編碼
‘token_type_ids’:區分兩個句子的編碼（上句全為0，下句全為1）
‘attention_mask’:指定對哪些詞進行self-Attention操作
代碼演示：

import torch from transformers import BertTokenizer model_name = 'bert-base-uncased'

# a.通過詞典導入分詞器
tokenizer = BertTokenizer.from_pretrained(model_name) sentence = "Hello, my son is laughing."

print(tokenizer.encode(sentence)) print(tokenizer.encode_plus(sentence))

運行結果：

[101, 7592, 1010, 2026, 2365, 2003, 5870, 1012, 102] {'input_ids': [101, 7592, 1010, 2026, 2365, 2003, 5870, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

下面展示一下pytorch中的源代碼：

@add_end_docstrings(ENCODE_KWARGS_DOCSTRING, ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING) def encode_plus( self, text: Union[TextInput, PreTokenizedInput, EncodedInput], text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None, add_special_tokens: bool = True, padding: Union[bool, str, PaddingStrategy] = False, truncation: Union[bool, str, TruncationStrategy] = False, max_length: Optional[int] = None, stride: int = 0, is_split_into_words: bool = False, pad_to_multiple_of: Optional[int] = None, return_tensors: Optional[Union[str, TensorType]] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, **kwargs ) -> BatchEncoding: """ Tokenize and prepare for the model a sequence or a pair of sequences. .. warning:: This method is deprecated, ``__call__`` should be used instead. Args: text (:obj:`str`, :obj:`List[str]` or :obj:`List[int]` (the latter only for not-fast tokenizers)): The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the ``tokenize`` method) or a list of integers (tokenized string ids using the ``convert_tokens_to_ids`` method). text_pair (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`, `optional`): Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using the ``tokenize`` method) or a list of integers (tokenized string ids using the ``convert_tokens_to_ids`` method). """

        # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
        padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies( padding=padding, truncation=truncation, max_length=max_length, pad_to_multiple_of=pad_to_multiple_of, verbose=verbose, **kwargs, ) return self._encode_plus( text=text, text_pair=text_pair, add_special_tokens=add_special_tokens, padding_strategy=padding_strategy, truncation_strategy=truncation_strategy, max_length=max_length, stride=stride, is_split_into_words=is_split_into_words, pad_to_multiple_of=pad_to_multiple_of, return_tensors=return_tensors, return_token_type_ids=return_token_type_ids, return_attention_mask=return_attention_mask, return_overflowing_tokens=return_overflowing_tokens, return_special_tokens_mask=return_special_tokens_mask, return_offsets_mapping=return_offsets_mapping, return_length=return_length, verbose=verbose, **kwargs, )

add_special_tokens的默認參數為True。
text_pair：Optional second sequence to be encoded。

import torch from transformers import BertTokenizer model_name = 'bert-base-uncased'

# a.通過詞典導入分詞器
tokenizer = BertTokenizer.from_pretrained(model_name) sentence = "Hello, my son is laughing." sentence2 = "Hello, my son is cuting."

print(tokenizer.encode_plus(sentence,sentence2))

結果：

{'input_ids': [101, 7592, 1010, 2026, 2365, 2003, 5870, 1012, 102, 7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

　　其中101代表[cls]，102代表[sep]。由於add_special_tokens的默認參數為True，所以中間拼接會有連接詞[sep]，‘token_type_ids’:區分兩個句子的編碼（上句全為0，下句全為1）。

 print(tokenizer.encode_plus(sentence,sentence2,truncation="only_second",padding="max_length"))

padding為補零操作，默認加到max_length=512;

 print(tokenizer.encode_plus(sentence,sentence2,truncation="only_second",padding="max_length",max_length=12,stride=2,return_token_type_ids=True,))

{'input_ids': [101, 7592, 1010, 2026, 2365, 2003, 5870, 1012, 102, 7592, 1010, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

print(tokenizer.encode_plus(sentence,sentence2,truncation="only_second",padding="max_length",max_length=12,stride=2,return_token_type_ids=True,return_overflowing_tokens=True,))

結果：

{'overflowing_tokens': [7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012], 'num_truncated_tokens': 6, 'input_ids': [101, 7592, 1010, 2026, 2365, 2003, 5870, 1012, 102, 7592, 1010, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

2.encode和tokeninze方法的區別

直接上代碼比較直觀（其中處理的句子是隨便起的，為了凸顯WordPiece功能）：

sentence = "Hello, my son is cuting." input_ids_method1 = torch.tensor( tokenizer.encode(sentence, add_special_tokens=True))  # Batch size 1
    # tensor([ 101, 7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012, 102])
 input_token2 = tokenizer.tokenize(sentence) # ['hello', ',', 'my', 'son', 'is', 'cut', '##ing', '.']
input_ids_method2 = tokenizer.convert_tokens_to_ids(input_token2) # tensor([7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012]) # 並沒有開頭和結尾的標記：[cls]、[sep]

（當tokenizer.encode函數中的add_special_tokens設置為False時，同樣不會出現開頭和結尾標記：[cls], [sep]。）

從例子中可以看出，encode方法可以一步到位地生成對應模型的輸入。
相比之下，tokenize只是用於分詞，可以分成WordPiece的類型，並且在分詞之后還要手動使用convert_tokens_to_ids方法，比較麻煩。
通過源碼閱讀，發現encode方法中調用了tokenize方法，所以在使用的過程中，我們可以通過設置encode方法中的參數，達到轉化數據到可訓練格式一步到位的目的，下面開始介紹encode的相關參數與具體操作。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 decode和encode區別 decode 和 encode 區別 python中decode和encode區別 serialize和json_encode 區別 Python中decode與encode的區別 json_decode 和 json_encode 區別 json_decode和json_encode的區別 1.5.3 什么是Tokenizer-分詞 MyBatis與Mybatis-plus的區別 Mybatis和Mybaits-plus在springboot項目中使用的區別