transformers中,關於PreTrainedTokenizer的使用


在我們使用transformers進行預訓練模型學習及微調的時候,我們需要先對數據進行預處理,然后經過處理過的數據才能“喂”進bert模型里面,這這個過程中我們使用的主要的工具就是tokenizer。你可以建立一個tokenizer通過與相關預訓練模型相關的tokenizer類,例如,對於Roberta,我們可以使用與之相關的RobertaTokenizer。或者直接通過AutoTokenizer類(這個類能自動的識別所建立的tokenizer是與哪個bert模型對於的)。通過tokenizer,它會將一個給定的文本分詞成一個token序列,然后它會映射這些tokens成tokenizer的詞匯表中token所對應的下標,在這個過程中,tokenizer還會增加一些預訓練模型輸入格式所要求的額外的符號,如,'[CLS]','[SEP]'等。經過這個預處理后,就可以直接地“喂”進我們的預訓練模型里面。
下面,我們具體看一下怎么使用它。
首先,先獲得一個PreTrainedTokenizer類。

from transformers import BertTokenizer
TOKENIZER_PATH = "../input/huggingface-bert/bert-base-chinese"
tokenizer = BertTokenizer.from_pretrained(TOKENIZER_PATH)

tokenizer有一個方法tokenize,它會將我們的輸入句子進行分詞操作,
image
而convert_tokens_to_ids會將一個句子分詞得到的tokens映射成它們在詞匯表中對應的下標,
image
我們要將下標序列解碼到tokens,decode方法可以辦到,
image
tokenizer里面的方法encode,batch_encode,encode_plus,batch_encode_plus將上面的兩個步驟都包含了,使用起來更加方便,不過這些方法在transformers的將來的版本中,會被遺棄,它們的全部的功能都集成到了__call__方法里面去了,所以我們下面中間的講解__call__方法,__call__方法里面的參數和這些遺棄的方法里的函數參數基本一致,不再贅述。

查看官方文檔,我們可以得到__call__方法的api如下:
image
我們經常需要設置的參數如下:

  • text (str, List[str], List[List[str]]) – The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
  • text_pair (str, List[str], List[List[str]]) – The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
  • add_special_tokens (bool, optional, defaults to True) – Whether or not to encode the sequences with the special tokens relative to their model.
  • padding (bool, str or PaddingStrategy, optional, defaults to False)
    Activates and controls padding. Accepts the following values:
    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
  • truncation (bool, str or TruncationStrategy, optional, defaults to False)
    Activates and controls truncation. Accepts the following values:
    • True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
  • max_length (int, optional)
    Controls the maximum length to use by one of the truncation/padding parameters.
    If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
  • is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
  • return_tensors (str or TensorType, optional)
    If set, will return tensors instead of list of python integers. Acceptable values are:
    • 'tf': Return TensorFlow tf.constant objects.
    • 'pt': Return PyTorch torch.Tensor objects.
    • 'np': Return Numpy np.ndarray objects.
  • return_token_type_ids (bool, optional)
    Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.
  • return_attention_mask (bool, optional)
    Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

下面,我們主要通過例子,來學習__call__的使用及函數參數的理解。
對於__call__方法,我們先一覽它的基本使用:
image
我們看到,在得到的index列表中,開頭多了101,末尾多了102,這是增加的額外的token "[CLS]"和"[SEP]",我們可以驗證這一點:
image
並不是所有的模型需要增加特殊的tokens,例如我們使用gpt2-meduim而不是bert-base-cased的時候。如果想禁止這個行為(當你自己已經手動添加上特殊的tokens的時候,強烈建議你這樣做),可以設置參數add_special_tokens=False。
假如你有好幾個句子需要處理,你可以以列表的形式傳進去,
image
如果我們同時傳進去了好幾個句子,作為一批樣本,最后處理的結果要喂進去model里面,那么我們可以設置填充參數padding,截斷參數truncation,以及返回pytorch 張量參數return_tensors="pt",
image
有時候,你需要喂進去句子對到你的模型中去,例如,你想分類兩個句子做為一對,是否是相似的,或者對於問答系統模型,一個句子做為context,一個句子做為question。此時,對於BERT模型來說,輸入需要表示成這樣的格式:
[CLS] Sequence A [SEP] Sequence B [SEP]
第一個句子做為text的實參,第二個句子做為text_pair的實參,
image
假如你有多個句子對需要處理,你需要把它們做為兩個列表傳進tokenizer:一個第一個句子的列表和一個第二個句子的列表,
image
image
同樣的,針對句子對輸入,你可以設置最大長度參數,填充參數,截斷參數。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM