huggingface 🤗 Transformers的簡單使用

本文轉載自查看原文 2021-02-06 17:02 555

本文討論了huggingface 🤗 Transformers的簡單使用。
原網址:https://huggingface.co/transformers/quicktour.html


使用transformer庫需要兩個部件:Tokenizer和model。
使用.from_pretrained（name）就可以下載Tokenizer和model。

一、
實例化Tokenizer和model:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
或者:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = DistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

二、Tokenizer
Tokenizer的作用是:
1、分詞
2、將每個分出來的詞轉化為唯一的ID(int類型)。

pt_batch = tokenizer(
 ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
 padding=True,
 truncation=True,
 max_length=5,
 return_tensors="pt"
)
print(pt_batch)

其中，當使用list作為batch進行輸入時，使用到的padding:是否將所有句子pad到同一個長度。truncation:當遇到超過max_length的句子時是否直接截斷到max_length。
return_tensors="pt“表示返回pytorch類型的tensor,=“tf”表示返回TensorFlow類型的tensor。

輸出如下:
{'input_ids': tensor([[ 101, 2057, 2024, 2200,  102], [ 101, 2057, 3246, 2017,  102]]), 
'attention_mask': tensor([[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]])}

3、model
將分詞后的結果輸入模型中:
For a PyTorch model, you need to unpack the dictionary by adding **

pt_outputs = pt_model(**pt_batch)
print(pt_outputs)

輸出如下:
SequenceClassifierOutput(loss=None, logits=tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

In 🤗 Transformers, all outputs are tuples. Here, we get a tuple with just the final activations of the model.
這一坨東西也是tuple。
Pytorch model outputs are special dataclasses so that you can get autocompletion for their attributes in an IDE. 
They also behave like a tuple or a dictionary (e.g., you can index with an integer, a slice or a string) 
in which case the attributes not set (that have None values) are ignored.
可以直接pt_outputs[0]來訪問。
輸出如下:
tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>)

All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model before the final activation function (like SoftMax) 
since this final activation function is often fused with the loss.

還可以這樣輸出中間參數:
pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
all_hidden_states, all_attentions = pt_outputs[-2:]

4、激活函數
最后接入激活函數:
import torch.nn.functional as F
pt_predictions = F.softmax(pt_outputs[0], dim=-1)
print(pt_predictions)

輸出如下:
tensor([[2.2043e-04, 9.9978e-01],
        [5.3086e-01, 4.6914e-01]], grad_fn=<SoftmaxBackward>)
可以看見:第一句"We are very happy to show you the 🤗 Transformers library."明顯傾向於第二個LABEL(也就是positive)，
第二句"We hope you don't hate it."難以區分。


5、個性化修改模型

from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)


或者:

from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification(config)

修改之后就需要自己重新訓練(如果改動大)/fine-tune(如果只更改了最上層)

Main concepts

The library is built around three types of classes for each model:

Model classes such as BertModel, which are 30+ PyTorch models (torch.nn.Module) or Keras models (tf.keras.Model) that work with the pretrained weights provided in the library.
Configuration classes such as BertConfig, which store all the parameters required to build a model. You don’t always need to instantiate these yourself. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model).
Tokenizer classes such as BertTokenizer, which store the vocabulary for each model and provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model.

All these classes can be instantiated from pretrained instances and saved locally using two methods:

from_pretrained() lets you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (the supported models are provided in the list here) or stored locally (or on a server) by the user,
save_pretrained() lets you save a model/configuration/tokenizer locally so that it can be reloaded using from_pretrained()

On top of those three base classes, the library provides two APIs:

　　pipeline() for quickly using a model (plus its associated tokenizer and configuration) on a given task

　　Trainer() to quickly train or fine-tune a given model.



一些名詞:

Input IDs：

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence = "A Titan RTX has 24GB of VRAM"
print(tokenizer.tokenize(sequence))
t1 = tokenizer(sequence)
print(t1)
t2 = tokenizer.decode(t1["input_ids"])
print(t2)

['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
{'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] A Titan RTX has 24GB of VRAM [SEP]

Token Type IDs:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"
encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])
print(encoded_dict)


{'input_ids': [101, 20164, 10932, 2271, 7954, 1110, 1359, 1107, 17520, 102, 2777, 1110, 20164, 10932, 2271, 7954, 1359, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

其中token_type_ids的0/1就區分了兩個句子。

Position IDs:

the position IDs (position_ids) are used by the model to identify each token’s position in the list of tokens.

They are an optional parameter. If no position_ids are passed to the model, the IDs are automatically created as absolute positional embeddings.

Absolute positional embeddings are selected in the range [0, config.max_position_embeddings - 1].

Labels:

就是GT，用來算LOSS。

Decoder input IDs:

The input IDs of labels that will be fed to the decoder.

Most encoder-decoder models (BART, T5) create their decoder_input_ids on their own from the labels. In such models, passing the labels is the preferred way to handle training.

Using 🤗 Transformers部分的一些有用的點:

1、Preprocessing data

對於pair of data

>>> encoded_input = tokenizer("How old are you?", "I'm 6 years old") >>> print(encoded_input) {'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102],  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

>>> batch_sentences = ["Hello I'm a single sentence", ... "And another sentence", ... "And the very very last one"] >>> batch_of_second_sentences = ["I'm a sentence that goes with the first sentence", ... "And I should be encoded with the second sentence", ... "And I go with the very last one"] >>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences) >>> print(encoded_inputs) {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102],  [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102],  [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],  [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],  [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


對於Pre-tokenized inputs:

If you want to use pre-tokenized inputs, just set is_split_into_words=True when passing your inputs to the tokenizer. For instance, we have:

>>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True) >>> print(encoded_input) {'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]} 

2、fine-tuning
The library also includes a number of task-specific final layers or ‘heads’ 
whose weights are instantiated randomly when not present in the specified pre-trained model. 
For example, instantiating a model with 
BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) 
will create a BERT model instance with encoder weights 
copied from the bert-base-uncased model and a randomly initialized sequence classification head 
on top of the encoder with an output size of 2. 
Models are initialized in eval mode by default. We can call model.train() to put it in train mode.

import torch
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
# set to train mode (default is eval mode)
model.train()

from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
 {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
 {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text_batch = ["I love Pixar.", "I don't care for Pixar."]
encoding = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True)
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
labels = torch.tensor([1,0])
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss

# 這兩句用來訓練
loss.backward()
optimizer.step()

也可以用Transformer庫自帶的訓練器:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

fine-tuning code:

from pathlib import Path
import os

# get data & labels
def read_imdb_split(split_dir):
 texts = []
 labels = []
 for folder in ["pos", "neg"]:
 path = os.path.join(split_dir, folder)
 for file in os.listdir(path):
 with open(os.path.join(path, file), 'r') as f:
 texts.append(f.read())
 labels.append(0 if folder == "pos" else 1)

 return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

# create validation set
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2)

# tokenize
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

# create a torch dataset
from torch.utils.data.dataset import Dataset
import torch.tensor
class IMDbDataset(Dataset):
 # 以這個類構造的子類，一定要定義兩個函數。
 # len
 # 用於提供數據集size
 # getitem
 # 通過給定索引獲取數據和標簽。一次只能獲取一個數據
 def __init__(self, encodings, labels):
 self.encodings = encodings
 self.labels = labels

 def __len__(self):
 return len(self.labels)

 def __getitem__(self, y):
 # x.__getitem__(y) <==> x[y]
 # encoding中有好多項，比如MASK，所以就這樣key value方便訪問所有的，避免遺漏。
 item = {key: torch.tensor(val[y]) for key, val in self.encodings.items()}
 item["label"] = torch.tensor(self.labels[y])
 return item

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

# fine-tune with Trainer
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
 output_dir='./results', # output directory
 num_train_epochs=1, # total number of training epochs
 per_device_train_batch_size=16, # batch size per device during training
 per_device_eval_batch_size=2, # batch size for evaluation
 warmup_steps=500, # number of warmup steps for learning rate scheduler
 weight_decay=0.01, # strength of weight decay
 logging_dir='./logs', # directory for storing logs
 logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
 model=model, # the instantiated 🤗 Transformers model to be trained
 args=training_args, # training arguments, defined above
 train_dataset=train_dataset, # training dataset
 eval_dataset=val_dataset # evaluation dataset
)
# simply train can train the model
trainer.train()

model.save_pretrained("./save_pretrained")


# eval
# test_pred = model(test_encodings)
# print(test_pred[0:100])

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 HuggingFace-transformers系列的介紹以及在下游任務中的使用 Huggingface中的BERT模型的使用方法 Pytorch-Bert預訓練模型的使用（調用transformers） Transformers 庫常見的用例 | 三 Transformers 快速入門 | 一 Transformers for Graph Representation 第1篇 Transformers各種API的綜述 pytorch transformers finetune 疑惑總結 Bitmap簡單使用及簡單解析讀Multimodal Motion Prediction with Stacked Transformers