huggingface 🤗 Transformers的簡單使用


本文討論了huggingface 🤗 Transformers的簡單使用。
原網址:https://huggingface.co/transformers/quicktour.html


使用transformer庫需要兩個部件:Tokenizer和model。
使用.from_pretrained(name)就可以下載Tokenizer和model

一、
實例化Tokenizer和model:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
或者:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = DistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

二、Tokenizer
Tokenizer
的作用是:
1、分詞
2、將每個分出來的詞轉化為唯一的ID(int類型)。

pt_batch = tokenizer(
["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
padding=True,
truncation=True,
max_length=5,
return_tensors="pt"
)
print(pt_batch)

其中,當使用list作為batch進行輸入時,使用到的padding:是否將所有句子pad到同一個長度。truncation:當遇到超過max_length的句子時是否直接截斷到max_length
return_tensors="pt“表示返回pytorch類型的tensor,=“tf”表示返回TensorFlow類型的tensor。

輸出如下:
{'input_ids': tensor([[ 101, 2057, 2024, 2200,  102], [ 101, 2057, 3246, 2017,  102]]),
'attention_mask': tensor([[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]])}

3、model
將分詞后的結果輸入模型中:
For a PyTorch model, you need to unpack the dictionary by adding **
pt_outputs = pt_model(**pt_batch)
print(pt_outputs)
輸出如下:
SequenceClassifierOutput(loss=None, logits=tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

In 🤗 Transformers, all outputs are tuples. Here, we get a tuple with just the final activations of the model.
這一坨東西也是tuple。
Pytorch model outputs are special dataclasses so that you can get autocompletion for their attributes in an IDE.
They also behave like a tuple or a dictionary (e.g., you can index with an integer, a slice or a string)
in which case the attributes not set (that have None values) are ignored.
可以直接pt_outputs[0]來訪問。
輸出如下:
tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>)

All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model before the final activation function (like SoftMax)
since this final activation function is often fused with the loss.

還可以這樣輸出中間參數:
pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
all_hidden_states, all_attentions = pt_outputs[-2:]

4、激活函數
最后接入激活函數:
import torch.nn.functional as F
pt_predictions = F.softmax(pt_outputs[0], dim=-1)
print(pt_predictions)
輸出如下:
tensor([[2.2043e-04, 9.9978e-01],
        [5.3086e-01, 4.6914e-01]], grad_fn=<SoftmaxBackward>)
可以看見:第一句"We are very happy to show you the 🤗 Transformers library."明顯傾向於第二個LABEL(也就是positive),
第二句"We hope you don't hate it."難以區分。


5、個性化修改模型
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

或者:
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification(config)
修改之后就需要自己重新訓練(如果改動大)/fine-tune(如果只更改了最上層)


Main concepts

The library is built around three types of classes for each model:

  • Model classes such as BertModel, which are 30+ PyTorch models (torch.nn.Module) or Keras models (tf.keras.Model) that work with the pretrained weights provided in the library.

  • Configuration classes such as BertConfig, which store all the parameters required to build a model. You don’t always need to instantiate these yourself. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model).

  • Tokenizer classes such as BertTokenizer, which store the vocabulary for each model and provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model.

All these classes can be instantiated from pretrained instances and saved locally using two methods:

  • from_pretrained() lets you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (the supported models are provided in the list here) or stored locally (or on a server) by the user,

  • save_pretrained() lets you save a model/configuration/tokenizer locally so that it can be reloaded using from_pretrained()

On top of those three base classes, the library provides two APIs:

  pipeline() for quickly using a model (plus its associated tokenizer and configuration) on a given task

  Trainer() to quickly train or fine-tune a given model.



一些名詞:

Input IDs:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence = "A Titan RTX has 24GB of VRAM"
print(tokenizer.tokenize(sequence))
t1 = tokenizer(sequence)
print(t1)
t2 = tokenizer.decode(t1["input_ids"])
print(t2)

['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
{'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] A Titan RTX has 24GB of VRAM [SEP]


 
        

Token Type IDs:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"
encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])
print(encoded_dict)



{'input_ids': [101, 20164, 10932, 2271, 7954, 1110, 1359, 1107, 17520, 102, 2777, 1110, 20164, 10932, 2271, 7954, 1359, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
其中token_type_ids的0/1就區分了兩個句子。


Position IDs:

the position IDs (position_ids) are used by the model to identify each token’s position in the list of tokens.

They are an optional parameter. If no position_ids are passed to the model, the IDs are automatically created as absolute positional embeddings.

Absolute positional embeddings are selected in the range [0, config.max_position_embeddings - 1].

 

Labels:

就是GT,用來算LOSS。

 

Decoder input IDs:

The input IDs of labels that will be fed to the decoder.

Most encoder-decoder models (BART, T5) create their decoder_input_ids on their own from the labels. In such models, passing the labels is the preferred way to handle training.

 

 

Using 🤗 Transformers部分的一些有用的點:

1、Preprocessing data

對於pair of data
>>> encoded_input = tokenizer("How old are you?", "I'm 6 years old") >>> print(encoded_input) {'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102],  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


>>> batch_sentences = ["Hello I'm a single sentence", ... "And another sentence", ... "And the very very last one"] >>> batch_of_second_sentences = ["I'm a sentence that goes with the first sentence", ... "And I should be encoded with the second sentence", ... "And I go with the very last one"] >>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences) >>> print(encoded_inputs) {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102],  [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102],  [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],  [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],  [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


對於Pre-tokenized inputs:

If you want to use pre-tokenized inputs, just set is_split_into_words=True when passing your inputs to the tokenizer. For instance, we have:

>>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True) >>> print(encoded_input) {'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]} 

2、fine-tuning
The library also includes a number of task-specific final layers or ‘heads’
whose weights are instantiated randomly when not present in the specified pre-trained model.
For example, instantiating a model with
BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
will create a BERT model instance with encoder weights
copied from the bert-base-uncased model and a randomly initialized sequence classification head
on top of the encoder with an output size of 2.
Models are initialized in eval mode by default. We can call model.train() to put it in train mode.
 
        
import torch
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
# set to train mode (default is eval mode)
model.train()

from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text_batch = ["I love Pixar.", "I don't care for Pixar."]
encoding = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True)
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
labels = torch.tensor([1,0])
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss

# 這兩句用來訓練
loss.backward()
optimizer.step()

也可以用Transformer庫自帶的訓練器:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments


fine-tuning code:
from pathlib import Path
import os

# get data & labels
def read_imdb_split(split_dir):
texts = []
labels = []
for folder in ["pos", "neg"]:
path = os.path.join(split_dir, folder)
for file in os.listdir(path):
with open(os.path.join(path, file), 'r') as f:
texts.append(f.read())
labels.append(0 if folder == "pos" else 1)

return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

# create validation set
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2)

# tokenize
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

# create a torch dataset
from torch.utils.data.dataset import Dataset
import torch.tensor
class IMDbDataset(Dataset):
# 以這個類構造的子類,一定要定義兩個函數。
# len
# 用於提供數據集size
# getitem
# 通過給定索引獲取數據和標簽。一次只能獲取一個數據
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels

def __len__(self):
return len(self.labels)

def __getitem__(self, y):
# x.__getitem__(y) <==> x[y]
# encoding中有好多項,比如MASK,所以就這樣key value方便訪問所有的,避免遺漏。
item = {key: torch.tensor(val[y]) for key, val in self.encodings.items()}
item["label"] = torch.tensor(self.labels[y])
return item

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

# fine-tune with Trainer
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=1, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=2, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset
)
# simply train can train the model
trainer.train()

model.save_pretrained("./save_pretrained")


# eval
# test_pred = model(test_encodings)
# print(test_pred[0:100])







免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM