經常做NLP任務,要想獲得好一點的准確率,需要一個與訓練好的embedding模型。
參考:github
Install
pip install pytorch-pretrained-bert
Usage
BertTokenizer
BertTokenizer
會分割輸入的句子,便於后面嵌入。
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenized input
text = "Who was Jim Henson ? Jim Henson was a puppeteer"
tokenized_text = tokenizer.tokenize(text)
對於找不到的詞,會限制最大長度進行分割。
BertModel
tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
將上面的列表轉為tensor,並傳給bertmodel
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()
# Predict hidden states features for each layer
encoded_layers, _ = model(tokens_tensor, segments_tensors)