NLP(九):pytorch用transformer庫實現BERT


一、資源

(1)預訓練模型權重

鏈接:  密碼: 1upi

(2)數據集選擇的THUCNews,自行下載並整理出10w條數據,內容是10類新聞文本標題的中文分類問題(10分類),每類新聞標題數據量相等,為1w條。數據集可在我的百度網盤自行下載:鏈接:  密碼: p0wj。

(3)安裝

pip install transformers

(4)參考:

https://zhuanlan.zhihu.com/p/112655246

https://spaces.ac.cn/archives/6736

二、簡介

由於pytorch_pretrained_bert庫是transformers的老版庫,不再進行更新了。所以以下對原文章代碼進行了更新,換成以transformers為框架的代碼,並且將打印輸出設置的更加簡約。本文章適用於初學者,有興趣的可上手嘗試。

----------------- 分割線 ----------------

之前用bert一直都是根據keras-bert封裝庫操作的,操作非常簡便(可參考蘇劍林大佬博客當Bert遇上Keras:這可能是Bert最簡單的打開姿勢),這次想要來嘗試一下基於pytorch的bert實踐。

最近pytorch大火,而目前很少有博客完整的給出基於pytorch的bert的應用代碼,本文從最簡單的中文文本分類入手,一步一步的給出每段代碼~ (代碼簡單清晰,讀者有興趣可上手實踐)

(1)首先安裝transformers庫, 即:pip install transformers(版本為4.4.2);

(2)然后下載預訓練模型權重,這里下載的是 chinese_roberta_wwm_ext_pytorch ,下載鏈接為中文BERT-wwm系列模型 (這里可選擇多種模型),如果下載不了,可在我的百度網盤下載:鏈接:  密碼: 1upi;

(3)數據集選擇的THUCNews,自行下載並整理出10w條數據,內容是10類新聞文本標題的中文分類問題(10分類),每類新聞標題數據量相等,為1w條。數據集可在我的百度網盤自行下載:鏈接:  密碼: p0wj。

下圖為數據集展示(最后10行),格式為"title \t label",標題和所屬類別兩列用'\t'分隔。

廢話少說,下面進入代碼階段。(訓練環境為Google Colab,GPU為T4,顯存大約15G)

1 導入必要的庫

import pandas as pd import numpy as np import json, time from tqdm import tqdm from sklearn.metrics import accuracy_score, classification_report import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler from transformers import BertModel, BertConfig, BertTokenizer, AdamW, get_cosine_schedule_with_warmup import warnings warnings.filterwarnings('ignore') bert_path = "bert_model/" # 該文件夾下存放三個文件('vocab.txt', 'pytorch_model.bin', 'config.json') tokenizer = BertTokenizer.from_pretrained(bert_path) # 初始化分詞器

2 預處理數據集

input_ids, input_masks, input_types, = [], [], [] # input char ids, segment type ids, attention mask labels = [] # 標簽 maxlen = 30 # 取30即可覆蓋99% with open("news_title_dataset.csv", encoding='utf-8') as f: for i, line in tqdm(enumerate(f)): title, y = line.strip().split('\t') # encode_plus會輸出一個字典,分別為'input_ids', 'token_type_ids', 'attention_mask'對應的編碼 # 根據參數會短則補齊,長則切斷 encode_dict = tokenizer.encode_plus(text=title, max_length=maxlen, padding='max_length', truncation=True) input_ids.append(encode_dict['input_ids']) input_types.append(encode_dict['token_type_ids']) input_masks.append(encode_dict['attention_mask']) labels.append(int(y)) input_ids, input_types, input_masks = np.array(input_ids), np.array(input_types), np.array(input_masks) labels = np.array(labels) print(input_ids.shape, input_types.shape, input_masks.shape, labels.shape)
輸出:(27秒,速度較快)
100000it [00:27, 3592.75it/s]
(100000, 30) (100000, 30) (100000, 30) (100000,)

3 切分訓練集、驗證集和測試集

# 隨機打亂索引 idxes = np.arange(input_ids.shape[0]) np.random.seed(2019) # 固定種子 np.random.shuffle(idxes) print(idxes.shape, idxes[:10]) # 8:1:1 划分訓練集、驗證集、測試集 input_ids_train, input_ids_valid, input_ids_test = input_ids[idxes[:80000]], input_ids[idxes[80000:90000]], input_ids[idxes[90000:]] input_masks_train, input_masks_valid, input_masks_test = input_masks[idxes[:80000]], input_masks[idxes[80000:90000]], input_masks[idxes[90000:]] input_types_train, input_types_valid, input_types_test = input_types[idxes[:80000]], input_types[idxes[80000:90000]], input_types[idxes[90000:]] y_train, y_valid, y_test = labels[idxes[:80000]], labels[idxes[80000:90000]], labels[idxes[90000:]] print(input_ids_train.shape, y_train.shape, input_ids_valid.shape, y_valid.shape, input_ids_test.shape, y_test.shape)
輸出:

4加載到高效的DataLoader

BATCH_SIZE = 64 # 如果會出現OOM問題,減小它 # 訓練集 train_data = TensorDataset(torch.LongTensor(input_ids_train), torch.LongTensor(input_masks_train), torch.LongTensor(input_types_train), torch.LongTensor(y_train)) train_sampler = RandomSampler(train_data) train_loader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE) # 驗證集 valid_data = TensorDataset(torch.LongTensor(input_ids_valid), torch.LongTensor(input_masks_valid), torch.LongTensor(input_types_valid), torch.LongTensor(y_valid)) valid_sampler = SequentialSampler(valid_data) valid_loader = DataLoader(valid_data, sampler=valid_sampler, batch_size=BATCH_SIZE) # 測試集(是沒有標簽的) test_data = TensorDataset(torch.LongTensor(input_ids_test), torch.LongTensor(input_masks_test), torch.LongTensor(input_types_test)) test_sampler = SequentialSampler(test_data) test_loader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)

5 定義bert模型

# 定義model class Bert_Model(nn.Module): def __init__(self, bert_path, classes=10): super(Bert_Model, self).__init__() self.config = BertConfig.from_pretrained(bert_path) self.bert = BertModel.from_pretrained(bert_path) self.fc = nn.Linear(self.config.hidden_size, classes) # 直接分類 def forward(self, input_ids, attention_mask=None, token_type_ids=None): outputs = self.bert(input_ids, attention_mask, token_type_ids) out_pool = outputs[1] # 池化后的輸出 logit = self.fc(out_pool) return logit

可以發現,bert模型的定義由於高效簡易的封裝庫存在,使得定義模型較為容易,如果想要在bert之后加入cnn/rnn等層,可在這里定義。

6 實例化bert模型

def get_parameter_number(model): # 打印模型參數 total_num = sum(p.numel() for p in model.parameters()) trainable_num = sum(p.numel() for p in model.parameters() if p.requires_grad) return 'Total parameters: {}, Trainable parameters: {}'.format(total_num, trainable_num) DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") EPOCHS = 5 model = Bert_Model(bert_path).to(DEVICE) print(get_parameter_number(model))
輸出:Total parameters: 102275338, Trainable parameters: 102275338

7 定義優化器

optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=1e-4) #AdamW優化器 scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=len(train_loader), num_training_steps=EPOCHS*len(train_loader)) # 學習率先線性warmup一個epoch,然后cosine式下降。

8 定義訓練函數和驗證測試函數

# 評估模型性能,在驗證集上 def evaluate(model, data_loader, device): model.eval() val_true, val_pred = [], [] with torch.no_grad(): for idx, (ids, att, tpe, y) in (enumerate(data_loader)): y_pred = model(ids.to(device), att.to(device), tpe.to(device)) y_pred = torch.argmax(y_pred, dim=1).detach().cpu().numpy().tolist() val_pred.extend(y_pred) val_true.extend(y.squeeze().cpu().numpy().tolist()) return accuracy_score(val_true, val_pred) #返回accuracy # 測試集沒有標簽,需要預測提交 def predict(model, data_loader, device): model.eval() val_pred = [] with torch.no_grad(): for idx, (ids, att, tpe) in tqdm(enumerate(data_loader)): y_pred = model(ids.to(device), att.to(device), tpe.to(device)) y_pred = torch.argmax(y_pred, dim=1).detach().cpu().numpy().tolist() val_pred.extend(y_pred) return val_pred def train_and_eval(model, train_loader, valid_loader, optimizer, scheduler, device, epoch): best_acc = 0.0 patience = 0 criterion = nn.CrossEntropyLoss() for i in range(epoch): """訓練模型""" start = time.time() model.train() print("***** Running training epoch {} *****".format(i+1)) train_loss_sum = 0.0 for idx, (ids, att, tpe, y) in enumerate(train_loader): ids, att, tpe, y = ids.to(device), att.to(device), tpe.to(device), y.to(device) y_pred = model(ids, att, tpe) loss = criterion(y_pred, y) optimizer.zero_grad() loss.backward() optimizer.step() scheduler.step() # 學習率變化 train_loss_sum += loss.item() if (idx + 1) % (len(train_loader)//5) == 0: # 只打印五次結果 print("Epoch {:04d} | Step {:04d}/{:04d} | Loss {:.4f} | Time {:.4f}".format( i+1, idx+1, len(train_loader), train_loss_sum/(idx+1), time.time() - start)) # print("Learning rate = {}".format(optimizer.state_dict()['param_groups'][0]['lr'])) """驗證模型""" model.eval() acc = evaluate(model, valid_loader, device) # 驗證模型的性能 if acc > best_acc: best_acc = acc torch.save(model.state_dict(), "best_bert_model.pth") print("current acc is {:.4f}, best acc is {:.4f}".format(acc, best_acc)) print("time costed = {}s \n".format(round(time.time() - start, 5)))

9 開始訓練和驗證模型

# 訓練和驗證評估 train_and_eval(model, train_loader, valid_loader, optimizer, scheduler, DEVICE, EPOCHS)
輸出:(訓練時間較長,500s左右一個epoch,這里只訓練了2個epoch,驗證集便得到了0.9680的accuracy)

10 加載最優模型進行測試

# 加載最優權重對測試集測試 model.load_state_dict(torch.load("best_bert_model.pth")) pred_test = predict(model, test_loader, DEVICE) print("\n Test Accuracy = {} \n".format(accuracy_score(y_test, pred_test))) print(classification_report(y_test, pred_test, digits=4))
輸出:測試集准確率為96.72%

------------------------------------

經過以上10步,即可建立起較為完整的基於pytorch的bert文本分類體系,代碼也較為簡單易懂,對讀者有幫助記得點個贊支持一下呀~

-完結-

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM