Pytorch-使用Bert預訓練模型微調中文文本分類


渣渣本跑不動,以下代碼運行在Google Colab上。

語料鏈接:https://pan.baidu.com/s/1YxGGYmeByuAlRdAVov_ZLg
提取碼:tzao

neg.txt和pos.txt各5000條酒店評論,每條評論一行。

安裝transformers庫

!pip install transformers

導包,設定超參數

 1 import numpy as np
 2 import random
 3 import torch
 4 import matplotlib.pyplot as plt
 5 from torch.nn.utils import clip_grad_norm_
 6 from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler 
 7 from transformers import BertTokenizer, BertForSequenceClassification, AdamW
 8 from transformers import get_linear_schedule_with_warmup
 9 
10 SEED = 123
11 BATCH_SIZE = 16
12 LEARNING_RATE = 2e-5
13 WEIGHT_DECAY = 1e-2
14 EPSILON = 1e-8
15 
16 random.seed(SEED)
17 np.random.seed(SEED)
18 torch.manual_seed(SEED)

1.數據預處理

1.1讀取文件

 1 def readfile(filename):
 2     with open(filename, encoding="utf-8") as f:        
 3         content = f.readlines()
 4         return content
 5 
 6 pos_text, neg_text = readfile('hotel/pos.txt'), readfile('hotel/neg.txt')
 7 sentences = pos_text + neg_text
 8 
 9 #設定標簽
10 pos_targets = np.ones((len(pos_text)))
11 neg_targets = np.zeros((len(neg_text)))
12 targets = np.concatenate((pos_targets, neg_targets), axis=0).reshape(-1, 1)   #(10000, 1)
13 total_targets = torch.tensor(targets)

Tip:調用readfile時報錯了UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 0

解決辦法:將txt文件在Notepad++中打開,點擊工具欄的編碼,轉為UTF-8編碼。

1.2BertTokenizer進行編碼,將每一句轉成數字

1 tokenizer = BertTokenizer.from_pretrained('bert-base-chinese', cache_dir="E:/transformer_file/")
2 print(pos_text[2])
3 print(tokenizer.tokenize(pos_text[2]))
4 print(tokenizer.encode(pos_text[2]))
5 print(tokenizer.convert_ids_to_tokens(tokenizer.encode(pos_text[2])))

不錯,下次還考慮入住。交通也方便,在餐廳吃的也不錯。

['不', '錯', ',', '下', '次', '還', '考', '慮', '入', '住', '。', '交', '通', '也', '方', '便', ',', '在', '餐', '廳', '吃', '的', '也', '不', '錯', '。']

[101, 679, 7231, 8024, 678, 3613, 6820, 5440, 5991, 1057, 857, 511, 769, 6858, 738, 3175, 912, 8024, 1762, 7623, 1324, 1391, 4638, 738, 679, 7231, 511, 102]

['[CLS]', '不', '錯', ',', '下', '次', '還', '考', '慮', '入', '住', '。', '交', '通', '也', '方', '便', ',', '在', '餐', '廳', '吃', '的', '也', '不', '錯', '。', '[SEP]']

為了使每一句的長度相等,稍作處理;

 1 #將每一句轉成數字(大於126做截斷,小於126做PADDING,加上首尾兩個標識,長度總共等於128)
 2 def convert_text_to_token(tokenizer, sentence, limit_size=126):
 3     
 4     tokens = tokenizer.encode(sentence[:limit_size])  #直接截斷  
 5     if len(tokens) < limit_size + 2:                  #補齊(pad的索引號就是0)
 6         tokens.extend([0] * (limit_size + 2 - len(tokens)))   
 7     return tokens
 8 
 9 input_ids = [convert_text_to_token(tokenizer, sen) for sen in sentences]
10 
11 input_tokens = torch.tensor(input_ids)
12 print(input_tokens.shape)                    #torch.Size([10000, 128])

1.3attention_masks, 在一個文本中,如果是PAD符號則是0,否則就是1

 1 #建立mask
 2 def attention_masks(input_ids):
 3     atten_masks = []
 4     for seq in input_ids:
 5         seq_mask = [float(i>0) for i in seq]
 6         atten_masks.append(seq_mask)
 7     return atten_masks
 8 
 9 atten_masks = attention_masks(input_ids)
10 attention_tokens = torch.tensor(atten_masks)    

構造input_ids和atten_masks的目的和前面一節中提到的.encode_plus函數返回的input_ids和attention_mask一樣,input_type_ids和本次任務無關,它是針對每個訓練集有兩個句子的任務(如問答任務)。

1.4划分訓練集和測試集

兩個划分函數的參數random_state和test_size值要一致,才能使得train_inputs和train_masks一一對應。

1 from sklearn.model_selection import train_test_split
2 train_inputs, test_inputs, train_labels, test_labels = train_test_split(input_tokens, total_targets, random_state=666, test_size=0.2)
3 train_masks, test_masks, _, _ = train_test_split(attention_tokens, input_tokens, random_state=666, test_size=0.2)
4 print(train_inputs.shape, test_inputs.shape)      #torch.Size([8000, 128]) torch.Size([2000, 128])
5 print(train_masks.shape)                          #torch.Size([8000, 128])和train_inputs形狀一樣
6 
7 print(train_inputs[0])
8 print(train_masks[0])

tensor([ 101, 2769, 6370, 4638, 3221, 10189, 1039, 4638, 117, 852, 2769, 6230, 2533, 8821, 1039, 4638, 7599, 3419, 3291, 1962, 671, 763, 117, 3300, 671, 2476, 1377, 809, 1288, 1309, 4638, 3763, 1355, 119, 2456, 6379, 1920, 2157, 6370, 3249, 6858, 7313, 106, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

1.5創建DataLoader,用來取出一個batch的數據

TensorDataset 可以用來對 tensor 進行打包,就好像 python 中的 zip 功能。該類通過每一個 tensor 的第一個維度進行索引,所以該類中的 tensor 第一維度必須相等,且TensorDataset 中的參數必須是 tensor類型。

RandomSampler對數據集隨機采樣。

SequentialSampler按順序對數據集采樣。

1 train_data = TensorDataset(train_inputs, train_masks, train_labels)
2 train_sampler = RandomSampler(train_data)
3 train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)
4 
5 test_data = TensorDataset(test_inputs, test_masks, test_labels)
6 test_sampler = SequentialSampler(test_data)
7 test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)

查看一下train_dataloader的內容:

1 for i, (train, mask, label) in enumerate(train_dataloader):
2     print(train.shape, mask.shape, label.shape)               #torch.Size([16, 128]) torch.Size([16, 128]) torch.Size([16, 1])
3     break
4 print('len(train_dataloader)=', len(train_dataloader))        #500

2.創建模型、優化器

創建模型

1 model = BertForSequenceClassification.from_pretrained("bert-base-chinese", num_labels = 2)     #num_labels表示2個分類,好評和差評 2 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
3 model.to(device)

定義優化器

參數eps是為了提高數值穩定性而添加到分母的一個項(默認: 1e-8)

1 optimizer = AdamW(model.parameters(), lr = LEARNING_RATE, eps = EPSILON)

更通用的寫法:bias和LayNorm.weight沒有用權重衰減

1 no_decay = ['bias', 'LayerNorm.weight']
2 optimizer_grouped_parameters = [
3         {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': WEIGHT_DECAY},
4         {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
5 ]
6 optimizer = AdamW(optimizer_grouped_parameters, lr = LEARNING_RATE, eps = EPSILON)

 學習率預熱,訓練時先從小的學習率開始訓練

1 epochs = 2
2 # training steps 的數量: [number of batches] x [number of epochs]. 
3 total_steps = len(train_dataloader) * epochs
4 
5 # 設計 learning rate scheduler.
6 scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = total_steps)

3.訓練、評估模型 

3.1模型准確率

1 def binary_acc(preds, labels):      #preds.shape=(16, 2) labels.shape=torch.Size([16, 1])
2     correct = torch.eq(torch.max(preds, dim=1)[1], labels.flatten()).float()      #eq里面的兩個參數的shape=torch.Size([16])    
3     acc = correct.sum().item() / len(correct)
4     return acc

3.2計算模型運行時間

1 import time
2 import datetime
3 def format_time(elapsed):    
4     elapsed_rounded = int(round((elapsed)))    
5     return str(datetime.timedelta(seconds=elapsed_rounded))   #返回 hh:mm:ss 形式的時間

3.3訓練模型

  • 傳入model的參數必須是tensor類型的;
  • nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=2)用於解決神經網絡訓練過擬合的方法 ;

輸入是(NN參數,最大梯度范數,范數類型=2) 一般默認為L2 范數;

Tip: 注意這個方法只在訓練的時候使用,在測試的時候不用;

 1 def train(model, optimizer):
 2     t0 = time.time()
 3     avg_loss, avg_acc = [],[]
 4     
 5     model.train()
 6     for step, batch in enumerate(train_dataloader):
 7 
 8         # 每隔40個batch 輸出一下所用時間.
 9         if step % 40 == 0 and not step == 0:            
10             elapsed = format_time(time.time() - t0)
11             print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
12 
13         b_input_ids, b_input_mask, b_labels = batch[0].long().to(device), batch[1].long().to(device), batch[2].long().to(device)
14         
15         output = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
16         loss, logits = output[0], output[1] 
17     
18         avg_loss.append(loss.item())
19         
20         acc = binary_acc(logits, b_labels)
21         avg_acc.append(acc)
22         
23         optimizer.zero_grad()
24         loss.backward()
25         clip_grad_norm_(model.parameters(), 1.0)      #大於1的梯度將其設為1.0, 以防梯度爆炸
26         optimizer.step()              #更新模型參數
27         scheduler.step()              #更新learning rate
28         
29     avg_acc = np.array(avg_acc).mean()
30     avg_loss = np.array(avg_loss).mean()
31     return avg_loss, avg_acc

此處output的形式為(元組類型,第0個元素是loss值,第1個元素是每個batch中好評和差評的概率):

(tensor(0.0210, device='cuda:0', grad_fn=<NllLossBackward>), 
tensor([[-2.9815,  2.6931],
        [-3.2380,  3.1935],
        [-3.0775,  3.0713],
        [ 3.0191, -2.3689],
        [ 3.1146, -2.7957],
        [ 3.7798, -2.7410],
        [-0.3273,  0.8227],
        [ 2.5012, -1.5535],
        [-3.0231,  3.0162],
        [ 3.4146, -2.5582],
        [ 3.3104, -2.2134],
        [ 3.3776, -2.5190],
        [-2.6513,  2.5108],
        [-3.3691,  2.9516],
        [ 3.2397, -2.0473],
        [-2.8622,  2.7395]], device='cuda:0', grad_fn=<AddmmBackward>))

3.4評估模型

調用model模型時不傳入label值。

 1 def evaluate(model):    
 2     avg_acc = []    
 3     model.eval()         #表示進入測試模式
 4       
 5     with torch.no_grad():
 6         for batch in test_dataloader:
 7             b_input_ids, b_input_mask, b_labels = batch[0].long().to(device), batch[1].long().to(device), batch[2].long().to(device)
 8         
 9             output = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
10             
11             acc = binary_acc(output[0], b_labels)
12             avg_acc.append(acc)
13     avg_acc = np.array(avg_acc).mean()
14     return avg_acc

此處output的形式為(元組類型,第0個元素是每個batch中好評和差評的概率):

(tensor([[ 3.8217, -2.7516],
        [ 2.7585, -2.0853],
        [-2.9317,  2.9092],
        [-3.3724,  3.2597],
        [-2.8692,  2.6741],
        [-3.2784,  2.9276],
        [ 3.4946, -2.8895],
        [ 3.7855, -2.8623],
        [-2.2249,  2.4336],
        [-2.4257,  2.4606],
        [ 3.3996, -2.5760],
        [-3.1986,  3.0841],
        [ 3.6883, -2.9492],
        [ 3.2883, -2.3600],
        [ 2.6723, -2.0778],
        [-3.1868,  3.1106]], device='cuda:0'),)

3.5運行訓練模型和評估模型

1 for epoch in range(epochs):
2     
3     train_loss, train_acc = train(model, optimizer)
4     print('epoch={},訓練准確率={},損失={}'.format(epoch, train_acc, train_loss))
5     test_acc = evaluate(model)
6     print("epoch={},測試准確率={}".format(epoch, test_acc))

運行結果如下:

  Batch    40  of    500.    Elapsed: 0:00:14.
  Batch    80  of    500.    Elapsed: 0:00:28.
  Batch   120  of    500.    Elapsed: 0:00:42.
  Batch   160  of    500.    Elapsed: 0:00:57.
  Batch   200  of    500.    Elapsed: 0:01:12.
  Batch   240  of    500.    Elapsed: 0:01:26.
  Batch   280  of    500.    Elapsed: 0:01:41.
  Batch   320  of    500.    Elapsed: 0:01:56.
  Batch   360  of    500.    Elapsed: 0:02:11.
  Batch   400  of    500.    Elapsed: 0:02:26.
  Batch   440  of    500.    Elapsed: 0:02:42.
  Batch   480  of    500.    Elapsed: 0:02:57.
epoch=0,訓練准確率=0.9015,損失=0.2549531048182398
epoch=0,測試准確率=0.9285
  Batch    40  of    500.    Elapsed: 0:00:16.
  Batch    80  of    500.    Elapsed: 0:00:31.
  Batch   120  of    500.    Elapsed: 0:00:47.
  Batch   160  of    500.    Elapsed: 0:01:03.
  Batch   200  of    500.    Elapsed: 0:01:18.
  Batch   240  of    500.    Elapsed: 0:01:34.
  Batch   280  of    500.    Elapsed: 0:01:50.
  Batch   320  of    500.    Elapsed: 0:02:06.
  Batch   360  of    500.    Elapsed: 0:02:22.
  Batch   400  of    500.    Elapsed: 0:02:37.
  Batch   440  of    500.    Elapsed: 0:02:53.
  Batch   480  of    500.    Elapsed: 0:03:09.
epoch=1,訓練准確率=0.9595,損失=0.14357946291333065
epoch=1,測試准確率=0.939

4.預測

 1 def predict(sen):
 2     
 3     input_id = convert_text_to_token(tokenizer, sen)
 4     input_token =  torch.tensor(input_id).long().to(device)            #torch.Size([128])
 5     
 6     atten_mask = [float(i>0) for i in input_id]
 7     attention_token = torch.tensor(atten_mask).long().to(device)       #torch.Size([128])         
 8     
 9     output = model(input_token.view(1, -1), token_type_ids=None, attention_mask=attention_token.view(1, -1))     #torch.Size([128])->torch.Size([1, 128])否則會報錯 10     print(output[0])
11     
12     return torch.max(output[0], dim=1)[1]
13 
14 label = predict('酒店位置難找,環境不太好,隔音差,下次不會再來的。')
15 print('好評' if label==1 else '差評')
16 
17 label = predict('酒店還可以,接待人員很熱情,衛生合格,空間也比較大,不足的地方就是沒有窗戶')
18 print('好評' if label==1 else '差評')
19 
20 label = predict('"服務各方面沒有不周到的地方, 各方面沒有沒想到的細節"')
21 print('好評' if label==1 else '差評')

tensor([[ 3.5719, -2.7315]], device='cuda:0', grad_fn=<AddmmBackward>)

差評

tensor([[-2.7998, 2.8675]], device='cuda:0', grad_fn=<AddmmBackward>)

好評

tensor([[-1.9614, 1.5925]], device='cuda:0', grad_fn=<AddmmBackward>)

好評

性能還可以,第三句這種有點奇怪的句子也能正確識別了。 

 

參考鏈接:https://blog.csdn.net/Code_Tookie/article/details/104944888?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM