Bert是非常強化的NLP模型,在文本分類的精度非常高。本文將介紹Bert中文文本分類的基礎步驟,文末有代碼獲取方法。
步驟1:讀取數據
本文選取了頭條新聞
分類數據集來完成分類任務,此數據集是根據頭條新聞的標題來完成分類。
101 京城最值得你來場文化之旅的博物館_!_保利集團,馬未都,中國科學技術館,博物館,新中國
101 發酵床的墊料種類有哪些?哪種更好?
101 上聯:黃山黃河黃皮膚黃土高原。怎么對下聯?
101 林徽因什么理由拒絕了徐志摩而選擇梁思成為終身伴侶?
101 黃楊木是什么樹?
首先需要下載數據,並解壓數據:
wget http://github.com/skdjfla/toutiao-text-classfication-dataset/raw/master/toutiao_cat_data.txt.zip
!unzip toutiao_cat_data.txt.zip
按照數據集格式讀取新聞標題和新聞標簽:
import pandas as pd
import codecs
# 標簽
news_label = [int(x.split('_!_')[1])-100
for x in codecs.open('toutiao_cat_data.txt')]
# 文本
news_text = [x.strip().split('_!_')[-1] if x.strip()[-3:] != '_!_' else x.strip().split('_!_')[-2]
for x in codecs.open('toutiao_cat_data.txt')]
步驟2:划分數據集
借助train_test_split
划分20%的數據為驗證集,並保證訓練集和驗證部分類別同分布。
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader, TensorDataset
import numpy as np
import pandas as pd
import random
import re
# 划分為訓練集和驗證集
# stratify 按照標簽進行采樣,訓練集和驗證部分同分布
x_train, x_test, train_label, test_label = train_test_split(news_text[:],
news_label[:], test_size=0.2, stratify=news_label[:])
步驟3:對文本進行編碼
使用transformers
對文本進行轉換,這里使用的是bert-base-chinese
模型,所以加載的Tokenizer也要對應。
# transformers bert相關的模型使用和加載
from transformers import BertTokenizer
# 分詞器,詞典
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
train_encoding = tokenizer(x_train, truncation=True, padding=True, max_length=64)
test_encoding = tokenizer(x_test, truncation=True, padding=True, max_length=64)
使用編碼后的數據構建Dataset:
# 數據集讀取
class NewsDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
# 讀取單個樣本
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(int(self.labels[idx]))
return item
def __len__(self):
return len(self.labels)
train_dataset = NewsDataset(train_encoding, train_label)
test_dataset = NewsDataset(test_encoding, test_label)
這里dataset是直接讀取文本在經過所以加載的Tokenizer處理后的數據,主要的含義如下:
input_ids
:字的編碼token_type_ids
:標識是第一個句子還是第二個句子attention_mask
:標識是不是填充
步驟4:定義Bert模型
由於這里是文本分類任務,所以直接使用BertForSequenceClassification
完成加載即可,這里需要制定對應的類別數量。
from transformers import BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=17)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# 單個讀取到批量讀取
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=16, shuffle=True)
# 優化方法
optim = AdamW(model.parameters(), lr=2e-5)
total_steps = len(train_loader) * 1
scheduler = get_linear_schedule_with_warmup(optim,
num_warmup_steps = 0, # Default value in run_glue.py
num_training_steps = total_steps)
步驟5:模型訓練與驗證
使用常規的正向傳播和反向傳播即可,在訓練過程中計算類別准確率。
# 訓練函數
def train():
model.train()
total_train_loss = 0
iter_num = 0
total_iter = len(train_loader)
for batch in train_loader:
# 正向傳播
optim.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs[0]
total_train_loss += loss.item()
# 反向梯度信息
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# 參數更新
optim.step()
scheduler.step()
iter_num += 1
if(iter_num % 100==0):
print("epoth: %d, iter_num: %d, loss: %.4f, %.2f%%" % (epoch, iter_num, loss.item(), iter_num/total_iter*100))
print("Epoch: %d, Average training loss: %.4f"%(epoch, total_train_loss/len(train_loader)))
def validation():
model.eval()
total_eval_accuracy = 0
total_eval_loss = 0
for batch in test_dataloader:
with torch.no_grad():
# 正常傳播
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs[0]
logits = outputs[1]
total_eval_loss += loss.item()
logits = logits.detach().cpu().numpy()
label_ids = labels.to('cpu').numpy()
total_eval_accuracy += flat_accuracy(logits, label_ids)
avg_val_accuracy = total_eval_accuracy / len(test_dataloader)
print("Accuracy: %.4f" % (avg_val_accuracy))
print("Average testing loss: %.4f"%(total_eval_loss/len(test_dataloader)))
print("-------------------------------")
for epoch in range(4):
print("------------Epoch: %d ----------------" % epoch)
train()
validation()
訓練一個Epoch的輸出精度已經達到87%,Bert模型非常有效。
------------Epoch: 0 ----------------
epoth: 0, iter_num: 2500, loss: 0.7519, 100.00%
Epoch: 0, Average training loss: 0.6181
Accuracy: 0.8747
Average testing loss: 0.4602
-------------------------------
轉自:https://zhuanlan.zhihu.com/p/388009679