bert 文本分類問題


本篇文章,使用pytorch框架 微調bert

bert官方文檔:https://huggingface.co/transformers/model_doc/bert.html

bert文件:https://github.com/huggingface/transformers

這里有一篇文章可以很好的了解bert:https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html

bert預訓練模型下載地址(可以用迅雷下載):

PRETRAINED_VOCAB_ARCHIVE_MAP = {
'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt",
'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt",
'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt",
'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",
'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt",
}
PRETRAINED_MODEL_ARCHIVE_MAP = {
    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz",
    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz",
    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz",
    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz",
    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz",
    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz",
    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz",
}

中文分類與英文分類主要區別在於所調用的預訓練模型是英文的還是中文的(個人感覺)。比如,“bert-base-uncased”是英文的預訓練模型,“bert-base-chinese”是中文模型。

通過下面源碼中的例子,我們可以清楚的知道預訓練分詞的調用方式

Examples::

            # We can't instantiate directly the base class `PreTrainedTokenizer` so let's show our examples on a derived class: BertTokenizer

            # Download vocabulary from S3 and cache.
            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

            # Download vocabulary from S3 (user-uploaded) and cache.
            tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-german-cased')

            # If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)
            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/')

            # If the tokenizer uses a single vocabulary file, you can point directly to this file
            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/my_vocab.txt')

            # You can link tokens to special vocabulary when instantiating
            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', unk_token='<unk>')
            # You should be sure '<unk>' is in the vocabulary when doing that.
            # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)
            assert tokenizer.unk_token == '<unk>'

        """

通過下面的BertForSequenceClassification的源碼,我們可以發現,多分類與二分類模型上的區別在於  self.num_labels = config.num_labels 這句話,config是哪來的呢?

config參數來自於:

modelConfig = BertConfig.from_pretrained('bert-base-uncased/bert_config.json')
model = BertForSequenceClassification.from_pretrained('./bert-base-uncased/pytorch_model.bin', config=modelConfig)
通過上面兩句話我們可以知道,通過修改bert_config.json文件中的“num_labels”的值就可以了。
@add_start_docstrings("""Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks. """,
    BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
class BertForSequenceClassification(BertPreTrainedModel):
    r"""
        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
            Labels for computing the sequence classification/regression loss.
            Indices should be in ``[0, ..., config.num_labels - 1]``.
            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
            Classification (or regression if config.num_labels==1) loss.
        **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
            Classification (or regression if config.num_labels==1) scores (before SoftMax).
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
    Examples:
        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
        outputs = model(input_ids, labels=labels)
        loss, logits = outputs[:2]
    """
    def __init__(self, config):
        super(BertForSequenceClassification, self).__init__(config)
        self.num_labels = config.num_labels
 
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
 
        self.init_weights()
 
    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,
                position_ids=None, head_mask=None):
        outputs = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
                            attention_mask=attention_mask, head_mask=head_mask)
        pooled_output = outputs[1]
 
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
 
        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
 
        if labels is not None:
            if self.num_labels == 1:
                #  We are doing regression
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1), labels.view(-1))
            else:
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs
 
        return outputs  # (loss), logits, (hidden_states), (attentions)

還有另一種方法可以將多分類變成多分類問題,就是將bert的最后一層進行重寫

https://blog.csdn.net/weixin_41519463/article/details/100863313

另一位大佬實現多文本分類,可以參考https://github.com/649453932/Bert-Chinese-Text-Classification-Pytorch

# 使用 transformers 提供的序列分類模型 BertForSequenceClassification
from transformers import BertForSequenceClassification

PRETRAINED_MODEL_NAME = "bert-base-chinese"
NUM_LABELS = 3

model = BertForSequenceClassification.from_pretrained(
    PRETRAINED_MODEL_NAME, num_labels=NUM_LABELS)
# 在 BertModel 基礎上,自己添加 Dropout 層和 線性分類層,實際效果等價於 transformers 提供的 BertForSequenceClassification
class Model(nn.Module):

    def __init__(self, config):
        super(Model, self).__init__()
        self.bert = BertModel.from_pretrained(config.bert_path)
        for param in self.bert.parameters():
            param.requires_grad = True
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(config.hidden_size, config.num_classes)

    def forward(self, x):
        context = x[0]  # 輸入的句子
        mask = x[2]  # 對padding部分進行mask
        seg_ids = x[3]  # 句子間的分隔情況
        _, pooled = self.bert(input_ids=context, token_type_ids=seg_ids,  attention_mask=mask, output_all_encoded_layers=False)
        out = self.dropout(pooled)
        out = self.fc(out)
        return out
 
 model = Model(config)

 

通過BertForSequenceClassification中 from_pretrained函數中的example我們可以清楚的知道了預訓練模型的調用方法

Examples::

            # For example purposes. Not runnable.
            model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
            model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
            model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
            assert model.config.output_attention == True
            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
            config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')
            model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_tf=True, config=config)

 

從下面開始,我通過AI研習社金融評論分類為案例,對bert簡單應用。

1、導入頭文件

import pandas as pd
import numpy as np
from transformers import BertTokenizer,AdamW, BertConfig,  BertForSequenceClassification
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

from tqdm import trange
 from torchviz import make_dot  #這是一個可視化工具,感興趣的可以百度一下
  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

2、數據處理將詞轉化成詞向量:

 
         
  train_data = pd.read_csv("./input/train.csv")

#
提取句子並處理 sentences = ['[CLS]' + sent + '[SEP]' for sent in train_data.text.values] label = train_data.label.values # print(len(train_data)) # print(train_data.text.values) tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) tokenized_sents = [tokenizer.tokenize(sent) for sent in sentences] # print(tokenized_sents[0]) #將分割后的句子轉化成數字 word-->idx MAX_LEN = 512 input_ids = [tokenizer.convert_tokens_to_ids(sent) for sent in tokenized_sents]

3、做padding和 truncating 形成相同大小的長度:

'''
#自己手動做pad,truncating表示大於最大長度截斷
def pad_sequences(inputs, max_l):
    def pad(x):
        return x[:max_l] if len(x) > max_l else x+[0]*(max_l-len(x))
    feature = np.array([pad(x) for x in input_ids], dtype=np.long)
return feature

 '''
#做PADDING,這里使用keras的包做pad,truncating
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype='long', truncating='post', padding='post')
# input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, truncating='post', padding='post')
# print(input_ids[0])

4、建立attention_mask, bert模型中必須要傳入mask:

attention_mask=[]
for seq in input_ids:
    seq_mak = [float(i>0) for i in seq]
    attention_mask.append(seq_mak)

5、划分訓練集和驗證集:

train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids,label,
                                                                    random_state=2020,test_size=0.1)

train_masks, valid_masks, _, _ = train_test_split(attention_mask, input_ids,random_state=2020,test_size=0.1)

6、生成dataloader數據迭代器:

#轉成tensor
train_inputs = torch.LongTensor(train_inputs)
valid_inputs =torch.LongTensor(validation_inputs)
train_labels = torch.tensor(train_labels)
valid_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
valid_masks = torch.tensor(valid_masks)

#生成dataloader
batch_size=16
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
valid_data = TensorDataset(valid_inputs, valid_masks, valid_labels)
valid_dataloader =DataLoader(valid_data, batch_size=batch_size)

7、加載模型:

#加載模型
modelConfig = BertConfig.from_pretrained('bert-base-uncased/bert_config.json')
model = BertForSequenceClassification.from_pretrained('./bert-base-uncased/pytorch_model.bin', config=modelConfig)

# model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
# fc_features = model.classifier.in_features
# model.classifier=nn.Linear(fc_features, 11)

  model.to(device)

8、加添優化器:

定義優化器,注意BertAdam、AdamW是不同版本的adam優化方法,版本更新太快,知道使用就行,定義需要weight decay的參數
‘gamma’, ‘beta’ 是值LayerNormal層的,不要decay,直接訓練即可。其他參數除去bias,均使用weight decay的方法進行訓練
weight decay可以簡單理解在Adam上的一個優化的基礎上成使用L2正則(AdamW)。

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']   #這三個變量不做梯度微調
optimizer_grouped_parameters = [
    {'params':[p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
    'weight_decay_rate':0.01},
    {'params':[p for n, p in param_optimizer if  any(nd in n for nd in no_decay)],
    'weight_decay_rate':0.0},
]

optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5)

9、微調訓練:

接下來就是訓練部分了,注意在訓練是傳入label,模型可以直接得到loss,如果不傳入label,便只有一個logits

#計算准確率
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    label_flat = label.flatten()
    return np.sum(pred_flat == label_flat)/len(label_flat)

train_loss_set = []
epochs=10
for _ in trange(epochs, desc='Epoch'):
    model.train()
    tr_loss= 0.0 
    nb_tr_example, nb_tr_steps = 0, 0
    for step, batch in enumerate(train_dataloader):
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        optimizer.zero_grad()
         #取第一個位置,BertForSequenceClassification第一個位置是Loss,第二個位置是[CLS]的logits
        loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)[0]
        train_loss_set.append(loss.item())
        loss.backward()
        optimizer.step()


        tr_loss+=loss.item()
        nb_tr_example+=b_input_ids.size(0)
        nb_tr_steps+=1

    print("Train loss: {}".format(tr_loss / nb_tr_steps))

(如果想要可視化參數可以執行)

   '''
    y = model(x)
    vis_graph = make_dot(y, params=dict(list(model.named_parameters()) + [('x', x)]))
    vise_graph.view()
    '''

10、評估:

#模型評估
    model.eval()
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    for batch in valid_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        with torch.no_grad():
            logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)[0]
        
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1
    print("Validation Accuracy: {}".format(eval_accuracy / nb_eval_steps))

11、預測

test_df = pd.read_csv("./input/test.csv")

test_sencents = [[['CLS']+ sent + ['SEP']] for sent in test_df.text.values] 
test_tokenizers = [tokenizer.tokenize(sent) for sent in test_sencents]

test_input_ids = [tokenizer.convert_tokens_to_ids(token) for token in test_tokenizers]

input_ids = pad_sequences(test_input_ids, maxlen=MAX_LEN, dtype='long', truncating='post', padding='post')
#建立mask
test_mask=[]
for seq in input_ids:
    seq_mak = [float(i>0) for i in seq]
    test_mask.append(seq_mak)

test_label=[]

for in_ids, mask, in input_ids, test_mask:
    in_ids=torch.LongTensor(in_ids).to(device)
    mask = torch.tensor(mask).to(device)
    logits_str = model(in_ids, token_type_ids=None, attention_mask=mask)[0]
    test_label.append(np.argmax(logits_str.detach().cpu().numpy(), axis=1))
bert_Test= pd.DataFrame(test_df.index,columns=['id'])
bert_Test['label']=test_label
bert_Test.to_csv('bert_test.csv', index=None, header=None)

 

本案例借鑒:

https://github.com/huggingface/transformers

https://zhuanlan.zhihu.com/p/56103665

https://github.com/real-brilliant/bert_chinese_pytorch

https://blog.csdn.net/Real_Brilliant/article/details/84880528

https://github.com/649453932/Bert-Chinese-Text-Classification-Pytorch

https://blog.csdn.net/Ascn_/article/details/106194326


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM