BERT實戰——基於Keras

本文轉載自查看原文 2019-10-05 17:09 11601 NLP/ 深度學習/ Keras/ Python

keras_bert 和 kert4keras

keras_bert 是 CyberZHG 大佬封裝好了Keras版的Bert，可以直接調用官方發布的預訓練權重。

github：https://github.com/CyberZHG/keras-bert

快速安裝：pip install keras-bert

kert4keras 是 蘇劍林 大佬參考 keras-bert 重新編寫的一個 keras 版的 bert，所以使用體驗差不多，但 kert4keras 可以適配 albert

github：https://github.com/bojone/bert4keras

快速安裝：pip install git+https://www.github.com/bojone/bert4keras.git

keras_bert

Tokenizer

在 keras-bert 里面，使用 Tokenizer 會將文本拆分成字並生成相應的id。

我們需要提供一個字典，字典存放着 token 和 id 的映射。字典里還有 BERT 里特別的 token。

[CLS]，[SEP]，[UNK]等

在下面的示例中，如果文本拆分出來的字在字典不存在，它的 id 會是 5，代表 [UNK]，即 unknown

from keras_bert import Tokenizer
#字典
token_dict = {
    '[CLS]': 0,
    '[SEP]': 1,
    'un': 2,
    '##aff': 3,
    '##able': 4,
    '[UNK]': 5,
}

tokenizer = Tokenizer(token_dict)

# 拆分單詞實例
print(tokenizer.tokenize('unaffable')) 
# ['[CLS]', 'un', '##aff', '##able', '[SEP]']

# indices是字對應索引
# segments表示索引對應位置上的字屬於第一句話還是第二句話
# 這里只有一句話 unaffable，所以segments都是0
indices, segments = tokenizer.encode('unaffable')
print(indices)  
# [0, 2, 3, 4, 1]
print(segments)  
# [0, 0, 0, 0, 0]

我們用同樣的字典，拆分不存在字典中的單詞，結果如下，可以看到英語中會直接把不存在字典中的部分直接按字母拆分。

print(tokenizer.tokenize('unknown')) 
# ['[CLS]', 'un', '##k', '##n', '##o', '##w', '##n', '[SEP]']

indices, segments = tokenizer.encode('unknown')
# [0, 2, 5, 5, 5, 5, 5, 1]
# [0, 0, 0, 0, 0, 0, 0, 0]

下面是輸入兩句話的例子，encode 函數中我們可以帶上參數 max_len，只看文本拆分出來的 max_len 個字

如果拆分完的字不超過max_len，則用 0 填充

print(tokenizer.tokenize(first='unaffable', second='鋼'))
# ['[CLS]', 'un', '##aff', '##able', '[SEP]', '鋼', '[SEP]']
indices, segments = tokenizer.encode(first='unaffable', second='鋼', max_len=10)
print(indices)  
# [0, 2, 3, 4, 1, 5, 1, 0, 0, 0]
print(segments)  
# [0, 0, 0, 0, 0, 1, 1, 0, 0, 0]

注意這個 max_len 包括 BERT 中的特殊 token，比如下面的代碼
tokenizer.encode('unaffable', max_len=3)
# [0, 2, 1]
我們得到的結果是 [0, 2, 1]，0 和 1 分別代表 [CLS] 和 [SEP]

模型的訓練和使用

函數介紹

keras_bert 中我們可以使用 get_model() 來取得 BERT 模型，它有以下參數可供選擇

token_num：token 的數量

pos_num：最大 position 。默認512

seq_len：輸入序列的最大長度，為 None 時不限制。默認512

embed_dim：嵌入維度，默認768

transformer_num：transformer的個數，默認12

head_num：每個 transformer 中 multi-head attention 中 heads 的個數，默認12

feed_forward_dim：每個 transformer 中 feed-forward 層的維度，默認3072

dropout_rate：dropout 的概率

attention_activation：attention 層的激活函數

feed_forward_activation：feed forward 層使用的激活函數，默認是gelu

training：如果為True，則將返回帶有 MLM 和 NSP輸出的模型；否則，將返回輸入層和最后一個特征提取層。默認 True

trainable：模型是否是可訓練的，默認和 training 一樣的設置

output_layer_num：多少個FeedForward-Norm層的輸出被連接為單個輸出。僅在 training 為 False 時可用。默認1

use_task_embed：是否將 task embedding 加到現有的 embedding 中，默認 False

task_num：任務數，默認10

use_adapter：是否在每個殘差網絡前使用 feed-forward adapter，默認 False

adapter_units：feed-forward adapter 中第一個 transformation 的維度

關於adapter可以參考這篇論文：https://arxiv.org/pdf/1902.00751.pdf

gen_batch_inputs() 函數可以產生我們用於訓練的數據，可用參數如下

sentence_pairs：列表，這個包含了許多 token 組成的句子對。

token_dict：包括 BERT 所用的特殊符號在內的字典

token_list：包括所有 token 的列表

seq_len：序列的長度，默認512

mask_rate：隨機 token 被替換為 [MASK] 的概率，然后預測這個被替換的 token。默認0.15

mask_mask_rate：如果一個 token 要被替換為 [MASK]，真正替換為 [MASK] 的概率。默認0.8

mask_random_rate：如果一個 token 要被替換為 [MASK]，替換成一個隨機的 token。默認0.1

swap_sentence_rate：交換第一個句子和第二個句子的概率。默認0.5

force_mask：至少一個位置的 token 被 masked，默認 True

compile_model() 函數用來編譯我們的模型，可用參數如下

model：要編譯的模型

weight_decay：權重衰減率，默認0.01

decay_steps：學習率會在這個步長中線性衰減至0，默認100000

warmup_steps：學習率會在預熱步長中線性增長到設置的學習率，默認10000

learning_rate：學習率，默認1e-4

warmup可以參考這篇文章：https://yinguobing.com/tensorflowzhong-de-xue-xi-lu-re-shen/

當step小於warm up setp時，學習率等於基礎學習率×(當前step/warmup_step)，由於后者是一個小於1的數值，因此在整個warm up的過程中，學習率是一個遞增的過程！當warm up結束后，學習率開始遞減。

load_trained_model_from_checkpoint() 函數用來加載官方訓練好的模型，可用參數如下

config_file：JSON 配置文件路徑

checkpoint_file：checkpoint 文件路徑

training：True 的話，會返回整個模型，否則會忽略 MLM 和 NSP 部分。默認 False

trainable：模型是否可訓練，默認和 training 設置一樣

output_layer_num：多少個FeedForward-Norm層的輸出被連接為單個輸出。僅在 training 為 False 時可用。默認1

seq_len：如果這個數值比配置文件中的長度小，position embeddings 會被切成適用於這個長度。默認1e9

構建和訓練模型

這個例子里面，我們的不用 Tokenizer 將文本拆分成 “字”，而是使用 “詞” 級別作為模型的輸入

這里跟 keras 的文本處理很像，可以參考下面這篇文章

https://www.cnblogs.com/dogecheng/p/11565530.html

用keras_bert進行情感分析的實例可以參考下面的文章

https://www.cnblogs.com/dogecheng/p/11824494.html

import keras
from keras_bert import get_base_dict, get_model, compile_model, gen_batch_inputs


# 輸入示例
sentence_pairs = [
    [['all', 'work', 'and', 'no', 'play'], ['makes', 'jack', 'a', 'dull', 'boy']],
    [['from', 'the', 'day', 'forth'], ['my', 'arm', 'changed']],
    [['and', 'a', 'voice', 'echoed'], ['power', 'give', 'me', 'more', 'power']],
]

# 構建 token 字典
# 這個字典存放的是【詞】
token_dict = get_base_dict()  
# get_base_dict()返回一個字典
# 字典預置了一些特殊token，具體內容如下
# {'': 0, '[UNK]': 1, '[CLS]': 2, '[SEP]': 3, '[MASK]': 4}
for pairs in sentence_pairs:
    for token in pairs[0] + pairs[1]:
        if token not in token_dict:
            token_dict[token] = len(token_dict)
# token_dict 是由詞組成的字典，大致如下
# {'': 0, '[UNK]': 1, '[CLS]': 2, '[SEP]': 3, '[MASK]': 4, 'all': 5, 'work': 6,..., 'me': 26, 'more': 27}

token_list = list(token_dict.keys())


# 構建和訓練模型
model = get_model(
    token_num=len(token_dict),
    head_num=5,
    transformer_num=12,
    embed_dim=25,
    feed_forward_dim=100,
    seq_len=20,
    pos_num=20,
    dropout_rate=0.05,
)
compile_model(model)
model.summary()

def _generator():
    while True:
        yield gen_batch_inputs(
            sentence_pairs,
            token_dict,
            token_list,
            seq_len=20,
            mask_rate=0.3,
            swap_sentence_rate=1.0,
        )

model.fit_generator(
# 這里測試集和驗證集使用了同樣的數據
# 實際中使用時不能這樣
    generator=_generator(),
    steps_per_epoch=1000,
    epochs=100,
    validation_data=_generator(),
    validation_steps=100,
    callbacks=[
        keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
    ],
)


# 使用訓練好的模型
# 取出 輸入層 和 最后一個特征提取層
inputs, output_layer = get_model(
    token_num=len(token_dict),
    head_num=5,
    transformer_num=12,
    embed_dim=25,
    feed_forward_dim=100,
    seq_len=20,
    pos_num=20,
    dropout_rate=0.05,
    training=False,
    trainable=False,
    output_layer_num=4,
)

下載和使用預訓練模型

參考地址：https://github.com/CyberZHG/keras-bert/tree/master/demo

我們可以使用 load_trained_model_from_checkpoint() 函數使用本地已經下載好的預訓練模型，可以從 BERT 的 github 上獲取下載地址

谷歌BERT地址：https://github.com/google-research/bert

中文預訓練BERT-wwm：https://github.com/ymcui/Chinese-BERT-wwm

下面是使用預訓練模型提取輸入文本的特征

import os

# 設置預訓練模型的路徑
pretrained_path = 'chinese_L-12_H-768_A-12'
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')

# 構建字典
# 也可以用 keras_bert 中的 load_vocabulary() 函數
# 傳入 vocab_path 即可
# from keras_bert import load_vocabulary
# token_dict = load_vocabulary(vocab_path)
import codecs
token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)

# 加載預訓練模型
from keras_bert import load_trained_model_from_checkpoint
model = load_trained_model_from_checkpoint(config_path, checkpoint_path)

# Tokenization
from keras_bert import Tokenizer

tokenizer = Tokenizer(token_dict)
text = '語言模型'
tokens = tokenizer.tokenize(text)
# ['[CLS]', '語', '言', '模', '型', '[SEP]']
indices, segments = tokenizer.encode(first=text, max_len=512)
print(indices[:10])
# [101, 6427, 6241, 3563, 1798, 102, 0, 0, 0, 0]
print(segments[:10])
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

# 提取特征
import numpy as np

predicts = model.predict([np.array([indices]), np.array([segments])])[0]
for i, token in enumerate(tokens):
    print(token, predicts[i].tolist()[:5])

下面我們用預訓練模型預測句子中被 MASKED 掉的詞語是什么

token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)

token_dict_rev = {v: k for k, v in token_dict.items()}

model = load_trained_model_from_checkpoint(config_path, checkpoint_path, training=True)

text = '數學是利用符號語言研究數量、結構、變化以及空間等概念的一門學科'
tokens = tokenizer.tokenize(text)
tokens[1] = tokens[2] = '[MASK]'# ['[CLS]', '[MASK]', '[MASK]', '是', '利',..., '學', '科', '[SEP]']

indices = np.array([[token_dict[token] for token in tokens] + [0] * (512 - len(tokens))])
segments = np.array([[0] * len(tokens) + [0] * (512 - len(tokens))])
masks = np.array([[0, 1, 1] + [0] * (512 - 3)])
predicts = model.predict([indices, segments, masks])[0].argmax(axis=-1).tolist()
print('Fill with: ', list(map(lambda x: token_dict_rev[x], predicts[0][1:3])))
# Fill with:  ['數', '學']

albert 和 keras4bert

使用示例：https://github.com/bojone/bert4keras/tree/master/examples

albert中文預訓練模型：https://github.com/brightmart/albert_zh

基本使用

本文代碼已不全部適用最新的bert4keras，部分函數名字、位置發生了變化。

最新版本的可以看：https://www.cnblogs.com/dogecheng/p/11824494.html

keras4bert 是基於 keras-bert 重新編寫的一個 keras 版的 bert，可以適配 albert，只需要在load_pretrained_model函數里加上albert=True。

使用體驗和 keras_bert 差不多，下面是 github 提供的使用例子。

SimpleTokenizer是一個簡單的分詞器，直接將文本分割為單字符序列，專為中文處理設計，原則上只適用於中文模型。

load_pretrained_model 可用參數如下

config_path：JSON 配置文件路徑

checkpoint_file：checkponit 文件路徑

with_mlm：是否包含 MLM 部分，默認 False

seq2seq：True 則用來做seq2seq任務的Bert，默認 False

keep_words：要保留的詞ID列表

albert：是否是 ALBERT 模型

from bert4keras.bert import load_pretrained_model
from bert4keras.utils import SimpleTokenizer, load_vocab
import numpy as np

config_path = './albert/albert_config_large.json'
checkpoint_path = './albert/albert_model.ckpt'
dict_path = './albert/vocab.txt'

token_dict = load_vocab(dict_path)
tokenizer = SimpleTokenizer(token_dict)
# 使用ALBERT
model = load_pretrained_model(config_path, checkpoint_path, albert=True) 

# 編碼測試
token_ids, segment_ids = tokenizer.encode(u'語言模型')
print(model.predict([np.array([token_ids]), np.array([segment_ids])]))

預測 MASKED 掉的詞匯

# 建立ALBERT模型，加載權重
# 預測 MASKED 掉的詞匯，需要 MLM 層
model = load_pretrained_model(config_path, checkpoint_path, with_mlm=True, albert=True)

token_ids, segment_ids = tokenizer.encode(u'科學技術是第一生產力')

# mask掉“技術”
token_ids[3] = token_ids[4] = token_dict['[MASK]']

# 用mlm模型預測被mask掉的部分
probas = model.predict([np.array([token_ids]), np.array([segment_ids])])[0]
print(tokenizer.decode(probas[3:5].argmax(axis=1))) 
# 技術

情感分析實例

數據集：https://github.com/bojone/bert4keras/tree/master/examples/datasets

或百度網盤下載：鏈接: https://pan.baidu.com/s/1OAhNbRYpU1HW25_vChdRng 提取碼: uxax

測試環境：

Ubuntu 16.04.6

Anaconda Python 3.7.3

數據集是兩個 excel 表，分別存放着正面和負面評價，下面是負面評價的內容

先設置預訓練模型的路徑，並讀取原始數據

# 序列最大長度
maxlen = 100
config_path = './albert_base_zh/bert_config.json'
checkpoint_path = './albert_base_zh/bert_model.ckpt'
dict_path = './albert_base_zh/vocab.txt'


neg = pd.read_excel('datasets/neg.xls', header=None)
pos = pd.read_excel('datasets/pos.xls', header=None)

構建字典並建立分詞器

# 字出現的次數
chars = {}
# 數據集
data = []

for d in neg[0]:
    data.append((d, 0))
    for c in d:
        chars[c] = chars.get(c, 0) + 1

for d in pos[0]:
    data.append((d, 1))
    for c in d:
        chars[c] = chars.get(c, 0) + 1

# 保留出現次數大於 4 次的字
chars = {i: j for i, j in chars.items() if j >= 4}

# 讀取字典
_token_dict = load_vocab(dict_path)
# 構造字典
# token_dict 里是存放的都是本任務里用得到的字
# keep_words 存放的是索引
token_dict, keep_words = {}, []

for c in ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[unused1]']:
    token_dict[c] = len(token_dict)
    keep_words.append(_token_dict[c])

for c in chars:
    if c in _token_dict:
        token_dict[c] = len(token_dict)
        keep_words.append(_token_dict[c])

tokenizer = SimpleTokenizer(token_dict) # 建立分詞器

構建訓練數據和測試數據

if not os.path.exists('./random_order.json'):
    random_order = list(range(len(data)))
    np.random.shuffle(random_order)
    json.dump(
        random_order,
        open('./random_order.json', 'w'),
        indent=4
    )
else:
    random_order = json.load(open('./random_order.json'))

# 按照9:1的比例划分訓練集和驗證集
train_data = [data[j] for i, j in enumerate(random_order) if i % 10 != 0]
valid_data = [data[j] for i, j in enumerate(random_order) if i % 10 == 0]

def seq_padding(X, padding=0):
    # 用 0 填充序列
    # 讓所有輸入序列長度一致
    L = [len(x) for x in X]
    ML = max(L)
    return np.array([
        np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X
    ])

class data_generator:
    def __init__(self, data, batch_size=32):
        self.data = data
        self.batch_size = batch_size
        self.steps = len(self.data) // self.batch_size
        if len(self.data) % self.batch_size != 0:
            self.steps += 1
    def __len__(self):
        return self.steps
    def __iter__(self):
        while True:
            idxs = list(range(len(self.data)))
            np.random.shuffle(idxs)
            X1, X2, Y = [], [], []
            for i in idxs:
                d = self.data[i]
                text = d[0][:maxlen]
                # x1 是字對應的索引
                # x2 是句子對應的索引
                x1, x2 = tokenizer.encode(first=text)
                y = d[1]
                X1.append(x1)
                X2.append(x2)
                Y.append([y])
                if len(X1) == self.batch_size or i == idxs[-1]:
                    X1 = seq_padding(X1)
                    X2 = seq_padding(X2)
                    Y = seq_padding(Y)
                    yield [X1, X2], Y
                    [X1, X2, Y] = [], [], []


train_D = data_generator(train_data)
valid_D = data_generator(valid_data)

構建模型並訓練

from keras.layers import *
from keras.models import Model
import keras.backend as K
from keras.optimizers import Adam

model = load_pretrained_model(
    config_path,
    checkpoint_path,
    keep_words=keep_words,
    albert=True
)

output = Lambda(lambda x: x[:, 0])(model.output)
output = Dense(1, activation='sigmoid')(output)
model = Model(model.input, output)

model.compile(
    loss='binary_crossentropy',
    optimizer=Adam(1e-5),  # 用足夠小的學習率
    # optimizer=PiecewiseLinearLearningRate(Adam(1e-5), {1000: 1e-5, 2000: 6e-5}),
    metrics=['accuracy']
)
model.summary()

model.fit_generator(
    train_D.__iter__(),
    steps_per_epoch=len(train_D),
    epochs=10,
    validation_data=valid_D.__iter__(),
    validation_steps=len(valid_D)
)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 keras遇到bert實戰一（bert實現分類） keras_bert Bert實戰---情感分類在Keras中用Bert進行情感分析基於keras4bert的seq2seq機制的文章標題生成 ImportError: cannot import name 'Tokenizer' from 'keras_bert' 也來玩玩目前最大的中文GPT2模型（bert4keras）什么是BERT？訓練BERT模型加入到深度學習網絡層中——keras_bert庫使用指南使用keras的LSTM進行預測----實戰練習