使用BERT模型生成token級向量

本文轉載自查看原文 2019-08-23 00:53 4115 tensorflow/ python相關/ 深度學習/ 自然語言處理/ 數據挖掘與機器學習

本文默認讀者有一定的Transformer基礎，如果沒有，請先稍作學習Transormer以及BERT。

相信網上有很多方法可以生成BERT向量，最有代表性的一個就是bert as service，用幾行代碼就可以生成向量，但是這樣生成的是句向量，也就是說，正確的做法是輸入一句句子：

我是一個中國人，我熱愛着中國的每一個城市。

輸出的是這句句子的向量，一個768維的向量（google預訓練是這么做的），這個向量是具有上下文信息的，詳細參考Transformer結構。但是網上有一些做法是用bert as service來生成詞級向量，例如輸入[‘我’，‘是’，‘一個’， ‘中國’， ‘人’]，得到5個768維的向量，用來作為詞向量，但這樣做是錯誤的！具體原因參照我前面的描述，既然思想是錯誤的，也就不奇怪效果不好了，所以在這種情況下，請先別着急說BERT預訓練模型不work。

BERT生成token級別的向量，這兩篇文章理解的比較准確（我的代碼有一部分參考第二篇博客）：

https://blog.csdn.net/u012526436/article/details/87697242

https://blog.csdn.net/shine19930820/article/details/85730536

為什么說是token級別的向量呢？因為Transformer結構所決定其輸入和輸出的長度相等的，而且對於中文預訓練模型，做法是將中文拆成一個個的字來做學習的，因此每一個token就是一個字。對於一句話，我們會在其頭上加[cls]在尾部加[SEP]，並且BERT是一個多任務的預訓練過程，現在假設text_a是我們需要獲取向量的句子，text_b為空，那么，輸入：

我是一個中國人，我熱愛着中國的每一個城市。

處理后：

[CLS]我是一個中國人，我熱愛着中國的每一個城市。[SEP]

通常我們會用第一個字符[CLS]的輸出向量（768維）作為整個句子的向量表示，用來接全連接、softmax層做分類，現在我打算獲取這樣一個句子中每一個字符的向量表示，並存儲下來，以備下游任務，如果我只用[CLS]的向量來做分類，那么就只取第一個向量，如果用所有token的向量來做卷積，那么就舍棄一頭一尾的向量，取中間的向量來做卷積，這樣下游任務不用改太多代碼，把這樣的信息存儲在文件里，下游任務用起來比較靈活。

存儲ndarray

要能夠把詞向量存儲下來供下次使用，就肯定要知道怎么存儲ndarray，因為拿到的詞向量是shape為(N, seq_len, 768)的ndarray，N代表有多少個句子，seq_len代表句子的長度（加上頭尾），768即向量的維度。這里我使用h5py存儲ndarray，當然也有一些別的方法。

import h5py
# shape a: (3, 4, 5)
a = np.array([[[1,0.5,1,0.3,-1],[1,0.5,1,0.3,-1],[1,0.5,1,0.3,-1],[1,0.5,1,0.3,-1]],
              [[1,0.5,1,0.3,-1],[1,0.5,1,0.3,-1],[1,0.5,1,0.3,-1],[1,0.5,1,0.3,-1]],
              [[1,0.5,1,0.3,-1],[1,0.5,1,0.3,-1],[1,0.5,1,0.3,-1],[1,0.5,1,0.3,-1]]])
print(a.shape)

save_file = h5py.File('../downstream/input_c_emb.h5', 'w')
save_file.create_dataset('train', data=a)
save_file.close()

open_file = h5py.File('../downstream/input_c_emb.h5', 'r')
data = open_file['train'][:]
open_file.close()
print(data)
print(type(data))
print(data.shape)

字符級Token

因為我希望對中文字符進行一個字符一個字符的提取，而不是會把非中文字符拼在一起（這是google的邏輯），比如說”我出生於1996年“這句話，我希望提取為‘我’，‘出’，‘生’，‘於’，‘1’，‘9’，‘9’，‘6’，‘年’，因此需要自己寫一個token類，在bert項目中tokenization.py文件中。

class CharTokenizer(object):
  """Runs end-to-end tokenziation."""

  def __init__(self, vocab_file, do_lower_case=True):
    self.vocab = load_vocab(vocab_file)
    self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
    self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)

  def tokenize(self, text):
    split_tokens = []
    for token in self.basic_tokenizer.tokenize(text):
      for sub_token in token:
        # 有的字符在預訓練詞典里沒有
        # 這部分字符替換成[UNK]符號
        if not sub_token in self.vocab:
          split_tokens.append('[UNK]')
        else:
          split_tokens.append(sub_token)
    return split_tokens

  def convert_tokens_to_ids(self, tokens):
    return convert_tokens_to_ids(self.vocab, tokens)

提取向量作為特征

這里對於輸入輸出稍作解釋，我的輸入有三個文件，train.txt，val.txt，test.txt，顧名思義了。每個文件中的一系列的句子，比如train.txt中有5000多行，代表5000多個句子，每一個句子是已經以空格分開的序列，比如”我愛中國“。輸出就是一個input_c_emb.h5，里面保存了所有的嵌入向量，以train，val，test標識為分隔。

代碼注釋還行，就不詳細說了。

這段代碼在項目中是token_features.py，項目地址后面會放。

# 獲取token features，即每一個字符的向量，可以用cls作為句子向量，也可以用每一個字符的向量
import os
import sys
curPath = os.path.abspath(os.path.dirname(__file__))
rootPath = os.path.split(curPath)[0]
sys.path.append(rootPath)
print(sys.path)
import tensorflow as tf
import tokenization
import modeling
import numpy as np
import h5py



# 配置文件
# data_root是模型文件，可以用預訓練的，也可以用在分類任務上微調過的模型
data_root = '../chinese_wwm_ext_L-12_H-768_A-12/'
bert_config_file = data_root + 'bert_config.json'
bert_config = modeling.BertConfig.from_json_file(bert_config_file)
init_checkpoint = data_root + 'bert_model.ckpt'
bert_vocab_file = data_root + 'vocab.txt'

# 經過處理的輸入文件路徑
file_input_x_c_train = '../data/legal_domain/train_x_c.txt'
file_input_x_c_val = '../data/legal_domain/val_x_c.txt'
file_input_x_c_test = '../data/legal_domain/test_x_c.txt'

# embedding存放路徑
emb_file_dir = '../data/legal_domain/emb.h5'

# graph
input_ids = tf.placeholder(tf.int32, shape=[None, None], name='input_ids')
input_mask = tf.placeholder(tf.int32, shape=[None, None], name='input_masks')
segment_ids = tf.placeholder(tf.int32, shape=[None, None], name='segment_ids')

BATCH_SIZE = 16
SEQ_LEN = 510


def batch_iter(x, batch_size=64, shuffle=False):
    """生成批次數據，一個batch一個batch地產生句子向量"""
    data_len = len(x)
    num_batch = int((data_len - 1) / batch_size) + 1

    if shuffle:
        indices = np.random.permutation(np.arange(data_len))
        x_shuffle = np.array(x)[indices]
    else:
        x_shuffle = x[:]

    word_mask = [[1] * (SEQ_LEN + 2) for i in range(data_len)]
    word_segment_ids = [[0] * (SEQ_LEN + 2) for i in range(data_len)]

    for i in range(num_batch):
        start_id = i * batch_size
        end_id = min((i + 1) * batch_size, data_len)
        yield x_shuffle[start_id:end_id], word_mask[start_id:end_id], word_segment_ids[start_id:end_id]


def read_input(file_dir):
    # 從文件中讀到所有需要轉化的句子
    # 這里需要做統一長度為510
    # input_list = []
    with open(file_dir, 'r', encoding='utf-8') as f:
        input_list = f.readlines()

    # input_list是輸入list，每一個元素是一個str，代表輸入文本
    # 現在需要轉化成id_list
    word_id_list = []
    for query in input_list:
        split_tokens = token.tokenize(query)
        if len(split_tokens) > SEQ_LEN:
            split_tokens = split_tokens[:SEQ_LEN]
        else:
            while len(split_tokens) < SEQ_LEN:
                split_tokens.append('[PAD]')
        # ****************************************************
        # 如果是需要用到句向量，需要用這個方法
        # 加個CLS頭，加個SEP尾
        tokens = []
        tokens.append("[CLS]")
        for i_token in split_tokens:
            tokens.append(i_token)
        tokens.append("[SEP]")
        # ****************************************************
        word_ids = token.convert_tokens_to_ids(tokens)
        word_id_list.append(word_ids)
    return word_id_list


# 初始化BERT
model = modeling.BertModel(
    config=bert_config,
    is_training=False,
    input_ids=input_ids,
    input_mask=input_mask,
    token_type_ids=segment_ids,
    use_one_hot_embeddings=False
)

# 加載BERT模型
tvars = tf.trainable_variables()
(assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment)
# 獲取最后一層和倒數第二層
encoder_last_layer = model.get_sequence_output()
encoder_last2_layer = model.all_encoder_layers[-2]

# 讀取數據
token = tokenization.CharTokenizer(vocab_file=bert_vocab_file)

input_train_data = read_input(file_dir='../data/legal_domain/train_x_c.txt')
input_val_data = read_input(file_dir='../data/legal_domain/val_x_c.txt')
input_test_data = read_input(file_dir='../data/legal_domain/test_x_c.txt')

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    save_file = h5py.File('../downstream/input_c_emb.h5', 'w')
    emb_train = []
    train_batches = batch_iter(input_train_data, batch_size=BATCH_SIZE, shuffle=False)
    for word_id, mask, segment in train_batches:
        feed_data = {input_ids: word_id, input_mask: mask, segment_ids: segment}
        last2 = sess.run(encoder_last2_layer, feed_dict=feed_data)
        # print(last2.shape)
        for sub_array in last2:
            emb_train.append(sub_array)
    # 可以保存了
    emb_train_array = np.asarray(emb_train)
    save_file.create_dataset('train', data=emb_train_array)

    # val
    emb_val = []
    val_batches = batch_iter(input_val_data, batch_size=BATCH_SIZE, shuffle=False)
    for word_id, mask, segment in val_batches:
        feed_data = {input_ids: word_id, input_mask: mask, segment_ids: segment}
        last2 = sess.run(encoder_last2_layer, feed_dict=feed_data)
        # print(last2.shape)
        for sub_array in last2:
            emb_val.append(sub_array)
    # 可以保存了
    emb_val_array = np.asarray(emb_val)
    save_file.create_dataset('val', data=emb_val_array)

    # test
    emb_test = []
    test_batches = batch_iter(input_test_data, batch_size=BATCH_SIZE, shuffle=False)
    for word_id, mask, segment in test_batches:
        feed_data = {input_ids: word_id, input_mask: mask, segment_ids: segment}
        last2 = sess.run(encoder_last2_layer, feed_dict=feed_data)
        # print(last2.shape)
        for sub_array in last2:
            emb_test.append(sub_array)
    # 可以保存了
    emb_test_array = np.asarray(emb_test)
    save_file.create_dataset('test', data=emb_test_array)

    save_file.close()

    print(emb_train_array.shape)
    print(emb_val_array.shape)
    print(emb_test_array.shape)

    # 這邊目標是接下游CNN任務，因此先寫入所有token的embedding，768維
    # 寫入shape直接是(N, max_seq_len + 2, 768)
    # 下游需要選用的時候，如果卷積，則去掉頭尾使用，如果全連接，則直接使用頭部
    # 這里直接設定max_seq_len=510，加上[cls]和[sep]，得到512
    # 寫入(n, 512, 768) ndarray到文件，需要用的時候再讀出來，就直接舍棄embedding層

項目地址

點擊這里

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用BERT模型生成句子序列向量使用BERT詞向量使用BERT獲取中文詞向量使用BERT獲取中文詞向量 BERT模型使用及一個問題使用微調后的Bert模型做編碼器進行文本特征向量抽取及特征降維 BERT模型小白使用Bert跑分類模型 NLP與深度學習（六）BERT模型的使用 BERT模型