NLP（三十三）：中文BERT字字量

本文轉載自查看原文 2021-12-28 09:51 745 NLP

中文字、詞Bert向量生成
利用Bert預訓練模型生成中文的字、詞向量，字向量是直接截取Bert的輸出結果；詞向量則是把詞語中的每個字向量進行累計求平均（畢竟原生Bert是基於字符訓練的），Bert預訓練模型采用的是科大訊飛的chinese_wwm_ext_pytorch，網盤下載地址：

鏈接：https://pan.baidu.com/s/1Tnewi3mbKN2x1XsX5IQl6g
提取碼：9qv7

，希望對各位有幫助！！！這里輸出的詞向量和字向量還是靜態的，因為沒有進行fine tuning微調，如果想使用動態詞向量，直接加載Bert預訓練模型后進行fine tuning，可以在Bert后連接其他網絡進行微調。

# coding = utf-8
import jieba
import logging
import numpy as np
from transformers import BertModel, BertTokenizer

jieba.setLogLevel(logging.INFO)

bert_path = "../chinese_wwm_ext_pytorch"
bert = BertModel.from_pretrained(bert_path)
token = BertTokenizer.from_pretrained(bert_path)


# Bert 字向量生成
def get_data(path, char):
    words = []
    with open(path, "r", encoding="utf-8") as f:
        sentences = f.readlines()
        if char:
            for sent in sentences:
                words.extend([word.strip() for word in sent.strip().replace(" ", "") if word not in words])
        else:
            for sentence in sentences:
                cut_word = jieba.lcut(sentence.strip().replace(" ", ""))
                words.extend([w for w in cut_word if w not in words])
    return words


def get_bert_embed(path, char=False):
    words = get_data(path, char)
    file_word = open("word_embed.txt", "a+", encoding="utf-8")
    file_word.write(str(len(words)) + " " + "768" + "\n")
    # 字向量
    if char:
        file_char = open("char_embed.txt", "a+", encoding="utf-8")
        file_char.write(str(len(words)) + " " + "768" + "\n")
        for word in words:
            inputs = token.encode_plus(word, padding="max_length", truncation=True, max_length=10,
                                       add_special_tokens=True,
                                       return_tensors="pt")
            out = bert(**inputs)
            out = out[0].detach().numpy().tolist()
            out_str = " ".join("%s" % embed for embed in out[0][1])
            embed_out = word + " " + out_str + "\n"
            file_char.write(embed_out)
        file_char.close()

    # 詞向量 (采用字向量累加求均值)
    for word in words:
        words_embed = np.zeros(768)  # bert tensor shape is 768
        inputs = token.encode_plus(word, padding="max_length", truncation=True, max_length=50, add_special_tokens=True,
                                   return_tensors="pt")
        out = bert(**inputs)
        word_len = len(word)
        out_ = out[0].detach().numpy()
        for i in range(1, word_len + 1):
            out_str = out_[0][i]
            words_embed += out_str
        words_embed = words_embed / word_len
        words_embedding = words_embed.tolist()
        result = word + " " + " ".join("%s" % embed for embed in words_embedding) + "\n"
        file_word.write(result)

    file_word.close()


# char 為False時執行的是詞向量生成， 為True則執行字向量生成
get_bert_embed("text.txt", char=False)
print("Generate Finished!!!")

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 NLP（三十三）：sentence-transformers句子相似度官方示例 NLP（三十三）：sentence-transformers句子相似度官方示例 NLP之BERT中文文本分類超詳細教程 BBS論壇（三十三） NLP 基於kashgari和BERT實現中文命名實體識別（NER） NLP（四十三）：sentence_bert+pytorch向量檢索，進行語義匹配最強NLP模型-BERT NLP新秀 - Bert NLP學習（3）---Bert模型【NLP】徹底搞懂BERT