pytorch BiLSTM+CRF代碼詳解重點

本文轉載自查看原文 2019-07-14 20:49 865

一. BILSTM + CRF介紹

https://www.jianshu.com/p/97cb3b6db573

1.介紹

基於神經網絡的方法，在命名實體識別任務中非常流行和普遍。如果你不知道Bi-LSTM和CRF是什么，你只需要記住他們分別是命名實體識別模型中的兩個層。

1.1開始之前

我們假設我們的數據集中有兩類實體——人名和地名，與之相對應在我們的訓練數據集中，有五類標簽：

B-Person， I- Person，B-Organization，I-Organization

假設句子x由五個字符w1,w2,w3,w4,w5組成，其中【w1,w2】為人名類實體，【w3】為地名類實體，其他字符標簽為“O”。

1.2BiLSTM-CRF模型

以下將給出模型的結構：
第一，句子x中的每一個單元都代表着由字嵌入或詞嵌入構成的向量。其中，字嵌入是隨機初始化的，詞嵌入是通過數據訓練得到的。所有的嵌入在訓練過程中都會調整到最優。
第二，這些字或詞嵌入為BiLSTM-CRF模型的輸入，輸出的是句子x中每個單元的標簽。

Bi-LSTM結構圖

盡管一般不需要詳細了解BiLSTM層的原理，但是為了更容易知道CRF層的運行原理，我們需要知道BiLSTM的輸出層。

圖2.Bi-LSTM標簽預測原理圖

如上圖所示，BiLSTM層的輸出為每一個標簽的預測分值，例如，對於單元w0,BiLSTM層輸出的是

1.5 (B-Person), 0.9 (I-Person), 0.1 (B-Organization), 0.08 (I-Organization) 0.05 (O).

這些分值將作為CRF的輸入。

1.3 如果沒有CRF層會怎樣

你也許已經發現了，即使沒有CRF層，我們也可以訓練一個BiLSTM命名實體識別模型，如圖所示：

圖3.去除CRF的BiLSTM命名實體識別模型

由於BiLSTM的輸出為單元的每一個標簽分值，我們可以挑選分值最高的一個作為該單元的標簽。例如，對於單元w0,“B-Person”有最高分值—— 1.5，因此我們可以挑選“B-Person”作為w0的預測標簽。同理，我們可以得到w1——“I-Person”，w2—— “O” ，w3——“B-Organization”，w4——“O”。
雖然我們可以得到句子x中每個單元的正確標簽，但是我們不能保證標簽每次都是預測正確的。例如，圖4.中的例子，標簽序列是“I-Organization I-Person” and “B-Organization I-Person”，很顯然這是錯誤的。

在這里插入圖片描述

1.4 CRF層能從訓練數據中獲得約束性的規則

CRF層可以為最后預測的標簽添加一些約束來保證預測的標簽是合法的。在訓練數據訓練過程中，這些約束可以通過CRF層自動學習到。
這些約束可以是：
I：句子中第一個詞總是以標簽“B-“ 或 “O”開始，而不是“I-”
II：標簽“B-label1 I-label2 I-label3 I-…”,label1, label2, label3應該屬於同一類實體。例如，“B-Person I-Person” 是合法的序列, 但是“B-Person I-Organization” 是非法標簽序列.
III：標簽序列“O I-label” is 非法的.實體標簽的首個標簽應該是 “B-“ ，而非 “I-“, 換句話說,有效的標簽序列應該是“O B-label”。
有了這些約束，標簽序列預測中非法序列出現的概率將會大大降低。

二. 標簽的score和損失函數的定義

https://zhuanlan.zhihu.com/p/27338210

Bi-LSTM layer的輸出維度是tag size，這就相當於是每個詞 w_i 映射到tag的發射概率值，設Bi-LSTM的輸出矩陣為P，其中P_i,j代表詞w_i映射到tag_j的非歸一化概率。對於CRF來說，我們假定存在一個轉移矩陣A，則A_i,j代表tag_i轉移到tag_j的轉移概率。
對於輸入序列 X 對應的輸出tag序列 y，定義分數為

在這里插入圖片描述

利用Softmax函數，我們為每一個正確的tag序列y定義一個概率值（Y_X代表所有的tag序列，包括不可能出現的）

在這里插入圖片描述

因而在訓練中，我們只需要最大化似然概率p(y|X)即可，這里我們利用對數似然

在這里插入圖片描述

所以我們將損失函數定義為-log(p(y|X))，就可以利用梯度下降法來進行網絡的學習了。
loss function:

在這里插入圖片描述

在對損失函數進行計算的時候，S(X,y)的計算很簡單，而

在這里插入圖片描述

（下面記作logsumexp）的計算稍微復雜一些，因為需要計算每一條可能路徑的分數。這里用一種簡便的方法，對於到詞w _i+1的路徑，可以先把到詞w _i的logsumexp計算出來，因為

在這里插入圖片描述

因此先計算每一步的路徑分數和直接計算全局分數相同，但這樣可以大大減少計算的時間。

三. 對於損失函數的詳細解釋

這篇文章對於理解十分有用

https://blog.csdn.net/cuihuijun1hao/article/details/79405740

舉例說【我愛中國人民】對應標簽【N V N】那這個標簽就是一個完整的路徑，也就對應一個Score值。
接下來我想講的是這個公式:

在這里插入圖片描述

這個公式成立是很顯然的，動筆算一算就知道了，代碼里其實就是用了這個公式的原理。

def _forward_alg(self, feats): # Do the forward algorithm to compute the partition function init_alphas = torch.full((1, self.tagset_size), -10000.) # START_TAG has all of the score. init_alphas[0][self.tag_to_ix[START_TAG]] = 0. # Wrap in a variable so that we will get automatic backprop forward_var = init_alphas # Iterate through the sentence for feat in feats: alphas_t = [] # The forward tensors at this timestep for next_tag in range(self.tagset_size): # broadcast the emission score: it is the same regardless of # the previous tag emit_score = feat[next_tag].view( 1, -1).expand(1, self.tagset_size) # the ith entry of trans_score is the score of transitioning to # next_tag from i trans_score = self.transitions[next_tag].view(1, -1) # The ith entry of next_tag_var is the value for the # edge (i -> next_tag) before we do log-sum-exp next_tag_var = forward_var + trans_score + emit_score # The forward variable for this tag is log-sum-exp of all the # scores. alphas_t.append(log_sum_exp(next_tag_var).view(1)) forward_var = torch.cat(alphas_t).view(1, -1) terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]] alpha = log_sum_exp(terminal_var) return alpha

我們看到有這么一段代碼 :

next_tag_var = forward_var + trans_score + emit_score

我們主要就是來講講他。
首先這個算法的思想是：假設我們要做一個詞性標注的任務，對句子【我愛中華人民】，我們要對這個句子做

在這里插入圖片描述

意思就是對這個句子所有可能的標注，都算出來他們的Score，然后按照指數次冪加起來，再取對數。一般來說取所有可能的標注情況比較復雜，我們這里舉例是長度為三，但是實際過程中，可能比這個要大得多，所以我們需要有一個簡單高效得算法。也就是我們程序中得用得算法，他是這么算得:
先算出【我，愛】可能標注得所有情況，取 log_sum_exp 然后加上轉換到【中國人民】得特征值再加上【中國人民】對應得某個標簽得特征值。其等價於【我，愛，中國人民】所有可能特征值指數次冪相加，然后取對數.
接下來我們來驗證一下是不是這樣

首先我們假設詞性一共只有兩種名詞N 和動詞 V
那么【我，愛】得詞性組合一共有四種 N + N，N + V, V + N, V + V
那么【愛】標注為N時得log_sum_exp 為

在這里插入圖片描述

【愛】標注為 V時的 log_sum_exp為

在這里插入圖片描述

我們的forward列表里就是存在着這兩個值，即：

在這里插入圖片描述

假設【中華人民】得詞性為N,我們按照代碼來寫一下公式,在forward列表對應位置相加就是這樣

在這里插入圖片描述

四. 代碼塊詳細說明：

https://blog.csdn.net/Jason__Liang/article/details/81772632

先說明兩個重要的矩陣:

feats: 發射矩陣(emit score)是sentence 在embedding后,再經過LSTM后得到的矩陣(也就是LSTM的輸出), 維度為11 * 5 (11為sentence 的length，5是標簽數）。這個矩陣表示經過LSTM后sentence的每個word對應的每個labels的得分)。表示發射概率。

self.transitions:轉移矩陣，維度為55，transitions[i][j]表示label j轉移到label i的概率。transtion[i]維度為15,表示每個label轉移到label i的概率。表示概率矩陣

1. def log_sum_exp(vec)

# compute log sum exp in numerically stable way for the forward algorithm def log_sum_exp(vec): #vec是1*5, type是Variable max_score = vec[0, argmax(vec)] # max_score維度是１，　max_score.view(1,-1)維度是１＊１， # max_score.view(1, -1).expand(1, vec.size()[1])的維度是１＊５ max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1]) # 里面先做減法，減去最大值可以避免e的指數次，計算機上溢 return max_score + \ torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))

你可能會疑問return 的結果為什么先減去max_score.其實這是一個小技巧，因為一上來就做指數運算可能會引起計算結果溢出，先減去score，經過log_sum_exp后，再把max_score給加上。
其實就等同於：

return torch.log(torch.sum(torch.exp(vec)))

2. def neg_log_likelihood(self, sentence, tags)

如果你完整地把代碼讀完，你會發現neg_log_likelihood()這個函數是loss function.

loss = model.neg_log_likelihood(sentence_in, targets)

我們來分析一下neg_log_likelihood()函數代碼：

    def neg_log_likelihood(self, sentence, tags): # feats: 11*5 經過了LSTM+Linear矩陣后的輸出，之后作為CRF的輸入。 feats = self._get_lstm_features(sentence) forward_score = self._forward_alg(feats) gold_score = self._score_sentence(feats, tags) return forward_score - gold_score

你在這里可能會有疑問：問什么forward_score - gold_score可以作為loss呢。
這里，我們回顧一下我們在上文中說明的loss function函數公式:

在這里插入圖片描述

你就會發現forward_score和gold_score分別表示上述等式右邊的兩項。

3. def _forward_alg(self, feats)：

我們通過上一個函數的分析得知這個函數就是用來求forward_score的,也就是loss function等式右邊的第一項：

在這里插入圖片描述

 # 預測序列的得分 # 只是根據隨機的transitions，前向傳播算出的一個score #用到了動態規划的思想，但因為用的是隨機的轉移矩陣，算出的值很大score>20 def _forward_alg(self, feats): # do the forward algorithm to compute the partition function init_alphas = torch.full((1, self.tagset_size), -10000.) # 1*5 而且全是-10000 # START_TAG has all of the score # 因為start tag是4，所以tensor([[-10000., -10000., -10000., 0., -10000.]])， # 將start的值為零，表示開始進行網絡的傳播， init_alphas[0][self.tag_to_ix[START_TAG]] = 0 # warp in a variable so that we will get automatic backprop forward_var = init_alphas # 初始狀態的forward_var，隨着step t變化 # iterate through the sentence # 會迭代feats的行數次 for feat in feats: #feat的維度是５ 依次把每一行取出來~ alphas_t = [] # the forward tensors at this timestep for next_tag in range(self.tagset_size): #next tag 就是簡單 i，從0到len # broadcast the emission(發射) score: # it is the same regardless of the previous tag # 維度是1*5 LSTM生成的矩陣是emit score emit_score = feat[next_tag].view( 1, -1).expand(1, self.tagset_size) # the i_th entry of trans_score is the score of transitioning # to next_tag from i trans_score = self.transitions[next_tag].view(1, -1) # 維度是1*5 # The ith entry of next_tag_var is the value for the # edge (i -> next_tag) before we do log-sum-exp # 第一次迭代時理解： # trans_score所有其他標簽到Ｂ標簽的概率 # 由lstm運行進入隱層再到輸出層得到標簽Ｂ的概率，emit_score維度是１＊５，5個值是相同的 next_tag_var = forward_var + trans_score + emit_score # The forward variable for this tag is log-sum-exp of all the scores alphas_t.append(log_sum_exp(next_tag_var).view(1)) # 此時的alphas t 是一個長度為5，例如<class 'list'>: # [tensor(0.8259), tensor(2.1739), tensor(1.3526), tensor(-9999.7168), tensor(-0.7102)] forward_var = torch.cat(alphas_t).view(1, -1) #到第（t-1）step時５個標簽的各自分數 # 最后只將最后一個單詞的forward var與轉移 stop tag的概率相加 # tensor([[ 21.1036, 18.8673, 20.7906, -9982.2734, -9980.3135]]) terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]] alpha = log_sum_exp(terminal_var) # alpha是一個0維的tensor return alpha

4. def _score_sentence(self, feats, tags)

由2的函數分析，我們知道這個函數就是求gold_score,即loss function的第二項

# 根據真實的標簽算出的一個score， # 這與上面的def _forward_alg(self, feats)共同之處在於： # 兩者都是用的隨機轉移矩陣算的score # 不同地方在於，上面那個函數算了一個最大可能路徑，但實際上可能不是真實的各個標簽轉移的值 # 例如：真實標簽是N V V,但是因為transitions是隨機的，所以上面的函數得到其實是N N N這樣， # 兩者之間的score就有了差距。而后來的反向傳播，就能夠更新transitions，使得轉移矩陣逼近 #真實的“轉移矩陣” # 得到gold_seq tag的score 即根據真實的label 來計算一個score， # 但是因為轉移矩陣是隨機生成的，故算出來的score不是最理想的值 def _score_sentence(self, feats, tags): #feats 11*5 tag 11 維 # gives the score of a provied tag sequence score = torch.zeros(1) # 將START_TAG的標簽３拼接到tag序列最前面，這樣tag就是12個了 tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags]) for i, feat in enumerate(feats): # self.transitions[tags[i + 1], tags[i]] 實際得到的是從標簽i到標簽i+1的轉移概率 # feat[tags[i+1]], feat是step i 的輸出結果，有５個值， # 對應B, I, E, START_TAG, END_TAG, 取對應標簽的值 # transition【j,i】 就是從i ->j 的轉移概率值 score = score + \ self.transitions[tags[i+1], tags[i]] + feat[tags[i + 1]] score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]] return score

5. def _viterbi_decode(self, feats):

# 維特比解碼， 實際上就是在預測的時候使用了， 輸出得分與路徑值 # 預測序列的得分 def _viterbi_decode(self, feats): backpointers = [] # initialize the viterbi variables in long space init_vvars = torch.full((1, self.tagset_size), -10000.) init_vvars[0][self.tag_to_ix[START_TAG]] = 0 # forward_var at step i holds the viterbi variables for step i-1 forward_var = init_vvars for feat in feats: bptrs_t = [] # holds the backpointers for this step viterbivars_t = [] # holds the viterbi variables for this step for next_tag in range(self.tagset_size): # next-tag_var[i] holds the viterbi variable for tag i # at the previous step, plus the score of transitioning # from tag i to next_tag. # we don't include the emission scores here because the max # does not depend on them(we add them in below) # 其他標簽（B,I,E,Start,End）到標簽next_tag的概率 next_tag_var = forward_var + self.transitions[next_tag] best_tag_id = argmax(next_tag_var) bptrs_t.append(best_tag_id) viterbivars_t.append(next_tag_var[0][best_tag_id].view(1)) # now add in the emssion scores, and assign forward_var to the set # of viterbi variables we just computed # 從step0到step(i-1)時5個序列中每個序列的最大score forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1) backpointers.append(bptrs_t) # bptrs_t有５個元素 # transition to STOP_TAG # 其他標簽到STOP_TAG的轉移概率 terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]] best_tag_id = argmax(terminal_var) path_score = terminal_var[0][best_tag_id] # follow the back pointers to decode the best path best_path = [best_tag_id] for bptrs_t in reversed(backpointers): best_tag_id = bptrs_t[best_tag_id] best_path.append(best_tag_id) # pop off the start tag # we don't want to return that ti the caller start = best_path.pop() assert start == self.tag_to_ix[START_TAG] # Sanity check best_path.reverse() # 把從后向前的路徑正過來 return path_score, best_path

如果對於該函數還沒有太理解，可以參考這篇博客：

https://blog.csdn.net/appleml/article/details/78579675

總結

以上就是我結合了幾篇比較不錯的博客后的總結，歡迎大家提問

作者：17shouuu
鏈接：https://www.jianshu.com/p/566c6faace64
來源：簡書
簡書著作權歸作者所有，任何形式的轉載都請聯系作者獲得授權並注明出處。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 中文NER的那些事兒1. Bert-Bilstm-CRF基線模型詳解&代碼實現 BiLSTM-CRF模型理解命名實體識別 BiLSTM——CRF ALBERT+BiLSTM+CRF實現序列標注 NLP --- 條件隨機場CRF詳解重點特征函數轉移矩陣通俗理解BiLSTM-CRF命名實體識別模型中的CRF層 CRF keras代碼實現基於BERT的中文命名實體識別任務(BERT-BiLSTM-CRF-NER) DL4NLP —— 序列標注：BiLSTM-CRF模型做基於字的中文命名實體識別 crf進行詞性標記詳解

pytorch BiLSTM+CRF代碼詳解 重點

一. BILSTM + CRF介紹

1.介紹

1.1開始之前

1.2BiLSTM-CRF模型

1.3 如果沒有CRF層會怎樣

1.4 CRF層能從訓練數據中獲得約束性的規則

二. 標簽的score和損失函數的定義

三. 對於損失函數的詳細解釋

四. 代碼塊詳細說明：

1. def log_sum_exp(vec)

2. def neg_log_likelihood(self, sentence, tags)

3. def _forward_alg(self, feats)：

4. def _score_sentence(self, feats, tags)

5. def _viterbi_decode(self, feats):

總結

免責聲明！

pytorch BiLSTM+CRF代碼詳解重點