第四期coding_group筆記_用CRF實現分詞-詞性標注


一、背景知識

1.1 什么是分詞?

  NLP的基礎任務分為三個部分,詞法分析、句法分析和語義分析,其中詞法分析中有一種方法叫Tokenization,對漢字以字為單位進行處理叫做分詞。

  Example :  我  去  北  京

       S       S       B       E

  注:S代表一個單獨詞,B代表一個詞的開始,E表示一個詞的結束(北京是一個詞)。

 

1.2 什么是詞性標注?

  句法分析中有一種方法叫詞性標注(pos tagging),詞性標注的目標是使用類似PNVB等的標簽對句子(一連串的詞或短語)進行打簽。

  Example :           I   can  open  this   can  .

  Pos tagging  -> PN  MD   VV   PN   NN  PU

  注:PN代詞 MD情態動詞 VV 動詞 NN名詞 PU標點符號

 

1.3 什么是分詞-詞性標注?

  分詞-詞性標注就是將分詞和詞性標注兩個任務同時進行,在一個模型里完成,可以減少錯誤傳播。

  Example :   我    去    北    京

       S-PN      S-VV        B-NN        E-NN

  注:如果想理解更多關於nlp基礎任務的知識,可參看我整理的張岳老師暑期班的第一天的筆記。

 

1.4 什么是CRF

  條件隨機場(conditional random field)是一種用來標記和切分序列化數據的統計模型。在NLP領域可以用來做序列標注任務。

   注:更多關於條件隨機場的理論知識,可以參考以下內容:

  條件隨機場綜述  

  如何輕松愉快地理解條件隨機場(CRF)  

  條件隨機場介紹(譯)Introduction to Conditional Random Fields

  CRF條件隨機場簡介

 

二、CRF序列標注

2.1 模型結構圖

 

 

最底下的詞向量層,上兩層是Bi-LSTM層,最上面一層是CRF層。數據流程是從下層向上層計算。

 

2.2 CRF部分

2.2.1 理論

Point 1: CRF中,每個特征函數以下列信息作為輸入,輸出是一個實數值。

(1)一個句子s

(2)詞在句子中的位置i

(3)當前詞的標簽

(4)前一個詞的標簽

注:通過限制特征只依賴於當前與之前詞的標簽,而不是句子中的任意標簽,實際上是建立了一種特殊的線性CRF,而不是廣義上的CRF

 

Point 2: CRF的訓練參數

1Input: x = {我,去,北京}

2Answer: ygold = {PN,  VV,  NN}

3)y'CRF標注的所有可能值,有3*3*3=27個;

4T矩陣存儲轉移分數,T[yiyi-1]是上個標簽是的情況下,下個標簽是yi的分數;

5)hi是向量序列,通過神經網絡Bi-LSTM得到,hi[yi]被標成的發射分數;

(6)score(x,y)是模型對x被標注成y所打出的分數,是一個實數值;

       

  Example : 我  去  北京

                PN     VV     NN

   

7)P(ygold|x)是模型x對標注出ygold的概率;

       

 

Point 3: CRF的訓練目標:訓練模型使得變大

Step 1: P(ygold|x)進行轉化,取對數

       

      

Step 2: 最終目標函數,使用梯度下降法

       

      

Step 3: 編程實現

       

 1     def _forward_alg(self, feats):
 2         # do the forward algorithm to compute the partition function
 3         init_alphas = torch.Tensor(1, self.labelSize).fill_(0)
 4         # Wrap in a variable so that we will get automatic backprop
 5         forward_var = autograd.Variable(init_alphas)
 6 
 7         # Iterate through the sentence
 8         for idx in range(len(feats)):
 9             feat = feats[idx]
10             alphas_t = []           # The forward variables at this timestep
11             for next_tag in range(self.labelSize):
12                 # broadcast the emission score: it is the same regardless of the previous tag
13                 if idx == 0:
14                     alphas_t.append(feat[next_tag].view(1, -1))
15                 else:
16                     emit_score = feat[next_tag].view(1, -1).expand(1, self.labelSize)
17                     # the ith entry of trans_score is the score of transitioning to next_tag from i
18                     trans_score = self.T[next_tag]
19                     # The ith entry of next_tag_var is the value for the edge (i -> next_tag) before we do log-sum-exp
20                     next_tag_var = forward_var + trans_score + emit_score
21                     # The forward variable for this tag is log-sum-exp of all the scores.
22                     alphas_t.append(self.log_sum_exp(next_tag_var))
23             forward_var = torch.cat(alphas_t).view(1, -1)
24         alpha_score = self.log_sum_exp(forward_var)
25         return alpha_score
1   # Compute log sum exp in a numerically stable way for the forward algorithm
2     def log_sum_exp(self, vec):                         
3         max_score = vec[0, self.argmax(vec)]
4         max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
5         return max_score + torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))

     

1     def neg_log_likelihood(self, feats, tags):
2         forward_score = self._forward_alg(feats)                # calculate denominator
3         gold_score = self._score_sentence(feats, tags)          
4         return forward_score - gold_score                       # calculate loss

train()中的訓練部分:

 1         for iter in range(self.hyperParams.maxIter):
 2             print('###Iteration' + str(iter) + "###")
 3             random.shuffle(indexes)
 4             for idx in range(len(trainExamples)):
 5                 # Step 1. Remember that Pytorch accumulates gradients. We need to clear them out before each instance
 6                 self.model.zero_grad()
 7                 # Step 2. Get our inputs ready for the network, that is, turn them into Variables of word indices.
 8                 self.model.LSTMHidden = self.model.init_hidden()
 9                 exam = trainExamples[indexes[idx]]
10                 # Step 3. Run our forward pass. Compute the loss, gradients, and update the parameters by calling optimizer.step()
11                 lstm_feats = self.model(exam.feat)
12                 loss = self.model.crf.neg_log_likelihood(lstm_feats, exam.labelIndexs)
13                 loss.backward()
14                 optimizer.step()
15                 if (idx + 1) % self.hyperParams.verboseIter == 0:
16                     print('current: ', idx + 1,  ", cost:", loss.data[0])

 

Point 4: 使用模型預測序列

使用維特比解碼算法,解決籬笆圖中的最短路徑問題

 

step 1:  初始節點沒有轉移值

 

1                 if idx == 0:
2                     viterbi_var.append(feat[next_tag].view(1, -1))

 

step 2: 節點值由三部分組成,最后求取最大值,得到lastbestlabel的下標

 

 1             for next_tag in range(self.labelSize):
 2                 if idx == 0:
 3                     viterbi_var.append(feat[next_tag].view(1, -1))
 4                 else:
 5                     emit_score = feat[next_tag].view(1, -1).expand(1, self.labelSize)
 6                     trans_score = self.T[next_tag]
 7                     next_tag_var = forward_var + trans_score + emit_score
 8                     best_label_id = self.argmax(next_tag_var)
 9                     bptrs_t.append(best_label_id)
10                     viterbi_var.append(next_tag_var[0][best_label_id])

 

step 3: 計算出所有節點,比較最后一個詞的值,求取最大值之后,向前推出最佳序列。

 

 

維特比解碼算法實現預測序列

 1     def _viterbi_decode(self, feats):
 2         init_score = torch.Tensor(1, self.labelSize).fill_(0)
 3         # forward_var at step i holds the viterbi variables for step i-1
 4         forward_var = autograd.Variable(init_score)
 5         back = []
 6         for idx in range(len(feats)):
 7             feat = feats[idx]
 8             bptrs_t = []                        # holds the backpointers for this step
 9             viterbi_var = []                    # holds the viterbi variables for this step
10             for next_tag in range(self.labelSize):
11                 # next_tag_var[i] holds the viterbi variable for tag i at the previous step, 
12                 # plus the score of transitioning from tag i to next_tag. 
13                 # We don't include the emission scores here because the max does not 
14                 # depend on them (we add them in below)
15                 if idx == 0:
16                     viterbi_var.append(feat[next_tag].view(1, -1))
17                 else:
18                     emit_score = feat[next_tag].view(1, -1).expand(1, self.labelSize)
19                     trans_score = self.T[next_tag]
20                     next_tag_var = forward_var + trans_score + emit_score
21                     best_label_id = self.argmax(next_tag_var)
22                     bptrs_t.append(best_label_id)
23                     viterbi_var.append(next_tag_var[0][best_label_id])
24             # Now add in the emission scores, and assign forward_var to the set of viterbi variables we just computed
25             forward_var = (torch.cat(viterbi_var)).view(1, -1)
26             if idx > 0:
27                 back.append(bptrs_t)
28         best_label_id = self.argmax(forward_var)
29         # Follow the back pointers to decode the best path.
30         best_path = [best_label_id]
31         path_score = forward_var[0][best_label_id]
32         for bptrs_t in reversed(back):
33             best_label_id = bptrs_t[best_label_id]
34             best_path.append(best_label_id)
35         best_path.reverse()
36         return path_score, best_path

 

train()函數中的預測部分

 1        # Check predictions after training
 2             eval_dev = Eval()
 3             for idx in range(len(devExamples)):
 4                 predictLabels = self.predict(devExamples[idx])
 5                 devInsts[idx].evalPRF(predictLabels, eval_dev)
 6             print('Dev: ', end="")
 7             eval_dev.getFscore()
 8 
 9             eval_test = Eval()
10             for idx in range(len(testExamples)):
11                 predictLabels = self.predict(testExamples[idx])
12                 testInsts[idx].evalPRF(predictLabels, eval_test)
13             print('Test: ', end="")
14             eval_test.getFscore()
1     def predict(self, exam):
2         tag_hiddens = self.model(exam.feat)
3         _, best_path = self.model.crf._viterbi_decode(tag_hiddens)
4         predictLabels = []
5         for idx in range(len(best_path)):
6             predictLabels.append(self.hyperParams.labelAlpha.from_id(best_path[idx]))
7         return predictLabels

 

Point 5 : 使用F1分數測量精度,最佳值為1,最差為0

        

 1     def getFscore(self):
 2         if self.predict_num == 0:
 3             self.precision = 0
 4         else:
 5             self.precision = self.correct_num / self.predict_num
 6 
 7         if self.gold_num == 0:
 8             self.recall = 0
 9         else:
10             self.recall = self.correct_num / self.gold_num
11 
12         if self.precision + self.recall == 0:
13             self.fscore = 0
14         else:
15             self.fscore = 2 * (self.precision * self.recall) / (self.precision + self.recall)
16         print("precision: ", self.precision, ", recall: ", self.recall, ", fscore: ", self.fscore)

 注:全部代碼和注釋鏈接

擴展:可將數據中第二列和第一列一起放入Bi-LSTM中提取特征,這次只用到數據的第一列和第三列。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM