BLEU METEOR ROUGE CIDEr 詳解和實現

本文轉載自查看原文 2019-12-14 22:35 729 NLP

一、指標概述　

這四種指標都是機器翻譯的自動評價指標，對於一些生成式文本任務，也是使用這幾種評價指標。

二、Bleu原理詳解　

BLEU是IBM於2002年提出的。我們假定人工給出的譯文為reference，機器翻譯的譯文為candidate。

1.最早的BLEU算法

最早的BLEU算法是直接統計cadinate中的單詞有多少個出現在reference中，具體的式子是：

　　　　$BLEU=\frac{出現在reference中的candinate的單詞的個數}{cadinate中單詞的總數}$

　　　　以下面例子為例：

　　　　candinate:the the the the the the the

　　　　reference:the cat is on the mat

　　　　cadinate中所有的單詞都在reference中出現過，因此：

　　　　$BLEU=\frac{7}{7}=1$

對上面的結果顯然是不合理的，而且主要是分子的統計不合理，因此對上面式子中的分子進行了改進。

2.改進的BLEU算法

　　　　針對上面不合理的結果，對分子的計算進行了改進，具體的公式變為如下：

　　　　$BLEU=\frac{Count^{clip}_{w_i}}{cadinate中單詞的總數}$

　　　　$Count^{clip}_{w_i}=min(Count_{w_i},Ref-Count_{w_i})$

　　　　上面式子中：

　　　　$Count_{w_i}$ 表示單詞$w_i$在candinate中出現的次數；

　　　　$Ref-Count_{w_i}$ 表示單詞$w_i$在reference中出現的次數；

　　　　但一般情況下reference可能會有多個j句子，因此有：

　　　　$Count^{clip}=max(Count^{clip}_{w_i,j}),j=1,2,3......$

　　　　上面式子中：j表示第j個reference。

　　　　仍然以上面的例子為例，在candinate中只有一個單詞the，因此只要計算一個$Count^{clip}$，the在reference中只出現了兩次，因此：

　　　　$BLEU=\frac{2}{7}$

3.引入n-gram

在上面我們一直都是對於單個單詞進行計算，單個單詞可以看作時1−gram，1−gram可以描述翻譯的充分性，即逐字翻譯的能力，但不能關注翻譯的流暢性，因此引入了n−gram，在這里一般n不大於4。引入n−gram后的表達式如下：

　　　 $p_n=\frac{\sum_{c\in candidates}\sum_{n-gram\in c}Count_{clip}(n-gram)}{\sum_{{c}'\in candidates}\sum_{{n-gram}'\in {c}'}Count({n-gram}')}$

　　　　很多時候在評價一個系統時會用多條candinate來評價，因此上面式子中引入了一個候選集合candinates。$p_n$中的n表示n-gram，$p_n$表示n-gram的精度，即1−gram時，n=1。

　　　　接下來簡單的理解下上面的式子，首先來看分子：

　　　　1）第一個$\sum$描述的是各個candinate的總和，就是有多個句子

　　　　2）第二個$\sum$描述的是一條candinate中所有的n−gram的總和，就是一個句子的n-gram的個數

　　　　3）$Count_{clip}(n-gram)$表示某一個n−gram詞的截斷計數；

　　　　再來看分母，前兩個$\sum$和分子中的含義一樣，Count({n-gram}')表示n−gram′在candinate中的計數。

　　　　再進一步來看，實際上分母就是candinate中n−gram的個數，分子是出現在reference中的candinate中n−gram的個數。

　　　　舉一個例子來看看實際的計算：

　　　　candinate: the cat sat on the mat

　　　　reference:the cat is on the mat

　　　　計算n−gram的精度：

　　　　$p1=\frac{5}{6}=0.83333$

　　　　$p2=\frac{3}{5}=0.6$

　　　　$p3=\frac{1}{4}=0.25$

　　　　$p4=\frac{0}{3}=0$

4.添加對句子長度的乘法因子　　　　

在翻譯時，若出現譯文很短的句子時往往會有較高的BLEU值，因此引入對句子長度的乘法因子，其表達式如下：

　　在這里c表示cadinate的長度，r表示reference的長度。

　　將上面的整合在一起，得到最終的表達式：

　　$BLEU=BPexp(\sum^N_{n=1}w_n logp_n)$

　　其中$exp(\sum^N_{n=1}w_n logp_n)$表示不同的n−gram的精度的對數的加權和。

三、具體實現

github下載鏈接：https://github.com/Maluuba/nlg-eval

將下載的文件放到工程目錄，而后使用如下代碼計算結果

具體的寫作格式如下：

from nlgeval import NLGEval
nlgeval=NLGEval()
#對應的模型生成的句子有三句話，每句話的的標准有兩句話
hyp=['this is the model generated sentence1 which seems good enough','this is sentence2 which has been generated by your model','this is sentence3 which has been generated by your model']
ref1=['this is one reference sentence for sentence1','this is a reference sentence for sentence2 which was generated by your model','this is a reference sentence for sentence3 which was generated by your model']
ref2=['this is one more reference sentence for sentence1','this is the second reference sentence for sentence2','this is a reference sentence for sentence3 which was generated by your model']
lis=[ref1,ref2]
ans=nlgeval.compute_metrics(hyp_list=hyp,ref_list=lis)
# res=compute_metrics(hypothesis='nlg-eval-master/examples/hyp.txt',
#                    references=['nlg-eval-master/examples/ref1.txt','nlg-eval-master/examples/ref2.txt'])
print(ans)

輸出結果如下：

{'Bleu_2': 0.5079613089004589, 'Bleu_3': 0.35035098185199764, 'Bleu_1': 0.6333333333122222, 'Bleu_4': 0.25297649984340986, 'ROUGE_L': 0.5746244363308142, 'CIDEr': 1.496565428735557, 'METEOR': 0.3311277692098822}

參考鏈接：https://www.cnblogs.com/jiangxinyang/p/10523585.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【nlp】BLEU、ROUGE評價指標機器翻譯評測——BLEU算法詳解 (新增在線計算BLEU分值) Deep Learning基礎--機器翻譯BLEU與Perplexity詳解理解bleu Meteor入門 NLP之ROUGE[筆記] Meteor + node-imap(nodejs) + mailparser(nodejs) 實現完整收發郵件 BLEU (Bilingual Evaluation Understudy) meteor學習-- #一安裝meteor快速使用 Meteor入門介紹