機器翻譯評價指標 — BLEU算法

本文轉載自查看原文 2019-03-13 15:42 8946 自然語言處理

1，概述

　　機器翻譯中常用的自動評價指標是 $BLEU$ 算法，除了在機器翻譯中的應用，在其他的 $seq2seq$ 任務中也會使用，例如對話系統。

2 $BLEU$算法詳解

　　假定人工給出的譯文為$reference$，機器翻譯的譯文為$candidate$。

　　1）最早的$BLEU$算法

　　　　最早的$BLEU$算法是直接統計$cadinate$中的單詞有多少個出現在$reference$中，具體的式子是：

　　　　$BLEU = \frac {出現在reference中的candinate的單詞的個數} {cadinate中單詞的總數}$

　　　　以下面例子為例：

　　　　$ candinate:$ the the the the the the the

　　　　$ reference:$ the cat is on the mat

　　　　$cadinate$中所有的單詞都在$reference$中出現過，因此：

　　　　$BLEU = \frac {7} {7} = 1$

　　　　對上面的結果顯然是不合理的，而且主要是分子的統計不合理，因此對上面式子中的分子進行了改進。

　　2）改進的$BLEU$算法 — 分子截斷計數

　　　　針對上面不合理的結果，對分子的計算進行了改進，具體的做法如下：

　　　　$Count_{w_i}^{clip} = min(Count_{w_i},Ref\_Count_{w_i})$

　　　　上面式子中：

　　　　$Count_{w_i}$ 表示單詞$w_i$在$candinate$中出現的次數；

　　　　$Ref\_Count_{w_i}$ 表示單詞$w_i$在$reference$中出現的次數；

　　　　但一般情況下$reference$可能會有多個，因此有：

　　　　$Count^{clip} = max(Count_{w_i,j}^{clip}), j=1,2,3......$

　　　　上面式子中：$j$表示第$j$個$reference$。

　　　　仍然以上面的例子為例，在$candinate$中只有一個單詞$the$，因此只要計算一個$Count^{clip}$，$the$在$reference$中只出現了兩次，因此：

　　　　$BLEU = \frac {2} {7}$

　　3）引入$n-gram$

　　　　在上面我們一直談的都是對於單個單詞進行計算，單個單詞可以看作時$1-gram$，$1-gram$可以描述翻譯的充分性，即逐字翻譯的能力，但不能關注翻譯的流暢性，因此引入了$n-gram$，在這里一般$n$不大於4。引入$n-gram$后的表達式如下：

　　　　$p_{n}=\frac{\sum_{c_{\in candidates}}\sum_{n-gram_{\in c}}Count_{clip}(n-gram)}{\sum_{c^{'}_{\in candidates}}\sum_{n-gram^{'}_{\in c^{'}}}Count(n-gram^{'})}$

　　　　很多時候在評價一個系統時會用多條$candinate$來評價，因此上面式子中引入了一個候選集合$candinates$。$p_{n}$ 中的$n$表示$n-gram$，$p_{n}$表示$n_gram$的精度，即$1-gram$時，$n = 1$。

　　　　接下來簡單的理解下上面的式子，首先來看分子：

　　　　1）第一個$\sum$ 描述的是各個$candinate$的總和；

　　　　2）第二個$\sum$ 描述的是一條$candinate$中所有的$n-gram$的總和；

　　　　3）$Count_{clip}(n-gram)$ 表示某一個$n-gram$詞的截斷計數；

　　　　再來看分母，前兩個$\sum$和分子中的含義一樣，$Count(n-gram^{'})$表示$n-gram^{'}$在$candinate$中的計數。

　　　　再進一步來看，實際上分母就是$candinate$中$n-gram$的個數，分子是出現在$reference$中的$candinate$中$n-gram$的個數。

　　　　舉一個例子來看看實際的計算：

　　　　$candinate:$ the cat sat on the mat

　　　　$reference:$ the cat is on the mat

　　　　計算$n-gram$的精度：

　　　　$p_1 = \frac {5} {6} = 0.83333$

　　　　$p_2 = \frac {3} {5} = 0.6$

　　　　$p_3 = \frac {1} {4} = 0.25$

　　　　$p_4 = \frac {0} {3} = 0$

　　4）添加對句子長度的乘法因子

　　　　在翻譯時，若出現譯文很短的句子時往往會有較高的$BLEU$值，因此引入對句子長度的乘法因子，其表達式如下：

　　　　在這里$c$表示$cadinate$的長度，$r$表示$reference$的長度。

　　將上面的整合在一起，得到最終的表達式：

　　　　$BLEU = BP exp(\sum_{n=1}^N w_n \log p_n)$

　　其中$exp(\sum_{n=1}^N w_n \log p_n)$ 表示不同的$n-gram$的精度的對數的加權和。

3，$NLTK$實現

　　可以直接用工具包實現

from nltk.translate.bleu_score import sentence_bleu, corpus_bleu
from nltk.translate.bleu_score import SmoothingFunction
reference = [['The', 'cat', 'is', 'on', 'the', 'mat']]
candidate = ['The', 'cat', 'sat', 'on', 'the', 'mat']
smooth = SmoothingFunction()  # 定義平滑函數對象
score = sentence_bleu(reference, candidate, weight=(0.25,0.25, 0.25, 0.25), smoothing_function=smooth.method1)
corpus_score = corpus_bleu([reference], [candidate], smoothing_function=smooth.method1)

　　$NLTK$中提供了兩種計算$BLEU$的方法，實際上在sentence_bleu中是調用了corpus_bleu方法，另外要注意$reference$和$candinate$連個參數的列表嵌套不要錯了，weight參數是設置不同的$n-gram$的權重，另外weight元祖中的數量決定了計算$BLEU$時，會用幾個$n-gram$，以上面為例，會用$1-gram, 2-gram, 3-gram, 4-gram$。SmoothingFunction是用來平滑log函數的結果的，防止$f_n = 0$時，取對數為負無窮。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器翻譯評價指標之BLEU詳細計算過程機器翻譯評價指標機器翻譯評測——BLEU改進后的NIST算法機器翻譯評測——BLEU算法詳解 (新增在線計算BLEU分值) Deep Learning基礎--機器翻譯BLEU與Perplexity詳解利用BLEU進行機器翻譯檢測（Python-NLTK-BLEU評分方法）【nlp】BLEU、ROUGE評價指標【機器翻譯】機器翻譯入門 NMT 機器翻譯機器學習面試--算法評價指標