bleu全稱為Bilingual Evaluation Understudy(雙語評估替換),是2002年提出的用於評估機器翻譯效果的一種方法,這種方法簡單朴素、短平快、易於理解。因為其效果還算說得過去,因此被廣泛遷移到自然語言處理的各種評估任務中。這種方法可以說是:山上無老虎,猴子稱大王。時無英雄遂使豎子成名。蜀中無大將,廖化做先鋒。
問題描述
首先,對bleu算法建立一個直觀的印象。
有兩類問題:
1、給定一個句子和一個候選句子集,求bleu值,此問題稱為sentence_bleu
2、給定一堆句子和一堆候選句子集,求bleu值,此問題稱為corpus_bleu
機器翻譯得到的句子稱為candidate,候選句子集稱為references。
計算方式就是計算candidate和references的公共部分。公共部分越多,說明翻譯結果越好。
給定一個句子和一個候選句子集計算bleu值
bleu考慮1,2,3,4共4個n-gram,可以給每個n-gram指定權重。
對於n-gram:
- 對candidate和references分別分詞(n-gram分詞)
- 統計candidate和references中每個word的出現頻次
- 對於candidate中的每個word,它的出現頻次不能大於references中最大出現頻次
這一步是為了整治形如the the the the the這樣的candidate,因為the在candidate中出現次數太多了,導致分值為1。為了限制這種不正常的candidate,使用正常的references加以約束。 - candidate中每個word的出現頻次之和除以總的word數,即為得分score
- score乘以句子長度懲罰因子即為最終的bleu分數
這一步是為了整治短句子,比如candidate只有一個詞:the,並且the在references中出現過,這就導致得分為1。也就是說,有些人因為怕說錯而保持沉默。
bleu的發展不是一蹴而就的,很多人為了修正bleu,不斷發現bleu的漏洞並提出解決方案。從bleu的發展歷程上,我們可以學到如何設計規則整治badcase。
最后,對於1-gram,2-gram,3-gram的組合,應該采用幾何平均,也就是s1^w1*s2^2*s3^w3
,而不是算術平均w1*s1+w2*s2+w3*s3
。
from collections import Counter
import numpy as np
from nltk.translate import bleu_score
def bp(references, candidate):
# brevity penality,句子長度懲罰因子
ind = np.argmin([abs(len(i) - len(candidate)) for i in references])
if len(references[ind]) < len(candidate):
return 1
scale = 1 - (len(candidate) / len(references[ind]))
return np.e ** scale
def parse_ngram(sentence, gram):
# 把一個句子分成n-gram
return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)] # 此處一定要注意+1,否則會少一個gram
def sentence_bleu(references, candidate, weight):
bp_value = bp(references, candidate)
s = 1
for gram, wei in enumerate(weight):
gram = gram + 1
# 拆分n-gram
ref = [parse_ngram(i, gram) for i in references]
can = parse_ngram(candidate, gram)
# 統計n-gram出現次數
ref_counter = [Counter(i) for i in ref]
can_counter = Counter(can)
# 統計每個詞在references中的出現次數
appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())
score = appear / len(can)
# 每個score的權值不一樣
s *= score ** wei
s *= bp_value # 最后的分數需要乘以懲罰因子
return s
references = [
"the dog jumps high",
"the cat runs fast",
"dog and cats are good friends"
]
candidate = "the d o g jump s hig"
weights = [0.25, 0.25, 0.25, 0.25]
print(sentence_bleu(references, candidate, weights))
print(bleu_score.sentence_bleu(references, candidate, weights))
一個corpus是由多個sentence組成的,計算corpus_bleu並非求sentence_bleu的均值,而是一種略微復雜的計算方式,可以說是沒什么道理的狂想曲。
corpus_bleu
一個文檔包含3個句子,句子的分值分別為a1/b1,a2/b2,a3/b3。
那么全部句子的分值為:(a1+a2+a3)/(b1+b2+b3)
懲罰因子也是一樣:三個句子的長度分別為l1,l2,l3,對應的最接近的reference分別為k1,k2,k3。那么相當於bp(l1+l2+l3,k1+k2+k3)。
也就是說:對於corpus_bleu不是單純地對sentence_bleu求均值,而是基於更統一的一種方法。
from collections import Counter
import numpy as np
from nltk.translate import bleu_score
def bp(references_len, candidate_len):
if references_len < candidate_len:
return 1
scale = 1 - (candidate_len / references_len)
return np.e ** scale
def parse_ngram(sentence, gram):
return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)]
def corpus_bleu(references_list, candidate_list, weights):
candidate_len = sum(len(i) for i in candidate_list)
reference_len = 0
for candidate, references in zip(candidate_list, references_list):
ind = np.argmin([abs(len(i) - len(candidate)) for i in references])
reference_len += len(references[ind])
s = 1
for index, wei in enumerate(weights):
up = 0 # 分子
down = 0 # 分母
gram = index + 1
for candidate, references in zip(candidate_list, references_list):
# 拆分n-gram
ref = [parse_ngram(i, gram) for i in references]
can = parse_ngram(candidate, gram)
# 統計n-gram出現次數
ref_counter = [Counter(i) for i in ref]
can_counter = Counter(can)
# 統計每個詞在references中的出現次數
appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())
up += appear
down += len(can)
s *= (up / down) ** wei
return bp(reference_len, candidate_len) * s
references = [
[
"the dog jumps high",
"the cat runs fast",
"dog and cats are good friends"],
[
"ba ga ya",
"lu ha a df",
]
]
candidate = ["the d o g jump s hig", 'it is too bad']
weights = [0.25, 0.25, 0.25, 0.25]
print(corpus_bleu(references, candidate, weights))
print(bleu_score.corpus_bleu(references, candidate, weights))
如果你用的NLTK版本是3.2,發布時間是2016年3月份,那么計算corpus_bleu時有一處bug。NLTK在2016年10月份已經修復了此處bug。對於句子分值的求和,NLTK代碼中是使用Fraction,Fraction會自動對分子和分母進行化簡,導致求和的時候計算錯誤。
簡化代碼
在計算sentence_bleu和corpus_bleu過程中,許多步驟都是相似的、可以合並的。精簡后的代碼如下:
from collections import Counter
import numpy as np
from nltk.translate import bleu_score
def bp(references_len, candidate_len):
return np.e ** (1 - (candidate_len / references_len)) if references_len > candidate_len else 1
def nearest_len(references, candidate):
return len(references[np.argmin([abs(len(i) - len(candidate)) for i in references])])
def parse_ngram(sentence, gram):
return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)]
def appear_count(references, candidate, gram):
ref = [parse_ngram(i, gram) for i in references]
can = parse_ngram(candidate, gram)
# 統計n-gram出現次數
ref_counter = [Counter(i) for i in ref]
can_counter = Counter(can)
# 統計每個詞在references中的出現次數
appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())
return appear, len(can)
def corpus_bleu(references_list, candidate_list, weights):
candidate_len = sum(len(i) for i in candidate_list)
reference_len = sum(nearest_len(references, candidate) for candidate, references in zip(candidate_list, references_list))
bp_value = bp(reference_len, candidate_len)
s = 1
for index, wei in enumerate(weights):
up = 0 # 分子
down = 0 # 分母
gram = index + 1
for candidate, references in zip(candidate_list, references_list):
appear, total = appear_count(references, candidate, gram)
up += appear
down += total
s *= (up / down) ** wei
return bp_value * s
def sentence_bleu(references, candidate, weight):
bp_value = bp(nearest_len(references, candidate), len(candidate))
s = 1
for gram, wei in enumerate(weight):
gram = gram + 1
appear, total = appear_count(references, candidate, gram)
score = appear / total
# 每個score的權值不一樣
s *= score ** wei
# 最后的分數需要乘以懲罰因子
return s * bp_value
if __name__ == '__main__':
references = [
[
"the dog jumps high",
"the cat runs fast",
"dog and cats are good friends"],
[
"ba ga ya",
"lu ha a df",
]
]
candidate = ["the d o g jump s hig", 'it is too bad']
weights = [0.25, 0.25, 0.25, 0.25]
print(corpus_bleu(references, candidate, weights))
print(bleu_score.corpus_bleu(references, candidate, weights))
print(sentence_bleu(references[0], candidate[0], weights))
print(bleu_score.sentence_bleu(references[0], candidate[0], weights))
參考資料
https://cloud.tencent.com/developer/article/1042161
https://en.wikipedia.org/wiki/BLEU
https://blog.csdn.net/qq_31584157/article/details/77709454
https://www.jianshu.com/p/15c22fadcba5