理解bleu


bleu全稱為Bilingual Evaluation Understudy(雙語評估替換),是2002年提出的用於評估機器翻譯效果的一種方法,這種方法簡單朴素、短平快、易於理解。因為其效果還算說得過去,因此被廣泛遷移到自然語言處理的各種評估任務中。這種方法可以說是:山上無老虎,猴子稱大王。時無英雄遂使豎子成名。蜀中無大將,廖化做先鋒。

問題描述

首先,對bleu算法建立一個直觀的印象。
有兩類問題:
1、給定一個句子和一個候選句子集,求bleu值,此問題稱為sentence_bleu
2、給定一堆句子和一堆候選句子集,求bleu值,此問題稱為corpus_bleu

機器翻譯得到的句子稱為candidate,候選句子集稱為references。
計算方式就是計算candidate和references的公共部分。公共部分越多,說明翻譯結果越好。

給定一個句子和一個候選句子集計算bleu值

bleu考慮1,2,3,4共4個n-gram,可以給每個n-gram指定權重。

對於n-gram:

  • 對candidate和references分別分詞(n-gram分詞)
  • 統計candidate和references中每個word的出現頻次
  • 對於candidate中的每個word,它的出現頻次不能大於references中最大出現頻次
    這一步是為了整治形如the the the the the這樣的candidate,因為the在candidate中出現次數太多了,導致分值為1。為了限制這種不正常的candidate,使用正常的references加以約束。
  • candidate中每個word的出現頻次之和除以總的word數,即為得分score
  • score乘以句子長度懲罰因子即為最終的bleu分數
    這一步是為了整治短句子,比如candidate只有一個詞:the,並且the在references中出現過,這就導致得分為1。也就是說,有些人因為怕說錯而保持沉默。

bleu的發展不是一蹴而就的,很多人為了修正bleu,不斷發現bleu的漏洞並提出解決方案。從bleu的發展歷程上,我們可以學到如何設計規則整治badcase。

最后,對於1-gram,2-gram,3-gram的組合,應該采用幾何平均,也就是s1^w1*s2^2*s3^w3,而不是算術平均w1*s1+w2*s2+w3*s3

from collections import Counter

import numpy as np
from nltk.translate import bleu_score


def bp(references, candidate):
    # brevity penality,句子長度懲罰因子
    ind = np.argmin([abs(len(i) - len(candidate)) for i in references])
    if len(references[ind]) < len(candidate):
        return 1
    scale = 1 - (len(candidate) / len(references[ind]))
    return np.e ** scale


def parse_ngram(sentence, gram):
    # 把一個句子分成n-gram
    return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)]  # 此處一定要注意+1,否則會少一個gram


def sentence_bleu(references, candidate, weight):
    bp_value = bp(references, candidate)
    s = 1
    for gram, wei in enumerate(weight):
        gram = gram + 1
        # 拆分n-gram
        ref = [parse_ngram(i, gram) for i in references]
        can = parse_ngram(candidate, gram)
        # 統計n-gram出現次數
        ref_counter = [Counter(i) for i in ref]
        can_counter = Counter(can)
        # 統計每個詞在references中的出現次數
        appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())
        score = appear / len(can)
        # 每個score的權值不一樣
        s *= score ** wei
    s *= bp_value  # 最后的分數需要乘以懲罰因子
    return s


references = [
    "the dog jumps high",
    "the cat runs fast",
    "dog and cats are good friends"
]
candidate = "the d o g  jump s hig"
weights = [0.25, 0.25, 0.25, 0.25]
print(sentence_bleu(references, candidate, weights))
print(bleu_score.sentence_bleu(references, candidate, weights))

一個corpus是由多個sentence組成的,計算corpus_bleu並非求sentence_bleu的均值,而是一種略微復雜的計算方式,可以說是沒什么道理的狂想曲。

corpus_bleu

一個文檔包含3個句子,句子的分值分別為a1/b1,a2/b2,a3/b3。
那么全部句子的分值為:(a1+a2+a3)/(b1+b2+b3)

懲罰因子也是一樣:三個句子的長度分別為l1,l2,l3,對應的最接近的reference分別為k1,k2,k3。那么相當於bp(l1+l2+l3,k1+k2+k3)。

也就是說:對於corpus_bleu不是單純地對sentence_bleu求均值,而是基於更統一的一種方法。

from collections import Counter

import numpy as np
from nltk.translate import bleu_score


def bp(references_len, candidate_len):
    if references_len < candidate_len:
        return 1
    scale = 1 - (candidate_len / references_len)
    return np.e ** scale


def parse_ngram(sentence, gram):
    return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)]


def corpus_bleu(references_list, candidate_list, weights):
    candidate_len = sum(len(i) for i in candidate_list)
    reference_len = 0
    for candidate, references in zip(candidate_list, references_list):
        ind = np.argmin([abs(len(i) - len(candidate)) for i in references])
        reference_len += len(references[ind])
    s = 1
    for index, wei in enumerate(weights):
        up = 0  # 分子
        down = 0  # 分母
        gram = index + 1
        for candidate, references in zip(candidate_list, references_list):
            # 拆分n-gram
            ref = [parse_ngram(i, gram) for i in references]
            can = parse_ngram(candidate, gram)
            # 統計n-gram出現次數
            ref_counter = [Counter(i) for i in ref]
            can_counter = Counter(can)
            # 統計每個詞在references中的出現次數
            appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())
            up += appear 
            down += len(can) 
        s *= (up / down) ** wei
    return bp(reference_len, candidate_len) * s


references = [
    [
        "the dog jumps high",
        "the cat runs fast",
        "dog and cats are good friends"],
    [
        "ba ga ya",
        "lu ha a df",
    ]
]
candidate = ["the d o g  jump s hig", 'it is too bad']
weights = [0.25, 0.25, 0.25, 0.25]
print(corpus_bleu(references, candidate, weights))
print(bleu_score.corpus_bleu(references, candidate, weights))

如果你用的NLTK版本是3.2,發布時間是2016年3月份,那么計算corpus_bleu時有一處bug。NLTK在2016年10月份已經修復了此處bug。對於句子分值的求和,NLTK代碼中是使用Fraction,Fraction會自動對分子和分母進行化簡,導致求和的時候計算錯誤。

簡化代碼

在計算sentence_bleu和corpus_bleu過程中,許多步驟都是相似的、可以合並的。精簡后的代碼如下:

from collections import Counter

import numpy as np
from nltk.translate import bleu_score


def bp(references_len, candidate_len):
    return np.e ** (1 - (candidate_len / references_len)) if references_len > candidate_len else 1


def nearest_len(references, candidate):
    return len(references[np.argmin([abs(len(i) - len(candidate)) for i in references])])


def parse_ngram(sentence, gram):
    return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)]


def appear_count(references, candidate, gram):
    ref = [parse_ngram(i, gram) for i in references]
    can = parse_ngram(candidate, gram)
    # 統計n-gram出現次數
    ref_counter = [Counter(i) for i in ref]
    can_counter = Counter(can)
    # 統計每個詞在references中的出現次數
    appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())
    return appear, len(can)


def corpus_bleu(references_list, candidate_list, weights):
    candidate_len = sum(len(i) for i in candidate_list)
    reference_len = sum(nearest_len(references, candidate) for candidate, references in zip(candidate_list, references_list))
    bp_value = bp(reference_len, candidate_len)
    s = 1
    for index, wei in enumerate(weights):
        up = 0  # 分子
        down = 0  # 分母
        gram = index + 1
        for candidate, references in zip(candidate_list, references_list):
            appear, total = appear_count(references, candidate, gram)
            up += appear 
            down += total 
        s *= (up / down) ** wei
    return bp_value * s


def sentence_bleu(references, candidate, weight):
    bp_value = bp(nearest_len(references, candidate), len(candidate))
    s = 1
    for gram, wei in enumerate(weight):
        gram = gram + 1
        appear, total = appear_count(references, candidate, gram)
        score = appear / total
        # 每個score的權值不一樣
        s *= score ** wei
    # 最后的分數需要乘以懲罰因子
    return s * bp_value


if __name__ == '__main__':
    references = [
        [
            "the dog jumps high",
            "the cat runs fast",
            "dog and cats are good friends"],
        [
            "ba ga ya",
            "lu ha a df",
        ]
    ]
    candidate = ["the d o g  jump s hig", 'it is too bad']
    weights = [0.25, 0.25, 0.25, 0.25]
    print(corpus_bleu(references, candidate, weights))
    print(bleu_score.corpus_bleu(references, candidate, weights))
    print(sentence_bleu(references[0], candidate[0], weights))
    print(bleu_score.sentence_bleu(references[0], candidate[0], weights))

參考資料

https://cloud.tencent.com/developer/article/1042161
https://en.wikipedia.org/wiki/BLEU
https://blog.csdn.net/qq_31584157/article/details/77709454
https://www.jianshu.com/p/15c22fadcba5


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM