理解bleu

本文轉載自查看原文 2018-10-26 00:38 3109

bleu全稱為Bilingual Evaluation Understudy（雙語評估替換），是2002年提出的用於評估機器翻譯效果的一種方法，這種方法簡單朴素、短平快、易於理解。因為其效果還算說得過去，因此被廣泛遷移到自然語言處理的各種評估任務中。這種方法可以說是：山上無老虎，猴子稱大王。時無英雄遂使豎子成名。蜀中無大將，廖化做先鋒。

問題描述

首先，對bleu算法建立一個直觀的印象。
有兩類問題：
1、給定一個句子和一個候選句子集，求bleu值，此問題稱為sentence_bleu
2、給定一堆句子和一堆候選句子集，求bleu值，此問題稱為corpus_bleu

機器翻譯得到的句子稱為candidate，候選句子集稱為references。
計算方式就是計算candidate和references的公共部分。公共部分越多，說明翻譯結果越好。

給定一個句子和一個候選句子集計算bleu值

bleu考慮1，2，3，4共4個n-gram，可以給每個n-gram指定權重。

對於n-gram：

對candidate和references分別分詞（n-gram分詞）
統計candidate和references中每個word的出現頻次
對於candidate中的每個word，它的出現頻次不能大於references中最大出現頻次
這一步是為了整治形如the the the the the這樣的candidate，因為the在candidate中出現次數太多了，導致分值為1。為了限制這種不正常的candidate，使用正常的references加以約束。
candidate中每個word的出現頻次之和除以總的word數，即為得分score
score乘以句子長度懲罰因子即為最終的bleu分數
這一步是為了整治短句子，比如candidate只有一個詞：the，並且the在references中出現過，這就導致得分為1。也就是說，有些人因為怕說錯而保持沉默。

bleu的發展不是一蹴而就的，很多人為了修正bleu，不斷發現bleu的漏洞並提出解決方案。從bleu的發展歷程上，我們可以學到如何設計規則整治badcase。

最后，對於1-gram，2-gram，3-gram的組合，應該采用幾何平均，也就是s1^w1*s2^2*s3^w3，而不是算術平均w1*s1+w2*s2+w3*s3。

from collections import Counter

import numpy as np
from nltk.translate import bleu_score


def bp(references, candidate):
    # brevity penality,句子長度懲罰因子
    ind = np.argmin([abs(len(i) - len(candidate)) for i in references])
    if len(references[ind]) < len(candidate):
        return 1
    scale = 1 - (len(candidate) / len(references[ind]))
    return np.e ** scale


def parse_ngram(sentence, gram):
    # 把一個句子分成n-gram
    return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)]  # 此處一定要注意+1，否則會少一個gram


def sentence_bleu(references, candidate, weight):
    bp_value = bp(references, candidate)
    s = 1
    for gram, wei in enumerate(weight):
        gram = gram + 1
        # 拆分n-gram
        ref = [parse_ngram(i, gram) for i in references]
        can = parse_ngram(candidate, gram)
        # 統計n-gram出現次數
        ref_counter = [Counter(i) for i in ref]
        can_counter = Counter(can)
        # 統計每個詞在references中的出現次數
        appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())
        score = appear / len(can)
        # 每個score的權值不一樣
        s *= score ** wei
    s *= bp_value  # 最后的分數需要乘以懲罰因子
    return s


references = [
    "the dog jumps high",
    "the cat runs fast",
    "dog and cats are good friends"
]
candidate = "the d o g  jump s hig"
weights = [0.25, 0.25, 0.25, 0.25]
print(sentence_bleu(references, candidate, weights))
print(bleu_score.sentence_bleu(references, candidate, weights))

一個corpus是由多個sentence組成的，計算corpus_bleu並非求sentence_bleu的均值，而是一種略微復雜的計算方式，可以說是沒什么道理的狂想曲。

corpus_bleu

一個文檔包含3個句子，句子的分值分別為a1/b1，a2/b2，a3/b3。
那么全部句子的分值為：(a1+a2+a3)/(b1+b2+b3)

懲罰因子也是一樣：三個句子的長度分別為l1,l2,l3，對應的最接近的reference分別為k1,k2,k3。那么相當於bp(l1+l2+l3,k1+k2+k3)。

也就是說：對於corpus_bleu不是單純地對sentence_bleu求均值，而是基於更統一的一種方法。

from collections import Counter

import numpy as np
from nltk.translate import bleu_score


def bp(references_len, candidate_len):
    if references_len < candidate_len:
        return 1
    scale = 1 - (candidate_len / references_len)
    return np.e ** scale


def parse_ngram(sentence, gram):
    return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)]


def corpus_bleu(references_list, candidate_list, weights):
    candidate_len = sum(len(i) for i in candidate_list)
    reference_len = 0
    for candidate, references in zip(candidate_list, references_list):
        ind = np.argmin([abs(len(i) - len(candidate)) for i in references])
        reference_len += len(references[ind])
    s = 1
    for index, wei in enumerate(weights):
        up = 0  # 分子
        down = 0  # 分母
        gram = index + 1
        for candidate, references in zip(candidate_list, references_list):
            # 拆分n-gram
            ref = [parse_ngram(i, gram) for i in references]
            can = parse_ngram(candidate, gram)
            # 統計n-gram出現次數
            ref_counter = [Counter(i) for i in ref]
            can_counter = Counter(can)
            # 統計每個詞在references中的出現次數
            appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())
            up += appear 
            down += len(can) 
        s *= (up / down) ** wei
    return bp(reference_len, candidate_len) * s


references = [
    [
        "the dog jumps high",
        "the cat runs fast",
        "dog and cats are good friends"],
    [
        "ba ga ya",
        "lu ha a df",
    ]
]
candidate = ["the d o g  jump s hig", 'it is too bad']
weights = [0.25, 0.25, 0.25, 0.25]
print(corpus_bleu(references, candidate, weights))
print(bleu_score.corpus_bleu(references, candidate, weights))

如果你用的NLTK版本是3.2，發布時間是2016年3月份，那么計算corpus_bleu時有一處bug。NLTK在2016年10月份已經修復了此處bug。對於句子分值的求和，NLTK代碼中是使用Fraction，Fraction會自動對分子和分母進行化簡，導致求和的時候計算錯誤。

簡化代碼

在計算sentence_bleu和corpus_bleu過程中，許多步驟都是相似的、可以合並的。精簡后的代碼如下：

from collections import Counter

import numpy as np
from nltk.translate import bleu_score


def bp(references_len, candidate_len):
    return np.e ** (1 - (candidate_len / references_len)) if references_len > candidate_len else 1


def nearest_len(references, candidate):
    return len(references[np.argmin([abs(len(i) - len(candidate)) for i in references])])


def parse_ngram(sentence, gram):
    return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)]


def appear_count(references, candidate, gram):
    ref = [parse_ngram(i, gram) for i in references]
    can = parse_ngram(candidate, gram)
    # 統計n-gram出現次數
    ref_counter = [Counter(i) for i in ref]
    can_counter = Counter(can)
    # 統計每個詞在references中的出現次數
    appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())
    return appear, len(can)


def corpus_bleu(references_list, candidate_list, weights):
    candidate_len = sum(len(i) for i in candidate_list)
    reference_len = sum(nearest_len(references, candidate) for candidate, references in zip(candidate_list, references_list))
    bp_value = bp(reference_len, candidate_len)
    s = 1
    for index, wei in enumerate(weights):
        up = 0  # 分子
        down = 0  # 分母
        gram = index + 1
        for candidate, references in zip(candidate_list, references_list):
            appear, total = appear_count(references, candidate, gram)
            up += appear 
            down += total 
        s *= (up / down) ** wei
    return bp_value * s


def sentence_bleu(references, candidate, weight):
    bp_value = bp(nearest_len(references, candidate), len(candidate))
    s = 1
    for gram, wei in enumerate(weight):
        gram = gram + 1
        appear, total = appear_count(references, candidate, gram)
        score = appear / total
        # 每個score的權值不一樣
        s *= score ** wei
    # 最后的分數需要乘以懲罰因子
    return s * bp_value


if __name__ == '__main__':
    references = [
        [
            "the dog jumps high",
            "the cat runs fast",
            "dog and cats are good friends"],
        [
            "ba ga ya",
            "lu ha a df",
        ]
    ]
    candidate = ["the d o g  jump s hig", 'it is too bad']
    weights = [0.25, 0.25, 0.25, 0.25]
    print(corpus_bleu(references, candidate, weights))
    print(bleu_score.corpus_bleu(references, candidate, weights))
    print(sentence_bleu(references[0], candidate[0], weights))
    print(bleu_score.sentence_bleu(references[0], candidate[0], weights))

參考資料

https://cloud.tencent.com/developer/article/1042161
https://en.wikipedia.org/wiki/BLEU
https://blog.csdn.net/qq_31584157/article/details/77709454
https://www.jianshu.com/p/15c22fadcba5

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 BLEU (Bilingual Evaluation Understudy) 【NLP】BLEU值【nlp】BLEU、ROUGE評價指標【NLP-00-3】BLEU計算【NLP】MT中BLEU評分機制 BLEU METEOR ROUGE CIDEr 詳解和實現機器翻譯評測——BLEU算法詳解 (新增在線計算BLEU分值) 利用BLEU進行機器翻譯檢測（Python-NLTK-BLEU評分方法） Deep Learning基礎--機器翻譯BLEU與Perplexity詳解機器翻譯評價指標 — BLEU算法