【學習筆記】淺談文本生成中的采樣方法

本文轉載自查看原文 2021-06-29 16:24 186 NLP技術/ 自然語言處理

本文學習記錄一下，文本生成過程，以及過程中如何 選擇/采樣 下一個生成的詞。首先將簡單介紹一下文本生成(text generation)的完成過程；然后簡單介紹下常用的采樣(sampling)方法；最后，將實現並討論以下三種采樣方法（Greedy Sampling、Temperature Samling、Top-K采樣）的優勢和劣勢。

什么是NLG ?

語言模型(LM, language model)用於生成文本，一般可分為word-by-word和character-by-character兩種級別的方式。
訓練過程中，我們利用一系列token序列(input, X)數據和目標token(targen, y)進行模型訓練，期望得到一個能根據輸入的序列，生成下一個token的條件概率分布（詞匯表長度）的模型。下圖案例表示在給定輸入為"I want to cook"序列時，word-Level LM預測一個詞匯表長度的條件概率分布的過程。

文本生成過程主要分為一下四個步驟：

step1: 給定一個句子的序列作為LM的輸入
step2:LM輸出一個詞匯表長度的概率分布
step3: 從概率分布中，依據某種策略，sample一個詞。
step4: 將sample到的詞，拼接到生成文本的字符串
step5: 繼續輸入下一個新序列，重復上述過程。

NLG解碼策略

在文本生成任務中，sampling是指按照LM模型生成的所有token的條件概率分布，隨機選擇一個token。這意味着，在語言模型生成概率分布后，采取哪種策略來選擇下一個token,顯得極其重要。常見的策略有：

Greedy Search (Maximization)
Beam Search
Temperature Sampling
Top-K Sampling
Top-P Sampling (Nucleus sampling)
本文將着重介紹前面三種方法及其實現，並在后面簡單介紹其他兩種方法的思想。

1.訓練一個語言模型

本文着重關注的是采樣（sampling）的方法及其實現，因此我們假設我們已經有了一個LM，該模型能夠根據我們的輸出，輸出一個詞匯表長度的概率分布。具體假設如下：

選擇文本生成方式為 character-by-character.
詞匯表（vocabulary）中的字符為'a' to 'z'.
已根據一定的預料，訓練好語言模型（ Language Model ）
該Language Model 能夠根據輸入序列生成一個詞匯表長度的條件概率分布。
現在，我們需要的是根據概率分布，sample (select) 下一個 token。

1.1 定義詞典

dictionary =[]
for c in range(ord('a'), ord('z')+1):
    dictionary.append(chr(c))

1.2 模擬一個已訓練好的LM

class language_model:
    def __init__(self, dictionary):
        self.dictionary = dictionary
    def predict(self):
        output= np.random.rand(len(dictionary))
        output=output/output.sum()
        return  output

# model=language_model(dictionary)

1.3 模擬生成的條件概率分布

predictions= model.predict()
plt.bar(dictionary,predictions)
plt.show()

經過上述假設，我們可以開始下面的采樣策略的實現了。

常見采樣策略

Greedy Search解碼

Greedy search方法的思想較為簡單，就是直接選擇概率分布中概率最大的token（（或字符））作為當前解碼出來的詞（或字符）。但是，該方法的問題在於，如果我們總是選擇概率最大的詞，將會生成很多重復的句子( get stuck in loops )，例如“I don’t know. I don’t know. I don’t know. I don’t know.”樣例代碼如下：

def greedy_search(conditional_probability):
#     print(np.argmax(conditional_probability))
    return (np.argmax(conditional_probability))

print(predictions)
next_token = greedy_search(predictions)
print(next_token)
print("Sampled token: ",dictionary[next_token])

輸出：
[0.01558192 0.00141205 0.05824388 0.05974056 0.07144658 0.02249477
0.03664056 0.07573829 0.0782964 0.07217844 0.01622408 0.02825687
0.02290704 0.04392459 0.04238757 0.03190642 0.00968754 0.02540264
0.00605495 0.02393471 0.03006855 0.00061328 0.07406862 0.06144887
0.06505202 0.02628881]
8
Sampled token: i

Beam Search 解碼

另一種比較流行的解碼方法叫beam search，該方法是對greedy search的擴展版本，返回一系列最有可能的輸出序列。
和greedy search選擇可能性最大的構成序列不同，beam search在$t$步時，生成$t + 1$步的所有可能組成，並從中選擇k個概率最大的組合，其中k為指定的搜索參數。
我們在開始的位置不用隨機選擇，而是選擇K個最可能在開始位置的詞語作為序列的第一個詞。
當K取1時，即為Greedy Search，而在大多數機器翻譯的任務中 K一般取值5-10。當K較大時，往往會帶來較好的結果，因為保留更多的選擇性，更可能帶來最佳的組合，相應的，也會增加計算成本和解碼速度。
舉例說明上述表述。首先，定義一個函數，該函數在給定一個序列（假設長度為N, 詞匯表長度為V）的概率分布(矩陣，N x V)，以及搜索參數K時，得到解碼結果。在每一個step，每一個候選子序列( candidate sequence)擴展所有可能的下一個子token，然后按照score進行排序，並選擇score最大的K個子序列，作為當前step的解碼結果。重復上述過程，直到迭代結束。
一般來說，概率值時較小的數值，經過一些列連乘后，會更小，為防止下溢（underflowing the floating point numbers),將其計算轉換為取其對數，然后相加的過程。樣例代碼如下：

from math import log
from numpy import array
from numpy import argmax

def beam_search_decoder(data, k):
	sequences = [[list(), 0.0]]
	# 迭代序列中的每一步
	for row in data:
		all_candidates = list()
		# 計算每種 hypotheses 的分值，並存儲到 all_candidates
		for i in range(len(sequences)):
			seq, score = sequences[i]
			for j in range(len(row)):
				candidate = [seq + [j], score - log(row[j])]
 				# print("da, ", candidate)
				all_candidates.append(candidate)
			print(f"j={j},all_cand={all_candidates}")
		# 對所有的候選序列，通過 score 排序
		ordered = sorted(all_candidates, key=lambda tup:tup[1])
		# 選擇 K 個分 score 最高的
		sequences = ordered[:k]
	return sequences

結果如下：

n = 10

data = []
for i in range(10):
    prediction = model.predict()
    data.append(prediction)

data = array(data)
# print(data)
result = beam_search_decoder(data, 5)

for seq in result:
	print(seq)

TEMPERATURE 采樣

Temperature sampling 的想法源於熱力統計學的概念，溫度高往往意味着更容易是低能量狀態。在概率模型中， logits 代表能量值，將其送入softmax函數錢，除以temperature值，得到最終的采樣概率分布。一個Temperature Sampling的keras實現案例。

0. 繪制模型生成的條件概率分布

plt.bar(dictionary,predictions)
plt.show()

1. 使用 “temperature” Reweighting 分布

temperature=0.2
conditional_probability = np.asarray(predictions).astype("float64")
conditional_probability = np.log(conditional_probability) / temperature
plt.bar(dictionary,conditional_probability)
plt.show()

2. 應用 softmax 函數

softmax函數的原理是將集合中每個元素轉化成對應的指數形式，然后分別處理所有元素指數的和，公式如下：

def softmax(z):
    return np.exp(z)/sum(np.exp(z))

reweighted_conditional_probability = softmax(conditional_probability)
plt.bar(dictionary,reweighted_conditional_probability)
plt.show()

3. 從 reweighted 的分布中，重新采樣下一個字母

我們采用多項式分布( **multinomial distribution**)從中sample一個token。多項式分布函數中的參數有：

n: int, Number of experiments.
pvals: sequence of floats, length p. Probabilities of each of the p different outcomes. These must sum to 1 (however, the last element is always assumed to account for the remaining probability, as long as sum(pvals[:-1]) <= 1).
size: int or tuple of ints, optional. Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

我們調用多項式分布函數，並設置參數為(1, reweighted_conditional_probability, 1) ，因為我們只需要實驗一次，並從概率分布中sample出一個結果。樣例代碼如下：

probas = np.random.multinomial(1, reweighted_conditional_probability, 1)
plt.bar(dictionary,np.squeeze(probas))
plt.show()

4 把之前的操作放在一起

def temperature_sampling (conditional_probability, temperature=1.0):
    conditional_probability = np.asarray(conditional_probability).astype("float64")
    conditional_probability = np.log(conditional_probability) / temperature
    reweighted_conditional_probability = softmax(conditional_probability)
    probas = np.random.multinomial(1, reweighted_conditional_probability, 1)
    plt.bar(dictionary,reweighted_conditional_probability)
    plt.show()

    return np.argmax(probas)
    
 for temp in np.arange(0.2,1.6,0.8):
    next_token = temperature_sampling(predictions, temperature=temp)
    print("Temperature: ", temp)
    print("Sampled token: ",dictionary[next_token],"\n")

5. 一些觀察后的結論

在大多數研究中, tempreature的選擇，往往呈現如下規律：

當 temperature 設置為較小或者0的值時， Temperature Sampling 等同於 每次選擇最大概率的 Greedy Search。
小的temperature 會引發極大的 repetitive 和predictable文本，但是文本內容往往更貼合語料(highly realistic)，基本所有的詞都來自與語料庫。
當temperatures較大時, 生成的文本更具有隨機性( random)、趣味性( interesting)，甚至創造性( creative); 甚至有些時候能發現一些新詞(misspelled words) 。
當設置高 temperature時，文本局部結構往往會被破壞，大多數詞可能會時semi-random strings 的形式。
實際應用中，往往experiment with multiple temperature values! 當保持了一定的隨機性又能不破壞結構時，往往會得到有意思的生成文本。

Top K 采樣

原文鏈接(Fan et. al, 2018) 該論文介紹了一種新的簡單但是高效的采樣方法，Top-K sampling。

在Top-K 采樣中, 依舊是從概率分布中，依據概率最大選擇k個單詞中，不同的點在於，該方法會對這K個詞的概率重新再次進行分布(redistributed)，然后依據新的概率分布重新取下一個token。GPT2模型就是用的這種采樣方法，使其在故事生成（story generation）方面較為成熟。具體如下

1.首先我們有一個概率分布

predictions= model.predict()
plt.bar(dictionary,predictions)
plt.show()

2.選擇top K分布

我們使用函數tf.math.top_k() 在概率分布中輸出 最大的 k 個 實體的值(values)及其對應的索引(indices )。通過索引，我們能得到其對應的tokens.

k=5
top_k_probabilities, top_k_indices= tf.math.top_k(predictions, k=k, sorted=True)
top_k_indices = np.asarray(top_k_indices).astype("int32")
top_k_tokens=[dictionary[i] for i in top_k_indices]
top_k_indices, top_k_tokens
# top_k_probabilities.numpy().sum()

(array([ 8, 7, 22, 9, 4]), ['i', 'h', 'w', 'j', 'e'])

3. 應用softmax函數

top_k_redistributed_probability=softmax(np.log(top_k_probabilities))
top_k_redistributed_probability = np.asarray(top_k_redistributed_probability).astype("float32")
print('top_k_tokens: ',top_k_tokens)
print('top_k_redistributed_probability: ',top_k_redistributed_probability)
print('Total probability: ', top_k_redistributed_probability.sum())

top_k_tokens: ['h', 'p', 'n', 'i', 'k']
top_k_redistributed_probability: [0.21983118 0.21332353 0.21130912 0.19023508 0.16530107]
Total probability: 1.0

plt.bar(top_k_tokens,top_k_redistributed_probability)
plt.show()

4.從 reweighted 的分布中，重新采樣下一個字母

sampled_token = np.random.choice(top_k_indices, 
                                 p=top_k_redistributed_probability)
print("Sampled token id: ",sampled_token, 
      " token: ",dictionary[sampled_token])

Sampled token id: 11 token: l

5. 完整過程

  def top_k_sampling(conditional_probability, k):
    top_k_probabilities, top_k_indices= tf.math.top_k(predictions, k=k, sorted=True)
    top_k_indices = np.asarray(top_k_indices).astype("int32")
    top_k_redistributed_probability=softmax(np.log(top_k_probabilities))
    top_k_redistributed_probability = np.asarray(top_k_redistributed_probability).astype("float32")
    sampled_token = np.random.choice(top_k_indices, p=top_k_redistributed_probability)
    top_k_tokens=[dictionary[i] for i in top_k_indices]
    plt.bar(top_k_tokens,top_k_redistributed_probability)
    plt.show()
    return sampled_token

predictions= model.predict()
plt.bar(dictionary,predictions)
plt.show()

6.使用 top-k 采樣 different k values

for k in range (5, 25, 5):
  next_token = top_k_sampling(predictions, k=k)
  print("k: ", k)
  print("Sampled token: ",dictionary[next_token],"\n")

7. 一些觀察后的結論

基本top k的采樣方法，能夠提升生成質量，因為它會把概率較低的結果丟棄（ removing the tail），因此能使得生成過程不那么偏離主題。
但是一些情況下：

丟棄掉的部分（Tail）可能會包含很多的詞語，這導致我們能選擇的詞匯較少。
而在一些情況下，丟棄掉大部分可能包含的詞匯較少，我們能生成較為豐富的文本。

因此， k 值的選擇對於生成結果極其重要。

Top p采樣

有很多采樣的方法被提出來，top p也是其中一種最為常見的方法。

Top-P Sampling (Nucleus sampling): 與top k對低概率詞匯直接丟棄的處理方法不同，top p采用的是累計概率的方式。即從累計概率超過某一個閾值p的詞匯中進行采樣。換言之，根據參數p的大小調節(0<=p<=1), Top-P Sampling增大了出現概率較小的詞匯的生成的概率。更多細節說明樣例代碼：

def scatter_values_on_batch_indices(values, batch_indices):
    shape = shape_list(batch_indices)
    # broadcast batch dim to shape
    broad_casted_batch_dims = tf.reshape(tf.broadcast_to(tf.expand_dims(tf.range(shape[0]), axis=-1), shape), [1, -1])
    # transform batch_indices to pair_indices
    pair_indices = tf.transpose(tf.concat([broad_casted_batch_dims, tf.reshape(batch_indices, [1, -1])], 0))
    # scatter values to pair indices
    return tf.scatter_nd(pair_indices, tf.reshape(values, [-1]), shape)


def set_tensor_by_indices_to_value(tensor, indices, value):
    # create value_tensor since tensor value assignment is not possible in TF
    value_tensor = tf.zeros_like(tensor) + value
    return tf.where(indices, value_tensor, tensor)


def shape_list(x):
    """Deal with dynamic shape in tensorflow cleanly."""
    static = x.shape.as_list()
    dynamic = tf.shape(x)
    return [dynamic[i] if s is None else s for i, s in enumerate(static)]

def top_p_decoding(logits, top_p=1.0, filter_value=-float("Inf"), min_tokens_to_keep=1):
    sorted_indices = tf.argsort(logits, direction="DESCENDING")
    sorted_logits = tf.gather(
        logits, sorted_indices, axis=-1, batch_dims=1
    )  # expects logits to be of dim (batch_size, vocab_size)

    cumulative_probs = tf.math.cumsum(tf.nn.softmax(sorted_logits, axis=-1), axis=-1)

    # Remove tokens with cumulative probability above the threshold (token with 0 are kept)
    sorted_indices_to_remove = cumulative_probs > top_p

    if min_tokens_to_keep > 1:
        # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)
        sorted_indices_to_remove = tf.concat(
            [
                tf.zeros_like(sorted_indices_to_remove[:, :min_tokens_to_keep]),
                sorted_indices_to_remove[:, min_tokens_to_keep:],
            ],
            -1,
        )

    # Shift the indices to the right to keep also the first token above the threshold
    sorted_indices_to_remove = tf.roll(sorted_indices_to_remove, 1, axis=-1)
    sorted_indices_to_remove = tf.concat(
        [tf.zeros_like(sorted_indices_to_remove[:, :1]), sorted_indices_to_remove[:, 1:]],
        -1,
    )
    # scatter sorted tensors to original indexing
    indices_to_remove = scatter_values_on_batch_indices(sorted_indices_to_remove, sorted_indices)
    logits = set_tensor_by_indices_to_value(logits, indices_to_remove, filter_value)
    return logits

n = 10

data = []
for i in range(10):
    prediction = model.predict()
    data.append(prediction)
data = array(data)
print(data)
result = top_p_decoding(data, 0.5)

for seq in result:
	print(seq)

總結

本文討論了文本生成過程中的一些常見的采樣方法及其部分實現。也討論了不同方法間的優缺點。總體而言，沒有最好的方法，只有最適合任務的方法，推薦結合具體任務通過反復實驗找到最佳的生成方法。推薦使用不同的參數，在生成文本的結構性和隨機性之間進行權衡，來得到有意思的文本生成結果。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 基於seq2seq文本生成的解碼/采樣策略文本生成任務的評價方法 LSTM文本生成(二) 運用深度學習進行文本生成基於LSTM語言模型的文本生成基於LSTM模型實現文本生成使用 paddle來進行文本生成文本生成論文集深度學習中圖像上采樣的方法實現nlp文本生成中的beam search解碼器