《python深度學習》筆記---8.1、使用LSTM生成文本
一、總結
一句話總結:
其實原理非常簡單,就是單層的LSTM把訓練數據中單詞與字符的統計規律學好,然后softmax層相當於分類對應到詞表中的各個字符的概率
from tensorflow.keras import layers model = keras.models.Sequential() model.add(layers.LSTM(128, input_shape=(maxlen, len(chars)))) model.add(layers.Dense(len(chars), activation='softmax'))
1、人工智能的目的?
【人工智能不是為了替代我們的智能】:的確,到目前為止,我們見到的人工智能藝術作品的水平還很低。人工智能還遠遠比不上 人類編劇、畫家和作曲家。但是,替代人類始終都不是我們要談論的主題,人工智能不會替代 我們自己的智能,
【而是會為我們的生活和工作帶來更多的智能】:而是會為我們的生活和工作帶來更多的智能,即另一種類型的智能。在許多 領域,特別是創新領域中,人類將會使用人工智能作為增強自身能力的工具,實現比人工智能 更加強大的智能。
2、人工智能發揮作用的地方?
【簡單的模式識別與專業技能】:很大一部分的藝術創作都是簡單的模式識別與專業技能。這正是很多人認為沒有吸引力、 甚至可有可無的那部分過程。
【我們的感知模式、語言和藝術作品都具有統計結構】:學習這種結構是深度學習算法所擅長的。
3、機器學習模型只是一種數學運算?
【機器學習模型能夠對圖像、 音樂和故事的統計潛在空間(latent space)進行學習,然后從這個空間中采樣(sample)】:創造 出與模型在訓練數據中所見到的藝術作品具有相似特征的新作品。
【機器學習模型只是一種數學運算】:當然,這種采樣本身並不是 藝術創作行為。它只是一種數學運算,算法並沒有關於人類生活、人類情感或我們人生經驗的 基礎知識;相反,它從一種與我們的經驗完全不同的經驗中進行學習。
4、使用 LSTM 生成文本實例中 如何生成序列數據?
【使用前面的標記作為輸入,訓練一個網絡來預測序列中接下來的一個或多個標記】:用深度學習生成序列數據的通用方法,就是使用前面的標記作為輸入,訓練一個網絡(通常是循環神經網絡或卷積神經網絡)來預測序列中接下來的一個或多個標記。
【例如,給定輸入 the cat is on the ma,訓練網絡來預測目標 t,即下一個字符。】
5、語言模型(language model)?
【給定前面的標記,能夠對下一個標記的概率進行建模的任何網絡】:與前面處理文本數據時一樣,標記 (token)通常是單詞或字符,給定前面的標記,能夠對下一個標記的概率進行建模的任何網絡 都叫作語言模型(language model)。
【語言的潛在空間(latent space),即語言的統計結構】:語言模型能夠捕捉到語言的潛在空間(latent space),即語言的統計結構。
6、使用 LSTM 生成文本實例中 的采樣和條件數據是什么?
【采樣(sample,即生成新序列)】:一旦訓練好了這樣一個語言模型,就可以從中采樣(sample,即生成新序列)。
【初始文本字符串[即條件數據(conditioning data)]】:向模型中輸入一個初始文本字符串[即條件數據(conditioning data)],要求模型生成下一個字符或下一個單詞(甚至可以同時生成多個標記),然后將生成的輸出添加到輸入數據中,並多次重復這一過程
7、生成文本時,如何選擇下一個字符至關重要?
【貪婪采樣】:一種簡單的方法是貪婪采樣(greedy sampling), 就是始終選擇可能性最大的下一個字符。但這種方法會得到重復的、可預測的字符串,看起來 不像是連貫的語言。
【隨機采樣】:一種更有趣的方法是做出稍顯意外的選擇:在采樣過程中引入隨機性,即 從下一個字符的概率分布中進行采樣。這叫作隨機采樣(stochastic sampling,stochasticity 在這 個領域中就是“隨機”的意思)。在這種情況下,根據模型結果,如果下一個字符是 e 的概率為 0.3,那么你會有 30% 的概率選擇它。
8、為什么采樣(生成新序列)的時候需要有一定的隨機性?
【純隨機采樣有最大的熵,隨機性大】:考慮一個極端的例子——純隨機采樣,即從均勻概率分布中 抽取下一個字符,其中每個字符的概率相同。這種方案具有最大的隨機性,換句話說,這種概 率分布具有最大的熵。當然,它不會生成任何有趣的內容。
【貪婪采樣有最小的熵,沒有任何隨機性】:再來看另一個極端——貪婪采樣。 貪婪采樣也不會生成任何有趣的內容,它沒有任何隨機性,即相應的概率分布具有最小的熵。
【更小的熵可以讓生成的序列具有更加可預測的結構(因此可能看起來更真實),而更大的熵會得到更加出人意料且更有創造性的序列】:但是,還有許多其他中間點具有更大或更小的熵,你可能希望都研究一下。更小的 熵可以讓生成的序列具有更加可預測的結構(因此可能看起來更真實),而更大的熵會得到更加 出人意料且更有創造性的序列。
9、softmax 溫度(softmax temperature)?
【為了在采樣過程中控制隨機性的大小】:我們引入一個叫作 softmax 溫度(softmax temperature) 的參數
【用於表示采樣概率分布的熵,即表示所選擇的下一個字符會有多么出人意料或多么可預測】
10、用於預測下一個字符的單層 LSTM 模型?
其實原理非常簡單,就是單層的LSTM把訓練數據中單詞與字符的統計規律學好,然后softmax層相當於分類對應到詞表中的各個字符的概率
from tensorflow.keras import layers model = keras.models.Sequential() model.add(layers.LSTM(128, input_shape=(maxlen, len(chars)))) model.add(layers.Dense(len(chars), activation='softmax'))
11、使用LSTM生成文本 注意點?
我們可以生成離散的序列數據,其方法是:給定前面的標記,訓練一個模型來預測接下 來的一個或多個標記。
對於文本來說,這種模型叫作語言模型。它可以是單詞級的,也可以是字符級的。
對下一個標記進行采樣,需要在堅持模型的判斷與引入隨機性之間尋找平衡。
處理這個問題的一種方法是使用softmax 溫度。一定要嘗試多種不同的溫度,以找到合適的那一個。
二、8.1、使用LSTM生成文本
博客對應課程的視頻位置:
[...]
Implementing character-level LSTM text generation
Let's put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a language model. You could use any sufficiently large text file or set of text files -- Wikipedia, the Lord of the Rings, etc. In this example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model we will learn will thus be specifically a model of Nietzsche's writing style and topics of choice, rather than a more generic model of the English language.
Preparing the data
Let's start by downloading the corpus and converting it to lowercase:
from tensorflow import keras import numpy as np path = keras.utils.get_file( 'nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt') text = open(path).read().lower() print('Corpus length:', len(text))
print(text[0:400])
Next, we will extract partially-overlapping sequences of length maxlen
, one-hot encode them and pack them in a 3D Numpy array x
of shape (sequences, maxlen, unique_characters)
. Simultaneously, we prepare a array y
containing the corresponding targets: the one-hot encoded characters that come right after each extracted sequence.
# Length of extracted character sequences
maxlen = 60 # We sample a new sequence every `step` characters step = 3 # This holds our extracted sequences sentences = [] # This holds the targets (the follow-up characters) next_chars = [] for i in range(0, len(text) - maxlen, step): sentences.append(text[i: i + maxlen]) next_chars.append(text[i + maxlen]) print('Number of sequences:', len(sentences)) # List of unique characters in the corpus chars = sorted(list(set(text))) print('Unique characters:', len(chars)) # Dictionary mapping unique characters to their index in `chars` char_indices = dict((char, chars.index(char)) for char in chars) # Next, one-hot encode the characters into binary arrays. print('Vectorization...') x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool) y = np.zeros((len(sentences), len(chars)), dtype=np.bool) for i, sentence in enumerate(sentences): for t, char in enumerate(sentence): x[i, t, char_indices[char]] = 1 y[i, char_indices[next_chars[i]]] = 1
print(chars)
Building the network
Our network is a single LSTM
layer followed by a Dense
classifier and softmax over all possible characters. But let us note that recurrent neural networks are not the only way to do sequence data generation; 1D convnets also have proven extremely successful at it in recent times.
from tensorflow.keras import layers model = keras.models.Sequential() model.add(layers.LSTM(128, input_shape=(maxlen, len(chars)))) model.add(layers.Dense(len(chars), activation='softmax'))
Since our targets are one-hot encoded, we will use categorical_crossentropy
as the loss to train the model:
optimizer = keras.optimizers.RMSprop(lr=0.01) model.compile(loss='categorical_crossentropy', optimizer=optimizer)
Training the language model and sampling from it
Given a trained model and a seed text snippet, we generate new text by repeatedly:
- 1) Drawing from the model a probability distribution over the next character given the text available so far
- 2) Reweighting the distribution to a certain "temperature"
- 3) Sampling the next character at random according to the reweighted distribution
- 4) Adding the new character at the end of the available text
This is the code we use to reweight the original probability distribution coming out of the model, and draw a character index from it (the "sampling function"):
def sample(preds, temperature=1.0): preds = np.asarray(preds).astype('float64') preds = np.log(preds) / temperature exp_preds = np.exp(preds) preds = exp_preds / np.sum(exp_preds) probas = np.random.multinomial(1, preds, 1) return np.argmax(probas)
Finally, this is the loop where we repeatedly train and generated text. We start generating text using a range of different temperatures after every epoch. This allows us to see how the generated text evolves as the model starts converging, as well as the impact of temperature in the sampling strategy.
import random import sys for epoch in range(1, 60): print('epoch', epoch) # Fit the model for 1 epoch on the available training data model.fit(x, y, batch_size=128, epochs=1) # Select a text seed at random start_index = random.randint(0, len(text) - maxlen - 1) generated_text = text[start_index: start_index + maxlen] print('--- Generating with seed: "' + generated_text + '"') for temperature in [0.2, 0.5, 1.0, 1.2]: print('------ temperature:', temperature) sys.stdout.write(generated_text) # We generate 400 characters for i in range(400): sampled = np.zeros((1, maxlen, len(chars))) for t, char in enumerate(generated_text): sampled[0, t, char_indices[char]] = 1. preds = model.predict(sampled, verbose=0)[0] next_index = sample(preds, temperature) next_char = chars[next_index] generated_text += next_char generated_text = generated_text[1:] sys.stdout.write(next_char) sys.stdout.flush() print()
As you can see, a low temperature results in extremely repetitive and predictable text, but where local structure is highly realistic: in particular, all words (a word being a local pattern of characters) are real English words. With higher temperatures, the generated text becomes more interesting, surprising, even creative; it may sometimes invent completely new words that sound somewhat plausible (such as "eterned" or "troveration"). With a high temperature, the local structure starts breaking down and most words look like semi-random strings of characters. Without a doubt, here 0.5 is the most interesting temperature for text generation in this specific setup. Always experiment with multiple sampling strategies! A clever balance between learned structure and randomness is what makes generation interesting.
Note that by training a bigger model, longer, on more data, you can achieve generated samples that will look much more coherent and realistic than ours. But of course, don't expect to ever generate any meaningful text, other than by random chance: all we are doing is sampling data from a statistical model of which characters come after which characters. Language is a communication channel, and there is a distinction between what communications are about, and the statistical structure of the messages in which communications are encoded. To evidence this distinction, here is a thought experiment: what if human language did a better job at compressing communications, much like our computers do with most of our digital communications? Then language would be no less meaningful, yet it would lack any intrinsic statistical structure, thus making it impossible to learn a language model like we just did.
Take aways
- We can generate discrete sequence data by training a model to predict the next tokens(s) given previous tokens.
- In the case of text, such a model is called a "language model" and could be based on either words or characters.
- Sampling the next token requires balance between adhering to what the model judges likely, and introducing randomness.
- One way to handle this is the notion of softmax temperature. Always experiment with different temperatures to find the "right" one.