循環神經網絡與LSTM網絡
循環神經網絡RNN
循環神經網絡廣泛地應用在序列數據上面,如自然語言,語音和其他的序列數據上。序列數據是有很強的次序關系,比如自然語言。通過深度學習關於序列數據的算法要比兩年前的算法有了很大的提升。由此誕生了很多有趣的應用,比如語音識別,音樂合成,聊天機器人,機器翻譯,自然語言理解和其他的一些應用。
符號說明:
上標[l]: 表示第層,例如,例如是第四層的激活元。和是層參數
上標(i):表示第i個樣本,例如表示第訓練樣本輸入
上標<t>:表示第個時間戳,例如是輸入x在第的時間戳。是樣本i
個時間戳
下標i:表示向量中的第項。例如表示激活元在l層中的第項
循環神經網絡的基本結構如下:
簡單的說明,x就是我們輸入的序列數據,如果讓每一個表示一個字符的話,單詞hello,就會有四個輸入,分別表示h,e,l,l,o。y是我們的輸出數據,我們希望通過獲得的下一個字符。在這個例子中是hello的第一個字符h,的下一個字符為e,因此我們希望為e。
循環神經網絡的輸入:(h,e,l,l,o)
循環神經網絡的輸出:(根據實際的輸出情況)
輸出正確的標簽:(e,l,l,o, )
其中為隱變量記錄之前的狀態信息用來計算下一個,以及
圖中的每個RNN-cell結構都是相同的(所有的參數也一樣)。因此上圖中的循環神經網絡也可以表示成如下的形式
該模型有點像隱馬爾可夫模型,每一個預測結果只依賴與前一個節點的信息,而與無關。
對於每一個RNN-cell的內容如下:
這是基本的RNN單元,通過當前的輸入和(之前的隱藏狀態包含了過去的信息)進行計算獲得預測的下一個字符以及來給下一層RNN單元計算。
代價函數的計算
要對整個前向傳播中產生的預測結果y都要用來計算代價代價函數。代價函數可以使用上次提到的交叉熵來計算。將預測結果與輸出的正確標簽做交叉熵。
由於給出了RNN-cell的結果因此循環神經網絡可以進行前向傳播了。由於現在的深度學習框架比如tensorflow等很方便,只要定義好前向傳播的過程,以及代價函數,框架就會幫你進行反向傳播計算。
如果不用tensorflow框架進行神經網絡搭建這里還是給出了循環神經網絡的反向傳播計算公式(注:如果求導過於復雜可以通過Calculus軟件進行求導的計算,該軟件會給出具體的求導過程)
對於訓練好的模型進行采樣
還是以上面的例子為例,在采樣的過程中用戶輸入第一個字符='h',循環神經網絡根據'h'預測'h'的下一個字符最大概率為'e',把預測的字符'e'作為下一個輸入以此循環遞歸。關於何時結束,我們可以指定定長的字符生成后結束。注意:在大樣本的訓練中,為了保證結果的多樣性,我們一般不直接選擇預測最大的字符作為下一個輸入字符,而是在預測的結果序列進行概率采樣,或者隨機采樣。
LSTM(長短時記憶網絡)
上文中已經提到了可以對序列模型進行訓練,以及采樣。但是由於循環神經網絡層數過多,存在梯度消失的問題。該問題直接導致的效果如下。假設有下面兩個句子用來訓練
"The cat, which already ate……,was full"
"The cats, which already ate……,were full"
逗號表示由很多個單詞。
我們需要學習當出現cat的時候需要使用was,當出現cats的時候需要使用were。由於梯度消失的問題,理論上循環神經網絡是無法學習這種具有長時間依賴關系的詞性變化。由此引入了LSTM(長短時記憶網絡)。
LSTM的整體結構和RNN很像都是循環遞歸的,只是將RNN-cell替換成LSTM-cell。LSTM-cell的表示如下
:Forget gate(忘記門),在這個例子中讓我們假設每次的輸入都是一個單詞,我們希望LSTM保持語法結構,例如主語是單數還是復數。如果主語從單數變為復數,我們需要找到一種方法來避免之前存儲的關於單數和復數的狀態。在LSTM中,Forget gate讓我們做這件事情。
上面式子的計算結果在0到1之間。Forget gate向量會和之前cell的狀態相乘,如果值為0(或者接近0)那么它意味着LSTM應該移除之前的狀態信息(例如,主語是單數),如果是1,那么應該保持該信息。
:Update gate(更新門),一旦我們忘記了主語是單數,我們需要一種方式來說更新它來反應新的主語是復數。下面是更新門的表達式:
更新門會和相乘,來計算
為了計算新的主語,我們需要創建一個新的向量可以用來增加先前的cell狀態。表達式如下:
最后信息的cell狀態如下:
:Outputs gate(輸出門),用來控制輸出結果
將每一個LSTM-cell連接起來就構成了LSTM的前向傳播了
在tensorflow中根據該網絡結構就可以構造出前向傳播的算法,進行計算了。下面會給出LSTM的代碼。
如果不用tensorflow進行計算,這里簡單的給出LSTM的反向傳播公式:
gate derivatives
Parameter derivatives
previous hidden state, previous memory state, and input
#導入相應的包 from __future__ import print_function import os import numpy as np import random import string import tensorflow as tf import zipfile from six.moves import range from six.moves.urllib.request import urlretrieve #讀取莎士比亞的文本文件 path = "./shakespeare" #文件夾目錄 files= os.listdir(path) #得到文件夾下的所有文件名稱 text ="" for file in files: #遍歷文件夾 if not os.path.isdir(file): #判斷是否是文件夾,不是文件夾才打開 f = open(path+"/"+file,'r'); #打開文件 iter_f = iter(f); #創建迭代器 str = "" for line in iter_f: #遍歷文件,一行行遍歷,讀取文本 str = str + line text+=str #每個文件的文本存到list中 print('Data size %d' % len(text)) #將莎士比亞文本中出現的所有字符做成字典並編號 vocab = set(text) vocab_to_int = {c: i for i, c in enumerate(vocab)} int_to_vocab = dict(enumerate(vocab)) #構造測試集和訓練集 valid_size = 1000 valid_text = text[:valid_size] train_text = text[valid_size:] train_size = len(train_text) print(train_size, train_text[:64]) print(valid_size, valid_text[:64]) vocabulary_size =len(set(text)) batch_size = 64 num_unrollings = 10 #訓練用的Batch生成器 class BatchGenerator(object): def __init__(self, text, batch_size, num_unrollings): self._text = text self._text_size = len(text) self._batch_size = batch_size self._num_unrollings = num_unrollings segment = self._text_size // batch_size self._cursor = [offset * segment for offset in range(batch_size)] self._last_batch = self._next_batch() def _next_batch(self): """Generate a single batch from the current cursor position in the data.""" batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float) for b in range(self._batch_size): #batch[b, char2id(self._text[self._cursor[b]])] = 1.0 batch[b, vocab_to_int[self._text[self._cursor[b]]]] = 1.0 self._cursor[b] = (self._cursor[b] + 1) % self._text_size return batch def next(self): """Generate the next array of batches from the data. The array consists of the last batch of the previous array, followed by num_unrollings new ones. """ batches = [self._last_batch] for step in range(self._num_unrollings): batches.append(self._next_batch()) self._last_batch = batches[-1] return batches #將概率分布轉化為1-hot形式 def characters(probabilities): """Turn a 1-hot encoding or a probability distribution over the possible characters back into its (most likely) character representation.""" #return [id2char(c) for c in np.argmax(probabilities, 1)] temp=np.argmax(probabilities,1) return [int_to_vocab[c] for c in np.argmax(probabilities, 1)] #將batches轉化成字符串 def batches2string(batches): """Convert a sequence of batches back into their (most likely) string representation.""" s = [''] * batches[0].shape[0] for b in batches: s = [''.join(x) for x in zip(s, characters(b))] return s train_batches = BatchGenerator(train_text, batch_size, num_unrollings) valid_batches = BatchGenerator(valid_text, 1, 1) print(batches2string(train_batches.next())) print(batches2string(train_batches.next())) print(batches2string(valid_batches.next())) print(batches2string(valid_batches.next())) def logprob(predictions, labels): """Log-probability of the true labels in a predicted batch.""" predictions[predictions < 1e-10] = 1e-10 return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0] #采樣相關的函數,用於字符的生成 def sample_distribution(distribution): """Sample one element from a distribution assumed to be an array of normalized probabilities. """ r = random.uniform(0, 1) s = 0 for i in range(len(distribution)): s += distribution[i] if s >= r: return i return len(distribution) - 1 def sample(prediction): """Turn a (column) prediction into 1-hot encoded samples.""" p = np.zeros(shape=[1, vocabulary_size], dtype=np.float) p[0, sample_distribution(prediction[0])] = 1.0 return p def random_distribution(): """Generate a random column of probabilities.""" b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size]) tmp=np.sum(b,1) tmp1=np.sum(b,1)[:None] return b/np.sum(b, 1)[:,None] num_nodes = 64 #構造LSTM graph = tf.Graph() with graph.as_default(): # Parameters: # Input gate: input, previous output, and bias. ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) ib = tf.Variable(tf.zeros([1, num_nodes])) # Forget gate: input, previous output, and bias. fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) fb = tf.Variable(tf.zeros([1, num_nodes])) # Memory cell: input, state and bias. cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) cb = tf.Variable(tf.zeros([1, num_nodes])) # Output gate: input, previous output, and bias. ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1)) om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1)) ob = tf.Variable(tf.zeros([1, num_nodes])) # Variables saving state across unrollings. saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False) saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False) # Classifier weights and biases. w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1)) b = tf.Variable(tf.zeros([vocabulary_size])) # Definition of the cell computation. def lstm_cell(i, o, state): """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf Note that in this formulation, we omit the various connections between the previous state and the gates.""" input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib) forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb) update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb state = forget_gate * state + input_gate * tf.tanh(update) output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob) return output_gate * tf.tanh(state), state # Input data. train_data = list() for _ in range(num_unrollings + 1): train_data.append( tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size])) train_inputs = train_data[:num_unrollings] train_labels = train_data[1:] # labels are inputs shifted by one time step. # Unrolled LSTM loop. outputs = list() output = saved_output state = saved_state for i in train_inputs: output, state = lstm_cell(i, output, state) outputs.append(output) # State saving across unrollings. with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]): # Classifier. # The Classifier will only run after saved_output and saved_state were assigned. logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b) loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits( logits=logits, labels=tf.concat(train_labels, 0))) # Optimizer. global_step = tf.Variable(0) learning_rate = tf.train.exponential_decay( 10.0, global_step, 5000, 0.1, staircase=True) optimizer = tf.train.GradientDescentOptimizer(learning_rate) gradients, v = zip(*optimizer.compute_gradients(loss)) gradients, _ = tf.clip_by_global_norm(gradients, 1.25) optimizer = optimizer.apply_gradients( zip(gradients, v), global_step=global_step) # Predictions. train_prediction = tf.nn.softmax(logits) # Sampling and validation eval: batch 1, no unrolling. sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size]) saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) reset_sample_state = tf.group( saved_sample_output.assign(tf.zeros([1, num_nodes])), saved_sample_state.assign(tf.zeros([1, num_nodes]))) sample_output, sample_state = lstm_cell( sample_input, saved_sample_output, saved_sample_state) with tf.control_dependencies([saved_sample_output.assign(sample_output), saved_sample_state.assign(sample_state)]): sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b)) saver = tf.train.Saver() num_steps = 37001 summary_frequency = 100 #saver = tf.train.Saver(var_list=[ib]) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print('Initialized') mean_loss = 0 for step in range(num_steps): batches = train_batches.next() feed_dict = dict() # Add training data from batches to corresponding train_data position in the feed_dict for i in range(num_unrollings + 1): feed_dict[train_data[i]] = batches[i] # Train the model _, l, predictions, lr = session.run( [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict) mean_loss += l if step % summary_frequency == 0: if step > 0: mean_loss = mean_loss / summary_frequency # The mean loss is an estimate of the loss over the last few batches. print( 'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr)) mean_loss = 0 labels = np.concatenate(list(batches)[1:]) print('Minibatch perplexity: %.2f' % float( np.exp(logprob(predictions, labels)))) if step % (summary_frequency * 10) == 0: # Generate some samples. print('=' * 80) for _ in range(5): feed = sample(random_distribution()) tmp=characters(feed) sentence = characters(feed)[0] reset_sample_state.run() for _ in range(1379): prediction = sample_prediction.eval({sample_input: feed}) feed = sample(prediction) #feed = np.zeros(shape=[1, vocabulary_size], dtype=np.float) #feed[0, vocab_to_int[characters(prediction)[0]]] = 1.0 sentence += characters(prediction)[0] print(sentence) # Measure validation set perplexity. reset_sample_state.run() valid_logprob = 0 for _ in range(valid_size): b = valid_batches.next() predictions = sample_prediction.eval({sample_input: b[0]}) valid_logprob = valid_logprob + logprob(predictions, b[1]) print('Validation set perplexity: %.2f' % float(np.exp( valid_logprob / valid_size)))
實驗結果:
在阿里雲中的服務器中跑1h的結果為
Iivind to sh te no sh bnt if tete t segsunsed mrovd of teauty sormw And ttmdi n,
Aut teatme
Tr whene tn toer tone tf tftwaas srmtoundsng miser and to e stnce whrks anl the r sfer s aar thoarered
Tnd tore toaet oned
Th thol'nte miye thle theu h the seert
And stael soar,
Then trewgd then sirt thauld ttyle oegg, trwroun hhe weaath
Ahve tou sere nles thet ts teailesTour noy th mtye Autt ther ihere ore tavh trr h st tavh tyes
And tese,tove a dot su g,sart,
可以看出來它已經學會了部分單詞,以及分段了
在阿里雲中的服務器中跑3h的結果為:
L
Thosethet tn the sime tirte orove , Ao tolb tn the srlt shen t antrtn,
wo myst n st
Aut tune eye ds tocjenedtitl be tot ing mh trne
7
To ivesi soi teat uent temieeksand tvteem,oolrh oeintitn the sear frel d snd mfe
Tided itml r tnpge taovang snuie tor trrg r ios morld ty Ay ledd, o sii e thi ift tfrerlewe aow that toar f beathed wea
Tnd tincle silh tuape the saerher of atne tor mhes ht
Thet teve sn tour siic m'hle teseishne tn ty hlay ahat tldyn t coture tor my sene,
它已經初步知道可以給段落打標簽了,如上面出現的7
[Tā yǐjīng chūbù zhīdào kěyǐ gěi duànluò dǎ biāoqiānle, rú shàngmiàn chūxiàn de]
It has initially known that it can label paragraphs, as shown above.