RNN 入門教程 Part 1 – RNN 簡介


轉載 - Recurrent Neural Networks Tutorial, Part 1 – Introduction to RNNs

Recurrent Neural Networks (RNN) 是當前比較流行的模型,在自然語言處理中有很重要的應用。但是現在對RNN的詳細結構模型以及如何實現RNN算法的博客很少,故本文目的是翻譯該外文資料,幫助理解大家理解。同時,英文文章寫的很有深度,而且翻譯錯誤之處可能會很多,有興趣的可以參閱英文原文。本教程主要分為以下四個部分:

  1. RNN 簡介 (本文)
  2. 使用 numpy 和 theano 分別實現RNN模型
  3. 介紹 BPTT 算法和梯度消失問題
  4. 實現 RNN-LSTM 和 GRU 模型

As part of the tutorial we will implement a recurrent neural network based language model. The applications of language models are two-fold: First, it allows us to score arbitrary sentences based on how likely they are to occur in the real world. This gives us a measure of grammatical and semantic correctness. Such models are typically used as part of Machine Translation systems. Secondly, a language model allows us to generate new text (I think that’s the much cooler application). Training a language model on Shakespeare allows us to generate Shakespeare-like text. This fun post by Andrej Karpathy demonstrates what character-level language models based on RNNs are capable of.

I’m assuming that you are somewhat familiar with basic Neural Networks. If you’re not, you may want to head over to Implementing A Neural Network From Scratch,  which guides you through the ideas and implementation behind non-recurrent networks.

 

什么是RNN模型?

The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea. If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps (more on this later). Here is what a typical RNN looks like:

rnn

圖1.將一個循環神經網絡模型RNN按時間序列展開

 

The above diagram shows a RNN being unrolled (or unfolded) into a full network. By unrolling we simply mean that we write out the network for the complete sequence. For example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word. The formulas that govern the computation happening in a RNN are as follows:

  • $x_t$ 是在時間 $t$ 的輸入值. For example, $x_1$ could be a one-hot vector corresponding to the second word of a sentence.
  • $s_t$ 是時間$t$ 隱藏層的值. It’s the “memory” of the network. is calculated based on the previous hidden state and the input at the current step: \[s_t=f(Ux_t+Ws_{t-1})\]The function $f$ usually is a nonlinearity such as tanh or ReLU.  $s_{-1}$, which is required to calculate the first hidden state, is typically initialized to all zeroes.
  • $o_t$ 在時間$t$ 的輸出值. For example, if we wanted to predict the next word in a sentence.\[o_t=softmax(Vs_t)\] It would be a vector of probabilities across our vocabulary.

需要注意以下幾點:

  • 你可以將 hidden state 當作模型 memory. 但是他不能保存較長時間的信息。
  • 與傳統的神經網絡模型不同(每層使用不同的參數),RNN在每次迭代的時候都共享同一組參數($U,V,W$). 這表明,每次計算我們只是輸入不同的 $X_t$,其他的參數都是一樣的,這種設計方案極大減少了需要學習的參數。
  • 每一次迭代(step),都會有一個輸出,但我們並不必須使用它.例如,當我們預測一個句子的情感的時候我們可以使用RNN的最后的輸出來做預測,而不是輸入每一個單詞所輸出的值做預測(盡管這樣也可以) 。這個模型的主要特點就是sequence的所有信息都保存在RNN的hidden state里面。 

 

RNN模型的應用?

RNNs have shown great success in many NLP tasks. At this point I should mention that the most commonly used type of RNNs are LSTMs, which are much better at capturing long-term dependencies than vanilla RNNs are. But don’t worry, LSTMs are essentially the same thing as the RNN we will develop in this tutorial, they just have a different way of computing the hidden state. We’ll cover LSTMs in more detail in a later post. Here are some example applications of RNNs in NLP (by non means an exhaustive list).

 

語言、文本建模

Given a sequence of words we want to predict the probability of each word given the previous words. Language Models allow us to measure how likely a sentence is, which is an important input for Machine Translation (since high-probability sentences are typically correct). A side-effect of being able to predict the next word is that we get a generative model, which allows us to generate new text by sampling from the output probabilities. And depending on what our training data is we can generate all kinds of stuff. In Language Modeling our input is typically a sequence of words (encoded as one-hot vectors for example), and our output is the sequence of predicted words. When training the network we set $o_t=x_{t+1}$ since we want the output at step to be the actual next word.

 

相關的語言建模和文本自動生成的論文:

 

機器翻譯

Machine Translation is similar to language modeling in that our input is a sequence of words in our source language (e.g. German). We want to output a sequence of words in our target language (e.g. English). A key difference is that our output only starts after we have seen the complete input, because the first word of our translated sentences may require information captured from the complete input sequence.

Screen-Shot-2015-09-17-at-10.39.06-AM-1024x557

圖2. RNN 機器翻譯模型

機器翻譯的相關論文:

 

語音識別

Given an input sequence of acoustic signals from a sound wave, we can predict a sequence of phonetic segments together with their probabilities.

語音識別的相關論文:

 

Generating Image Descriptions

Together with convolutional Neural Networks, RNNs have been used as part of a model to generate descriptions for unlabeled images. It’s quite amazing how well this seems to work. The combined model even aligns the generated words with features found in the images.

Screen-Shot-2015-09-17-at-11.44.24-AM-1024x349

圖3. Deep Visual-Semantic Alignments for Generating Image Descriptions.

 

訓練RNN

訓練RNN模型和訓練傳統的Neural Network 十分相似,都是采用 backpropagation算法,只不過在這里稍微一點不同而已。因為整個模型在每個steps都共享參數parameters,計算gradient的時候不只需要計算當前step的值,還要包含前面所有step的gradient的值. For example, in order to calculate the gradient at $t=4$ we would need to backpropagate 3 steps and sum up the gradients. This is called Backpropagation Through Time (BPTT). If this doesn’t make a whole lot of sense yet, don’t worry, we’ll have a whole post on the gory details. For now, just be aware of the fact that vanilla RNNs trained with BPTT have difficulties learning long-term dependencies (e.g. dependencies between steps that are far apart) due to what is called the vanishing/exploding gradient problem. There exists some machinery to deal with these problems, and certain types of RNNs (like LSTMs) were specifically designed to get around them.

RNN 擴展模型

Over the years researchers have developed more sophisticated types of RNNs to deal with some of the shortcomings of the vanilla RNN model. We will cover them in more detail in a later post, but I want this section to serve as a brief overview so that you are familiar with the taxonomy of models.

  • 雙向RNNs模型: are based on the idea that the output at time may not only depend on the previous elements in the sequence, but also future elements. For example, to predict a missing word in a sequence you want to look at both the left and the right context. Bidirectional RNNs are quite simple. They are just two RNNs stacked on top of each other. The output is then computed based on the hidden state of both RNNs.

bidirectional-rnn-300x196

  • 堆疊雙向RNN模型: are similar to Bidirectional RNNs, only that we now have multiple layers per time step. In practice this gives us a higher learning capacity (but we also need a lot of training data).

Screen-Shot-2015-09-16-at-2.21.51-PM-272x300

  • LSTM 模型: are quite popular these days and we briefly talked about them above. LSTMs don’t have a fundamentally different architecture from RNNs, but they use a different function to compute the hidden state. The memory in LSTMs are called cells and you can think of them as black boxes that take as input the previous state $h_{t-1}$ and current input . Internally these cells  decide what to keep in (and what to erase from) memory. They then combine the previous state, the current memory, and the input. In turns out that these types of units are very efficient at capturing long-term dependencies. LSTMs can be quite confusing in the beginning but if you’re interested in learning more this post has an excellent explanation.

結論

到目前為止,我們介紹了基本的RNN模型架構和它的應用領域,在下一篇文章中,我們將將介紹如何使用Python的 numpy 和Theano 分別建立一個RNN算法模型。

如有什么問題,請留言。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM