RNN神經元理解

單個RNN神經元行為

括號中表示的是維度

向前傳播

def rnn_step_forward(x, prev_h, Wx, Wh, b):
  """
  Run the forward pass for a single timestep of a vanilla RNN that uses a tanh
  activation function.

  The input data has dimension D, the hidden state has dimension H, and we use
  a minibatch size of N.

  Inputs:
  - x: Input data for this timestep, of shape (N, D).
  - prev_h: Hidden state from previous timestep, of shape (N, H)
  - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
  - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
  - b: Biases of shape (H,)

  Returns a tuple of:
  - next_h: Next hidden state, of shape (N, H)
  - cache: Tuple of values needed for the backward pass.
  """
  next_h, cache = None, None
  ##############################################################################
  # TODO: Implement a single forward step for the vanilla RNN. Store the next  #
  # hidden state and any values you need for the backward pass in the next_h   #
  # and cache variables respectively.                                          #
  ##############################################################################
  next_h = np.tanh(x.dot(Wx) + prev_h.dot(Wh) + b)
  cache = (x, Wx, Wh, prev_h, next_h)
  ##############################################################################
  #                               END OF YOUR CODE                             #
  ##############################################################################
  return next_h, cache

反向傳播

def rnn_step_backward(dnext_h, cache):
  """
  Backward pass for a single timestep of a vanilla RNN.
  
  Inputs:
  - dnext_h: Gradient of loss with respect to next hidden state
  - cache: Cache object from the forward pass
  
  Returns a tuple of:
  - dx: Gradients of input data, of shape (N, D)
  - dprev_h: Gradients of previous hidden state, of shape (N, H)
  - dWx: Gradients of input-to-hidden weights, of shape (N, H)
  - dWh: Gradients of hidden-to-hidden weights, of shape (H, H)
  - db: Gradients of bias vector, of shape (H,)
  """
  dx, dprev_h, dWx, dWh, db = None, None, None, None, None
  ##############################################################################
  # TODO: Implement the backward pass for a single step of a vanilla RNN.      #
  #                                                                            #
  # HINT: For the tanh function, you can compute the local derivative in terms #
  # of the output value from tanh.                                             #
  ##############################################################################
  x, Wx, Wh, prev_h, next_h = cache
  dtanh = 1 - next_h**2
  dx = (dnext_h*dtanh).dot(Wx.T)
  dWx = x.T.dot(dnext_h*dtanh)
  dprev_h = (dnext_h*dtanh).dot(Wh.T)
  dWh = prev_h.T.dot(dnext_h*dtanh)
  db = np.sum(dnext_h*dtanh,axis=0)
  ##############################################################################
  #                               END OF YOUR CODE                             #
  ##############################################################################
  return dx, dprev_h, dWx, dWh, db

單層RNN神經元行為

x（N，T，D）表示N樣本的batch中有T個字符向量，每個響亮H維度。

RNN輸出有兩個方向，一個向上一層（輸出層），一個向同層下一個時序，所以反向傳播時兩個梯度需要相加，輸出層梯度可以直接求出（或是上一層中遞歸求出），所以使用dh(N,T,H)保存好，而同層時序梯度必須在同層中遞歸計算。

正向傳播

def rnn_forward(x, h0, Wx, Wh, b):
  """
  Run a vanilla RNN forward on an entire sequence of data. We assume an input
  sequence composed of T vectors, each of dimension D. The RNN uses a hidden
  size of H, and we work over a minibatch containing N sequences. After running
  the RNN forward, we return the hidden states for all timesteps.
  
  Inputs:
  - x: Input data for the entire timeseries, of shape (N, T, D).
  - h0: Initial hidden state, of shape (N, H)
  - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
  - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
  - b: Biases of shape (H,)
  
  Returns a tuple of:
  - h: Hidden states for the entire timeseries, of shape (N, T, H).
  - cache: Values needed in the backward pass
  """
  h, cache = None, None
  ##############################################################################
  # TODO: Implement forward pass for a vanilla RNN running on a sequence of    #
  # input data. You should use the rnn_step_forward function that you defined  #
  # above.                                                                     #
  ##############################################################################
  N, T, D = x.shape
  _, H = h0.shape
  h = np.zeros((N,T,H))
  h_next = h0
  cache = []
  for i in range(T):
    h[:,i,:], cache_next = rnn_step_forward(x[:,i,:], h_next, Wx, Wh, b)
    h_next = h[:,i,:]
    cache.append(cache_next)
  ##############################################################################
  #                               END OF YOUR CODE                             #
  ##############################################################################
  return h, cache

單層RNN反向傳播

def rnn_backward(dh, cache):
  """
  Compute the backward pass for a vanilla RNN over an entire sequence of data.
  
  Inputs:
  - dh: Upstream gradients of all hidden states, of shape (N, T, H)
  
  Returns a tuple of:
  - dx: Gradient of inputs, of shape (N, T, D)
  - dh0: Gradient of initial hidden state, of shape (N, H)
  - dWx: Gradient of input-to-hidden weights, of shape (D, H)
  - dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
  - db: Gradient of biases, of shape (H,)
  """
  dx, dh0, dWx, dWh, db = None, None, None, None, None
  ##############################################################################
  # TODO: Implement the backward pass for a vanilla RNN running an entire      #
  # sequence of data. You should use the rnn_step_backward function that you   #
  # defined above.                                                             #
  ##############################################################################
  x, Wx, Wh, prev_h, next_h = cache[-1]
  _, D = x.shape
  N, T, H = dh.shape
  dx = np.zeros((N,T,D))
  dh0 = np.zeros((N,H))
  dWx = np.zeros((D,H))
  dWh = np.zeros((H,H))
  db = np.zeros(H)
  dprev_h_ = np.zeros((N,H))
  for i in range(T-1,-1,-1):
    dx_, dprev_h_, dWx_, dWh_, db_ = rnn_step_backward(dh[:,i,:] + dprev_h_, cache.pop())
    dx[:,i,:] = dx_
    dh0 = dprev_h_
    dWx += dWx_
    dWh += dWh_
    db += db_
  ##############################################################################
  #                               END OF YOUR CODE                             #
  ##############################################################################
  return dx, dh0, dWx, dWh, db

圖像標注過程理解

正向傳播流程如下，

幾個有意思的點

字符和向量的映射

涉及兩個映射，

一個是caption_in到輸出節點維度向量的映射，映射矩陣是需要學習的參數
一個是輸出節點向量到字符的映射，這里面有專門的映射函數，輸出節點本身是變化的（被學習的）

第一個映射：

caption_in和caption_out是輸入和標准（caption_in=caption[:-1],caption_out=caption[1:]），不考慮batch的話是一維數組，通過We矩陣可以映射到字符向量空間，轉換以及反向傳播過程如下，

def word_embedding_forward(x, W):
  """
  Forward pass for word embeddings. We operate on minibatches of size N where
  each sequence has length T. We assume a vocabulary of V words, assigning each
  to a vector of dimension D.
  
  Inputs:
  - x: Integer array of shape (N, T) giving indices of words. Each element idx
    of x muxt be in the range 0 <= idx < V.
  - W: Weight matrix of shape (V, D) giving word vectors for all words.
  
  Returns a tuple of:
  - out: Array of shape (N, T, D) giving word vectors for all input words.
  - cache: Values needed for the backward pass
  """

  out = W[x, :]
  cache = (W, x)
  
  return out, cache

反向傳播注意，這不是個標准意義上的鏈式傳播的門，按照邏輯分析這個映射過程的梯度是疊加的，注意函數np.func.at()的用法

def word_embedding_backward(dout, cache):
  """
  Backward pass for word embeddings. We cannot back-propagate into the words
  since they are integers, so we only return gradient for the word embedding
  matrix.
  
  HINT: Look up the function np.add.at
  
  Inputs:
  - dout: Upstream gradients of shape (N, T, D)
  - cache: Values from the forward pass
  
  Returns:
  - dW: Gradient of word embedding matrix, of shape (V, D).
  """
  W, x = cache
  dW = np.zeros_like(W)
  #dW[x] += dout # this will not work, see the doc of np.add.at
  np.add.at(dW, x, dout)
  
  return dW

第二個映射：

正常的多維y=xW計算，

y = x.reshape(x.shape[0], -1).dot(w) + b # 保留N，后面的數據化為一維

這里的y=xW計算，

y = x.reshape(N * T, D).dot(w).reshape(N, T, M) + b # 其實使用y = x.dot(w) + b 效果是一樣的，自動廣播到最低維度

上面問題不大，問題在求梯度的時候，兩者處理有一定差別，注意到這一點的話只要在演算的時候較對好各個維度的值就好了（保證相乘的兩項維度可以相乘，而且理論結果維度和算式相符）

情況2的代碼，

def affine_forward(x, w, b):
  """
  Computes the forward pass for an affine (fully-connected) layer.

  The input x has shape (N, d_1, ..., d_k) where x[i] is the ith input.
  We multiply this against a weight matrix of shape (D, M) where
  D = \prod_i d_i

  Inputs:
  x - Input data, of shape (N, d_1, ..., d_k)
  w - Weights, of shape (D, M)
  b - Biases, of shape (M,)
  
  Returns a tuple of:
  - out: output, of shape (N, M)
  - cache: (x, w, b)
  """
  out = x.reshape(x.shape[0], -1).dot(w) + b
  cache = (x, w, b)
  return out, cache


def affine_backward(dout, cache):
  """
  Computes the backward pass for an affine layer.

  Inputs:
  - dout: Upstream derivative, of shape (N, M)
  - cache: Tuple of:
    - x: Input data, of shape (N, d_1, ... d_k)
    - w: Weights, of shape (D, M)

  Returns a tuple of:
  - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
  - dw: Gradient with respect to w, of shape (D, M)
  - db: Gradient with respect to b, of shape (M,)
  """
  x, w, b = cache
  dx = dout.dot(w.T).reshape(x.shape)
  dw = x.reshape(x.shape[0], -1).T.dot(dout)
  db = np.sum(dout, axis=0)
  return dx, dw, db

情況2的代碼，

def temporal_affine_forward(x, w, b):
  """
  Forward pass for a temporal affine layer. The input is a set of D-dimensional
  vectors arranged into a minibatch of N timeseries, each of length T. We use
  an affine function to transform each of those vectors into a new vector of
  dimension M.

  Inputs:
  - x: Input data of shape (N, T, D)
  - w: Weights of shape (D, M)
  - b: Biases of shape (M,)
  
  Returns a tuple of:
  - out: Output data of shape (N, T, M)
  - cache: Values needed for the backward pass
  """
  N, T, D = x.shape
  M = b.shape[0]
  # out = x.reshape(N * T, D).dot(w).reshape(N, T, M) + b
  out = x.dot(w) + b
  cache = x, w, b, out
  return out, cache


def temporal_affine_backward(dout, cache):
  """
  Backward pass for temporal affine layer.

  Input:
  - dout: Upstream gradients of shape (N, T, M)
  - cache: Values from forward pass

  Returns a tuple of:
  - dx: Gradient of input, of shape (N, T, D)
  - dw: Gradient of weights, of shape (D, M)
  - db: Gradient of biases, of shape (M,)
  """
  x, w, b, out = cache
  N, T, D = x.shape
  M = b.shape[0]

  #dx = dout.reshape(N * T, M).dot(w.T).reshape(N, T, D)
  #dw = dout.reshape(N * T, M).T.dot(x.reshape(N * T, D)).T
  dx = dout.dot(w.T)
  dw = x.reshape(N * T, D).T.dot(dout.reshape(N * T, M))
  db = dout.sum(axis=(0, 1))

  return dx, dw, db

進行一次向前傳播&向后傳播的測試函數，用於訓練

def loss(self, features, captions):
    """
    Compute training-time loss for the RNN. We input image features and
    ground-truth captions for those images, and use an RNN (or LSTM) to compute
    loss and gradients on all parameters.
    
    Inputs:
    - features: Input image features, of shape (N, D)
    - captions: Ground-truth captions; an integer array of shape (N, T) where
      each element is in the range 0 <= y[i, t] < V
      
    Returns a tuple of:
    - loss: Scalar loss
    - grads: Dictionary of gradients parallel to self.params
    """
    # Cut captions into two pieces: captions_in has everything but the last word
    # and will be input to the RNN; captions_out has everything but the first
    # word and this is what we will expect the RNN to generate. These are offset
    # by one relative to each other because the RNN should produce word (t+1)
    # after receiving word t. The first element of captions_in will be the START
    # token, and the first element of captions_out will be the first word.
    captions_in = captions[:, :-1]
    captions_out = captions[:, 1:]
    
    # You'll need this 
    mask = (captions_out != self._null)

    # Weight and bias for the affine transform from image features to initial
    # hidden state
    W_proj, b_proj = self.params['W_proj'], self.params['b_proj']
    
    # Word embedding matrix
    W_embed = self.params['W_embed']

    # Input-to-hidden, hidden-to-hidden, and biases for the RNN
    Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']

    # Weight and bias for the hidden-to-vocab transformation.
    W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']
    
    loss, grads = 0.0, {}
    ############################################################################
    # TODO: Implement the forward and backward passes for the CaptioningRNN.   #
    # In the forward pass you will need to do the following:                   #
    # (1) Use an affine transformation to compute the initial hidden state     #
    #     from the image features. This should produce an array of shape (N, H)#
    # (2) Use a word embedding layer to transform the words in captions_in     #
    #     from indices to vectors, giving an array of shape (N, T, W).         #
    # (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to    #
    #     process the sequence of input word vectors and produce hidden state  #
    #     vectors for all timesteps, producing an array of shape (N, T, H).    #
    # (4) Use a (temporal) affine transformation to compute scores over the    #
    #     vocabulary at every timestep using the hidden states, giving an      #
    #     array of shape (N, T, V).                                            #
    # (5) Use (temporal) softmax to compute loss using captions_out, ignoring  #
    #     the points where the output word is <NULL> using the mask above.     #
    #                                                                          #
    # In the backward pass you will need to compute the gradient of the loss   #
    # with respect to all model parameters. Use the loss and grads variables   #
    # defined above to store loss and gradients; grads[k] should give the      #
    # gradients for self.params[k].                                            #
    ############################################################################
    captions_in_emb, emb_cache = word_embedding_forward(captions_in, W_embed)
    
    h_0, feature_cache = affine_forward(features, W_proj, b_proj)
    h, rnn_cache = rnn_forward(captions_in_emb, h_0, Wx, Wh, b)
    temporal_out, temporal_cache = temporal_affine_forward(h, W_vocab, b_vocab)
    
    loss, dout = temporal_softmax_loss(temporal_out, captions_out, mask)
    
    dtemp, grads['W_vocab'], grads['b_vocab'] = temporal_affine_backward(dout, temporal_cache)
    drnn, dh0, grads['Wx'], grads['Wh'], grads['b'] = rnn_backward(dtemp, rnn_cache)
    dfeatures, grads['W_proj'], grads['b_proj'] = affine_backward(dh0, feature_cache)
    
    grads['W_embed'] = word_embedding_backward(drnn, emb_cache)
    
    return loss, grads

用於預測的函數

補充一下 RNN 是怎么生成一句話的，一般 RNN 的訓練（train）和預測（Inference or test）是不一樣的，因為訓練時有 label，每時刻的輸入都是 ground-truth 的單詞；而預測只能把自己上時刻的輸出（就是預測概率最大的那個單詞）當做輸入，頂多只是取前幾個 sampling 一下。對比如下：

Train
- 把 ground-truth word 當做 RNN 的輸入，也有把這種做法叫做 teacher forcing 的
- 比較常見的做法
Inference
- 把 RNN 上時刻的輸出，當做下時刻的輸入。
- 比如訓練的時候也這樣做，就會比較難訓練

def sample(self, features, max_length=30):
    """
    Run a test-time forward pass for the model, sampling captions for input
    feature vectors.

    At each timestep, we embed the current word, pass it and the previous hidden
    state to the RNN to get the next hidden state, use the hidden state to get
    scores for all vocab words, and choose the word with the highest score as
    the next word. The initial hidden state is computed by applying an affine
    transform to the input image features, and the initial word is the <START>
    token.

    For LSTMs you will also have to keep track of the cell state; in that case
    the initial cell state should be zero.

    Inputs:
    - features: Array of input image features of shape (N, D).
    - max_length: Maximum length T of generated captions.

    Returns:
    - captions: Array of shape (N, max_length) giving sampled captions,
      where each element is an integer in the range [0, V). The first element
      of captions should be the first sampled word, not the <START> token.
    """
    N = features.shape[0]
    captions = self._null * np.ones((N, max_length), dtype=np.int32)

    # Unpack parameters
    W_proj, b_proj = self.params['W_proj'], self.params['b_proj']
    W_embed = self.params['W_embed']
    Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']
    W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']
    
    ###########################################################################
    # TODO: Implement test-time sampling for the model. You will need to      #
    # initialize the hidden state of the RNN by applying the learned affine   #
    # transform to the input image features. The first word that you feed to  #
    # the RNN should be the <START> token; its value is stored in the         #
    # variable self._start. At each timestep you will need to do to:          #
    # (1) Embed the previous word using the learned word embeddings           #
    # (2) Make an RNN step using the previous hidden state and the embedded   #
    #     current word to get the next hidden state.                          #
    # (3) Apply the learned affine transformation to the next hidden state to #
    #     get scores for all words in the vocabulary                          #
    # (4) Select the word with the highest score as the next word, writing it #
    #     to the appropriate slot in the captions variable                    #
    #                                                                         #
    # For simplicity, you do not need to stop generating after an <END> token #
    # is sampled, but you can if you want to.                                 #
    #                                                                         #
    # HINT: You will not be able to use the rnn_forward or lstm_forward       #
    # functions; you'll need to call rnn_step_forward or lstm_step_forward in #
    # a loop.                                                                 #
    ###########################################################################
    
    # initialize the hidden state of the RNN by applying the learned affine transform to the input image features.
    prev_h, _ = affine_forward(features, W_proj, b_proj)
    # The first word that you feed to the RNN should be the <START> token
    x = np.array([self._start for i in range(N)]) # 輸出N個self._start
    captions[:, 0] = self._start

    for t in range(1, max_length):
        x_emb, _ = word_embedding_forward(x, W_embed)
        next_h, cache = rnn_step_forward(x_emb, prev_h, Wx, Wh, b)
        prev_h = next_h
        vocab_out, vocab_cache = affine_forward(next_h, W_vocab, b_vocab)
        x = vocab_out.argmax(1)
        captions[:, t] = x 

    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################
    return captions

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 『cs231n』作業3問題3選講_通過代碼理解圖像梯度『cs231n』作業3問題2選講_通過代碼理解LSTM網絡『cs231n』作業1選講_通過代碼理解KNN&交叉驗證&SVM 『cs231n』RNN之理解LSTM網絡『cs231n』通過代碼理解風格遷移 cs231n線性分類器作業 svm代碼 softmax CS231N作業1 詳細實錄（1）：環境准備+Knn CS231N系列課程作業總結『cs231n』通過代碼理解gan網絡&tensorflow共享變量機制_上【cs231n作業筆記】一：KNN分類器