學習筆記TF035:實現基於LSTM語言模型

本文轉載自查看原文 2017-08-12 11:05 1817

神經結構進步、GPU深度學習訓練效率突破。RNN，時間序列數據有效，每個神經元通過內部組件保存輸入信息。

卷積神經網絡，圖像分類，無法對視頻每幀圖像發生事情關聯分析，無法利用前幀圖像信息。RNN最大特點，神經元某些輸出作為輸入再次傳輸到神經元，可以利用之前信息。

xt是RNN輸入，A是RNN節點，ht是輸出。對RNN輸入數據xt，網絡計算得輸出結果ht，某些信息(state,狀態)傳到網絡輸入。輸出ht與label比較得誤差，用梯度下降(Gradient Descent)和Back-Propagation Through Time(BPTT)方法訓練網絡。BPTT，用反向傳播求解梯度，更新網絡參數權重。Real_Time Recurrent Learning(RTRL)，正向求解梯度，計算復雜度高。介於BPTT和RTRL之間混合方法，緩解時間序列間隔過長帶來梯度彌散問題。

RNN循環展開串聯結構，類似系列輸入x和系列輸出串聯普通神經網絡，上層神經網絡傳遞信息給下層。適合時間序列數據處理分析。展開每層級神經網絡，參數相同，只需要訓練一層RNN參數。共享參數思想與卷積神經網絡權值共享類似。

RNN處理整個時間序列信息，記憶最深是最后輸入信號。前信號強度越來越低。Long Sort Term Memory(LSTM)突破，語音識別、文本分類、語言模型、自動對話、機器翻譯、圖像標注領域。

長程依賴(Long-term Dependencies)，傳統RNN關鍵缺陷。LSTM，Schmidhuber教授1997年提出，解決長程依賴，不需要特別復雜調試超參數，默認記住長期信息。

LSTM內部結構，4層神經網絡，小圓圈是point-wise操作(向量加法、點乘等)，小矩形是一層可學習參數神經網絡。LSTM單元上直線代表LSTM狀態state，貫穿所有串聯LSTM單元，從第一個流向最后一個，只有少量線性干預和改變。狀態state傳遞，LSTM單凶添加或刪減信息，LSTM Gates控制信息流修改操作。Gates包含Sigmoid層和向量點乘操作。Sigmoid層輸出0到1間值，直接控制信息傳遞比例。0不允許信息傳遞，1讓信息全部通過。每個LSTM單元3個Gates，維護控制單元狀態信息。狀態信息儲存、修改，LSTM單元實現長程記憶。

RNN變種，LSTM，Gated Recurrent Unit(GRU)。GRU結構，比LSTM少一個Gate。計算效率更高(每個單元計算節約幾個矩陣運算)，占用內存少。GRU收斂所需迭代更少，訓練速度更快。

循環神經網絡，自然語言處理，語言模型。語言模型，預測語句概率模型，給定上下文語境，歷史出現單詞，預測下一個單詞出現概率，NLP、語音識別、機器翻譯、圖片標注任務基礎關鍵。Penn Tree Bank(PTB)常用數據集，質量高，不大，訓練快。《Recurrent Neural Network Regularization》。

下載PTB數據集，解壓。確保解壓文件路徑和Python執行路徑一致。1萬個不同單詞，有句尾標記，罕見詞匯統一處理為特殊字符。wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examplex.tgz 。tar xvf simple-examples.tgz 。

下載TensorFlow Models庫(git clone https://github.com/tensorflow/models.git)，進入目錄models/tutorials/rnn/ptb(cd)。載入常用庫，TensorFlow Models PTB reader，讀取數據內容。單詞轉唯一數字編碼。

定義語言模型處理輸入數據class，PTBInput。初始化方法__init__()，讀取參數config的batch_size、num_steps到本地變量。num_steps，LSTM展開步數(unrolled steps of LSTM)。計算epoth size ，epoch內訓練迭代輪數，數據長度整除batch_size、num_steps。reader.ptb_producer獲取特征數據input_data、label數據targets。每次執行獲取一個batch數據。

定義語言模型class，PTBModel。初始化函數__init__()，參數，訓練標記is_training、配置參數config、PTBInput類實例input_。讀取input_的batch_size、num_steps，讀取config的hidden_size(LSTM節點數)、vocab_size(詞匯表大小)到本地變量。

tf.contrib.rnn.BasicLSTMCell設置默認LSTM單元，隱含節點數hidden_size、gorget_bias(forget gate bias) 0，state_is_tuple True，接受返回state是2-tuple形式。訓練狀態且Dropout keep_prob小於1,1stm_cell接Dropout層，tf.contrib.rnn.DropoutWrapper函數。RNN堆疊函數 tf.contrib.rnn.MultiRNNCell 1stm_cell多層堆疊到cell，堆疊次數 config num_layers，state_is_truple設True,cell.zero_state設LSTM單元初始化狀態0。LSTM單元讀放單詞，結合儲存狀態state計算下一單詞出現概率分布，每次讀取單詞，狀態state更新。

創建網絡詞嵌入embedding，將one-hot編碼格式單詞轉向量表達形式。with tf.device("/cpu:0") 計算限定CPU進行。初始化embedding矩陣，行數設詞匯表數vocab_size，列數(單詞向量表達維數)hidden_size，和LST單元陷含節點數一致。訓練過程，embedding參數優化更新。tf.nn.embedding_lookup查詢單對應向量表達獲得inputs。訓練狀態加一層Dropout。

定義輸出outputs，tf.variable_scope設名RNN。控制訓練過程，限制梯度反向傳播展開步數固定值，num_steps.設置循環長度 num-steps，控制梯度傳播。從第2次循環，tf.get_varible_scope.reuse_variables設置復用變量。每次循環，傳入inputs、state到堆疊LSTM單元(cell)。inputs 3維度，第1維 batch第幾個樣本，第2維樣本第幾個單詞，第3維單詞向量表達維度。inputs[:,time_step,:] 所有樣本第time_step個單詞。輸出cell_output和更新state。結果cell_output添加輸出列表outputs。

tf.concat串接output內容，tf.reshape轉長一維向量。Softmax層，定義權重softmax_w、偏置softmax_b。tf.matmul 輸出output乘權重加偏置得網絡最后輸出logits。定久損失loss，tf.contrib.legacy_seq2seq.sequence_loss_by_example計算輸出logits和targets偏差。sequence_loss，target words average negative log probability，定義loss=1/N add i=1toN ln Ptargeti。tf.reduce_sum匯總batch誤差，計算平均樣本誤差cost。保留最終狀態final_state。不是訓練狀態直接返回。

定義學習速率變量lr，設不可訓練。tf.trainable_variables獲取全部可訓練參數tvars。針對cost，計算tvars梯度，tf.clip_by_global_norm設梯度最大范數，起正則化效果。Gradient Clipping防止Gradient Explosion梯度爆炸問題。不限制梯度，迭代梯度過大，訓練難收斂。定義優化器Gradient Descent。創建訓練操作_train_op，optimizer.apply_gradients，clip過梯度用到所有可訓練參數tvars，tf.contrib.framework.get_or_create_global_step生成全局統一訓練步數。

設置_new_lr(new learning rate) placeholder控制學習速率。定義操作_lr_update，tf.assign 賦_new_lr值給當前學習速率_lr。定義assign_lr函數，外部控制模型學習速率，學習速率值傳入_new_lr placeholder，執行_update_lr操作修改學習速率。

定義PTBModel class property。Python @property裝飾器，返回變量設只讀，防止修改變量引發問題。input、initial_state、cost、final_state、lr、train_op。

定義模型設置。init_scale，網絡權重初始scale。learning_rate，學習速率初始值。max_grad_norm，梯度最大范數。num_lyers，LSTM堆疊層數。num_steps，LSTM梯度反向傳播展開步數。hidden_size，LSTM內隱含節點數。max_epoch，初始學習速率可訓練epoch數，需要調整學習速率。max_max_epoch，總共可訓練epoch數。keep_prob，dropout層保留節點比例。lr_decay學習速率衰減速度。batch_size，每個batch樣本數量。

MediumConfig中型模型，減小init_scale，希望權重初值不要過大，小有利溫和訓練。學習速率、最大梯度范數不變，LSTM層數不變。梯度反向傳播展開步數num_steps從20增大到35。hidden_size、max_max_epoch增大3倍。設置dropout keep_prob 0.5。學習迭代次數增大，學習速率衰減速率lr_decay減小。batch_size、詞匯表vocab_size不變。

LargeConfig大型模型，進一步縮小init_scale。放寬最大梯度范數max_grad_norm到10。hidden_size提升到1500。max_epoch、max_max_epoch增大。keep_prob因模型復雜度上升繼續下降。學習速率衰減速率lr_decay進一步減小。

TestConfig測試用。參數盡量最小值。

定義訓練epoch數據函數run_epoch。記錄當前時間，初始化損失costs、迭代數據iters，執行model.initial_state初始化狀態，獲得初始狀態。創建輸出結果字典表fetches，包括cost、final_state。如果有評測操作，也加入fetches。訓練循環，次數epoch_size。循環，生成訓練feed_dict，全部LSTM單元state加入feed_dict，傳入feed_dict，執行fetches訓練網絡，拿到cost、state。累加cost到costs，累加num_steps到iters。每完成10%epoch，展示結果，當前epoch進度，perplexity(平均cost自然常數指數，語言模型比較性能重要指標，越低模型輸出概率分布在預測樣本越好)，訓練速度(單詞數每秒)。返回perplexity函數結果。

reader.ptb_raw_data讀取解壓后數據，得訓練數據、驗證數據、測試數據。定義訓練模型配置SmallConfig。測試配置eval_config需和訓練配置一致。測試配置batch_size、num_steps 1。

創建默認Graph，tf.random_uniform_initializer設置參數初始化器，參數范圍在[-init_scale,init_scale]之間。PTBInput和PTBModel創建訓練模型m，驗證模型mvalid，測試模型mtest。訓練、驗證模型用config，測試模型用測試配置eval_config。

tf.train.supervisor()創建訓練管理器sv，sv.managed_session創建默認session，執行訓練多個epoch數據循環。每個epoch循環，計算累計學習速率衰減值，只需計算超過max_epoch輪數，求lr_decay超出輪數次冪。初始學習速率乘累計衰減速，更新學習速率。循環內執行epoch訓練和驗證，輸出當前學習速率、訓練驗證集perplexity。完成全部訓練，計算輸出模型測試集perplexity。

SmallConfig小型模型，i7 6900K GTX 1080 訓練速率21000單詞每秒，最后epoch，訓練集36.9 perplexity，驗證集122.3、測試集116.7。

中型模型，訓練集48.45，驗證集86.16、測試集82.07。大型模型，訓練集37.87，驗證集82.62、測試集78.29。

LSTM存儲狀態，依靠狀態對當前輸入處理分析預測。RNN、LSTM賦預神經網絡記憶和儲存過往信息能力，模仿人類簡單記憶、推理功能。注意力(attention)機制是RNN、NLP領域研究熱點，機器更好模擬人腦功能。圖像標題生成任務，注意力機制RNN對區域圖像分析，生成對應文字描述。《Show,Attend and Tell:Neural Image Caption Generation with Visual Attention》。

    import time
    import numpy as np
    import tensorflow as tf
    import reader
    #flags = tf.flags
    #logging = tf.logging
    #flags.DEFINE_string("save_path", None,
    #                    "Model output directory.")
    #flags.DEFINE_bool("use_fp16", False,
    #                  "Train using 16-bit floats instead of 32bit floats")
    #FLAGS = flags.FLAGS
    #def data_type():
    #  return tf.float16 if FLAGS.use_fp16 else tf.float32
    class PTBInput(object):
      """The input data."""
      def __init__(self, config, data, name=None):
        self.batch_size = batch_size = config.batch_size
        self.num_steps = num_steps = config.num_steps
        self.epoch_size = ((len(data) // batch_size) - 1) // num_steps
        self.input_data, self.targets = reader.ptb_producer(
        data, batch_size, num_steps, name=name)
    class PTBModel(object):
      """The PTB model."""
      def __init__(self, is_training, config, input_):
        self._input = input_
        batch_size = input_.batch_size
        num_steps = input_.num_steps
        size = config.hidden_size
        vocab_size = config.vocab_size
        # Slightly better results can be obtained with forget gate biases
        # initialized to 1 but the hyperparameters of the model would need to be
        # different than reported in the paper.
        def lstm_cell():
          return tf.contrib.rnn.BasicLSTMCell(
              size, forget_bias=0.0, state_is_tuple=True)
    attn_cell = lstm_cell
        if is_training and config.keep_prob < 1:
          def attn_cell():
            return tf.contrib.rnn.DropoutWrapper(
                lstm_cell(), output_keep_prob=config.keep_prob)
        cell = tf.contrib.rnn.MultiRNNCell(
            [attn_cell() for _ in range(config.num_layers)], state_is_tuple=True)
        self._initial_state = cell.zero_state(batch_size, tf.float32)
        with tf.device("/cpu:0"):
          embedding = tf.get_variable(
              "embedding", [vocab_size, size], dtype=tf.float32)
      inputs = tf.nn.embedding_lookup(embedding, input_.input_data)
    if is_training and config.keep_prob < 1:
      inputs = tf.nn.dropout(inputs, config.keep_prob)
    # Simplified version of models/tutorials/rnn/rnn.py's rnn().
    # This builds an unrolled LSTM for tutorial purposes only.
    # In general, use the rnn() or state_saving_rnn() from rnn.py.
    #
    # The alternative version of the code below is:
    #
    # inputs = tf.unstack(inputs, num=num_steps, axis=1)
    # outputs, state = tf.nn.rnn(cell, inputs,
    #                            initial_state=self._initial_state)
    outputs = []
    state = self._initial_state
    with tf.variable_scope("RNN"):
      for time_step in range(num_steps):
        if time_step > 0: tf.get_variable_scope().reuse_variables()
        (cell_output, state) = cell(inputs[:, time_step, :], state)
        outputs.append(cell_output)
        output = tf.reshape(tf.concat(outputs, 1), [-1, size])
        softmax_w = tf.get_variable(
        "softmax_w", [size, vocab_size], dtype=tf.float32)
        softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=tf.float32)
        logits = tf.matmul(output, softmax_w) + softmax_b
        loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example(
            [logits],
            [tf.reshape(input_.targets, [-1])],
            [tf.ones([batch_size * num_steps], dtype=tf.float32)])
        self._cost = cost = tf.reduce_sum(loss) / batch_size
        self._final_state = state
        if not is_training:
          return
        self._lr = tf.Variable(0.0, trainable=False)
        tvars = tf.trainable_variables()
        grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars),
                                      config.max_grad_norm)
        optimizer = tf.train.GradientDescentOptimizer(self._lr)
        self._train_op = optimizer.apply_gradients(
            zip(grads, tvars),
            global_step=tf.contrib.framework.get_or_create_global_step())
        self._new_lr = tf.placeholder(
            tf.float32, shape=[], name="new_learning_rate")
        self._lr_update = tf.assign(self._lr, self._new_lr)
      def assign_lr(self, session, lr_value):
        session.run(self._lr_update, feed_dict={self._new_lr: lr_value})
      @property
      def input(self):
        return self._input
      @property
      def initial_state(self):
        return self._initial_state
      @property
      def cost(self):
        return self._cost
      @property
      def final_state(self):
        return self._final_state
      @property
      def lr(self):
        return self._lr
      @property
      def train_op(self):
        return self._train_op
    class SmallConfig(object):
      """Small config."""
      init_scale = 0.1
      learning_rate = 1.0
      max_grad_norm = 5
      num_layers = 2
      num_steps = 20
      hidden_size = 200
      max_epoch = 4
      max_max_epoch = 13
      keep_prob = 1.0
      lr_decay = 0.5
      batch_size = 20
      vocab_size = 10000
    class MediumConfig(object):
      """Medium config."""
      init_scale = 0.05
      learning_rate = 1.0
      max_grad_norm = 5
      num_layers = 2
      num_steps = 35
      hidden_size = 650
      max_epoch = 6
      max_max_epoch = 39
      keep_prob = 0.5
      lr_decay = 0.8
      batch_size = 20
      vocab_size = 10000
    class LargeConfig(object):
      """Large config."""
      init_scale = 0.04
      learning_rate = 1.0
      max_grad_norm = 10
      num_layers = 2
      num_steps = 35
      hidden_size = 1500
      max_epoch = 14
      max_max_epoch = 55
      keep_prob = 0.35
      lr_decay = 1 / 1.15
      batch_size = 20
      vocab_size = 10000
    class TestConfig(object):
      """Tiny config, for testing."""
      init_scale = 0.1
      learning_rate = 1.0
      max_grad_norm = 1
      num_layers = 1
      num_steps = 2
      hidden_size = 2
      max_epoch = 1
      max_max_epoch = 1
      keep_prob = 1.0
      lr_decay = 0.5
      batch_size = 20
      vocab_size = 10000
    def run_epoch(session, model, eval_op=None, verbose=False):
      """Runs the model on the given data."""
      start_time = time.time()
      costs = 0.0
      iters = 0
      state = session.run(model.initial_state)
      fetches = {
          "cost": model.cost,
          "final_state": model.final_state,
      }
      if eval_op is not None:
        fetches["eval_op"] = eval_op
      for step in range(model.input.epoch_size):
        feed_dict = {}
        for i, (c, h) in enumerate(model.initial_state):
          feed_dict[c] = state[i].c
          feed_dict[h] = state[i].h
        vals = session.run(fetches, feed_dict)
        cost = vals["cost"]
        state = vals["final_state"]
        costs += cost
        iters += model.input.num_steps
        if verbose and step % (model.input.epoch_size // 10) == 10:
          print("%.3f perplexity: %.3f speed: %.0f wps" %
                (step * 1.0 / model.input.epoch_size, np.exp(costs / iters),
                 iters * model.input.batch_size / (time.time() - start_time)))
      return np.exp(costs / iters)
    raw_data = reader.ptb_raw_data('simple-examples/data/')
    train_data, valid_data, test_data, _ = raw_data
    config = SmallConfig()
    eval_config = SmallConfig()
    eval_config.batch_size = 1
    eval_config.num_steps = 1
    with tf.Graph().as_default():
      initializer = tf.random_uniform_initializer(-config.init_scale,
                                              config.init_scale)
      with tf.name_scope("Train"):
        train_input = PTBInput(config=config, data=train_data, name="TrainInput")
        with tf.variable_scope("Model", reuse=None, initializer=initializer):
          m = PTBModel(is_training=True, config=config, input_=train_input)
          #tf.scalar_summary("Training Loss", m.cost)
          #tf.scalar_summary("Learning Rate", m.lr)
      with tf.name_scope("Valid"):
        valid_input = PTBInput(config=config, data=valid_data, name="ValidInput")
        with tf.variable_scope("Model", reuse=True, initializer=initializer):
          mvalid = PTBModel(is_training=False, config=config, input_=valid_input)
          #tf.scalar_summary("Validation Loss", mvalid.cost)
      with tf.name_scope("Test"):
        test_input = PTBInput(config=eval_config, data=test_data, name="TestInput")
        with tf.variable_scope("Model", reuse=True, initializer=initializer):
          mtest = PTBModel(is_training=False, config=eval_config,
                       input_=test_input)
      sv = tf.train.Supervisor()
      with sv.managed_session() as session:
        for i in range(config.max_max_epoch):
          lr_decay = config.lr_decay ** max(i + 1 - config.max_epoch, 0.0)
          m.assign_lr(session, config.learning_rate * lr_decay)
          print("Epoch: %d Learning rate: %.3f" % (i + 1, session.run(m.lr)))
          train_perplexity = run_epoch(session, m, eval_op=m.train_op,
                                   verbose=True)
          print("Epoch: %d Train Perplexity: %.3f" % (i + 1, train_perplexity))
          valid_perplexity = run_epoch(session, mvalid)
          print("Epoch: %d Valid Perplexity: %.3f" % (i + 1, valid_perplexity))
        test_perplexity = run_epoch(session, mtest)
        print("Test Perplexity: %.3f" % test_perplexity)
         # if FLAGS.save_path:
         #   print("Saving model to %s." % FLAGS.save_path)
         #   sv.saver.save(session, FLAGS.save_path, global_step=sv.global_step)
    #if __name__ == "__main__":
    #  tf.app.run()

參考資料：
《TensorFlow實戰》

歡迎付費咨詢(150元每小時)，我的微信：qingxingfengzi

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 統計語言模型與LSTM RNN LSTM語言模型語言模型系列（一）——AWD-LSTM 基於LSTM語言模型的文本生成學習筆記TF036:實現Bidirectional LSTM Classifier 基於MR實現ngram語言模型 NLP之語言模型 1. 語言模型深度學習與人類語言處理-語言模型 CSC321 神經網絡語言模型 RNN-LSTM