Tensorflow RNN中的坑

本文轉載自查看原文 2020-02-16 13:34 691

由於多個版本的積累，Tensorflow中的RNN比較雜亂，到底哪個版本實際效率高，經過實測我發現和api中說明的並不一致，在此記錄一下。

注意，以下相關代碼和結論均運行在tensorflow==1.10上

1.脈絡梳理

在1.10版本的tensorflow中，有關rnn的部分一般在以下四個包中，

tf.contrib.rnn
tf.contrib.cudnn_rnn
tf.nn.rnn_cell
tf.compat.v1.nn.rnn_cell

其中tf.nn.rnn_cell、tf.compat.v1.nn.rnn_cell和tf.contrib.rnn互相等價，那么簡化為兩部分：

tf.contrib.rnn
tf.contrib.cudnn_rnn

在此我們只考察常用的RNN cell（例如 lstm、gru等），並把等價的做個性能對比，上面包中能找到的分類如下：

LSTM：

　　tf.contrib.rnn.LSTMCell

　　tf.contrib.rnn.LSTMBlockCell

　　tf.contrib.rnn.LSTMBlockFusedCell

　　tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell

GRU:

　　tf.contrib.rnn.GRUCell

　　tf.contrib.rnn.GRUBlockCellV2

　　tf.contrib.cudnn_rnn.CudnnCompatibleGRUCell

SRU：

　　tf.contrib.rnn.SRUCell

2.評測代碼

代碼如下，除了簡單的session.run計時外，還做了利用timeline工具profiling。但在后續分析時，主要還是利用計時結果。因為timeline的分析結果里面細化到每個最小操作符，比較麻煩。

因為輸入數據是使用numpy隨機生成的，直接copy過去就能運行，代碼中比較了幾種情況，分別是lstm,bi-lstm,2 layer bi-lstm。

 1 # -*- coding: utf-8 -*-
 2 import tensorflow as tf
 3 import numpy as np
 4 import time
 5 import functools
 6 import json
 7 from tensorflow.python.client import timeline
 8 
 9 
10 def update_timeline(chrome_trace, _timeline_dict):
11     # convert crome trace to python dict
12     chrome_trace_dict = json.loads(chrome_trace)
13     # for first run store full trace
14     if _timeline_dict is None:
15         _timeline_dict = chrome_trace_dict
16     # for other - update only time consumption, not definitions
17     else:
18         for event in chrome_trace_dict['traceEvents']:
19             # events time consumption started with 'ts' prefix
20             if 'ts' in event:
21                 _timeline_dict['traceEvents'].append(event)
22     return _timeline_dict
23 
24 
25 batch_size = 1
26 time_step = 70
27 hidden_num = 512
28 stack_num = 1
29 
30 # 設置隨機種子，保持每次隨機數一致
31 np.random.seed(0)
32 # 創建正態分布輸入數據，batch_size*time_step*hidden_num
33 np_input_data = np.random.randn(batch_size, time_step, hidden_num).astype(np.float32)
34 np_input_len = [time_step]*batch_size
35 # print np_input_data
36 
37 # LSTM cells
38 #rnn_cell = tf.contrib.rnn.LSTMCell  # child of RNNCell
39 rnn_cell = tf.contrib.rnn.LSTMBlockCell  # child of RNNCell
40 # rnn_cell = tf.contrib.rnn.LSTMBlockFusedCell  # not child of RNNCell
41 # rnn_cell = tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell  # child of RNNCell
42 
43 # GRU cells
44 # rnn_cell = tf.contrib.rnn.GRUCell
45 # rnn_cell = tf.contrib.rnn.GRUBlockCellV2
46 # rnn_cell = tf.contrib.cudnn_rnn.CudnnCompatibleGRUCell
47 
48 # SRU cells
49 # rnn_cell = tf.contrib.rnn.SRUCell
50 
51 # 構建一個簡單的雙層雙向lstm網絡
52 input_data = tf.placeholder(dtype=tf.float32, shape=[batch_size, time_step, hidden_num], name='input_data')
53 trans_data = tf.transpose(input_data, [1, 0, 2])
54 
55 outputs = [trans_data]
56 for i in range(stack_num):
57     fw_rnn = rnn_cell(hidden_num,name='fw_cell_%d' % i)
58     #bw_rnn = rnn_cell(hidden_num,name='bw_cell_%d' % i)
59     if rnn_cell is not tf.contrib.rnn.LSTMBlockFusedCell:
60         fw_rnn = tf.contrib.rnn.FusedRNNCellAdaptor(fw_rnn, use_dynamic_rnn=False)
61         #bw_rnn = tf.contrib.rnn.FusedRNNCellAdaptor(bw_rnn, use_dynamic_rnn=True)
62     #bw_rnn = tf.contrib.rnn.TimeReversedFusedRNN(bw_rnn)
63     outputs1, state1 = fw_rnn(outputs[-1], sequence_length=np_input_len, dtype=tf.float32)
64     #outputs2, state2 = bw_rnn(outputs[-1], sequence_length=np_input_len, dtype=tf.float32)
65     #next_layer_input = tf.concat([outputs1, outputs2], axis=-1)
66     #outputs.append(next_layer_input)
67 
68     outputs.append(outputs1)
69 
70 total_time = 0
71 _timeline_dict = None
72 runs = 1000
73 sess = tf.InteractiveSession()
74 sess.run(tf.global_variables_initializer())
75 for i in range(0, runs):
76     # options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
77     run_metadata = tf.RunMetadata()
78     t1 = time.time()
79     # result = sess.run([outputs[-1]], feed_dict={input_data: np_input_data}, options=options,
80     #                   run_metadata=run_metadata)
81     result = sess.run([outputs[-1]], feed_dict={input_data: np_input_data},
82                       run_metadata=run_metadata)
83     t2 = time.time()
84     total_time += (t2 - t1)
85     # fetched_timeline = timeline.Timeline(run_metadata.step_stats)
86     # chrome_trace = fetched_timeline.generate_chrome_trace_format()
87     #print t2 - t1
88     # _timeline_dict = update_timeline(chrome_trace, _timeline_dict)
89 
90 print rnn_cell
91 print 'average time %f ms' % (total_time / float(runs)*1000.0)
92 # with open('fused_%d_runs.json' % runs, 'w') as f:
93 #     json.dump(_timeline_dict, f)

3.性能分析

在V100上測試，得到的結果如下

其中可以看出，無論是cudnn優化版本，還是原始版本，在gpu上的表現都不盡如人意,反倒是LSTMBlockFusedCell這個實現，效果出奇的好，甚至在GPU上好於GRU和SRU，CPU上差異也不大。至於造成這些現象的原因，需要閱讀這幾種實現的底層源碼，因為時間精力有限，我沒有仔細閱讀，如果有感興趣的讀者，可以深入研究。總結如下：

1.如果使用tf1中的rnn，建議使用LSTMBlockFusedCell

2.所有RNN相關實現在GPU上性能，基本和層數、方向是線性關系

3.所有RNN相關實現在CPU上性能，和層數、方向不一定是線性關系

4.GRU和SRU的實現在CPU上比GPU上效率高

5.tf1中的RNN坑很多，要謹慎使用

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 解讀tensorflow之rnn language model ——tensorflow 之RNN 關於tensorflow里面的tf.contrib.rnn.BasicLSTMCell 中num_units參數問題 Torch-RNN運行過程中的坑 [2]（Lua的string sub函數，讀取中文失敗，亂碼？）超詳細的RNN代碼實現(tensorflow) tensorflow教程:tf.contrib.rnn.DropoutWrapper 基於TensorFlow的循環神經網絡(RNN) [tf] tensorflow中multi-GPU小坑記錄 Tensorflow所遇坑 RNN