1. sess.run() hangs when called / sess.run() get stuck / freeze that ctrl+c can't kill process
解決:
1 coord = tf.train.Coordinator() 2 threads = tf.train.start_queue_runners(sess=sess,coord=coord)
2. tf.name_scope中values參數的作用
3. 手寫vgg遇到的loss不下降的問題
a. 如果使用SGD優化方法的話,注意batch_size和lr是linked的,取gpu內存中能運行的最大的batch_size,然后修改lr
b. 注意要對數據集進行shuffle,如果取一個batch,里面數據的標簽全部一樣就不好了
c. 修改bias, 把原本的1修改為0.7甚至是0.5 [但是我發現取0也行] 這個點暫時可以不管
d. 注意label的下標是從0開始還是從1開始
e. weight initialization: LSUV(All you need is a good init) > MSRA init(Delving into ... Kaiming He) > Xavier > Gaussian
f. 注意,對於VGG中的圖片預處理,一定要減去均值做中心化,其他可有可無
以上為推測,以下是以及解決的方法
a. 一定減均值,暫時沒有用scale
b. bias=0沒改,使用的weight init是 msra init
c. 一塊卡跑imagenet batch_size設置為64,不要用32
d. 傳入label和計算的logits下標一定要對齊
e. 使用Adam優化方法 初始學習率設置為2e-4
f. 注意tf.loss.softmax_cross_entropy(logits,label)的返回值不是loss, 它只是把loss加入到tf.GraphKeys.LOSSES當中,要想得到loss,使用tf.losses.get_total_loss(),但是這個loss是包括了regurilazation_loss,大概是13左右,會大於6.91,如果只是計算loss,應該就是6.91左右。
4. tf.nn.conv2d(paddding="SAME") padding="VALID"
注意,SAME不是指一定輸出和輸入的大小相同,而是指如果stride=1,那么輸出與輸入大小一致
5. 設置global_step是發現它沒有自動加一
如果想手動加一的話,需要如下寫法:
1 global_step = tf.Variable(0,trainable=False) 2 increment_op = tf.assign_add(global_step,tf.constant(1)) 3 sess = tf.Session() 4 init = tf.global_variables_initializer() 5 sess.run(init) 6 for step in range(0,10): 7 .... 8 sess.run(increment_op)
但是在tf.train.Optimizer中Apply_gradient中,它會實現自動加一的操作,如果使用的是compute_gradient和apply_gradient操作,需要在apply_gradient(global_step=global_step)傳入global_step的參數,如果使用的是minimize也需要手動傳入global_step這個參數,使用方法如下:
1 global_step = tf.Variable(0,trainable=False,name='global_step') 2 opt = tf.train.AdamOptimizer(lr).minimize(loss,global_step)
6. 如果遇到以下問題
在使用代碼如下的情況:
1 if __name__ == '__main__': 2 tf.app.run()
默認的tf.app.run()中是要帶參數的,也就是入口函數,默認是main函數,如果沒有定義main函數,需要手動將入口函數傳入tf.app.run()中,如下:
1 tf.app.run(my_func)
1 def run(main=None, argv=None): 2 """Runs the program with an optional 'main' function and 'argv' list.""" 3 f = flags.FLAGS 4 5 # Extract the args from the optional `argv` list. 6 args = argv[1:] if argv else None 7 8 # Parse the known flags from that list, or from the command 9 # line otherwise. 10 # pylint: disable=protected-access 11 flags_passthrough = f._parse_flags(args=args) 12 # pylint: enable=protected-access 13 14 main = main or sys.modules['__main__'].main 15 16 # Call the main function, passing through any arguments 17 # to the final program. 18 sys.exit(main(sys.argv[:1] + flags_passthrough))
main = main or sys.modules['__main__'].main
等號右邊的第一個main指 run(main=None, argv=None)傳進來的參數,也就是指定的入口函數,
sys.modules['__main__']
means current running file(e.g. my_model.py
) 第二個就是指文件中的main函數
如果沒有指定傳入參數,那么它就是默認為文件中的main函數,如果沒有的話就會報錯
tf.app.run(my_main_running_function)
因為定義了很多flags,所以定義main函數時需要傳入參數,寫成如下形式即可
def main(_):
7. 對於test set大小如果除不盡batch_size,那么處理方式如下,tf.data.Dataset.batch()會自動地將最后一個batch的數據作為一個batch,所以不需要手動處理
1 dataset = tf.data.Dataset.range(200) 2 batched = dataset.apply(tf.contrib.data.batch_and_drop_remainder(128)) 3 print(batched.output_shapes) # ==> "(128,)" (the batch dimension is known)
tf.contrib.data.batch_and_drop_remainder(128)是將最后一個不滿128的batch舍棄,所以輸出的batch大小都是固定的128
By contrast, dataset.batch(128)
would yield a two-element dataset with shapes (128,)
and (72,)
, so the batch dimension would not be statically known.
但是dataset.batch(128)輸出維度不固定,對於最后一個不滿128的batch也是作為一個batch輸出
但是到數據末尾時,再想取下一個batch就會報錯,所以建議都寫成如下方式:
1 dataset = tf.data.Dataset.range(200) 2 next_elements = dataset.batch(128) 3 While True: 4 try: 5 next_batch = sess.run(next_elements) 6 except tf.errors.OutOfRangeError: 7 print("finish") 8 break
以上代碼會取出兩個batch 一個是128一個是72,當想取第三個batch時,就會到except
8. variable_averages = tf.train.ExponentialMovingAverage() 滑動平均模型, 是對每個變量都維護一個影子變量,就是說模型中的參數變量變化劇烈的話你,你就不知道保存的哪個模型效果會更好,對模型中參數都做滑動平均,就可以提高測試時的穩
zhihu:芯尚刃
參考: https://www.zybuluo.com/irving512/note/957702
官網說法:
When training a model, it is often beneficial to maintain moving averages of the trained parameters. Evaluations that use averaged parameters sometimes produce significantly better results than the final trained values.
They help use the moving averages in place of the last trained values for evaluations. 用在evaluation時使結果更加穩定
Reasonable values for decay
are close to 1.0, typically in the multiple-nines range: 0.999, 0.9999, etc.
The apply()
method adds shadow copies of trained variables and add ops that maintain a moving average of the trained variables in their shadow copies. It is used when building the training model. The ops that maintain moving averages are typically run after each training step. The average()
and average_name()
methods give access to the shadow variables and their names. They are useful when building an evaluation model, or when restoring a model from a checkpoint file. They help use the moving averages in place of the last trained values for evaluations.
apply()方法對每個訓練變量增加影子變量並增加op結點用於維持滑動平均的計算,這在training過程中建立,這個op是在每次的訓練后運行
apply
方法會為每個變量(也可以指定特定變量)創建各自的shadow variable
, 即影子變量。之所以叫影子變量,是因為它會全程跟隨訓練中的模型變量。影子變量會被初始化為模型變量的值,然后,每訓練一個step,就更新一次。更新的方式為:
shadow_variable = decay * shadow_variable + (1 - decay) * updated_model_variable
average()方法可以獲取影子變量的值, average_name()可以獲取影子變量的名字, 這是在evaluation過程中常常使用。
使用例子:
1 # Create variables. 2 var0 = tf.Variable(...) 3 var1 = tf.Variable(...) 4 # ... use the variables to build a training model... 5 ... 6 # Create an op that applies the optimizer. This is what we usually 7 # would use as a training op. 8 opt_op = opt.minimize(my_loss, [var0, var1]) 9 10 # Create an ExponentialMovingAverage object 11 ema = tf.train.ExponentialMovingAverage(decay=0.9999) 12 13 # Create the shadow variables, and add ops to maintain moving averages 14 # of var0 and var1. 15 maintain_averages_op = ema.apply([var0, var1]) 16 17 # Create an op that will update the moving averages after each training 18 # step. This is what we will use in place of the usual training op. 19 with tf.control_dependencies([opt_op]): 20 training_op = tf.group(maintain_averages_op) 21 22 ...train the model by running training_op...
There are two ways to use the moving averages for evaluations:
- Build a model that uses the shadow variables instead of the variables. For this, use the
average()
method which returns the shadow variable for a given variable. - Build a model normally but load the checkpoint files to evaluate by using the shadow variable names. For this use the
average_name()
method. See thetf.train.Saver
for more information on restoring saved variables.
在測試階段有兩種方法獲取滑動平均值:
1. 用average()獲取每個變量的滑動平均值
2. 從ckpt用滑動平均值的名字載入滑動平均值
Restore from checkpoint:
訓練時若使用了ExponentialMovingAverage
,在保存checkpoint時,不僅僅會保存模型參數,優化器參數(如Momentum), 還會保存ExponentialMovingAverage
的shadow variable
。
之前,我們可以直接使用以下代碼restore模型參數, 但不會利用ExponentialMovingAverage
的結果:
saver = tf.Saver()
saver.restore(sess, save_path)
若要使用ExponentialMovingAverage
保存的參數:
variables_to_restore = ema.variables_to_restore() saver = tf.train.Saver(variables_to_restore) saver.restore(sess, save_path)
- 訓練時,維護模型參數的滑動平均數。
- 評價時,取出滑動平均數作為模型參數。
實例
variables in checkpoint: bias/ExponentialMovingAverage 0.664593 bias/Momentum 4.12663 weight [[ 0.01567289] [ 0.17180483]] weight/ExponentialMovingAverage [[ 0.10421171] [ 0.26470858]] weight/Momentum [[ 5.95625305] [ 6.24084663]] bias 0.602739 ============================================== variables restored not from ExponentialMovingAverage: weight:0 [[ 0.01567289] [ 0.17180483]] bias:0 0.602739 ============================================== variables restored from ExponentialMovingAverage: weight:0 [[ 0.10421171] [ 0.26470858]] bias:0 0.664593
by default ema.variables_to_restore()
uses tf.moving_average_variables() + tf.trainable_variables()
.
tf.moving_average_variables
If an ExponentialMovingAverage
object is created and the apply()
method is called on a list of variables, these variables will be added to the GraphKeys.MOVING_AVERAGE_VARIABLES
collection. This convenience function returns the contents of that collection.
訓練代碼中使用方式如下:
1 variable_averages = tf.train.ExponentialMovingAverage(decay, global_step) 2 3 variables_to_average = (tf.trainable_variables() + tf.moving_average_variables()) 4 variables_averages_op = variable_averages.apply(variables_to_average) 5 6 train_op = tf.group(opt, variables_averages_op)
為什么要加上tf.moving_average_variables()?(##todo 寫個代碼驗證下之前tf.moving_average_variables()里的變量是什么)
為什么說對bn的參數做了double-average 因為 batch_norm(trainable=True) 就會add variables to the GraphKeys.TRAINABLE_VARIABLES
只是為了和之前的代碼兼容,這里不需要對bn的參數進行操作
9. 對於batch_norm的update_op問題
參考: http://www.cnblogs.com/hrlnw/p/7227447.html
y和b是可有可無的,如果有就是可訓練的,u和o在訓練時使用的是batch內的統計值, 在測試時就是在訓練過程中累計的滑動平均值
訓練過程中:
(1) 輸入參數training=True
(2)計算loss時,要添加以下代碼:添加update_ops到最后的train_op中,這樣才能計算u和o的滑動平均
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
可是如果這里計算了滑動平均,那么8.中的計算為什么還要加入batch_norm的變量
測試過程:
training=False 在batch_norm的代碼中有if else的分流,如果是True的話就使用mini_batch 否則用全局統計值
注意: bn的滑動平均是加入到tf.GraphKeys.UPDATE_OPS 其他的變量滑動平均加入到tf.GraphKeys.MOVING_AVERAGE_VARIABLES
http://www.aiboy.pub/2017/11/26/TensorFlow_BN_Layer/
http://ruishu.io/2016/12/27/batchnorm/ 這個介紹的相當詳細 非常非常重要
10. tensorflow中變量的使用
注意,我們tensorflow代碼本質上還是在寫python代碼,所以如果你只是想建立一個普通的變量,不是圖中的變量,就用python的規則就好了,對於使用tf.Variable,如果要取它的值,還必須sess.run,不能直接賦值也夠惡心,唯一的好處就是如果你要對它進行summary的記錄時可以的,所以有些簡單的變量直接使用python變量就好了
11. 對於在training過程中要做validation
參考: More than one Graph – Code Reuse in TensorFlow
Typically a model will be used in at least three ways:
- Training – finding the correct weights or parameters for the model given some training data. Often done periodically as new data arrives.
- Evaluation – calculating various metrics during training on a different data set to evaluate training quality or for cross validation.
- Serving – on-demand prediction for new data
不同的目的要使用不同的圖,有時在一個模型中需要構建多個圖
在tensorflow的建圖過程中只支持增加,不支持刪除計算節點,甚至overwrite也不行
1 with tf.Session() as sess: 2 my_sqrt = tf.sqrt(4.0, name='my_sqrt') 3 # override 4 my_sqrt = tf.sqrt(2.0, name='my_sqrt') 5 #print all nodes 6 print sess.graph._nodes_by_name.keys()
[u'my_sqrt_1/x', u'my_sqrt_1', u'my_sqrt/x', u'my_sqrt']
所以想要改寫一個計算圖是不可能的,那么就只能多建一個圖
tensorflow甚至都特意提供了tf.variable_scope函數使得更為容易地建不同圖,用這種方式有個很大的缺點就是,即使是很小的改變也需要知道原始建圖的方法,這樣太麻煩了,每次都得重新改代碼建完整圖
如果我們只是需要一張圖能同時進行training和evaluation,我們可以使用條件邏輯:
TensorFlow does have a way to encode different behaviors into a single graph – the tf.cond
operation.
tf.cond只保護它內部的東西
以下代碼是正確的:
1 # Good - dropout inside the conditional 2 is_train = tf.placeholder(tf.bool) 3 activations = tf.cond(is_train, 4 lambda: tf.nn.dropout(activations, 0.7), 5 lambda: activations)
以下代碼是不正確的,即使is_train == False 也仍然會運行 dropout
1 # Bad - droupout outside the conditional evaluated every time! 2 is_train = tf.placeholder(tf.bool) 3 do_activations = tf.nn.dropout(activations, 0.7) 4 activations = tf.cond(is_train, 5 lambda: do_activations, 6 lambda: activations)
Queues and the condiional operator
1 tf.cond(is_eval, 2 lambda: tf.train.shuffle_batch(eval_tensors, 1024,100000,10000), 3 lambda: tf.train.shuffle_batch(train_tensors,1024,100000,10000))
以上代碼報錯了:operation has been marked as not fetchable
and crash.
Tensorflow是不允許enqueue conditionally, 所以要避免這個問題,我們需要將這個操作划分成兩部分,一個是conditional part that creates the queue, and conditional part that pulls from the correct queue.
以下是一個例子:
1 def create_queue(tensors, capacity, ...): 2 ... 3 queue = data_flow_ops.RandomShuffleQueue( 4 capacity=capacity, min_after_dequeue=min_after_dequeue, seed=seed, 5 dtypes=types, shapes=shapes, shared_name=shared_name) 6 return queue 7 8 def create_dequeue(queue, ...): 9 ... 10 dequeued = queue.dequeue_up_to(batch_size, name=name) 11 ... 12 return dequeued 13 14 def merge_queues(self, is_train_tensor, train_tensors, test_tensors, ...): 15 train_queue = self.create_queue(tensors=train_tensors, 16 capacity=... 17 ) 18 test_queue = self.create_queue(tensors=test_tensors 19 capacity=... 20 ) 21 22 input_values = tf.cond(is_train_tensor, 23 lambda: self.create_dequeue(train_queue, ...), 24 lambda: self.create_dequeue(test_queue, ...)
Working with saved graphs
這個太復雜了,不用
其實我們常用的不高效的方法是用placeholder, 用tf.data.dataset來讀取不同的訓練數據和驗證數據,在跑數據的時候用sess.run取出數據,用feed_dict放在圖中的placeholder中
說使用feed_dict可能會損失時間性能,但是卻是最為簡便的方法,In evaluation, I want to use the same graph and the same session. I can use tf.cond
to read from another separate queue instead of the RandomShuffleQueue
. Or I could use feed_dict
in evaluation.
關於tf.cond的使用
tf.cond is evaluated at the runtime
https://stackoverflow.com/questions/45517940/whats-the-difference-between-tf-cond-and-if-else
說是只有placeholder能改變條件的流向
tf.cond
is evaluated at the runtime, whereas if-else
is evaluated at the graph construction time.
If you want to evaluate your condition depending on the value of the tensor at the runtime, tf.cond
is the best option.
Do you mean if-condition depends at graph construction time, if at the runtime the condition changes, but it don't have effect on if block, because it is determined at graph construction time
The graph has been fixed after you finished drawing the graph and the if-else condition would not affect the graph while excuting the graph. 如果條件也是在圖中,比如說placeholder就有用
如果想在training過程中validation,用tf.cond基本不可能實現,會報一個OurOfRangeError: End of sequence 看github上說這個是tensorflow的bug: https://github.com/tensorflow/tensorflow/issues/12414
然而我自己實現的時候遇到的問題是: 當validation時 image_batch, label_batch是需要reset的,也就是batch要移到第一位上,這個還不能實現
12. 想在tensorboard中把兩個指標畫在同一個圖中
比如把training loss和validation loss畫在同一個圖中顯示
By using two FileWriters with a single tf.summary.scalar you can plot two scalars on a single graph.
1 import tensorflow as tf 2 from numpy import random 3 4 """ 5 Plotting multiple scalars on the same graph 6 """ 7 8 writer_val = tf.summary.FileWriter('./logs/plot_val') 9 writer_train = tf.summary.FileWriter('./logs/plot_train') 10 loss_var = tf.Variable(0.0) 11 tf.summary.scalar("loss", loss_var) 12 write_op = tf.summary.merge_all() 13 session = tf.InteractiveSession() 14 session.run(tf.global_variables_initializer()) 15 for i in range(100): 16 # loss validation 17 summary = session.run(write_op, {loss_var: random.rand()}) 18 writer_val.add_summary(summary, i) 19 writer_val.flush() 20 # loss train 21 summary = session.run(write_op, {loss_var: random.rand()}) 22 writer_train.add_summary(summary, i) 23 writer_train.flush()