tensorflow中multi-GPU小坑記錄

最近又需要點tf的代碼，有幾個點關於多卡的代碼點需要記錄下。一直想把平時常用的一些代碼段整理一下，但是一直沒時間，每周有在開新的進程，找時間再說吧。先零星的記點吧。

干貨

在tf構圖階段，把計算點都開在GPU上，盡量不要開在CPU上。提速杠杠滴！

在多卡讀取數據階段，在for len(num_gpu)循環外建立queue，在循環內取數據。

好了，主要的干貨就沒有了，看懂的可以ctrl+w了。
不太了解的，咱繼續。

tf構圖在GPU上

在tf構圖階段，把計算點都開在GPU上，盡量不要開在CPU上。提速杠杠滴！

with tf.Graph().as_default(), tf.device('/gpu:0'):
	y = interface(x)
	...
	with tf.Session(config = tf.ConfigProto(log_device_placement=True,allow_soft_placement=True)) as sess:
		sess.run(...)

在train function中構靜態圖時，把節點和運算都放在GPU上，同時需要加上allow_soft_placement=True，這個flag的作用是保證程序能正常運行，因為有些運算操作是不能放在GPU上運行的，flag保證了那些操作會轉移到CPU上運行。

result

在單卡上os.environ["CUDA_VISIBLE_DEVICES"] = "0"跑的數據，不做嚴謹定量展現，只定量的感受下提速。CPU上
113s per 100 batch_size,轉到GPU上后提速到11s。

CPU

GPU

在4卡上跑實驗，per 100 batch_size，CPU：295s,GPU:65s。

CPU

GPU

multiGPU讀取數據

目前在r1.2的版本上用的還是queue的方式，好像在r1.4版本上官方推薦使用Dataset API，暫時還沒有切版本。

2.在多卡讀取數據階段，在for len(num_gpu)循環外建立queue，在循環內取數據。

def get_input(self,data_path_list,batch_size):
	file_list = os.listdir(data_path_list)
	file_list = [data_path_list+'/'+ i for i in file_list]
	with tf.variable_scope("tfrecords_input"):
		filename_queue = tf.train.string_input_producer(file_list)
		reader = tf.TFRecordReader()
		_,serialized_example = reader.read(filename_queue)
		# 解析單個數據格式
		features = tf.parse_single_example(serialized_example,
									   features={
										   'x':tf.FixedLenFeature([],tf.string),
										   'labels':tf.FixedLenFeature([],tf.string)
									   })  
		x = tf.reshape(tf.decode_raw(features['x'],tf.int32),[2,self.n_features])
		labels = tf.reshape(tf.decode_raw(features['labels'],tf.int32), [1,2])
		return x, labels
		
def train():
	min_after_dequeue = 1000
	capacity = min_after_dequeue+4*batch_size
	x_, labels = get_input(train_tfrecords_path,batch_size)
	with tf.variable_scope(tf.get_variable_scope()): 
		for i in range(num_gpus):
			with tf.device('gpu:%d' % i): 
				with tf.name_scope('GPU_%d' % i) as scope:
					train_data = tf.train.shuffle_batch([x_,labels],
												  batch_size=batch_size,
												  capacity=capacity,
												  min_after_dequeue=min_after_dequeue,
												  num_threads=3)
					x =  train_data[0]
					y_ = train_data[-1]
					interface_output = interface(x)
					cur_loss = compute_loss(interface_output,y_,reg=None)
					...

最初是把tf.train.shuffle_batch這步放在了for循環的外邊，緊跟着get_input()函數，但是發現這樣的話讀取出來的是同一個batch_size的數據。

haha

寫tf的時候還是記得轉變下思維，不能總是按照C++的方式覺得每一步都會有定值，tf在沒有sess.run之前只是個占位符而已，並沒有數據流動！！！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 pytorch multi-gpu train XGBoost--3--CPU、GPU與Multi-GPU安裝 NCCL(Nvidia Collective multi-GPU Communication Library) Nvidia英偉達的Multi-GPU多卡通信框架NCCL 學習；PCIe 速率調研； TensorFlow使用GPU（tf.device） tf.device()指定tensorflow運行的GPU或CPU設備 tensorflow中的tf.Variable() tf中tf.metrics 坑 TensorFlow走過的坑之---數據讀取和tf中batch的使用方法 tensorflow踩坑合集2. TF Serving & gRPC 踩坑 tensorflow-gpu安裝遇到的坑