Distributed TensorFlow

Todo list:

Distributed TensorFlow簡介
Distributed TensorFlow的部署與運行
對3個台主機做多卡GPU和兩台主機做多卡GPU的結果作對比

Distributed TensorFlow 意在使用等多主機的GPU加載模型,加速訓練.
在分布式的tensorflow可以更快過運行更大的模型. Distributed tensorflow可以運行在分布式集群上,也可以運行在

在分布式的tensorflow是根據DisBelif進行的改進. 在DisBelief中有兩個不同的進程,分別是Parameter Server(PS) 和 worker replices;
PS的職責是: 保存模型的狀態(也是每次更新的參數值),並根據隨后的梯度進行更新. 他的作用是將每個work中的圖連接起來
worker的職責是: 計算權重的梯度
tensorflow借鑒了這種方式, 並且在程序代碼的書寫上更加人性化: DisBelief中的Work和PS是兩種不同的代碼執行的進程; 但是在tf中work和ps的代碼是完全相同的,

Work Replication

Work Replication有兩種方式一種是In-graph 另一種是Between-graph

In-graph:
將模型的計算圖的不同部分放在不同的機器上執行
In-graph模式，把計算已經從單機多GPU，擴展到了多機多GPU了，但是數據分發還是在一個節點。這樣的好處是配置簡單，其他多機多GPU的計算節點，暴露一個網絡接口，等在那里接受任務就好了。這些計算節點暴露出來的網絡接口，使用起來就跟本機的一個GPU設備所調用的函數一樣，指定tf.device(“/job:worker/task:n”)即可. PS負責join操作,
Between-graph:
數據並行，每台機器使用完全相同的計算圖; Between-graph模式下，訓練的參數保存在參數服務器，數據不用分發，數據分片的保存在各個計算節點，各個計算節點自己算自己的，算完了之后，把要更新的參數告訴參數服務器，參數服務器更新參數。這種模式的優點是不用訓練數據的分發了，尤其是在數據量在TB級的時候，節省了大量的時間，所以大數據深度學習還是推薦使用Between-graph模式。

以上兩種操作均支持同步更新和異步更新.
在同步更新的時候，每次梯度更新，要等所有分發出去的數據計算完成后，返回回來結果之后，把梯度累加算了均值之后，再更新參數。這樣的好處是loss的下降比較穩定，但是這個的壞處也很明顯，處理的速度取決於最慢的那個分片計算的時間。

在異步更新的時候，所有的計算節點，各自算自己的，更新參數也是自己更新自己計算的結果，這樣的優點就是計算速度快，計算資源能得到充分利用，但是缺點是loss的下降不穩定，抖動大。

在數據量小的情況下，各個節點的計算能力比較均衡的情況下，推薦使用同步模式；數據量很大，各個機器的計算性能摻差不齊的情況下，推薦使用異步的方式。

如何部署分布式Tensorflow?

Demo:

環境簡介:

ubuntu16.04 服務器 *3 , ip=[172.16.60.114,  172.16.60.107,  172.16.5:0.111]
Cuda8.0 , Cudnn6
Tensorflow 1.10.0
Anaconda3| python3.6

測試文件

代碼詳情參見:github: Leechen2014/tec4tensorflow

解析:

分布式使用方法
cluster = tf.train.ClusterSpec({'ps': 'ps的服務器的URL', 'worker': 'work服務的URL'})
server = tf.train.Server(cluster, job_name="自己其名字" task_index=FLAGS.task_index)
針對ps服務需要做:
server.join()

多卡的GPU 實現:
with tf.device(tf.train.replica_device_setter(cluster=cluster )) # 也可以在每台worker上寫worker_device = '/job:worker/task%d/gpu:0' , 這種方式有點麻煩

運行方法:

# 在ps主機啟動grcp服務, 運行的命令如下:
CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=ps --task_index=0

# 在107上運行命令如下:
CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=worker --task_index=0

# 在111上的運行命令如下:
CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=worker --task_index=1

注意事項:

不需要建立SSH 免密碼登錄.
代碼中由於是使用
with tf.device(tf.train.replica_device_setter(cluster=XXX)
的方式分配GPU的, 所以在指定task_index的時候,其編號順序應該和啟動順序應該與
flags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221','Comma-separated list of hostname:port pairs')
保持一致.

運行結果:

# 114 是ps, 啟動grpc服務
2018-09-12 16:07:55.938936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1 
2018-09-12 16:07:55.938944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y 
2018-09-12 16:07:55.938949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N 
2018-09-12 16:07:55.940175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:0 with 10403 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0d:00.0, compute capability: 6.1)
2018-09-12 16:07:56.080591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:1 with 10403 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:0e:00.0, compute capability: 6.1)
2018-09-12 16:07:56.742461: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:22221}
2018-09-12 16:07:56.742526: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 172.16.60.107:22221, 1 -> 172.16.50.111:22221}
2018-09-12 16:07:56.764061: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:375] Started server with target: grpc://localhost:22221

------------------------
# 107 是work0 
1536739841.883745: Worker 0: traing step 7599 dome (global step:9986)
1536739841.897058: Worker 0: traing step 7600 dome (global step:9988)
1536739841.910197: Worker 0: traing step 7601 dome (global step:9990)
1536739841.923900: Worker 0: traing step 7602 dome (global step:9992)
1536739841.936971: Worker 0: traing step 7603 dome (global step:9994)
1536739841.950250: Worker 0: traing step 7604 dome (global step:9996)
1536739841.964122: Worker 0: traing step 7605 dome (global step:9998)
1536739841.978155: Worker 0: traing step 7606 dome (global step:10000)
Training ends @ 1536739841.978258
Training elapsed time:98.617033 s
After 10000 training step(s), validation cross entropy = 1141.94


----------------------------
#111 是work1 
1536739841.872289: Worker 1: traing step 2389 dome (global step:9985)
1536739841.885433: Worker 1: traing step 2390 dome (global step:9987)
1536739841.898431: Worker 1: traing step 2391 dome (global step:9989)
1536739841.911799: Worker 1: traing step 2392 dome (global step:9991)
1536739841.924894: Worker 1: traing step 2393 dome (global step:9993)
1536739841.938620: Worker 1: traing step 2394 dome (global step:9995)
1536739841.952448: Worker 1: traing step 2395 dome (global step:9997)
1536739841.966328: Worker 1: traing step 2396 dome (global step:9999)
1536739841.979593: Worker 1: traing step 2397 dome (global step:10001)
Training ends @ 1536739841.979693
Training elapsed time:41.149895 s
After 10000 training step(s), validation cross entropy = 1141.94
D0912 16:10:42.498070727   37760 dns_resolver.cc:280]        Start resolving.

通過以上的運行結果可以發現, 114啟動了gRcp服務, 但沒有關閉, 關於這個問題,stack overflow中已經有人給出解決方法Shut down server in TensorFlow , 關於gRcp詳情參見[^using-grpc-in-python]:using-grpc-in-python

備注:

ps和worker可以在同一個host中共存, 這個很好理解,就像hadoop中master和slaver是可以共存的一樣. 為了避免出現端口沖突, 在同一個主機上ps的端口和worker端口應該不一樣
ps 可以有多個, 書寫方式可以參照work
再次強調,由於使用的是 with tf.device(tf.train.replica_device_setter(cluster=XXX) 所以, Worker的啟動順序如果和lags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221','Comma-separated list of hostname:port pairs') 中書寫的順序不同, 將會導致其產生OS Error

將ps也做成worker進程的方式是:
將第20行: flags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221', 'Comma-separated list of hostname:port pairs')
添加114的ip和端口號, 修改為: flags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221,172.16.60.114:22222', 'Comma-separated list of hostname:port pairs')
從新運行即可,注意運行順序
運行結果:

##############114 ps##################################
h strength 1 edge matrix:
2018-09-12 16:38:41.432822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1 
2018-09-12 16:38:41.432830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y 
2018-09-12 16:38:41.432835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N 
2018-09-12 16:38:41.433475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:0 with 10403 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0d:00.0, compute capability: 6.1)
2018-09-12 16:38:41.949217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:1 with 10403 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:0e:00.0, compute capability: 6.1)
2018-09-12 16:38:42.086615: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:22221}
2018-09-12 16:38:42.086674: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 172.16.60.107:22221, 1 -> 172.16.50.111:22221, 2 -> 172.16.60.114:22222}
2018-09-12 16:38:42.094741: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:375] Started server with target: grpc://localhost:22221

###############107 worker 0##########################
#CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=worker --task_index=0
1536741807.352432: Worker 0: traing step 3305 dome (global step:9997)
1536741807.388893: Worker 0: traing step 3306 dome (global step:10000)
Training ends @ 1536741807.388980
Training elapsed time:80.524482 s
After 10000 training step(s), validation cross entropy = 1127

####################111 worker 1###################################
#CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=worker --task_index=1
1536741807.370341: Worker 1: traing step 3222 dome (global step:9998)
1536741807.398533: Worker 1: traing step 3223 dome (global step:10002)
Training ends @ 1536741807.398634
Training elapsed time:79.786702 s
After 10000 training step(s), validation cross entropy = 1127

#################114 worker2 #############
#CUDA_VISIBLE_DEVICES='0,1' python TestDistributed.py --job_name=worker --task_index=2
1536741807.346162: Worker 2: traing step 3474 dome (global step:9996)
1536741807.359073: Worker 2: traing step 3475 dome (global step:10000)
Training ends @ 1536741807.359174
Training elapsed time:79.858818 s
After 10000 training step(s), validation cross entropy = 1127

結果對比

根據日志可以做出初步對比:
使用兩個worker平均耗時69.975s; loss=1141.94, 所需要的時間是三個worker,平均時間:80.806s;loss=1127

參考文獻

Distributed TensorFlow
TensorFlow分布式全套（原理，部署，實例）
白話tensorflow分布式部署和開發
 分布式注意事項
 學習筆記TF061:分布式TensorFlow，分布式原理、最佳實踐

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 TensorFlow for distributed tensorflow源碼解析之distributed_runtime [翻譯] TensorFlow 分布式之論文篇 "TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems" Redis Distributed lock Steeltoe之Distributed Tracing篇 torch.distributed.barrier() Houdini Distributed Simulations and Render Distributed lass&pass&sass 共享內存Distributed Memory 與分布式內存Distributed Memory Pytorch Distributed 初始化