Tensorflow學習筆記4：分布式Tensorflow

本文轉載自查看原文 2016-10-27 19:57 15831 tensorflow/ 深度學習

簡介

Tensorflow API提供了Cluster、Server以及Supervisor來支持模型的分布式訓練。

關於Tensorflow的分布式訓練介紹可以參考Distributed Tensorflow。簡單的概括說明如下：

Tensorflow分布式Cluster由多個Task組成，每個Task對應一個tf.train.Server實例，作為Cluster的一個單獨節點；
多個相同作用的Task可以被划分為一個job，例如ps job作為參數服務器只保存Tensorflow model的參數，而worker job則作為計算節點只執行計算密集型的Graph計算。
Cluster中的Task會相對進行通信，以便進行狀態同步、參數更新等操作。

Tensorflow分布式集群的所有節點執行的代碼是相同的。分布式任務代碼具有固定的模式：

# 第1步：命令行參數解析，獲取集群的信息ps_hosts和worker_hosts，以及當前節點的角色信息job_name和task_index

# 第2步：創建當前task結點的Server
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)

# 第3步：如果當前節點是ps，則調用server.join()無休止等待；如果是worker，則執行第4步。
if FLAGS.job_name == "ps":
    server.join()

# 第4步：則構建要訓練的模型
# build tensorflow graph model

# 第5步：創建tf.train.Supervisor來管理模型的訓練過程
# Create a "supervisor", which oversees the training process.
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), logdir="/tmp/train_logs")
# The supervisor takes care of session initialization and restoring from a checkpoint.
sess = sv.prepare_or_wait_for_session(server.target)
# Loop until the supervisor shuts down
while not sv.should_stop()
     # train model

Tensorflow分布式訓練代碼框架

根據上面說到的Tensorflow分布式訓練代碼固定模式，如果要編寫一個分布式的Tensorlfow代碼，其框架如下所示。

import tensorflow as tf

# Flags for defining the tf.train.ClusterSpec
tf.app.flags.DEFINE_string("ps_hosts", "",
                           "Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts", "",
                           "Comma-separated list of hostname:port pairs")

# Flags for defining the tf.train.Server
tf.app.flags.DEFINE_string("job_name", "", "One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")

FLAGS = tf.app.flags.FLAGS


def main(_):
  ps_hosts = FLAGS.ps_hosts.split(",")
  worker_hosts = FLAGS.worker_hosts(",")

  # Create a cluster from the parameter server and worker hosts.
  cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

  # Create and start a server for the local task.
  server = tf.train.Server(cluster,
                           job_name=FLAGS.job_name,
                           task_index=FLAGS.task_index)

  if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":
    # Assigns ops to the local worker by default.
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster)):

      # Build model...
      loss = ...
      global_step = tf.Variable(0)

      train_op = tf.train.AdagradOptimizer(0.01).minimize(
          loss, global_step=global_step)

      saver = tf.train.Saver()
      summary_op = tf.merge_all_summaries()
      init_op = tf.initialize_all_variables()

    # Create a "supervisor", which oversees the training process.
    sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
                             logdir="/tmp/train_logs",
                             init_op=init_op,
                             summary_op=summary_op,
                             saver=saver,
                             global_step=global_step,
                             save_model_secs=600)

    # The supervisor takes care of session initialization and restoring from
    # a checkpoint.
    sess = sv.prepare_or_wait_for_session(server.target)

    # Start queue runners for the input pipelines (if any).
    sv.start_queue_runners(sess)

    # Loop until the supervisor shuts down (or 1000000 steps have completed).
    step = 0
    while not sv.should_stop() and step < 1000000:
      # Run a training step asynchronously.
      # See `tf.train.SyncReplicasOptimizer` for additional details on how to
      # perform *synchronous* training.
      _, step = sess.run([train_op, global_step])


if __name__ == "__main__":
  tf.app.run()

對於所有Tensorflow分布式代碼，可變的只有兩點：

構建tensorflow graph模型代碼；
每一步執行訓練的代碼

分布式MNIST任務

我們通過修改tensorflow/tensorflow提供的mnist_softmax.py來構造分布式的MNIST樣例來進行驗證。修改后的代碼請參考mnist_dist.py。

我們同樣通過tensorlfow的Docker image來啟動一個容器來進行驗證。

$ docker run -d -v /path/to/your/code:/tensorflow/mnist --name tensorflow tensorflow/tensorflow

啟動tensorflow之后，啟動4個Terminal，然后通過下面命令進入tensorflow容器，切換到/tensorflow/mnist目錄下

$ docker exec -ti tensorflow /bin/bash
$ cd /tensorflow/mnist

然后在四個Terminal中分別執行下面一個命令來啟動Tensorflow cluster的一個task節點，

# Start ps 0
python mnist_dist.py --ps_hosts=localhost:2221,localhost:2222 --worker_hosts=localhost:2223,localhost:2224 --job_name=ps --task_index=0

# Start ps 1
python mnist_dist.py --ps_hosts=localhost:2221,localhost:2222 --worker_hosts=localhost:2223,localhost:2224 --job_name=ps --task_index=1

# Start worker 0
python mnist_dist.py --ps_hosts=localhost:2221,localhost:2222 --worker_hosts=localhost:2223,localhost:2224 --job_name=worker --task_index=0

# Start worker 1
python mnist_dist.py --ps_hosts=localhost:2221,localhost:2222 --worker_hosts=localhost:2223,localhost:2224 --job_name=worker --task_index=1

具體效果自己驗證哈。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【學習筆記】分布式Tensorflow 學習筆記TF061:分布式TensorFlow，分布式原理、最佳實踐 TensorFlow——分布式的TensorFlow運行環境 TensorFlow 分布式實踐 TensorFlow分布式實踐 tensorflow分布式運行『TensorFlow』分布式訓練_其三_多機分布式 [源碼解析] TensorFlow 之分布式變量 TensorFlow分布式部署【單機多卡】 TensorFlow分布式部署【多機多卡】