[源碼解析] PyTorch 分布式(1)------歷史和概述

使用 torch.multiprocessing 封裝了 Python 原生 multiprocessing模塊，這樣就可以利用多個CPU核。
導入了 THD (distributed pytorch)，這就有了用於分布式計算的底層庫。
引入了torch.distributed包，它允許在多台機器之間交換張量。使用這個包可以在多台機器之上使用更大的batch進行訓練。
發布了 c10d 庫，這成為 torch.distributed package和torch.nn.parallel.DistributedDataParallel 包的基礎后端，同時 THD 被廢棄。
提供了一個分布式RPC框架來支持分布式模型並行訓練。它允許遠程運行函數和引用遠程對象，而無需復制周圍的真實數據，並提供autograd和optimizer API以透明地進行后向傳播和跨RPC邊界更新參數。
引入了彈性訓練，Torchelastic提供了“torch.distributed.launch”CLI的一個嚴格超集，並添加了容錯和彈性功能。
引入了流水線並行，就是 torchgpipe。

其歷史演進圖如下：

                                           v1.0

                                           v1.1

                                           v1.2

               v0.1.8                      v1.3                  v1.7

                THD                        C10D               TorchElastic
                 +                          +                      +
                 |                          |                      |
                 |                          |                      |
                 |                          |                      |
                 |                          |                      |
                 |                          |                      |
                 |                          |                      |
+-------+--------+------------+-------------+-----------+----------+------------+----------> Time
        |                     |                         |                       |
        |                     |                         |                       |
        |                     |                         |                       |
        |                     |                         |                       |
        |                     |                         |                       |
        +                     +                         +                       +

  Multiprocessing       torch.distributed              RPC                   Pipeline

     v0.1.2                  v0.2                      v1.4                   v1.8

     v0.1.6                  v0.4                      v1.5                   v1.9

                                                       v1.6

具體歷史如下，有興趣的朋友可以研究一下，看看一個機器學習系統如何一步一步進入分布式世界，沒有興趣的朋友可以直接跳過到后續概述部分。

1.1 Multiprocessing

PyTorch 0.1.2

使用 torch.multiprocessing 封裝了 Python 原生 multiprocessing模塊，這樣就可以利用多個CPU核。

具體原因是，在Python 之中，使用線程是有技術問題的，主要就是 Global Interpreter Lock，因此應該使用多進程。

With Python, one cannot use threads because of a few technical issues.
Python has what is called Global Interpreter Lock, which does not allow threads to concurrently execute python code.

Hence, the most pythonic way to use multiple CPU cores is multiprocessing

We made PyTorch to seamlessly integrate with python multiprocessing.
This involved solving some complex technical problems to make this an air-tight solution, and more can be read in this in-depth technical discussion.

PyTorch 0.1.6

Multiprocessing 支持 CUDA。

Uptil now, Tensor sharing using multiprocessing only worked for CPU Tensors.
We've now enabled Tensor sharing for CUDA tensors when using python-3.
You can read more notes here: http://pytorch.org/docs/notes/multiprocessing.html

1.2 THD 底層庫

PyTorch 0.1.8

導入了 THD (distributed pytorch)，這就有了用於分布式計算的底層庫。

Merged an initial version of THD (distributed pytorch)

1.3 torch.distributed 庫

PyTorch 0.2

We introduce the torch.distributed package that allows you to exchange Tensors among multiple machines. Using this package, you can scale your network training over multiple machines and larger mini-batches. For example, you are given the primitives to implement Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.

The distributed package follows an MPI-style programming model. This means that there are functions provided to you such as send, recv, all_reduce that will exchange Tensors among nodes (machines).

For each of the machines to first identify each other and assign unique numbers to each other (ranks), we provide simple initialization methods:

shared file system (requires that all processes can access a single file system)

IP multicast (requires that all processes are in the same network)

environment variable (requires you to manually assign ranks and know an address of a node reachable from all processes)

這個版本引入了torch.distributed包，它允許在多台機器之間交換張量。使用這個包可以在多台機器之上使用更大的batch進行訓練。

該distributed包遵循 MPI 風格的編程模型，即distributed包提供了比如send, recv, all_reduce 這樣的方法來在不同的節點（機器）之間交換張量。

因為需要多台機器之間彼此識別，所以需要有一個機制來唯一標示每台機器，這就是rank。distributed包提供了幾種簡單的初始化方法：

共享文件系統（所有機器上的所有進程都可以訪問這個文件系統）
IP組播（要求所有進程在同一個網絡中）
環境變量（需要用戶手動指定rank，並且提供一個所有進程可訪問的節點地址）

World size是將參與訓練的進程數。每個進程都將被分配一個rank，該rank是一個介於 0 和 world_size - 1 之間的數字，在此作業中是唯一的。它將用作進程標識符，並將用於代替地址，例如，指定應將張量發送到哪個 rank（進程）。

分布式計算中的原語包括同步模式的send, recv 和異步模式的 isend，irecv。因為某些通信模式出現的太頻繁了，所以 PyTorch 開發了高階函數，比如all_reduce，這些集合通信原語會用於整個進程組，並且更加高效。

但是分布式包還是太底層，所以基本還是基於它來實現更高階的算法或者定制特殊算法，因為數據並行訓練是如此常見，PyTorch 為此創建了高級幫助程序DistributedDataParallel，它幾乎是 nn.DataParallel 的替代品。

PyTorch 0.4

這個版本有了幾處相關。

增加了DistributedDataParallelCPU，這個類和DistributedDataParallel很相似，但是主要支持在CPU之上訓練（DistributedDataParallel 目標是 GPU）。這個類支持 mpi, gloo and tcp 這些后端（tcp后端后來被廢除）。

Add DistributedDataParallelCPU. This is similar to DistributedDataParallel, but with specific support for models running on the CPU (contrary to DistributedDataParallel, which targets GPU), and supports mpi, gloo and tcp backends #5919.

增加了新工具腳本。此腳本可以在單個機器或者多個機器之上使用 DistributedDataParallel。

Helper utility for launching Distributed Training jobs

We have added an utility function to help launch jobs on a distributed setup.
In order to launch a script that leverages DistributedDataParallel on either single-node multiple-nodes, we can make use of torch.distributed launch as follows
python -m torch.distributed.launch my_script.py --arg1 --arg2 --arg3

增加了基於 NCCL 2.0 的新分布式后端，這樣速度得到很大提升，也可以基於多個GPU提供集合通信的新API。

A new distributed backend based on NCCL 2.0

PyTorch now has a new distributed backend, which leverages NCCL 2.0 for maximum speed.
It also provides new APIs for collective operations on multiple GPUs.
You can enable the new backend via
torch.distributed.init_process_group("nccl")

其他改進如下，比如聚合多個小廣播操作，混合精度，Infiniband 支持等。

Coalesce many small broadcasts to improve performance #4978

Add mixed-precision support for distributed training #4891

Release NCCL distributed backend. Previously it was marked as experimental. #4921

Enable Infiniband support for Gloo data channel with automatic IB device detection #4795

1.4 c10d庫

PyTorch 1.0

torch.distributed new "C10D" library

The torch.distributed package and torch.nn.parallel.DistributedDataParallel module are backed by the new "C10D" library. The main highlights of the new library are:

C10D is performance driven and operates entirely asynchronously for all backends: Gloo, NCCL, and MPI.

Significant Distributed Data Parallel performance improvements especially for slower network like ethernet-based hosts

Adds async support for all distributed collective operations in the torch.distributed package.

Adds send and recv support in the Gloo backend

這個版本發布了 c10d 庫，這成為 torch.distributed package和torch.nn.parallel.DistributedDataParallel 包的基礎后端，這個庫的主要亮點是：

因為C10D完全是異步操作，所以對於所有后端（Gloo, NCCL, 和 MPI）的性能都提升很大。
對於類似基於以太網的慢速網絡，分布式數據並行得到了巨大提升。
對torch.distributed 包中的所有分布式集合操作都添加了異步支持。
對Gloo后端添加了send、recv支持。

另外還有幾點修改。

TCP后端被移除，Gloo和 MPI 后端被推薦用於CPU集合通信，NCCL被推薦用於GPU集合通信。
舊的（基於THD）torch.distributed 包被廢棄。
舊的（基於THD）torch.nn.parallel.DistributedDataParallel包被廢棄。

torch.distributed: the TCP backend is removed, we recommend to use Gloo and MPI backends for CPU collectives and NCCL backend for GPU collectives.

the old (THD-backed) torch.distributed package is deprecated but still available at torch.distributed.deprecated.

The old (THD-backed) torch.nn.parallel.DistributedDataParallel is deprecated but still available at torch.nn.parallel.deprecated.DistributedDataParallel.

PyTorch 1.1

nn.parallel.DistributedDataParallel 可以支持多GPU模型，這樣模型並行和數據並行可以跨server進行協作。

DistributedDataParallel new functionality and tutorials

nn.parallel.DistributedDataParallel: can now wrap multi-GPU modules, which enables use cases such as model parallel (tutorial) on one server and data parallel (tutorial) across servers.

c10d ProcessGroup::getGroupRank 被移除。

PyTorch 1.2

此版本做了如下改進：

Distributed Package 可以支持CPU modules，稀疏張量，本地梯度累積。

Distributed Package

DistributedDataParallel: support CPU modules. (20236)

DistributedDataParallel: support sparse tensors. (19146)

DistributedDataParallel: support local gradient accumulation. (21736)

另外也有一些其他小改進，比如對於MPI操作加入了device guard 。

PyTorch 1.3

添加了torch.distributed對macOS的支持，但是只能使用Gloo后端，用戶只需要修改一行代碼就可以復用其他平台的代碼。也做了一些其他改進。

This release adds macOS support for torch.distributed with the Gloo backend. You can more easily switch from development (e.g. on macOS) to deployment (e.g. on Linux) without having to change a single line of code. The prebuilt binaries for macOS (stable and nightly) include support out of the box.

torch.distributed.all_reduce_coalesced Support allreduce of a list of same-device tensors (24949, 25470, 24876)

torch.distributed.all_reduce Add bitwise reduction ops (BAND, BOR, BXOR) (26824)

1.5 RPC框架

PyTorch 1.4.0

此版本開始試驗分布式模型訓練。

隨着RoBERTa等模型的規模不斷擴大直到數十億個參數，模型並行訓練變得越來越重要，因為其可以幫助研究人員突破極限。1.4.0 版本提供了一個分布式RPC框架來支持分布式模型並行訓練。它允許遠程運行函數和引用遠程對象，而無需復制相關真實數據，並提供autograd和optimizer API以透明地進行后向傳播和跨RPC邊界更新參數。

Distributed Model Parallel Training [Experimental]

With the scale of models, such as RoBERTa, continuing to increase into the billions of parameters, model parallel training has become ever more important to help researchers push the limits. This release provides a distributed RPC framework to support distributed model parallel training. It allows for running functions remotely and referencing remote objects without copying the real data around, and provides autograd and optimizer APIs to transparently run backwards and update parameters across RPC boundaries.

torch.distributed.rpc是一個新引入的包。它的基本構建塊可以在模型訓練和推理中遠程運行函數，這對於分布式模型並行或實現參數服務器框架等場景非常有用。更具體地說，它包含四個支柱：RPC、遠程引用、分布式autograd和分布式優化器。請參閱文件和教程更多細節。

RPC [Experimental]

torch.distributed.rpc is a newly introduced package. It contains basic building blocks to run functions remotely in model training and inference, which will be useful for scenarios like distributed model parallel or implementing parameter server frameworks. More specifically, it contains four pillars: RPC, Remote Reference, Distributed Autograd, and Distributed Optimizer. Please refer to the documentation and the tutorial for more details.

PyTorch 1.5

正式發布了 torch.distributed.rpc 。

“torch.distributed.rpc”包旨在支持不適合 “DistributedDataParallel”的各種分布式訓練范式。示例包括參數服務器訓練、分布式模型並行和分布式管道並行。torch.distributed.rpc包中的功能可以分為四組主要的API。

**RPC **API允許在指定目標工作進程上使用給定的參數來運行函數，並且可以獲取返回值或創建對返回值的分布式引用。
RRef（遠程引用）是另一個worker上對象的引用。持有RRef的工作者可以顯式地請求對象的副本，並且還可以與其他worker共享輕量級RRef，而不必擔心引用計數。當多個worker需要重復訪問同一遠程對象的不同版本時，這尤其有用。
使用分布式自動加載，應用程序可以自動計算梯度，即使模型已經使用RPC在多個worker上拆分過。PyTorch 在向前傳播過程中將RPC邊界處的局部autograd圖縫合在一起，並向在后向傳播之中穿越邊界讓參與者啟動本地autograd。
分布式優化器使用分布式autograd計算的梯度來更新模型參數。它的構造函數接受一個本地優化器（例如SGD，Adagrad等）和一個參數RRef列表，它的step函數在所有不同的 RRef 所有者（worker）之上自動使用本地優化器來更新參數。

Distributed RPC framework APIs [Now Stable]

The torch.distributed.rpc package aims at supporting a wide range of distributed training paradigms that do not fit into DistributedDataParallel. Examples include parameter server training, distributed model parallelism, and distributed pipeline parallelism. Features in the torch.distributed.rpc package can be categorized into four main sets of APIs.

The RPC API allows running a function on a specified destination worker with given arguments and fetches the return value or creates a distributed reference to the return value.

The RRef (Remote REFerence) serves as a reference to an object on another worker. A worker holding an RRef can explicitly request copies of the object, and it can also share the light-weight RRef with other workers without worrying about reference counting. This is especially useful when multiple workers need to repeatedly access different versions of the same remote object.

With Distributed Autograd, applications can automatically compute gradients even if a model is split on multiple workers using RPC. This is achieved by stitching together local autograd graphs at RPC boundaries in the forward pass and reaching out to participants to transparently launch local autograd in the backward pass.

The Distributed Optimizer uses gradients computed by Distributed Autograd to update model parameters. Its constructor takes a local optimizer (e.g., SGD, Adagrad, etc.) and a list of parameter RRefs, and its step() function automatically uses the local optimizer to update parameters on all distinct RRef owner workers.

PyTorch 1.6

此版本對 DDP 和 RPC 進行了大量的改進，也增加了新特性，幾個大特性包括：

Numerous improvements and new features for both distributed data parallel (DDP) training and the remote procedural call (RPC) packages.

RPC的TensorPipe后端

PyTorch 1.6為RPC模塊引入了一個新的后端，它利用了TensorPipe庫。TensorPipe庫是一個面向機器學習的張量感知的點對點通信原語，旨在對PyTorch中分布式培訓的當前原語（Gloo、MPI等）進行補足，這些原語是集合通信和分塊的。TensorPipe的成對性和異步性使其有助於構建超越數據並行的新的網絡模式：客戶機-服務器方式（例如，嵌入的參數服務器、actor-learner separation in Impala-style RL等）和模型與管道並行訓練（比如GPipe），gossip SGD等。

TensorPipe backend for RPC

PyTorch 1.6 introduces a new backend for the RPC module which leverages the TensorPipe library, a tensor-aware point-to-point communication primitive targeted at machine learning, intended to complement the current primitives for distributed training in PyTorch (Gloo, MPI, ...) which are collective and blocking. The pairwise and asynchronous nature of TensorPipe lends itself to new networking paradigms that go beyond data parallel: client-server approaches (e.g., parameter server for embeddings, actor-learner separation in Impala-style RL, ...) and model and pipeline parallel training (think GPipe), gossip SGD, etc.

[Beta] DDP+RPC

PyTorch分布式支持兩種強大的范式：DDP用於完全同步的數據並行訓練，RPC框架允許分布式模型並行。

目前，這兩個特性是獨立工作的，用戶不能混合和匹配這兩個特性來嘗試混合並行模式。從PyTorch 1.6開始，我們已經使DDP和RPC能夠無縫地協同工作，這樣用戶就可以將這兩種技術結合起來，實現數據並行和模型並行。例如，用戶希望在參數服務器上放置大型嵌入表，並使用RPC框架進行嵌入查找，但在培訓器上存儲較小的dense參數，並使用DDP同步dense參數。

[Beta] DDP+RPC

PyTorch Distributed supports two powerful paradigms: DDP for full sync data parallel training of models and the RPC framework which allows for distributed model parallelism. Currently, these two features work independently and users can’t mix and match these to try out hybrid parallelism paradigms.

Starting PyTorch 1.6, we’ve enabled DDP and RPC to work together seamlessly so that users can combine these two techniques to achieve both data parallelism and model parallelism. An example is where users would like to place large embedding tables on parameter servers and use the RPC framework for embedding lookups, but store smaller dense parameters on trainers and use DDP to synchronize the dense parameters. Below is a simple code snippet.

[Beta] RPC - Asynchronous User Functions

RPC異步用戶函數支持在執行用戶定義的函數時在服務器端進行yield 和resume。在此功能之前，當被調用方處理請求時，一個RPC線程將等待用戶函數返回。如果用戶函數包含IO（例如，嵌套RPC）或信令（例如，等待另一個請求解除阻止），則相應的RPC線程將處於空閑狀態，等待這些事件。因此，一些應用程序必須使用大量線程並且發送額外的RPC請求，這可能會導致性能下降。要使用戶函數在此類事件中yield，應用程序需要：1）使用@rpc.functions.async_executiondecorator封裝函數；2）讓函數返回'torch.futures.Future'，並將恢復邏輯作為回調安裝到'Future'對象上。

[Beta] RPC - Asynchronous User Functions

RPC Asynchronous User Functions supports the ability to yield and resume on the server side when executing a user-defined function. Prior to this feature, when an callee processes a request, one RPC thread waits until the user function returns. If the user function contains IO (e.g., nested RPC) or signaling (e.g., waiting for another request to unblock), the corresponding RPC thread would sit idle waiting for these events. As a result, some applications have to use a very large number of threads and send additional RPC requests, which can potentially lead to performance degradation. To make a user function yield on such events, applications need to: 1) Decorate the function with the @rpc.functions.async_execution decorator; and 2) Let the function return a torch.futures.Future and install the resume logic as callbacks on the Future object.

[Beta] Fork/Join Parallelism

此版本增加了對語言級構造的支持，以及對TorchScript代碼中粗粒度並行性的運行時支持。這種支持對於並行運行集成中的模型或並行運行遞歸網絡中的雙向組件等情況非常有用，並為任務級並行解鎖了並行體系結構（例如許多核心CPU）的計算能力。

TorchScript程序的並行執行通過兩個原語：“torch.jit.fork”和“torch.jit.wait” 完成支持。

[Beta] Fork/Join Parallelism

This release adds support for a language-level construct as well as runtime support for coarse-grained parallelism in TorchScript code. This support is useful for situations such as running models in an ensemble in parallel, or running bidirectional components of recurrent nets in parallel, and allows the ability to unlock the computational power of parallel architectures (e.g. many-core CPUs) for task level parallelism.

Parallel execution of TorchScript programs is enabled through two primitives: torch.jit.fork and torch.jit.wait.

1.6 彈性訓練

PyTorch 1.7

此版本對 DDP 和 RPC 進行了一些的改進，也增加了新特性，幾個大特性包括：

[Stable] TorchElastic now bundled into PyTorch docker image

Torchelastic提供了“torch.distributed.launch”CLI的一個嚴格超集，並添加了容錯和彈性功能。如果用戶對容錯不感興趣，他們可以通過設置“max_restarts=0”來獲得准確的功能/行為，並增加自動分配“RANK”和“MASTER_ADDR”端口的便利性（而不是在“torch.distributed.launch”中手動指定）。

通過將“torchelastic”與PyTorch捆綁在同一docker映像中，用戶可以立即開始試用torchelastic，而無需單獨安裝“torchelastic”。除了方便之外，在現有Kubeflow的分布式PyTorch操作符中添加對彈性參數的支持也是一個很好的選擇。

[Stable] TorchElastic now bundled into PyTorch docker image

Torchelastic offers a strict superset of the current torch.distributed.launch CLI with the added features for fault-tolerance and elasticity. If the user is not be interested in fault-tolerance, they can get the exact functionality/behavior parity by setting max_restarts=0 with the added convenience of auto-assigned RANK and MASTER_ADDR|PORT (versus manually specified in torch.distributed.launch).

By bundling torchelastic in the same docker image as PyTorch, users can start experimenting with torchelastic right-away without having to separately install torchelastic. In addition to convenience, this work is a nice-to-have when adding support for elastic parameters in the existing Kubeflow’s distributed PyTorch operators.

[Beta] Support for uneven dataset inputs in DDP

PyTorch 1.7引入了一個新的上下文管理器，與使用“torch.nn.parallel.DistributedDataParallel”進行訓練的模型結合使用，以支持使用跨不同進程的大小不均勻的數據集進行訓練。此功能在使用DDP時提供了更大的靈活性，並防止用戶必須手動確保不同進程中的數據集大小相同。使用此上下文管理器，DDP將自動處理不均勻的數據集大小，這可以防止在訓練結束時出現錯誤或掛起。

[Beta] Support for uneven dataset inputs in DDP

PyTorch 1.7 introduces a new context manager to be used in conjunction with models trained using torch.nn.parallel.DistributedDataParallel to enable training with uneven dataset size across different processes. This feature enables greater flexibility when using DDP and prevents the user from having to manually ensure dataset sizes are the same across different process. With this context manager, DDP will handle uneven dataset sizes automatically, which can prevent errors or hangs at the end of training.

其他特性包括：

[Beta] NCCL Reliability - Async Error/Timeout Handling
[Beta] TorchScript remote and rpc_sync
[Beta] Distributed optimizer with TorchScript support
[Beta] Enhancements to RPC-based Profiling
[Prototype] Windows support for Distributed Training

1.7 流水線訓練

PyTorch 1.8

此版本加入了一些重大改進，比如：提高NCCL可靠性；流水線並行支撐；RPC profiling；並支持添加梯度壓縮的通信hook。

其中流水線並行是把 fairscale.nn.Pipe引入進來，其實就是 torchgpipe。

Significant updates and improvements to distributed training including: Improved NCCL reliability; Pipeline parallelism support; RPC profiling; and support for communication hooks adding gradient compression.

Upstream fairscale.nn.Pipe into PyTorch as torch.distributed.pipeline (#44090)

PyTorch 1.9

主要是

Distributed / TorchElastic 的一些bug fix。
RPC 的重大改進以支持大規模GPU分布式訓練。
在PyTorch Profiler中支持分布式培訓、GPU利用率和SM效率。

研究完歷史之后，我們再看看分布式概述。

0x02 分布式概述

以下主要是基於https://pytorch.org/tutorials/beginner/dist_overview.html 官方文檔為基礎，加上自己的理解。

2.1 引論

2.1.1 torch.distributed 包

PyTorch 中的 torch.distributed包對於多進程並行提供了通信原語，使得這些進程可以在一個或多個計算機上運行的幾個計算節點之間進行通訊。 torch.distributed包的並行方式與multiprocessing ( torch.multiprocessing) 包不同，torch.distributed包支持多個通過網絡連接的機器，並且用戶必須為每個進程顯式啟動主訓練腳本的單獨副本。

在單機且同步模型的情況下，torch.distributed或着 torch.nn.parallel.DistributedDataParallel()包裝器可能仍然比其他數據並行方法（比如torch.nn.DataParallel）具有優勢：

每個進程維護自己的優化器，並在每次迭代中執行一個完整的優化步驟。雖然這可能看起來是多余的，但由於梯度已經聚集在一起並且是跨進程平均，因此對於每個進程都是相同的，這意味着不需要參數廣播步驟，減少了在節點之間傳輸張量所花費的時間。
每個進程都包含一個獨立的 Python 解釋器，消除了額外的解釋器開銷和“GIL 顛簸”，這些開銷來自單個 Python 進程驅動多個執行線程，多個模型副本或多個GPU 的開銷。這對於嚴重依賴 Python 運行時的模型尤其重要，這樣的模型通常具有遞歸層或許多小組件。

從 PyTorch v1.6.0 開始，功能torch.distributed可以分為三個主要組件：

分布式數據並行訓練 (DDP) 是一種廣泛采用的單程序多數據訓練范式。使用 DDP，模型會在每個進程上復制，並且每個模型副本都將被提供一組不同的輸入數據樣本。DDP 負責梯度通信以保持模型副本同步並將其與梯度計算重疊以加速訓練。
基於 RPC 的分布式訓練 (RPC) 旨在支持無法適應數據並行訓練的通用訓練結構，例如分布式管道並行、參數服務器范式以及 DDP 與其他訓練范式的組合。它有助於管理遠程對象生命周期並將 autograd 引擎擴展到機器邊界之外。
集體通信 (c10d) 庫支持跨組內的進程發送張量。它提供集體通信 API（例如all_reduce 和 all_gather）和 P2P 通信 API（例如send 和 isend）。DDP 和 RPC（進程組后端）) 在 v1.6.0 的 c10d 上構建，其中前者使用集體通信，后者使用 P2P 通信。通常，開發者不需要直接使用這個原始通信 API，因為上面的 DDP 和 RPC 特性可以服務於許多分布式訓練場景。但是，在某些用例中，此 API 仍然有用。一個例子是分布式參數平均，應用程序希望在反向傳遞后計算所有模型參數的平均值，而不是使用 DDP 來傳達梯度。這可以將通信與計算分離，並允許對通信內容進行更細粒度的控制，但另一方面，它也放棄了 DDP 提供的性能優化。在用PyTorch編寫分布式應用程序顯示了使用 c10d 通信 API 的示例。
- 通信方式：torch.distributed 的底層通信主要使用 Collective Communication (c10d) library 來支持跨組內的進程發送張量，並主要支持兩種類型的通信 API：
  - collective communication APIs: Distributed Data-Parallel Training (DDP)
  - P2P communication APIs: RPC-Based Distributed Training (RPC)
這兩種通信 API 在 PyTorch 中分別對應了兩種分布式訓練方式：Distributed Data-Parallel Training (DDP) 和 RPC-Based Distributed Training (RPC)。

大多數現有文檔是為 DDP 或 RPC 編寫的，本文的其余部分將詳細說明這兩個組件的材料。

2.1.2 知識鏈接

PyTorch的multiprocessing模塊封裝了python原生的multiprocessing模塊，在API上百分之百的兼容，它也注冊了定制的reducers, 可以使用IPC機制（共享內存）來讓不同的進程對同一份數據進行讀寫。但是其工作方式在CUDA上有很多弱點，比如必須規定各種進程的生命周期如何如何，導致CUDA上的multiprocessing經常行為超出預期。

2.2 數據並行訓練

在官方文檔中，可以了解到，在掌握 torch.distributed 的基礎的前提下，我們可以根據自身機器和任務的具體情況使用不同的分布式或並行訓練方式。PyTorch 為數據並行訓練提供了多種選項。一般來說，應用會從簡單到復雜，從原型到量產。這些應用共同的發展軌跡是：

如果數據和模型可以放在一個 GPU 中，並且不關心訓練速度，就使用單設備（single-device）訓練。
如果服務器上有多個 GPU，並且您希望以最少的代碼更改來加速訓練，那么可以使用單機多 GPU DataParallel。
如果您想進一步加快訓練速度並願意編寫更多代碼來設置它，可以使用單機多 GPU DistributedDataParallel。
如果應用程序需要跨機器邊界進行擴展，請使用多機 DistributedDataParallel 和啟動腳本。
如果預期會出現錯誤（例如，OOM）或者資源可以在訓練期間動態加入和離開，則使用torchelastic啟動分布式訓練。

2.3 `torch.nn.DataParallel`

DataParallel 包使用最低代碼量就可以利用單機多GPU達到並行性。它只需要對應用程序代碼進行一行更改。教程 Optional: Data Parallelism 展示了一個示例。需要注意的是，雖然DataParallel非常易於使用，但通常不能提供最佳性能。這是因為DataParallel的實現在每個前向傳遞中都會復制模型，並且其單進程多線程並行性會受到 GIL 競爭的影響。為了獲得更好的性能，請考慮使用 DistributedDataParallel。

2.4 `torch.nn.parallel.DistributedDataParallel`

與DataParallel相比， DistributedDataParallel 需要多一步設置，即調用 init_process_group。DDP 使用多進程並行，因此模型副本之間不存在 GIL 競爭。此外，模型在 DDP 構建時廣播，而不是在每次前向傳播時廣播，這也有助於加快訓練速度。DDP 附帶了多種性能優化技術。如需更深入的解釋，請參閱這篇 DDP 論文(VLDB'20)。

DDP材料如下：

DDP 筆記提供了一個入門示例及其設計和實現的一些簡要說明。如果這是您第一次使用 DDP，請從本文檔開始。
Getting Started with Distributed Data Parallel 解釋了 DDP 訓練的一些常見問題，包括不平衡的工作負載、檢查點和多設備模型。請注意，DDP 可以輕松地與單機模型並行最佳實踐教程中描述的單機多設備模型並行性相結合。
在啟動並配置分布式數據並行應用程序文件顯示如何使用DDP啟動腳本。
該 Shard Optimizer States With ZeroRedundancyOptimizer 配方演示了ZeroRedundancyOptimizer 如何有助於減少優化內存占用分布式數據並行訓練。

2.5 TorchElastic

隨着應用程序復雜性和規模的增長，故障恢復成為一項迫切的要求。

有時，在使用 DDP 時不可避免地會遇到 OOM 之類的錯誤，但 DDP 本身無法從這些錯誤中恢復，基本try-except塊也無法工作。這是因為 DDP 要求所有進程以緊密同步的方式運行，並且在不同進程中啟動的所有AllReduce通信必須匹配。

如果組中的某個進程拋出 OOM 異常，則很可能會導致不同步（不匹配的 AllReduce操作），從而導致崩潰或掛起。如果您預計訓練期間會發生故障，或者資源可能會動態離開和加入，請使用torchelastic啟動分布式數據並行訓練。

2.6 通用分布式訓練

許多訓練范式不適合數據並行，例如參數服務器范式，分布式管道並行，具有多個觀察者或代理的強化學習應用等。 torch.distributed.rpc目標是支持通用分布式訓練場景。

torch.distributed.rpc包有四大支柱：

RPC支持在遠程工作者上運行給定的函數。
RRef有助於管理遠程對象的生命周期。RRef 注釋中介紹了引用計數協議。
分布式 Autograd 將 autograd 引擎擴展到機器邊界之外。有關更多詳細信息，請參閱分布式 Autograd 設計。
分布式優化器可以自動聯系所有參與的workers，以使用分布式 autograd 引擎計算的梯度來更新參數。

RPC 教程如下（后續會選擇部分文章進行分析）：

在開始使用分布式RPC框架教程首先使用一個簡單的強化學習（RL）為例來說明RPC和器RRef。然后，它將基本的分布式模型並行應用於 RNN 示例，以展示如何使用分布式 autograd 和分布式優化器。
在使用分布式RPC框架實現一個參數服務器教程借用 HogWild！訓練的精華，將其應用於異步參數服務器 (PS) 訓練應用程序。
使用 RPC的分布式管道並行教程將單機管道並行示例（在單機模型並行最佳實踐中介紹）擴展到分布式環境，並展示了如何使用 RPC 實現它。
使用異步執行來實施批量RPC處理教程演示如何使用 @ rpc.functions.async_execution 裝飾器來實現RPC批處理，它可以幫助加快推理和培訓。它使用了上述教程 1 和 2 中使用的類似 RL 和 PS 示例。
將分布式RPC框架相與分布式數據並行結合教程演示了如何將DDP與RPC結合起來，這樣可以將分布式數據並行與分布式模型並行相結合訓練模型。