集合通信行為分析 - 基於NCCL


姚偉峰

Phases

from
Phase 1 - Bootstrap Phase: Initiate all nodes and then all ranks in a collective. It makes sure all ranks know about all other ranks, so any rank is able to communicate with any other rank.
Phase 2 - Topology Phase: Each node detects and maps out what hardware is located on the machine. Hardware includes CPUs, GPUs, NICs and interconnect types. Each node then creates an intra-machine graph, connects hardware with PCIe or NVLink interconnect, and evaluates the graph. When the intra-machine topology is decided, the system will decide what pattern to use for the whole system. The two main patterns are a tree or a ring. While the topology is evaluated, NCCL is also tuning it by performing tests. This allows each rank to pre-compute thresholds for message sizes.
Phase 3 - Collective Phase: A user can dispatch many collective operations using the same topology.

NCCL

Topology Phase

Build Physical Topology (i.e. System Topology)

建立rank間的鄰接表。

Transport Types
查看代碼

#define TRANSPORT_P2P 0
#define TRANSPORT_SHM 1
#define TRANSPORT_NET 2

Build Logical Topology (i.e. Graph Topology)

以2機16卡, NCCL 2.8.4為例

NCCL會構建tree,ring graph。

Tree Logical Topology
log
查看代碼

10.0.2.11: 2be7fa6883db:57976:58906 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12
10.0.2.11: 2be7fa6883db:57977:58920 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13
10.0.2.11: 2be7fa6883db:57978:58913 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] -1/-1/-1->15->14
10.0.2.11: 2be7fa6883db:57975:58907 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11
10.0.2.11: 2be7fa6883db:57974:58908 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10
10.0.2.11: 2be7fa6883db:57971:58905 [0] NCCL INFO Trees [0] 9/-1/-1->8->0 [1] 9/0/-1->8->-1
10.0.2.11: 2be7fa6883db:57973:58909 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->9
10.0.2.11: 2be7fa6883db:57972:58904 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] 10/-1/-1->9->8
10.0.2.12: 94f182076445:82266:83142 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
10.0.2.12: 94f182076445:82263:83145 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
10.0.2.12: 94f182076445:82262:83144 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
10.0.2.12: 94f182076445:82267:83151 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
10.0.2.12: 94f182076445:82265:83143 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
10.0.2.12: 94f182076445:82261:83141 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->8
10.0.2.12: 94f182076445:82268:83149 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
10.0.2.12: 94f182076445:82264:83150 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
10.0.2.12: 94f182076445:82263:83145 [2] NCCL INFO Channel 00 : 2[1d000] -> 1[1b000] via P2P/IPC
10.0.2.12: 94f182076445:82265:83143 [4] NCCL INFO Channel 00 : 4[3d000] -> 3[1e000] via P2P/IPC
10.0.2.12: 94f182076445:82264:83150 [3] NCCL INFO Channel 00 : 3[1e000] -> 2[1d000] via P2P/IPC
10.0.2.12: 94f182076445:82262:83144 [1] NCCL INFO Channel 00 : 1[1b000] -> 0[1a000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57976:58906 [5] NCCL INFO Channel 00 : 13[3e000] -> 12[3d000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57973:58909 [2] NCCL INFO Channel 00 : 10[1d000] -> 9[1b000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57975:58907 [4] NCCL INFO Channel 00 : 12[3d000] -> 11[1e000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57972:58904 [1] NCCL INFO Channel 00 : 9[1b000] -> 8[1a000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57974:58908 [3] NCCL INFO Channel 00 : 11[1e000] -> 10[1d000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57977:58920 [6] NCCL INFO Channel 00 : 14[40000] -> 13[3e000] via P2P/IPC
10.0.2.12: 94f182076445:82261:83141 [0] NCCL INFO Channel 00 : 8[1a000] -> 0[1a000] [receive] via NET/Socket/0
10.0.2.11: 2be7fa6883db:57978:58913 [7] NCCL INFO Channel 00 : 15[41000] -> 14[40000] via P2P/IPC
10.0.2.12: 94f182076445:82267:83151 [6] NCCL INFO Channel 00 : 6[40000] -> 5[3e000] via P2P/IPC
10.0.2.12: 94f182076445:82266:83142 [5] NCCL INFO Channel 00 : 5[3e000] -> 4[3d000] via P2P/IPC
10.0.2.12: 94f182076445:82261:83141 [0] NCCL INFO Channel 00 : 0[1a000] -> 8[1a000] [send] via NET/Socket/0
10.0.2.12: 94f182076445:82268:83149 [7] NCCL INFO Channel 00 : 7[41000] -> 6[40000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57971:58905 [0] NCCL INFO Channel 00 : 0[1a000] -> 8[1a000] [receive] via NET/Socket/0
10.0.2.11: 2be7fa6883db:57971:58905 [0] NCCL INFO Channel 00 : 8[1a000] -> 0[1a000] [send] via NET/Socket/0
10.0.2.12: 94f182076445:82263:83145 [2] NCCL INFO Channel 01 : 2[1d000] -> 1[1b000] via P2P/IPC
10.0.2.12: 94f182076445:82265:83143 [4] NCCL INFO Channel 01 : 4[3d000] -> 3[1e000] via P2P/IPC
10.0.2.12: 94f182076445:82264:83150 [3] NCCL INFO Channel 01 : 3[1e000] -> 2[1d000] via P2P/IPC
10.0.2.12: 94f182076445:82262:83144 [1] NCCL INFO Channel 01 : 1[1b000] -> 0[1a000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57976:58906 [5] NCCL INFO Channel 01 : 13[3e000] -> 12[3d000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57973:58909 [2] NCCL INFO Channel 01 : 10[1d000] -> 9[1b000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57975:58907 [4] NCCL INFO Channel 01 : 12[3d000] -> 11[1e000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57972:58904 [1] NCCL INFO Channel 01 : 9[1b000] -> 8[1a000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57974:58908 [3] NCCL INFO Channel 01 : 11[1e000] -> 10[1d000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57977:58920 [6] NCCL INFO Channel 01 : 14[40000] -> 13[3e000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57978:58913 [7] NCCL INFO Channel 01 : 15[41000] -> 14[40000] via P2P/IPC
10.0.2.12: 94f182076445:82267:83151 [6] NCCL INFO Channel 01 : 6[40000] -> 5[3e000] via P2P/IPC
10.0.2.12: 94f182076445:82266:83142 [5] NCCL INFO Channel 01 : 5[3e000] -> 4[3d000] via P2P/IPC
10.0.2.12: 94f182076445:82261:83141 [0] NCCL INFO Channel 01 : 8[1a000] -> 0[1a000] [receive] via NET/Socket/0
10.0.2.12: 94f182076445:82261:83141 [0] NCCL INFO Channel 01 : 0[1a000] -> 8[1a000] [send] via NET/Socket/0
10.0.2.12: 94f182076445:82268:83149 [7] NCCL INFO Channel 01 : 7[41000] -> 6[40000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57971:58905 [0] NCCL INFO Channel 01 : 0[1a000] -> 8[1a000] [receive] via NET/Socket/0
10.0.2.11: 2be7fa6883db:57971:58905 [0] NCCL INFO Channel 01 : 8[1a000] -> 0[1a000] [send] via NET/Socket/0
解析
  • 拓撲log格式

    IP: hostname:pid:tid [cudaDev] NCCL INFO Trees [channel ID] down0 rank/down1 rank/down2 rank->current rank->up rank
    如下面log中:
    10.0.2.11: 2be7fa6883db:57976:58906 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12
    可以解讀為 10.0.2.11上的設備5,其rank為13,有兩棵樹,分別為channel 0和channel 1: channel 0的子節點只有14, 父節點為12; channel 1一樣。

  • channel log格式

    IP: hostname:pid:tid [cudaDev] NCCL INFO Channel [channel ID] current rank[bus ID]->successor rank[bus ID] via transport type
    如下面log中:
    10.0.2.11: 2be7fa6883db:57976:58906 [5] NCCL INFO Channel 00 : 13[3e000] -> 14[40000] via P2P/IPC
    可以解讀為 10.0.2.11上的設備5(rank 為13, bus ID為3e000),其channel 0連接至rank 14,傳輸方式為P2P/IPC

結果

依此解析,可得兩棵一樣的tree,邏輯拓撲如下:
Alt text
其中socket雙工通道建立如下(雙工為1個channel):
Alt text

Ring Logical Topology
log
查看代碼

10.0.2.12: 94f182076445:82261:83141 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
10.0.2.12: 94f182076445:82261:83141 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
10.0.2.12: 94f182076445:82263:83145 [2] NCCL INFO Channel 00 : 2[1d000] -> 3[1e000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57978:58913 [7] NCCL INFO Channel 00 : 15[41000] -> 0[1a000] [send] via NET/Socket/0
10.0.2.12: 94f182076445:82261:83141 [0] NCCL INFO Channel 00 : 15[41000] -> 0[1a000] [receive] via NET/Socket/0
10.0.2.12: 94f182076445:82266:83142 [5] NCCL INFO Channel 00 : 5[3e000] -> 6[40000] via P2P/IPC
10.0.2.12: 94f182076445:82265:83143 [4] NCCL INFO Channel 00 : 4[3d000] -> 5[3e000] via P2P/IPC
10.0.2.12: 94f182076445:82262:83144 [1] NCCL INFO Channel 00 : 1[1b000] -> 2[1d000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57976:58906 [5] NCCL INFO Channel 00 : 13[3e000] -> 14[40000] via P2P/IPC
10.0.2.12: 94f182076445:82264:83150 [3] NCCL INFO Channel 00 : 3[1e000] -> 4[3d000] via P2P/IPC
10.0.2.12: 94f182076445:82267:83151 [6] NCCL INFO Channel 00 : 6[40000] -> 7[41000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57977:58920 [6] NCCL INFO Channel 00 : 14[40000] -> 15[41000] via P2P/IPC
10.0.2.12: 94f182076445:82268:83149 [7] NCCL INFO Channel 00 : 7[41000] -> 8[1a000] [send] via NET/Socket/0
10.0.2.11: 2be7fa6883db:57971:58905 [0] NCCL INFO Channel 00 : 7[41000] -> 8[1a000] [receive] via NET/Socket/0
10.0.2.12: 94f182076445:82261:83141 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1b000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57972:58904 [1] NCCL INFO Channel 00 : 9[1b000] -> 10[1d000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57974:58908 [3] NCCL INFO Channel 00 : 11[1e000] -> 12[3d000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57975:58907 [4] NCCL INFO Channel 00 : 12[3d000] -> 13[3e000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57973:58909 [2] NCCL INFO Channel 00 : 10[1d000] -> 11[1e000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57971:58905 [0] NCCL INFO Channel 00 : 8[1a000] -> 9[1b000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57974:58908 [3] NCCL INFO Channel 01 : 11[1e000] -> 12[3d000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57976:58906 [5] NCCL INFO Channel 01 : 13[3e000] -> 14[40000] via P2P/IPC
10.0.2.12: 94f182076445:82263:83145 [2] NCCL INFO Channel 01 : 2[1d000] -> 3[1e000] via P2P/IPC
10.0.2.12: 94f182076445:82266:83142 [5] NCCL INFO Channel 01 : 5[3e000] -> 6[40000] via P2P/IPC
10.0.2.12: 94f182076445:82265:83143 [4] NCCL INFO Channel 01 : 4[3d000] -> 5[3e000] via P2P/IPC
10.0.2.12: 94f182076445:82262:83144 [1] NCCL INFO Channel 01 : 1[1b000] -> 2[1d000] via P2P/IPC
10.0.2.12: 94f182076445:82264:83150 [3] NCCL INFO Channel 01 : 3[1e000] -> 4[3d000] via P2P/IPC
10.0.2.12: 94f182076445:82267:83151 [6] NCCL INFO Channel 01 : 6[40000] -> 7[41000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57978:58913 [7] NCCL INFO Channel 01 : 15[41000] -> 0[1a000] [send] via NET/Socket/0
10.0.2.12: 94f182076445:82261:83141 [0] NCCL INFO Channel 01 : 15[41000] -> 0[1a000] [receive] via NET/Socket/0
10.0.2.12: 94f182076445:82261:83141 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1b000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57977:58920 [6] NCCL INFO Channel 01 : 14[40000] -> 15[41000] via P2P/IPC
10.0.2.12: 94f182076445:82268:83149 [7] NCCL INFO Channel 01 : 7[41000] -> 8[1a000] [send] via NET/Socket/0
10.0.2.11: 2be7fa6883db:57972:58904 [1] NCCL INFO Channel 01 : 9[1b000] -> 10[1d000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57975:58907 [4] NCCL INFO Channel 01 : 12[3d000] -> 13[3e000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57973:58909 [2] NCCL INFO Channel 01 : 10[1d000] -> 11[1e000] via P2P/IPC
10.0.2.11: 2be7fa6883db:57971:58905 [0] NCCL INFO Channel 01 : 7[41000] -> 8[1a000] [receive] via NET/Socket/0
10.0.2.11: 2be7fa6883db:57971:58905 [0] NCCL INFO Channel 01 : 8[1a000] -> 9[1b000] via P2P/IPC
解析
  • 拓撲log格式

    IP: hostname:pid:tid [cudaDev] NCCL INFO Channel ring_ID/ring_number: rank0 rank1last_rank
    如下面log中:
    10.0.2.12: 94f182076445:82261:83141 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    可以解讀為:建成了02個ring,其中第0個ring的成員有:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15,該ring共由16個rank組成。

  • channel log格式
    與tree拓撲的格式一致。

結果

依此解析,可得兩個一樣的ring,邏輯拓撲如下:
Alt text

Collective Phase

用戶調用NCCL支持的集合通信原語進行通信:

  • 集合通信原語

    • AllReduce

    • Broadcast

    • Reduce

    • AllGather

    • ReduceScatter

  • 點對點通信原語

    • Send

    • Recv

NCCL在getAlgoInfo里面使用ncclTopoGetAlgoTime來遍歷計算(algorithm, protocol),最終選擇預測會最快做完指定數據量的指定集合通信原語的algorithm和protocol完成該通信原語。

Reference

  1. Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads

  2. NNCL github

  3. Massively Scale Your Deep Learning Training with NCCL 2.4

  4. Distributed Deep Neural Network Training: NCCL on SUMMIT

  5. Synthesizing Optimal Collective Algorithms

  6. Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

  7. Fast Multi-GPU communication over PCI Express

  8. 騰訊機智團隊分享–AllReduce算法的前世今生

  9. Fast Multi-GPU collectives with NCCL

  10. NCCL: Accelerated Multi-GPU Collective Communications

  11. Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

  12. BLINK: Fast and Generic Collectives for Distributed ML

  13. MPI Tutorial

  14. Optimizing Communication for Clusters of GPUs

  15. CNCL Design Docs


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM