NVIDIA GPU Turing架構簡述


NVIDIA GPU Turing架構簡述

本文摘抄自Turing官方白皮書:https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

SM

  1. Turing的流式多處理器(SM)和Volta的架構相同,都是7.x。

The Turing Streaming Multiprocessor (SM) is based on the same major architecture (7.x) as Volta, and provides similar improvements over Pascal.

  1. 每個SM包含4和線程束調度單元,每個調度單元處理一個線程束組,並有一組專用的算術指令單元。

Each Turing SM includes 4 warp-scheduler units.Each scheduler handles a static set of warps and issues to a dedicated set of arithmetic instruction units.

  1. 指令在兩個周期內執行,調度器可以在每個周期發出獨立的指令。而核心數學運算相關指令如FMA需要4個周期的延遲,與Volta相同,但比Pascal的六個周期要短。

Instructions are performed over two cycles, and the schedulers can issue independent instructions every cycle. Dependent instruction issue latency for core FMA math operations is four clock cycles, like Volta, compared to six cycles on Pascal.

  1. 與Volta架構類似,Turing提供了64個fp32核心,64個int32核心和8個改進的混合精度Tensor Cores核心。這使得fp32和int32的操作可以同時執行。但Turing僅提供2個fp64核心,因此雙精度的計算吞吐量較弱。

Like Volta, the Turing SM provides 64 FP32 cores, 64 INT32 cores and 8 improved mixed-precision Tensor Cores. This enables simultaneous execution of FP32 and INT32 operations. Turing has a lower double precision throughput than Volta with only 2 FP64 cores.

  1. Turing支持獨立線程調用,因此支持線程束內部的線程同步__syncwarp()。

The Turing architecture features the same Independent Thread Scheduling introduced with Volta. This enables intra-warp synchronization patterns previously unavailable and simplifies code changes when porting CPU code.

  1. Turing架構上編程需要注意如下幾點
  • 使用帶_sync后綴的線程束指令 (__shfl*, __any, __all, and __ballot) 代替原有的

To avoid data corruption, applications using warp intrinsics (__shfl*, __any, __all, and __ballot) should transition to the new, safe, synchronizing counterparts, with the *_sync suffix.

  • 在需要線程束同步的位置插入__syncwarp()指令

Applications that assume reads and writes are implicitly visible to other threads in the same warp need to insert the new __syncwarp() warp-wide barrier synchronization instruction between steps where data is exchanged between threads via global or shared memory.

  • 使用__syncthreads()指令時需要確保線程塊中的所有線程必須都能達到此位置

Applications using __syncthreads() or the PTX bar.sync (and their derivatives) in such a way that a barrier will not be reached by some non-exited thread in the thread block must be modified to ensure that all non-exited threads reach the barrier.

  1. Turing架構每SM支持最多32個線程束並行執行,相比之下,Volta支持64個。

The maximum number of concurrent warps per SM is 32 on Turing (versus 64 on Volta).

  1. 與Pascal 和Volta 架構相同,Turing架構每SM擁有64k個32-bit寄存器,每個線程最多可使用255個寄存器,每SM支持最多16個線程塊駐留,每SM的共享內存大小為64KB。

The register file size is 64k 32-bit registers per SM.
The maximum registers per thread is 255.
The maximum number of thread blocks per SM is 16.
Shared memory capacity per SM is 64KB.

Tensor Cores

  1. Volta引入Tensor Cores核心來加速混合精度浮點數據的矩陣乘法運算,Turing則更進一步,支持加速整數矩陣乘法運算。Tensor Cores通過每個線程束處理一個小的矩陣運算來達到最大效率,在實踐中,Tensor Cores通過用於執行由小矩陣片段組成的大矩陣運算。

Volta introduced Tensor Cores to accelerate matrix multiply operations on mixed precision floating point data. Turing adds acceleration for integer matrix multiply operations. In practice, Tensor Cores are used to perform much larger 2D or higher dimensional matrix operations, built up from these smaller matrix fragments.

  1. 每個Tensor Cores執行矩陣乘加操作:D=A*B+C。Tensor Cores支持半精度浮點(half)矩陣運算,即矩陣A和B為fp16精度,矩陣C和D可fp32也可fp16, CUDA 10 支持16x16x16, 32x8x16 和 8x32x16大小的fp16矩陣運算(Volta or Turing)。

Each Tensor Core performs the matrix multiply-accumulate: D = A x B + C. The Tensor Cores support half precision matrix multiplication, where the matrix multiply inputs A and B are FP16 matrices, while the accumulation matrices C and D may be either FP16 or FP32 matrices. When accumulating in FP32, the FP16 multiply results in a full precision product that is then accumulated using FP32 addition. CUDA 10 supports several fragment sizes, 16x16x16, 32x8x16, and 8x32x16 to use the Tensor Cores on Volta or Turing with FP16 inputs.

  1. Turing架構支持8-bit, 4-bit 和 1-bit整數輸入,32-bit整數累加的整數矩陣乘。當輸入為8-bit時,CUDA 支持16x16x16, 32x8x16 和 8x32x16大小的矩陣運算。當輸入為 4-bit時,CUDA支持8x8x32大小的矩陣運算。當輸入為1-bit時,CUDA支持8x8x128大小的矩陣運算。

Turing's Tensor Core supports integer matrix multiply operations, which can operate on 8-bit, 4-bit and 1-bit integer inputs, with 32-bit integer accumulation. When operating on 8-bit inputs, CUDA exposes fragment sizes of 16x16x16, 32x8x16, and 8x32x16. For sub-byte operations the fragment sizes available are 8x8x32 for 4-bit inputs, or 8x8x128 for 1-bit inputs.

  1. Turing架構擁有 96 KB的L1 / Shared Memory高速緩存,可配置為64KB Shared / 32KB L1或64KB L1 / 32KB Shared 。

The total size of the unified L1 / Shared Memory cache in Turing is 96 KB.

  1. Turing允許一個線程塊使用全部的64KB的共享內存,但靜態共享內存分配最大限制48 KB,超過48KB的只能動態分配。

Turing allows a single thread block to address the full 64 KB of shared memory. To maintain architectural compatibility, static shared memory allocations remain limited to 48 KB, and an explicit opt-in is also required to enable dynamic allocations above this limit.

總體體系架構

  • Turing架構包含Turing TU102 GPU、Turing TU104 GPU、Turing TU106 GPU。以TU102為例,TU102包含6個Graphics Processing Clusters (GPCs),36個Texture Processing Clusters (TPCs),和72個流式多處理器(SMs)。每個GPC包含一個專用光柵引擎和6個TPC,每個TPC包含2個SM,每個SM包含64個CUDA核心、8個Tensor Cores、一個256KB寄存器、4個紋理單元和96KB的L1/共享內存區域(可自配置)。Turing還通過每個SM擁有一個新的RT核心來實現Ray tracing acceleration。
  • TU102 GPU的全部配置包括4608個CUDA核心,72個RT核心,576個Tensor Cores,288個紋理單元和12個32-bit GDDR6存儲控制器(共計384-bit)。每個存儲控制器包含8個ROP單元和512KB的L2緩存,所以TU102 GPU總計包含96個ROP單元和6144KB的L2緩存。
  • Turing第一次使用GDDR6 DRAM內存,可達到14Gbps的速度。

SM體系架構

  • Turing架構引入了新的SM體系架構。每TPC包含2個SM,每SM包含2個fp64單元(共計144個),64個fp32核心和64個int32核心,這使得Turing可以並行執行fp32和int32運算。每個SM還包含8個混合精度Tensor Cores。
  • Turing的SM被划分為4個進程塊,每個進程塊包含16個fp32核心、16個int32核心、2個Tensor Cores、一個線程束調度器和一個調度單元,每個進程塊還包含一個新的L0指令緩存和一個64KB寄存器。這4個進程塊共享一個可自配置的96KB大小L1/共享內存區域。
  • Turing還改造了核心執行路徑,相比之前的單一執行路徑,Turing在執行混合fp算術指令如FADD、FMAD時為每個CUDA核心增加了另一個並行執行單元來並行執行整數運算和浮點數運算。
  • SM中的每個Tensor Core可以在每個時鍾周期內執行高達64次fp16輸入融合浮點乘加運算(FMA),8個Tensor Core這可以在每個時鍾周期內執行總計512次fp16乘加運算,或1024次fp運算。Turing新引入的INT8精度模式在每個時鍾周期內更是可以執行2048個整數運算。Tesla T4身為Turing第一個使用Turing架構的GPU,包含2560個CUDA核心和320個Tensor Cores,可實現130 TOPs(Tera Operations per second)的int8運算和高達260 TOPs的int4運算。

附1:Turing架構機型T4相關數據

T4配置數據

附2:Turing T4 vs Volta V100 vs Pascal P40比較

T4vsV100vsP40 來自GTC2019楊軍PPT通用推理優化工具


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM