NVIDIA GPU Volta架構簡述


NVIDIA GPU Volta架構簡述

本文摘抄自英偉達Volta架構官方白皮書:https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/sc18-tesla-democratization-tech-overview-r4-web.pdf

SM

  1. Volta架構目前僅GV100支持

Volta architecture comprises a single variant: GV100.

  1. Volta的每個SM包含4個線程束調度器。每個調度單元處理一個線程束組,並有一組專用的算術指令單元。

Each Turing SM includes 4 warp-scheduler units.Each scheduler handles a static set of warps and issues to a dedicated set of arithmetic instruction units.

  1. 指令在兩個周期內執行,調度器可以在每個周期發出獨立的指令。而核心數學運算相關指令如FMA需要4個周期的延遲,相比之下,Pascal需要六個周期。

Instructions are performed over two cycles, and the schedulers can issue independent instructions every cycle. Dependent instruction issue latency for core FMA math operations is four clock cycles, like Volta, compared to six cycles on Pascal.

  1. Volta提供了64個fp32核心,32個fp64核心,64個int32核心和8個混合精度Tensor Cores。V100最多提供了84個SM。不同於Pascal,Volta包含了專用的fp32和int32核心,這意味着Volta支持同時執行fp32和int32運算。

Similar to GP100, the GV100 SM provides 64 FP32 cores and 32 FP64 cores. The GV100 SM additionally includes 64 INT32 cores and 8 mixed-precision Tensor Cores. GV100 provides up to 84 SMs. Unlike Pascal GPUs, the GV100 SM includes dedicated FP32 and INT32 cores. This enables simultaneous execution of FP32 and INT32 operations.

  1. Volta對線程束中的每個線程提供了獨立線程調用,現在可以支持支持線程束內部的線程同步__syncwarp()。

The Volta architecture introduces Independent Thread Scheduling among threads in a warp. This feature enables intra-warp synchronization patterns previously unavailable and simplifies code changes when porting CPU code.

  1. Turing架構上編程需要注意如下幾點
  • 使用帶_sync后綴的線程束指令 (__shfl*, __any, __all, and __ballot) 代替原有的

To avoid data corruption, applications using warp intrinsics (__shfl*, __any, __all, and __ballot) should transition to the new, safe, synchronizing counterparts, with the *_sync suffix.

  • 在需要線程束同步的位置插入__syncwarp()指令

Applications that assume reads and writes are implicitly visible to other threads in the same warp need to insert the new __syncwarp() warp-wide barrier synchronization instruction between steps where data is exchanged between threads via global or shared memory.

  • 使用__syncthreads()指令時需要確保線程塊中的所有線程必須都能達到此位置

Applications using __syncthreads() or the PTX bar.sync (and their derivatives) in such a way that a barrier will not be reached by some non-exited thread in the thread block must be modified to ensure that all non-exited threads reach the barrier.

  1. 與Pascal 架構相同,Volta支持最多64個線程束並行執行。

The maximum number of concurrent warps per SM remains the same as in Pascal (i.e., 64)
8.與Pascal 架構相同,Turing架構每SM擁有64k個32-bit寄存器,每個線程最多可使用255個寄存器,每SM支持最多32個線程塊駐留,每SM的共享內存大小為96KB。
The register file size is 64k 32-bit registers per SM.
The maximum registers per thread is 255.
The maximum number of thread blocks per SM is 32.
Shared memory capacity per SM is 96KB, similar to GP104, and a 50% increase compared to GP100.

Tensor Cores

  1. 每個Tensor Core執行矩陣乘加操作:D = AxB + C,其中矩陣ABCD都是4x4大寫,矩陣AB為fp16浮點數,矩陣CD可fp16或fp32。

Each Tensor Core performs the following operation: D = AxB + C, where A, B, C, and D are 4x4 matrices. The matrix multiply inputs A and B are FP16 matrices, while the accumulation matrices C and D may be FP16 or FP32 matrices.

  1. 在CUDA層面,線程束接口假設16x16矩陣分配到了線程束中32個線程。

At the CUDA level, the warp-level interface assumes 16x16 size matrices spanning all 32 threads of the warp.

  1. GV100的每個HBM2堆棧和4堆棧最多使用8個存儲芯片,最大支持32GB的GPU內存,其中HBM2理論內存帶寬高達900GB/s。

GV100 uses up to eight memory dies per HBM2 stack and four stacks, with a maximum of 32 GB of GPU memory.A faster and more efficient HBM2 implementation delivers up to 900 GB/s of peak memory bandwidth, compared to 732 GB/s for GP100.

  1. Volta架構中,L1緩存、紋理緩存、共享內存共享128KB緩存,Volta支持配置每SM 0、8、16、32、64、96 KB共享內存。

In Volta the L1 cache, texture cache, and shared memory are backed by a combined 128 KB data cache.Volta supports shared memory capacities of 0, 8, 16, 32, 64, or 96 KB per SM.

  1. Volta允許一個線程塊使用全部的96KB的共享內存,當靜態分配限制最多48KB,超過48KB則需要動態分配。

Volta enables a single thread block to address the full 96 KB of shared memory. To maintain architectural compatibility, static shared memory allocations remain limited to 48 KB, and an explicit opt-in is also required to enable dynamic allocations above this limit.

附1:Volta架構機型V100配置數據

V100配置數據


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM