NVIDIA GPU Pascal架構簡述
本文摘抄自英偉達Pascal架構官方白皮書:https://www.nvidia.com/en-us/data-center/resources/pascal-architecture-whitepaper/
SM
- 相比Maxwell架構,Pascal架構改進了16-nm FinFET的制造工藝,並提供了各種其它架構改進。
Pascal further improves the already excellent power efficiency provided by the Maxwell architecture through both an improved 16-nm FinFET manufacturing process and various architectural modifications.
- 與Maxwell架構類似,Pascal架構在每個分區中提供了2的次冪個CUDA核心,且每個線程束調度器都處理一個線程束數目(32)的CUDA核心中。
Like Maxwell, Pascal employs a power-of-two number of CUDA Cores per partition. This simplifies scheduling compared to Kepler, since each of the SM's warp schedulers issue to a dedicated set of CUDA Cores equal to the warp width (32).
- 每個線程束調度器支持雙重處理(在內存加載/存儲操作周期內CUDA核心處理數學操作),現在單次操作即可利用所有的CUDA核心。
Each warp scheduler still has the flexibility to dual-issue (such as issuing a math operation to a CUDA Core in the same cycle as a memory operation to a load/store unit), but single-issue is now sufficient to fully utilize all CUDA Cores.
- 每個GP104的SM提供了4個線程束調度器,用於管理總計128個fp32核心和4個fp64核心。而每個GP100的SM提供了2個線程束調度器,用於管理64個fp32核心和32個fp64核心。一個GP104提供了20個SM,相比之下GP102提供了30個SM,而GP100則提供了高達60個SM。
Like Maxwell, each GP104 SM provides four warp schedulers managing a total of 128 single-precision (FP32) and four double-precision (FP64) cores. A GP104 processor provides up to 20 SMs, and the similar GP102 design provides up to 30 SMs.By contrast GP100 provides smaller but more numerous SMs. Each GP100 provides up to 60 SMs. Each SM contains two warp schedulers managing a total of 64 FP32 and 32 FP64 cores.
- 與Maxwell架構相同,Pascal架構每SM支持最多64個線程束並行,
The maximum number of concurrent warps per SM remains the same as in Maxwell and Kepler (i.e., 64)
- Pascal架構每SM擁有64k個32-bit寄存器,每線程最多允許使用255個寄存器,每SM支持最多32個線程塊駐留(相比Kepler架構,Pascal在線程塊為64或更小情況下kernel的占有率提高),每SM的共享內存大小GP100為64KB,GP104為96KB,但每個線程塊最多允許使用48KB共享內存。當然,英偉達推薦每個線程塊最多使用32KB共享內存,因為這能使得每個SM至少可同時駐留2個線程塊。
The register file size (64k 32-bit registers) is the same as that of Maxwell and Kepler GK110.
The maximum registers per thread, 255.
The maximum number of thread blocks per SM is 32. Compared to Kepler, Pascal should see an automatic occupancy improvement for kernels with thread blocks of 64 or fewer threads (shared memory and register file resource requirements permitting).
Shared memory capacity per SM is 64KB for GP100 and 96KB for GP104.The maximum shared memory per block remains limited at 48KB as with prior architectures. NVIDIA recommends that applications use at most 32 KB of shared memory in any one thread block.
- Pascal提升了對fp16浮點數格式(half)的支持。Pascal通過使用成對操作同時執行每個核心的兩個fp16指令來達到fp16的吞吐量峰值,為符合成對操作的條件,操作數必須以half2格式存儲。GP100提供了兩倍fp32的fp16吞吐量支持,相比之下GP104僅提供了64分之1倍fp32的fp16支持,但GP104提供了額外的int8格式支持。
Pascal provides improved FP16 support for applications. Peak FP16 throughput is attained by using a paired operation to perform two FP16 instructions per core simultaneously.GP100, designed with training deep neural networks in mind, provides FP16 throughput up to 2x that of FP32 arithmetic. On GP104, FP16 throughput is lower, 1/64th that of FP32. However, compensating for reduced FP16 throughput, GP104 provides additional high-throughput INT8 support not available in GP100.
- GP104為int8提供了專用的2路和4路整數點乘指令,__dp4a指令點乘4個8-bit整數並累加到一個32-bit整數中,類似的,__dp2a指令點乘2個16-bit整數和2個8-bit整數並累加到一個32-bit整數中。
GP104 provides specialized instructions for two-way and four-way integer dot products.The __dp4a intrinsic computes a dot product of four 8-bit integers with accumulation into a 32-bit integer. Similarly, __dp2a performs a two-element dot product between two 16-bit integers in one vector, and two 8-bit integers in another with accumulation into a 32-bit integer.
- GP100為DRAM提供了High Bandwidth Memory 2 (HBM2)存儲器,HBM2存儲芯片與GPU芯片封裝在同一硅片上,與傳統GDDR技術相比,這能提供更高的帶寬。GP100最多連接到4個HBM2堆棧,每個堆棧使用兩個512-bit存儲控制器。存儲總線的有效寬度為4096位,相比GM200的384-bit,這可以極大的提高帶寬峰值。因此,使用GP100架構的Tesla P100擁有高達732GB/s的帶寬峰值和 715 MHz的內存周期。
GP100 uses High Bandwidth Memory 2 (HBM2) for its DRAM. HBM2 memories are stacked on a single silicon package along with the GPU die. This allows much wider interfaces at similar power compared to traditional GDDR technology. GP100 is linked to up to four stacks of HBM2 and uses two 512-bit memory controllers for each stack.The effective width of the memory bus is then 4096 bits, a significant increase over the 384 bits in GM200. This allows a tremendous boost in peak bandwidth even at reduced memory clocks. Thus, the GP100 equipped Tesla P100 has a peak bandwidth of 732 GB/s with a modest 715 MHz memory clock.
- GP100提供了ECC校驗,通常情況下這會造成6.25%的內存消耗和20%的帶寬損失,而HBM2 提供了專用的ECC資源,因此可以實現無開銷的ECC校驗。
Like Kepler GK210, the GP100 GPU's register files, shared memories, L1 and L2 caches, and DRAM are all protected by Single-Error Correct Double-Error Detect (SECDED) ECC code. When enabling ECC support on a Kepler GK210, the available DRAM would be reduced by 6.25% to allow for the storage of ECC bits. Fetching ECC bits for each memory transaction also reduced the effective bandwidth by approximately 20% compared to the same GPU with ECC disabled. HBM2 memories, on the other hand, provide dedicated ECC resources, allowing overhead-free ECC protection.4
- 默認情況下,GP100將全局內存加載到L1/紋理緩存,與此相反,GP104僅僅加載到L2緩存(同Kepler和Maxwell架構),但GP104允許用戶在nvcc編譯時使用-Xptxas -dlcm=ca選項開啟L1/紋理緩存。
By default, GP100 caches global loads in the L1/Texture cache. In contrast, GP104 follows Kepler and Maxwell in caching global loads in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler. As with previous architectures, GP104 allows the developer to opt-in to caching all global loads in the unified L1/Texture cache by passing the -Xptxas -dlcm=ca flag to nvcc at compile time.
- 當啟用L1緩存時,Kepler通過128B粒度的內存服務來加載全局數據,否則內存服務粒度為32B。但在Pascal架構中,內存服務的粒度始終為32B。
Kepler serviced loads at a granularity of 128B when L1 caching of global loads was enabled and 32B otherwise. On Pascal the data access unit is 32B regardless of whether global loads are cached in L1.
- 同Maxwell架構,Pascal提供了原生的 32-bit整數共享內存原子指令,和32-bit、64-bit的CAS指令。Pascal還提供了FP64全局內存原子指令,現atomicAdd()函數支持32-bit和64-bit的整數和浮點數操作。
Like Maxwell, Pascal provides native shared memory atomic operations for 32-bit integer arithmetic, along with native 32 or 64-bit compare-and-swap (CAS). Pascal also extends atomic addition in global memory to function on FP64 data. The atomicAdd() function in CUDA has thus been generalized to support 32 and 64-bit integer and floating-point types.
- 對於GP100,原子指令可以針對通過nvlink連接的存儲器上的數據實現原子操作,但PCIe連接不支持。通常,跨存儲器的原子操作通過一個page-faults來將另一存儲器上的數據移動到本地實現。
For GP100 atomic operations may target the memories of peer GPUs connected through NVLink. Peer-to-peer atomics over NVLink use the same API as atomics targeting global memory. GPUs connected via PCIE do not support this feature.When an atomic targets a migratable address backed by a remote memory space, the local processor page-faults so that the kernel can migrate the appropriate memory page to local memory.
- 不同於Kepler架構L1和共享內存使用同一塊片上存儲,Maxwell和Pascal架構由於L1和紋理緩存合並,因此為每個SM提供了專用的共享內存存儲,GP100現每SM擁有64KB共享內存,GP104每SM擁有96KB共享內存。
For Kepler, shared memory and the L1 cache shared the same on-chip storage. Maxwell and Pascal, by contrast, provide dedicated space to the shared memory of each SM, since the functionality of the L1 and texture caches have been merged. This increases the shared memory space available per SM as compared to Kepler: GP100 offers 64 KB shared memory per SM, and GP104 provides 96 KB per SM.
- 不同於Kepler架構僅支持8-byte的共享內存bank模式,Pascal現返回固定的4-byte的共享內存bank,因此Pascal對於int、fp32等數據類型的共享內存存取也能達到較高的帶寬。
Kepler provided an optional 8-byte shared memory banking mode. Pascal follows Maxwell in returning to fixed four-byte banks.
- GP100現支持計算時搶占,計算時搶占允許在GPU上運行的計算任務在指令級粒度上被中斷,此時執行上下文(如寄存器、共享內存等)將被保存到全局內存上以便另一程序執行。
Compute Preemption is a new feature specific to GP100. Compute Preemption allows compute tasks running on the GPU to be interrupted at instruction-level granularity. The execution context (registers, shared memory, etc.) are swapped to GPU DRAM so that another application can be swapped in and run.
- Pascal使用49-bit虛擬地址空間(可訪問48-bit的物理地址空間)來統一所有GPU上的內存空間。Pascal同樣支持Page fault,Page fault允許程序在不需要明確同步的情況下訪問主機和設備上的托管內存,當kernel訪問一個非常駐頁面時,程序無需預同步所有托管內存,系統將該頁面移動到本地或映射到本地地址空間。
Pascal offers new hardware capabilities to extend Unified Memory (UM) support. An extended 49-bit virtual addressing space allows Pascal GPUs to address the full 48-bit virtual address space of modern CPUs as well as the memories of all GPUs in the system through a single virtual address space, not limited by the physical memory sizes of any one processor. Pascal GPUs also support memory page faulting. Page faulting allows applications to access the same managed memory allocations from both host and device without explicit synchronization. It also removes the need for the CUDA runtime to pre-synchronize all managed memory allocations before each kernel launch. Instead, when a kernel accesses a non-resident memory page, it faults, and the page can be migrated to the GPU memory on-demand, or mapped into the GPU address space for access over PCIe/NVLink interfaces.
- 在支持的操作系統上,可以通過同一指針訪問位於GPU和CPU上的內存,事實上,GPU可以訪問整個系統的虛擬內存,在這種系統上不再需要通過cudaMallocManaged()明確分配托管內存。
On supporting operating system platforms, any memory allocated with the default OS allocator (for example, malloc or new) can be accessed from both GPU and CPU code using the same pointer. In fact, all system virtual memory can be accessed from the GPU. On such systems, there is no need to explicitly allocate managed memory using cudaMallocManaged().
Pascal架構機型P40配置數據
