Google TPU 揭密
Google TPU(Tensor Processing Unit)問世之后,大家一直在猜測它的架構和性能。Google的論文“In-Datacenter Performance Analysis of a Tensor Processing Unit”讓我們有機會一探究竟。
首先讓我們看看摘要:
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC—called a Tensor Processing Unit (TPU)—deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU’s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, …) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters’ NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU’s GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
這段摘要的信息量非常大,我把重點高亮出來了。首先,這個TPU芯片是面向datacenter inference應用。它的核心是由65,536個8-bit MAC組成的矩陣乘法單元(matrix multiply unit),峰值可以達到92 TeraOps/second (TOPS) 。有一個很大的片上存儲器,一共28 MiB。它可以支持MLP,CNN和LSTM這些常見的NN網絡,並且支持TensorFLow框架。摘要里面還可以看出,傳統CPU和GPU使用的技術(caches, out-of-order execution, multithreading, multiprocessing, prefetching)它都沒用,原因是它面向的應用都是deterministic execution model,這也是它可以實現高效的原因。它的平均性能(TOPS)可以達到CPU和GPU的15到30倍,能耗效率(TOPS/W)能到30到80倍。如果使用GPU的DDR5 memory,這兩個數值可以達到大約GPU的70倍和CPU的200倍。
到此我們可以看出,Google的TPU采用了一個專用處理器或者硬件加速器的架構,比之前的想象的在GPU架構上改進的方法要激進的多,當然這樣做實現的效率也高得多。
整體架構
上圖是TPU的架構框圖,按文章的說法,這個芯片的目標是:
“The goal was to run whole inference models in the TPU to reduce interactions with the host CPU and to be flexible enough to match the NN needs of 2015 and beyond, instead of just what was required for 2013 NNs.”
具體的架構信息如下:
“TPU指令通過PCIe Gen3 x16總線從主機發送到指令緩沖區(instruction buffer)。內部模塊一般是通過256 bit寬的路徑連接在一起的。從右上角開始,矩陣乘法單元(Matrix Multiply Unit)是TPU的核心。它包含256x256 MAC,可以對有符號或無符號整數執行8位乘法和加法。 16位運算結果由下面的4MiB 32bit的累加器處理。 4MiB代表4096個, 256-element, 32-bit累加器。矩陣單元每個時鍾周期產生一個256-element的部分和(partial sum)。我們選擇了4096,首先是注意到每個字節的操作需要達到峰值性能為大約1350,所以我們將其舍入到2048;然后復制了它,以便編譯器可以在峰值運行時使用雙緩沖方式。”
“當使用8 bit權重(weight)和16 bit激活(activation)(或反之亦然)的混合時,矩陣乘法單元以半速計算,並且當它們都是16bit時以四分之一速度計算。它每個時鍾周期讀取和寫入256個值,可以執行矩陣乘法或卷積。矩陣單元保存一個64KiB的權重塊( tile of weights),並使用了雙緩沖(這樣可以在處理時同時讀入權重)。該單元設計用於密集矩陣,而沒有考慮稀疏的架構支持(部署時間的原因)。稀疏性將在未來的設計中占有優先地位。”
“矩陣乘法單元的權重通過片上的權重FIFO(Weight FIFO)進行緩存,該FIFO從片外8 GiB DRAM讀取。由於是用於推論,權重為只讀。8 GiB可以同時支持許多模型。權重FIFO的深度是4個tile。中間結果保存在24 MiB片上統一緩沖區(Unified Buffer)中,可以作為矩陣乘法單元的輸入。可編程DMA實現和CPU Host內存以及統一緩沖區傳輸數據。”
下圖給出了TPU芯片的布局,可以粗略看出各部分面積的比例。
指令集
按作者的說法,TPU的指令集是CISC類型(TPU instructions follow the CISC tradition),平均每個指令的時鍾周期CPI(clock cycles per instruction)是10到20。指令總共應該有10幾個,重要的指令如下:
1. Read_Host_Memory reads data from the CPU host memory into the Unified Buffer (UB).
2. Read_Weights reads weights from Weight Memory into the Weight FIFO as input to the Matrix Unit.
3. MatrixMultiply/Convolve causes the Matrix Unit to perform a matrix multiply or a convolution from the Unified Buffer into the Accumulators. A matrix operation takes a variable-sized B*256 input, multiplies it by a 256x256 constant weight input, and produces a B*256 output, taking B pipelined cycles to complete.
4. Activate performs the nonlinear function of the artificial neuron, with options for ReLU, Sigmoid, and so on. Its inputs are the Accumulators, and its output is the Unified Buffer. It can also perform the pooling operations needed for convolutions using the dedicated hardware on the die, as it is connected to nonlinear function logic.
5. Write_Host_Memory writes data from the Unified Buffer into the CPU host memory.
其它指令還包括:
The other instructions are alternate host memory read/write, set configuration, two versions of synchronization, interrupt host, debug-tag, nop, and halt. The CISC MatrixMultiply instruction is 12 bytes, of which 3 are Unified Buffer address; 2 are accumulator address; 4 are length (sometimes 2 dimensions for convolutions); and the rest are opcode and flags.
這是一個非常專用的指令集,主要就是運算和訪存指令。從另一個角度也說明了TPU是一個非常專用的處理器。
微結構
首先是最重要的一句話“The philosophy of the TPU microarchitecture is to keep the matrix unit busy”,相信這也是所有設計NN加速器的同學的philosophy。
“TPU為CISC指令使用4級流水線,其中每條指令在單獨的階段中執行。設計是是通過將其它指令的執行與MatrixMultiply指令重疊,來隱藏其他指令的執行時間。為此,Read_Weights指令遵循解耦訪問/執行(decoupled-access/execute)原理。它可以在發送地址之后,從權重存儲器中取出權重之前就完成。如果輸入激活或權重數據尚未就緒,矩陣單元就會暫停(stall)。”
“因為TPU的CISC指令可以占用站數千個時鍾周期,這與傳統的RISC流水線不同(每個階段一個時鍾周期),所以TPU沒有一個很清晰的流水線overlap diagrams。如果一個NN網絡的層的激活操作必須在下一層的矩陣乘法開始之前完成,就會發生有趣的情況;我們看到一個“延遲槽”(delay slot),其中矩陣單元會在從Unified Buffer中安全讀取數據之前等待顯式同步信號( explicit synchronization)。”
由於讀取大的SRAM的能耗比算術運算高的多,所以矩陣單元通過減少統一緩沖區的讀寫來降低能耗,所謂脈動運行(systolic execution)。下圖顯示了數據從左側流入,權重從頂部加載。給定的256-element乘累加運算通過矩陣作為對角波前(diagonal wavefront)移動。權重被預加載,並且與新塊的第一數據一起在前進波上生效。控制和數據是流水線方式,所以好像給出256個輸入是一次讀取的,並且它們可以立即更新256個累加器中的每一個的一個位置。從正確性的角度來看,矩陣單元的脈動性對軟件的是透明的。但是對於性能,它的延遲是軟件需要考慮的因素。
軟件
“TPU軟件棧必須與為CPU和GPU開發的軟件棧兼容,以便應用程序可以快速移植到TPU。在TPU上運行的應用程序的一部分通常寫在TensorFlow中,並被編譯成可以在GPU或TPU上運行的API。像GPU一樣,TPU棧分為用戶空間驅動程序和內核驅動程序。內核驅動程序是輕量級的,只處理內存管理和中斷。它是為長時間的穩定性而設計的。用戶空間驅動程序會經常變化。它設置和控制TPU執行,將數據重新格式化為TPU命令,將API調用轉換為TPU指令,並將其轉換為應用程序二進制文件。用戶空間驅動程序在首次evaluate的時候編譯模型,緩存程序image並將權重image寫入TPU的權重存儲器;第二次和以后的evaluate就是全速運行。 TPU從輸入到輸出完全運行大多數的模型,最大化TPU計算時間與I / O時間的比率。計算通常每次進行一層,重疊執行,允許矩陣乘法單元隱藏大多數非關鍵路徑操作。”
文章后面很大篇幅是介紹TPU和CPU,GPU性能對比的,大家可以仔細看看,有機會再討論吧。說實話,Google在2015年就能部署這樣的ASIC張量處理器,不得不佩服啊。
T.S.
參考:
1. Norman P. Jouppi, et al."In-Datacenter Performance Analysis of a Tensor Processing Unit",accepted by ISCA 2017