CUDA編程－（1）Tesla服務器Kepler架構和萬年的HelloWorld

本文轉載自查看原文 2016-03-21 10:26 1700 CUDA/ Nvidia/ Telsa/ GPU

結合CUDA范例精解以及CUDA並行編程。由於正在學習CUDA，CUDA用的比較多，因此翻譯一些個人認為重點的章節和句子，作為學習，程序將通過NVIDIA K40服務器得出結果。如果想通過本書進行CUDA編程，又不太懂CUDA和GPU的架構，可以將這個博客作為入門博客（但是希望你能有些基礎，因為我介紹的並不是特別全面，只是撿了一些我困惑很久后來明白的知識點，如果完全不懂GPU的話，建議通讀本書和介紹GPU的架構的書），我盡量在一個月更新完這本書的中文內容（部分）並補充一些自己的認識。歡迎大家評論和提問，轉載請注明出處。

吐槽：書本有些地方英文過於書面化，真的不太容易讓初學者理解。

正文

重點背景介紹：

Unlike previous generations that partitioned computing resources into vertex and pixel shaders, the CUDA Architecture included a unified shader pipeline, allowing each and every arithmetic logic unit (ALU) on the chip to be marshaled by a program intending to perform general-purpose computations. Because NVIDIA intended this new family of graphics processors to be used for general purpose computing, these ALUs were built to comply with IEEE requirements for single-precision floating-point arithmetic and were designed to use an instruction set tailored for general computation rather than specifically for graphics.

不同於之前將計算資源分配到頂點和像素着色器上（的架構），如今，CUDA架構包含了統一渲染總線，試圖讓每一個在芯片上的ALU算術邏輯單元都執行通用計算。因為NVIDIA公司（表示）這一新系列的圖形處理器可用於通用計算，並且這些部件建立了符合IEEE單精度浮點運算的要求，因此為通用計算設計了一個指令集，且不是專門為了圖形而設計的。

Furthermore, the execution units on the GPU were allowed arbitrary read and write access to memory as well as access to a software-managed cache known as shared memory. All of these features of the CUDA Architecture were added in order to create a GPU that would excel at computation in addition to performing well at traditional graphics tasks.

此外，在GPU上，執行單元可以任意讀寫訪問的內存以及訪問管理軟件的緩存被稱為共享內存。所有的這些CUDA架構的（設計）特點就是為了創造一個能擅長除了傳統的圖形計算的GPU（即，在通用計算等領域上也能夠發揮更大的作用）。

第一章中，分別介紹了並行計算等背景、並行計算的重要性、GPU計算的產生崛起、早期的GPU計算以及CUDA在醫學成像、計算流體動力學和環境科學領域的一些背景介紹，感興趣可細讀，不過個人感覺沒什么用。關於安裝CUDA網上教程多的是，不想重復贅述。不過提一點注意：你的電腦顯卡必須是英偉達。

第二章主要是介紹安裝的一些組件、需要的編譯器gcc、g++等等，如果已經裝好了，個人覺得也不需要看了。

第三章第一個例子

＃include "../common/book.h"

int main()
{
       printf("hello world!\n");      
       return 0;    
}

　　上面這個例子和你所寫的所有的C代碼一樣，但是它卻是一個CUDA程序，原因是，他是在host端執行的程序代碼。這里我們引出了一個概念：host端和device端。device執行的CUDA的核函數，host端執行的是CPU上的執行代碼。下圖中可以看出如何寫一個設備端的代碼。

This program makes two notable additions to the original “Hello, World!”
example:
• An empty function named kernel() qualified with __global__
• A call to the empty function, embellished with <<<1,1>>>

上面那段程序和一開始的那個"hello world"相比主要有2個額外值得注意的地方。

• 一個不帶參數的空 kernel()函數和它的前綴 __global__ 關鍵字
• 通過<<<1,1>>> 和kernel函數建立聯系

As we saw in the previous section, code is compiled by your system’s standard C compiler by default. For example, GNU gcc might compile your host code on Linux operating systems, while Microsoft Visual C compiles it on Windows systems. The NVIDIA tools simply feed this host compiler your code, and everything behaves as it would in a world without CUDA.

正如我們在前一節所看到的，代碼是由您系統標准的編譯器默認的。例如，GNU GCC可能在Linux操作系統下編譯你的主機代碼，而微軟的Visual C是基於Windows系統下編譯的。NVIDIA的工具只是提供（feed）你的代碼給主機編譯者（編譯器），接下來的行為是沒有任何CUDA的。

Now we see that CUDA C adds the __global__ qualifier to standard C. This mechanism alerts the compiler that a function should be compiled to run on a device instead of the host. In this simple example, nvcc gives the function kernel() to the compiler that handles device code, and it feeds main() to the host compiler as it did in the previous example. So, what is the mysterious call to kernel(), and why must we vandalize our standard C with angle brackets and a numeric tuple? Brace yourself, because this is where the magic happens.

現在我們看到，CUDA C加__global__關鍵字來限定標准C函數（類似於一種改寫了該方法的意思）。該機制通知編譯器函數應該編譯運行在設備上而不是主機。在這個簡單的例子中，NVCC給出了kernel()功能函數來處理設備代碼，它提供main()函數到host端就像前面的例子中寫的一樣（大概意思就是kernel執行在設備端，其它代碼執行在CPU上，原文是不是很拗口？！！）。所以，kernel()神秘的召喚是什么，以及我們為什么要破壞我們的標准C角括號和數字元組？振作起來，因為這是魔法發生的地方。

補充知識

GPU建立了一組SMX，多流處理器；每個SMX中，有許多的sp組成（流處理器），一個SMX的配置如下：

192 cores（都是SIMT cores（Single Instruction Multiple Threads） and 64k registers
         GPU中的SIMT對應於CPU中的SIMD（Single Instruction Multiple Data）
64KB of shared memory / L1 cache
8KB cache for constants
48KB texture cache for read-only arrays
up to 2K threads per SMX

Tesla K40服務器架構基於 NVIDIA Kepler™ 架構的，如圖下所示（官網Kepler架構說明，點擊下載）

圖中L2 Cache為2級緩存。K40一共15個SMX，每個SMX中之前已經說明，其中每個SMX核心數為192，得CUDA核心數為2880枚（15＊192），內存大小12G。圖下為各個參數：

SMX內的結構圖如下。Warp是CUDA線程執行的最小單元，一個單元32個線程並行執行。寄存器文件大小：65536*32bit。 32個特殊功能單元 (SFU), 32個負載/存儲單元(LD/ST),48k 只讀數據一級數據緩存。64K共享內存或者128K，平台不同數據不同。Tex為紋理存儲單元。

如今最新的K80服務器設置了雙GPU，內存容量都翻倍了。感興趣可以去了解。

1、程序寫完后，以.cu結尾，切勿.cpp什么的。執行方式 $: nvcc test.cu -o test

2、如果理解了GPU底層架構將會更加清楚線程的執行方式（我不會對GPU的架構做過多的贅述，只介紹K40服務器的架構特點）如果感覺很迷糊說不清楚請參考這篇：顯卡帝教你讀懂GPU架構圖輕松做達人以及下面官方的PDF

參考：

nv-ds-tesla-kcompute-arch-may-2012-cn

NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper

Tesla K80 An Inside Look Developer Whitepaper_CN

TESLA K10 GPU ACCELERATOR

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 萬年歷算法萬年歷小程序 Javascript實現萬年歷（日歷表）人類700萬年（震撼的極簡人類史）壽星萬年歷---Lua實現利用java編寫萬年歷 java實現萬年歷功能電子萬年歷相關資料分享一個我自己做的 Excel 萬年歷壽星萬年歷---java算法實現