cuda學習1-初始廬山真面目

本文轉載自查看原文 2017-05-25 11:12 1109 cuda/ cuda學習系列

　　cuda作為gpu計算中的代表，擁有着超級高的計算效率，其原因是gpu實際相當與一台超級並行機組，使用過MPI做並行計算的人們可能知道，所謂的並行計算，簡單講就是用多個U（計算單元）來完成一個U的計算任務，MPI中將其叫做核，我們知道一個cpu有一個或2，4，8個核，超級厲害的也就16個吧，原來人們為了做大規模的並行計算，將一大堆cpu裝在櫃子里，組成計算集群，但是那種設備大的嚇人，而且又有多少人會用呢。gpu則不同，一個小小的芯片上就存在着成千上萬的線程，是由分為grid，block，thread三級結構實現的。（本人初學者，學到哪寫到哪，借鑒需謹慎）

　　所謂的並行計算，就是用多個計算單元共同完成一個計算任務，那這有什么難度呢，這在生活中很常見啊，似乎沒什么可學的，可以想象我們造一座房子，我們要找一堆人過來，然后分配下任務，這里面有人做牆，有人做地板，屋頂等等吧，然后一聲令下開始吧，大家各司其職，把自己那部分做好，房子就做好了。但是事情沒有那么簡單，這個過程中存在很多問題會影響效率，甚至結果。比如說做牆的人要用錘子，但是錘子在被其他人占用着，怎么辦；做地板的把地做好了，但是又被做牆的人給不小心砸壞了；下面的結構還沒做好，就有人來裝屋頂了等等。這時就需要一個精確的管理方案，而這個管理方案就是所有並行API需要做的事。例如后面會學到的共享內存，同步等知識。目前看到書就是cuda by example了，很適合入門，看的很愉快。下面就開始了，什么hello world就不寫了，直接學干貨了。

 1 /*
 2  * Copyright 1993-2010 NVIDIA Corporation.  All rights reserved.
 3  *
 4  * NVIDIA Corporation and its licensors retain all intellectual property and 
 5  * proprietary rights in and to this software and related documentation. 
 6  * Any use, reproduction, disclosure, or distribution of this software 
 7  * and related documentation without an express license agreement from
 8  * NVIDIA Corporation is strictly prohibited.
 9  *
10  * Please refer to the applicable NVIDIA end user license agreement (EULA) 
11  * associated with this source code for terms and conditions that govern 
12  * your use of this NVIDIA software.
13  * 
14  */
15 
16 
17 #include "../common/book.h"
18 
19 #define N   10
20 
21 __global__ void add( int *a, int *b, int *c ) {
22     int tid = blockIdx.x;    // this thread handles the data at its thread id
23     if (tid < N)
24         c[tid] = a[tid] + b[tid];
25 }
26 
27 int main( void ) {
28     int a[N], b[N], c[N];
29     int *dev_a, *dev_b, *dev_c;
30 
31     // allocate the memory on the GPU
32     HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) );
33     HANDLE_ERROR( cudaMalloc( (void**)&dev_b, N * sizeof(int) ) );
34     HANDLE_ERROR( cudaMalloc( (void**)&dev_c, N * sizeof(int) ) );
35 
36     // fill the arrays 'a' and 'b' on the CPU
37     for (int i=0; i<N; i++) {
38         a[i] = -i;
39         b[i] = i * i;
40     }
41 
42     // copy the arrays 'a' and 'b' to the GPU
43     HANDLE_ERROR( cudaMemcpy( dev_a, a, N * sizeof(int),
44                               cudaMemcpyHostToDevice ) );
45     HANDLE_ERROR( cudaMemcpy( dev_b, b, N * sizeof(int),
46                               cudaMemcpyHostToDevice ) );
47 
48     add<<<N,1>>>( dev_a, dev_b, dev_c );
49 
50     // copy the array 'c' back from the GPU to the CPU
51     HANDLE_ERROR( cudaMemcpy( c, dev_c, N * sizeof(int),
52                               cudaMemcpyDeviceToHost ) );
53 
54     // display the results
55     for (int i=0; i<N; i++) {
56         printf( "%d + %d = %d\n", a[i], b[i], c[i] );
57     }
58 
59     // free the memory allocated on the GPU
60     HANDLE_ERROR( cudaFree( dev_a ) );
61     HANDLE_ERROR( cudaFree( dev_b ) );
62     HANDLE_ERROR( cudaFree( dev_c ) );
63 
64     return 0;
65 }

這段代碼講述了如何將兩個長度是10的向量相加，屬於gpu計算中基礎中的基礎，我們借助這段毫無難度的代碼熟悉一下cuda中的一些基本規則。

首先，每一個cuda代碼中必有kernel函數，也就是前面標有__global__的函數，如下：

1 __global__ void add( int *a, int *b, int *c ) {
2     int tid = blockIdx.x;    // this thread handles the data at its thread id
3     if (tid < N)
4         c[tid] = a[tid] + b[tid];
5 }

kernel函數的用意就是gpu中的每一個thread都會執行kernel，從而達到並行的目的。a，b，c三個參數傳入所有thread，每一個thread完成加法操作，為了使每一個thread的加法是對應數組中不同的元素，所以變量tid就意義重大，blockIdx.x是runtime中提供的變量，通常來講，一個gpu有1個grid，1個grid有多個block，這些block以一維或二維或三維數組的形式排列，blockIdx.x就是每個block在x方向上的索引值（就是序號），而每一個block又可以分為多個thread，thread按照一維或二維的方式排列。網上摘圖一個，以供理解

這個圖還說明了一個kernel就有一個grid，多個kernel有多個grid，漲姿勢。

接着說 tid = blockIdx.x，就是說要把每一個block的序號賦值給tid，block的序號是0，1，2，。。。這樣排列的，數組a，b，c的索引值tid也是0，1，2.。。。這樣的，這不就說明每個block都會計算他自己的那個數組分量了，這不是巧了嗎。

cudaMalloc( (void**)&dev_a, N * sizeof(int) )函數可以在device上申請存儲空間，這里有一個原則，就是host中的代碼不能操作用cudaMalloc申請的空間，因此想要釋放空間，用cudaFree。kernel才是操作cudaMalloc申請的變量的函數，kernel中用到的其他函數需要以__device__標明。

為變量申請空間后用函數cudaMemcpy( dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice )把host中的變量a賦值進device中即dev_a。然后device自己計算自己那份任務，即kernel中的計算。計算后用函數cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost )將結果dev_c傳會host。整體就是這么個流程。

　　需要注意的是對計算發出命令的add<<<N,1>>>( dev_a, dev_b, dev_c );kernel函數在調用時需要兩個參數，參數告訴runtime如何launch the kernel，就是怎么把kernel復制給那些計算單元的意思吧N代表N個一維block，1代表每個block里面包含1個thread。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 廬山真面目之十微服務架構 Net Core 基於 Docker 容器部署 Nginx 集群廬山真面目之三微服務架構Consul簡單版本實現廬山真面目之二微服務架構NGINX簡單版本實現 SpringCloud學習1-服務注冊與發現(Eureka) 設備樹DTS 學習：1-有關概念基於 SQLite 3 的 C 學習：1-開發流程與基本函數 httprunner學習1-環境與登錄接口案例 MiniUI學習筆記1-表單控件深度學習 CNN CUDA 版本2 【CUDA學習】計時方法