上一篇我們介紹了runtime庫中的一些函數,接下來我們來介紹cuda隨機數的生成。
回顧
cuda將函數與變量根據其所在位置,分割成兩部分。其中主機端(host)的函數與變量可以互相自由調用,設備端(device)的函數與變量也可自由調用,不過設備端有一種特殊的函數——__kernel__函數(核函數),這是主機端調用設備端函數的唯一方法。
核函數的調用需要運行時參數,host端和device端都是如此。
對於變量的訪問,host端可以修改和讀取__device__和__constant__,卻無法訪問__shared__;device端可以讀取上述三種,卻只能修改__device__和__shared__變量。至於host端變量,自然只有host端函數才能修改和讀取,device端無法訪問。
問題的提出
神經網絡的初始化、游戲NPC的隨機生成等,都需要用到隨機數。cuda的訪問控制,給了隨機數生成以一定的問題,host端當然可以用cstdlib中的rand()函數生成隨機數,但設備端如何使用這些隨機數?每調用一次rand(),就通過cudaMemcpy傳遞給顯存嗎?顯然不是,這會消耗太多I/O時間;先調用n次,一次性傳到device中嗎?雖然可行,但並不是最優解。能否用一種方法,讓主機僅僅傳給設備一個信號,使得多個隨機數在device端被生成,這樣就能避免過多的I/O帶來的耗時問題;或者調用一個設備端函數,使得設備端可以根據需要動態生成隨機數。於是curand庫和curand_kernel庫進入了我們的視野。
curand庫——host端的隨機數生成
curand庫使用了我們的第一種思路,讓主機向設備傳遞信號,指定隨機數的生成數量,生成若干隨機數以待使用。
隨機數生成器
curand提供了多種不同的隨機數生成器:
1 /** 2 * CURAND generator types 3 */
4 enum curandRngType { 5 CURAND_RNG_TEST = 0, 6 CURAND_RNG_PSEUDO_DEFAULT = 100, ///< Default pseudorandom generator 7 CURAND_RNG_PSEUDO_XORWOW = 101, ///< XORWOW pseudorandom generator 8 CURAND_RNG_PSEUDO_MRG32K3A = 121, ///< MRG32k3a pseudorandom generator 9 CURAND_RNG_PSEUDO_MTGP32 = 141, ///< Mersenne Twister MTGP32 pseudorandom generator 10 CURAND_RNG_PSEUDO_MT19937 = 142, ///< Mersenne Twister MT19937 pseudorandom generator 11 CURAND_RNG_PSEUDO_PHILOX4_32_10 = 161, ///< PHILOX-4x32-10 pseudorandom generator 12 CURAND_RNG_QUASI_DEFAULT = 200, ///< Default quasirandom generator 13 CURAND_RNG_QUASI_SOBOL32 = 201, ///< Sobol32 quasirandom generator 14 CURAND_RNG_QUASI_SCRAMBLED_SOBOL32 = 202, ///< Scrambled Sobol32 quasirandom generator 15 CURAND_RNG_QUASI_SOBOL64 = 203, ///< Sobol64 quasirandom generator 16 CURAND_RNG_QUASI_SCRAMBLED_SOBOL64 = 204 ///< Scrambled Sobol64 quasirandom generator 17 };
18 typedef enum curandRngType curandRngType_t;
許多英語比我好的同學,通過看注釋就能看明白,枚舉值大於等於100且小於200的這六個隨機數發生器是pseudorandom(偽隨機數),而枚舉值大於等於200的這五個隨機數發生器則是quasirandom(真隨機數)。
學過高中數學,我們知道,根據算法得到的隨機數為偽隨機數,而根據現實中的不依賴算法的隨機事件得到的隨機數為真隨機數。換句話說,偽隨機數可以被破解,而真隨機數是無法預測的。
比如cstdlib庫中的rand函數是這樣實現的:
1 unsigned int _Status = 0; 2 void __cdecl srand(unsigned int _Seed) { 3 _Status = _Seed; 4 } 5 int __cdecl rand(void) { 6 _Status = 214013u * _Status + 2531011u; 7 return (int)(_Status >> 16u & 0x7fffu); 8 }
很明顯這是通過算法模擬出來的隨機,當然是偽隨機數。
而模擬電視在無信號的時候出現的無序噪點,每一個噪點的值都是真隨機數生成的。
curand庫生成偽隨機數的算法,也是通過算法得到的,如XORWOW算法則是NVIDIA根據XORSHIFT算法改進而成的。而真隨機數,則是直接對顯卡電流等實際數據的采樣,再加上簡單的計算處理得到的。在密碼學加密中,隨機數盡量由真隨機數生成器生成,防止敵人預測出序列,輕松破解。
隨機數的生成
生成隨機數的方法也很簡單,首先要初始化隨機數生成器,然后調用“傳遞信號”的函數,讓device端生成若干隨機數。device端使用時只需知道隨機數數組的指針即可。
比如通過下面的操作:
1 curandGenerator_t gen; //隨機數生成器
2 curandStatus_t cst; //curand庫函數返回值暫存
3 cudaError_t cet; //runtime庫函數返回值暫存
4 unsigned int* arr; //隨機數存儲位置
5 const size_t N = 128; 6
7 //初始化生成器
8 cst = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_XORWOW); 9 assert(CURAND_STATUS_SUCCESS == cst); 10 //動態申請內存
11 cet = cudaMalloc(&arr, sizeof(unsigned int) * N); 12 assert(cudaSuccess == cet); 13 //將N個隨機數生成在數組arr中
14 cst = curandGenerate(gen, arr, N); 15 assert(CURAND_STATUS_SUCCESS == cst);
本例中使用的是XORWOW生成器,同理我們也可以用其他生成器,包括真隨機數生成器,生成其他類型的數、其他類型的分布,同樣也可以簡略來寫:
1 curandGenerator_t gen; //隨機數生成器
2 unsigned long long* uint64_arr; //64位無符號整數數組
3 double* double_arr; //64位浮點數數組
4 const size_t N = 128; 5
6 //初始化生成器
7 assert(CURAND_STATUS_SUCCESS == curandCreateGenerator(&gen, CURAND_RNG_QUASI_SOBOL64)); 8 //動態申請內存
9 assert(cudaSuccess == cudaMalloc(&uint64_arr, sizeof(unsigned long long) * N)); 10 assert(cudaSuccess == cudaMalloc(&double_arr, sizeof(double) * N)); 11 //將隨機數生成在兩個數組中 12 //生成unsigned long long類型的若干個隨機數
13 assert(CURAND_STATUS_SUCCESS == curandGenerateLongLong(gen, uint64_arr, N)); 14 //生成符合μ=0、σ²=1標准正態分布的double類型的若干個隨機數
15 assert(CURAND_STATUS_SUCCESS == curandGenerateNormalDouble(gen, double_arr, N, 0.0, 1.0));
除此之外,curand庫依然有很多生成不同類型隨機數的函數,其中常用的函數原型如下:
1 curandStatus_t CURANDAPI curandGenerate ( curandGenerator_t generator, unsigned int* outputPtr, size_t num ); 2 //Generate 32-bit pseudo or quasirandom numbers.
3
4 curandStatus_t CURANDAPI curandGenerateLogNormal ( curandGenerator_t generator, float* outputPtr, size_t n, float mean, float stddev ); 5 //Generate log-normally distributed floats.
6
7 curandStatus_t CURANDAPI curandGenerateLogNormalDouble ( curandGenerator_t generator, double* outputPtr, size_t n, double mean, double stddev ); 8 //Generate log-normally distributed doubles.
9
10 curandStatus_t CURANDAPI curandGenerateLongLong ( curandGenerator_t generator, unsigned long long* outputPtr, size_t num ); 11 //Generate 64-bit quasirandom numbers.
12
13 curandStatus_t CURANDAPI curandGenerateNormal ( curandGenerator_t generator, float* outputPtr, size_t n, float mean, float stddev ); 14 //Generate normally distributed doubles.
15
16 curandStatus_t CURANDAPI curandGenerateNormalDouble ( curandGenerator_t generator, double* outputPtr, size_t n, double mean, double stddev ); 17 //Generate normally distributed doubles.
18
19 curandStatus_t CURANDAPI curandGeneratePoisson ( curandGenerator_t generator, unsigned int* outputPtr, size_t n, double lambda ); 20 //Generate Poisson-distributed unsigned ints.
21
22 curandStatus_t CURANDAPI curandGenerateUniform ( curandGenerator_t generator, float* outputPtr, size_t num ); 23 //Generate uniformly distributed floats.
24
25 curandStatus_t CURANDAPI curandGenerateUniformDouble ( curandGenerator_t generator, double* outputPtr, size_t num ); 26 //Generate uniformly distributed doubles.
編譯錯誤的處理
需要注意的是,用VS直接創建的cuda工程,直接編譯上述代碼很可能出現一些編譯錯誤:
error LNK2019: 無法解析的外部符號 curandCreateGenerator,函數 main 中引用了該符號 error LNK2019: 無法解析的外部符號 curandGenerate,函數 main 中引用了該符號
有編程經驗的朋友們看到LNK就知道這是鏈接器報錯,肯定是少鏈接了一些文件。沒錯,只需要在項目-屬性-鏈接器-輸入-附加依賴項中添加一項"curand.lib"即可。
curand_kernel庫——device端的動態隨機數生成
也行有人覺得上述方法太過呆板,能生成多少隨機數必須在調用device端函數前指定,實際工程中可能會生成過多的隨機數,占用更多內存,也有可能生成的隨機數不夠,以致訪問無效內存或隨機數從頭循環。
但用第二種思路,使用一些__device__函數動態生成隨機數,則會避免這些問題。
可惜的是,國內國外論壇,甚至包括github,很少介紹和使用curand_kernel庫,真隨機數生成器的初始化方法更是無人介紹,連cuda文檔也沒有給出例子。
不過也好,我的機會來了——本篇將是歷史上第一篇詳細介紹curand_kernel庫使用方法,並給出使用實例的文章(狗頭)。
隨機數生成器
curand_kernel提供了八種隨機數生成器,並分別封裝為結構體:
1 //偽隨機數生成器原型
2 typedef struct curandStateMtgp32 curandStateMtgp32_t; 3 typedef struct curandStatePhilox4_32_10 curandStatePhilox4_32_10_t; 4 typedef struct curandStateXORWOW curandStateXORWOW_t; 5 typedef struct curandStateMRG32k3a curandStateMRG32k3a_t; 6 //真隨機數生成器原型
7 typedef struct curandStateSobol32 curandStateSobol32_t; 8 typedef struct curandStateScrambledSobol32 curandStateScrambledSobol32_t; 9 typedef struct curandStateSobol64 curandStateSobol64_t; 10 typedef struct curandStateScrambledSobol64 curandStateScrambledSobol64_t;
名字都很眼熟吧,沒錯,前四個依舊是偽隨機數生成器,后四個依舊是真隨機數生成器。除了使用了MTGP32算法的curandStateMtgp32_t生成器外,其他七個生成器都可以用__device__函數——curand_init()初始化。
不過,curand_kernel庫的生成器類型並不和curand庫的生成器一樣是枚舉類型,而是定義成了不同的結構體,甚至沒使用繼承或共用體。這也暗示了不同生成器的初始化需要傳入不同的參數。
MRG32K3A、Philox4_32_10與XORWOW生成器的初始化
除去MTGP32生成器外,剩下的三個生成器的初始化方法非常相似:
1 __device__ void curand_init ( unsigned long long seed, unsigned long long subsequence, unsigned long long offset, curandStateMRG32k3a_t* state ) 2 //Initialize MRG32k3a state.
3
4 __device__ void curand_init ( unsigned long long seed, unsigned long long subsequence, unsigned long long offset, curandStatePhilox4_32_10_t* state ) 5 //Initialize Philox4_32_10 state.
6
7 __device__ void curand_init ( unsigned long long seed, unsigned long long subsequence, unsigned long long offset, curandStateXORWOW_t* state ) 8 //Initialize XORWOW state.
以XORWOW為例,我們可以在核函數中這么寫來初始化多個curandStateXORWOW_t:
1 __global__ void initialize_generator(curandStateXORWOW_t* states, unsigned long long seed, size_t size) { 2 size_t Idx = threadIdx.x + blockDim.x * blockIdx.x; 3 if(Idx >= size) return; 4
5 curand_init(seed, Idx, 0, &states[Idx]); 6 } 7
8 //調用參數准備
9 curandStateXORWOW_t* states; 10 size_t size = 640 * 360; 11 assert(cudaSuccess == cudaMalloc(&states, sizeof(curandStateXORWOW_t) * size)); 12 //調用運行時參數准備
13 size_t thread_count = 1024; 14 size_t block_count = (size - 1) / thread_count + 1; 15 //調用核函數
16 initialize_generator<<<block_count, thread_count>>>(states, time(nullptr), size); 17 cudaDeviceSynchronize(); 18 assert(cudaSuccess == cudaGetLastError());
MTGP32生成器的初始化
而MTGP32生成器的初始化,需要包含curand_mtgp32_host.h庫。庫中有兩個函數:
1 __host__ __forceinline__ curandStatus_t curandMakeMTGP32Constants ( const mtgp32_params_fast_t params[], mtgp32_kernel_params_t* p ) 2 //Set up constant parameters for the mtgp32 generator.
3 __host__ __forceinline__ curandStatus_t CURANDAPI curandMakeMTGP32KernelState ( curandStateMtgp32_t* s, mtgp32_params_fast_t params[], mtgp32_kernel_params_t* k, int n, unsigned long long seed ) 4 //Set up initial states for the mtgp32 generator.
具體初始化方法如下:
1 mtgp32_kernel_params* devKernelParams; 2 curandStateMtgp32_t* states; 3 size_t size = 128; 4
5 assert(cudaSuccess == cudaMalloc(&devKernelParams, sizeof(mtgp32_kernel_params))); 6 assert(cudaSuccess == cudaMalloc(&states, size * sizeof(curandStateMtgp32_t))); 7 assert(CURAND_STATUS_SUCCESS == curandMakeMTGP32Constants(mtgp32dc_params_fast_11213, devKernelParams)); 8 assert(CURAND_STATUS_SUCCESS == curandMakeMTGP32KernelState(states, mtgp32dc_params_fast_11213, devKernelParams, size, time(nullptr)));
真隨機數生成器的初始化
curand_kernel庫中的四個真隨機數生成器的初始化都需要用到方向向量:
1 __device__ void curand_init ( curandDirectionVectors64_t direction_vectors, unsigned long long scramble_c, unsigned long long offset, curandStateScrambledSobol64_t* state ); 2 //Initialize Scrambled Sobol64 state.
3 __device__ void curand_init ( curandDirectionVectors64_t direction_vectors, unsigned long long offset, curandStateSobol64_t* state ); 4 //Initialize Sobol64 state.
5 __device__ void curand_init ( curandDirectionVectors32_t direction_vectors, unsigned int scramble_c, unsigned int offset, curandStateScrambledSobol32_t* state ); 6 //Initialize Scrambled Sobol32 state.
7 __device__ void curand_init ( curandDirectionVectors32_t direction_vectors, unsigned int offset, curandStateSobol32_t* state ); 8 //Initialize Sobol32 state.
curand庫有獲取方向向量的函數:
1 curandStatus_t CURANDAPI curandGetDirectionVectors32 ( curandDirectionVectors32_t* vectors, curandDirectionVectorSet_t set ); 2 //Get direction vectors for 32-bit quasirandom number generation.
3 curandStatus_t CURANDAPI curandGetDirectionVectors64 ( curandDirectionVectors64_t* vectors, curandDirectionVectorSet_t set ); 4 //Get direction vectors for 64-bit quasirandom number generation.
同時有兩個隨機數生成器也需要施加干擾。curand庫也有獲取干擾因子的函數:
1 curandStatus_t CURANDAPI curandGetScrambleConstants32 ( unsigned int** constants ) 2 //Get scramble constants for 32-bit scrambled Sobol'.
3 curandStatus_t CURANDAPI curandGetScrambleConstants64 ( unsigned long long** constants ) 4 //Get scramble constants for 64-bit scrambled Sobol'.
具體的生成方法就比較冗長了,這里以Scrambled Sobol 32為例:
1 __global__ void initialize_generator(curandStateScrambledSobol32_t* states, curandDirectionVectors32_t* devVectors, 2 unsigned int* devScrambleConstants, size_t size) { 3 size_t Idx = threadIdx.x + blockDim.x * blockIdx.x; 4 if(Idx >= size) return; 5
6 curand_init(devVectors[Idx], devScrambleConstants[Idx], Idx * 1024, &states[Idx]); 7 } 8
9 //host端變量
10 curandDirectionVectors32_t* hostVectors; 11 unsigned int* hostScrambleConstants; 12 //device端變量
13 curandStateScrambledSobol32_t* states; 14 curandDirectionVectors32_t* devVectors; 15 unsigned int* devScrambleConstants; 16
17 const size_t size = 128; 18 //獲取方向向量和擾動量
19 assert(CURAND_STATUS_SUCCESS == curandGetDirectionVectors32(&hostVectors, CURAND_SCRAMBLED_DIRECTION_VECTORS_32_JOEKUO6)); 20 assert(CURAND_STATUS_SUCCESS == curandGetScrambleConstants32(&hostScrambleConstants)); 21 //申請device端內存
22 assert(cudaSuccess == cudaMalloc(&states, size * sizeof(curandStateScrambledSobol32_t))); 23 assert(cudaSuccess == cudaMalloc(&devVectors, size * sizeof(curandDirectionVectors32_t))); 24 assert(cudaSuccess == cudaMalloc(&devScrambleConstants, size * sizeof(unsigned int))); 25 //將獲取到的方向向量和擾動量拷貝到device端
26 assert(cudaSuccess == cudaMemcpy(devVectors, hostVectors, size * sizeof(curandDirectionVectors32_t), cudaMemcpyHostToDevice)); 27 assert(cudaSuccess == cudaMemcpy(devScrambleConstants, hostScrambleConstants, size * sizeof(unsigned int), cudaMemcpyHostToDevice)); 28
29 size_t thread_count = 256; 30 size_t block_count = (size - 1) / thread_count + 1; 31 initialize_generator<<<block_count, thread_count>>>(states, devVectors, devScrambleConstants, size); 32 cudaDeviceSynchronize(); 33 assert(cudaSuccess == cudaGetLastError());
由於真隨機數的初始化沒有subsequence參數,所以不同線程隨機數的區分要靠offset來完成,因此curand_init的offset參數被我填了Idx * 1024,這個1024不是固定的,只要足夠大,使得不同線程生成的隨機數互不重合,但又不會太大以至於訪問無效內存即可。
生成器的使用
實際使用隨機數的時候則直接使用curand函數即可,非常方便。curand函數原型如下:
1 __device__ unsigned int curand ( curandStateMtgp32_t* state ) 2 //Return 32-bits of pseudorandomness from a mtgp32 generator.
3
4 __device__ unsigned long long curand ( curandStateScrambledSobol64_t* state ) 5 //Return 64-bits of quasirandomness from a scrambled Sobol64 generator.
6
7 __device__ unsigned long long curand ( curandStateSobol64_t* state ) 8 //Return 64-bits of quasirandomness from a Sobol64 generator.
9
10 __device__ unsigned int curand ( curandStateScrambledSobol32_t* state ) 11 //Return 32-bits of quasirandomness from a scrambled Sobol32 generator.
12
13 __device__ unsigned int curand ( curandStateSobol32_t* state ) 14 //Return 32-bits of quasirandomness from a Sobol32 generator.
15
16 __device__ unsigned int curand ( curandStateMRG32k3a_t* state ) 17 //Return 32-bits of pseudorandomness from an MRG32k3a generator.
18
19 __device__ unsigned int curand ( curandStatePhilox4_32_10_t* state ) 20 //Return 32-bits of pseudorandomness from an Philox4_32_10 generator.
21
22 __device__ unsigned int curand ( curandStateXORWOW_t* state ) 23 //Return 32-bits of pseudorandomness from an XORWOW generator.
依舊以XORWOW為例,我們來使用上述代碼生成若干可以被2或3整除的隨機數:
1 __global__ void use_generator(curandStateXORWOW_t* states, unsigned int* arr, size_t size) { 2 size_t Idx = threadIdx.x + blockDim.x * blockIdx.x; 3 if(Idx >= size) return; 4
5 arr[Idx] = curand(&states[Idx]); 6 while((arr[Idx] % 2 != 0) && (arr[Idx] % 3 != 0)) { 7 arr[Idx] = curand(&states[Idx]); 8 } 9 } 10
11 unsigned int* arr; 12 assert(cudaSuccess == cudaMalloc(&arr, sizeof(unsigned int) * size)); 13 use_generator<<<block_count, thread_count>>>(states, arr, size); 14 cudaDeviceSynchronize(); 15 assert(cudaSuccess == cudaGetLastError());
除curand()函數外,還有若干其他的生成函數,下例代碼我們用MTGP32生成器生成區間內均勻分布的double型變量:
1 __global__ void use_generator(curandStateMtgp32_t* states, double* arr, size_t size) { 2 size_t Idx = threadIdx.x + blockDim.x * blockIdx.x; 3 if(Idx >= size) return; 4
5 arr[Idx] = curand_uniform_double(&states[Idx]); 6 } 7
8 double* arr; 9 assert(cudaSuccess == cudaMalloc(&arr, sizeof(double) * size)); 10 size_t thread_count = 256; 11 size_t block_count = (size - 1) / thread_count + 1; 12 use_generator<<<block_count, thread_count>>>(states, arr, size); 13 cudaDeviceSynchronize(); 14 assert(cudaSuccess == cudaGetLastError());
其他常用的分布函數如下(注意,下文中gen_type可被替換為任何一種curand_kernel庫的生成器類型):
1 __device__ float curand_log_normal ( gen_type* state, float mean, float stddev ); 2 //Return a log-normally distributed float
3
4 __device__ double curand_log_normal_double ( gen_type* state, double mean, double stddev ); 5 //Return a log-normally distributed double
6
7 __device__ float curand_normal ( gen_type* state ); 8 //Return a normally distributed float
9
10 __device__ double curand_normal_double ( gen_type* state ); 11 //Return a normally distributed double
12
13 __device__ unsigned int curand_poisson ( gen_type* state, double lambda ); 14 //Return a Poisson-distributed unsigned int
15
16 __device__ float curand_uniform ( gen_type* state ); 17 //Return a uniformly distributed float
18
19 __device__ double curand_uniform_double ( gen_type* state ); 20 //Return a uniformly distributed double