博客鏈接:https://blog.csdn.net/sinat_23619409/article/details/84202651

1. cuda的安裝

到 https://developer.nvidia.com/cuda-toolkit 去下載。在安裝的時候一定要自定義安裝，否則將會安裝很多無用的東西。安裝的選項，可以選擇不更新驅動程序。

或者下載離線文件安裝

安裝，選擇自定義安裝。

安裝后，和英偉達cuda相關的程序如下圖所示。

注意，千萬不要勾選 Nsight Visual Studio Edition 2019.2等類似的無用的東西。

2. 測試環境是否安裝成功

運行cmd，輸入nvcc --version 即可查看版本號；

set cuda，可以查看cuda設置的環境變量。

3. 運行官方自帶的demo

在任務管理器中搜索，Browse CUDA Samples。或者一般位於 C:\ProgramData\NVIDIA Corporation\CUDA Samples

未編譯前，Debug文件夾中只有三個文件，如圖。

成功編譯后這個位置（具體路徑見上圖）將生成很多文件，在其中找到deviceQueryDrv.exe的程序拖入到cmd中，回車運行。

4. 自己配置cuda項目

（1）打開vs2017，創建一個空win32程序,即cuda_test項目。

（2）選擇cuda_test，點擊右鍵–>項目依賴項–>自定義生成，選擇CUDA10.1。

（3）右鍵源文件文件夾->添加->新建項->選擇CUDA C/C++File，取名cuda_main。

（4）點擊cuda_main.cu的屬性,在配置屬性–>常規–>項類型–>選擇“CUDA C/C++”。

注意：以下步驟中的項目屬性設置均針對x64。

（5）包含目錄配置：

右鍵點擊項目屬性–>屬性–>配置屬性–>VC++目錄–>包含目錄

添加包含目錄：$(CUDA_PATH)\include

（6）庫目錄配置

添加庫目錄：$(CUDA_PATH)\lib\x64

（7）依賴項

配置屬性–>鏈接器–>輸入–>附加依賴項

添加庫文件：cublas.lib;cuda.lib;cudadevrt.lib;cudart.lib;cudart_static.lib;OpenCL.lib

cuda_main.cu代碼如下：

 
                 #include "cuda_runtime.h"  
                
                 #include "cublas_v2.h"  
                
                 #include <time.h>  
                
                 #include <iostream>  
                
                 using  
                 namespace  
                 std; 
                
                 // 定義測試矩陣的維度  
                
                 int  
                 const  
                 M = 5; 
                
                 int  
                 const  
                 N = 10; 
                
                 int  
                 main() 
                
                 { 
                
                 // 定義狀態變量  
                
                 cublasStatus_t status; 
                
                 // 在 內存 中為將要計算的矩陣開辟空間  
                
                 float  
                 *h_A = ( 
                 float 
                 *) 
                 malloc 
                 (N*M *  
                 sizeof 
                 ( 
                 float 
                 )); 
                
                 float  
                 *h_B = ( 
                 float 
                 *) 
                 malloc 
                 (N*M *  
                 sizeof 
                 ( 
                 float 
                 )); 
                
                 // 在 內存 中為將要存放運算結果的矩陣開辟空間  
                
                 float  
                 *h_C = ( 
                 float 
                 *) 
                 malloc 
                 (M*M *  
                 sizeof 
                 ( 
                 float 
                 )); 
                
                 // 為待運算矩陣的元素賦予 0-10 范圍內的隨機數  
                
                 for  
                 ( 
                 int  
                 i = 0; i < N*M; i++) { 
                
                 h_A[i] = ( 
                 float 
                 )( 
                 rand 
                 () % 10 + 1); 
                
                 h_B[i] = ( 
                 float 
                 )( 
                 rand 
                 () % 10 + 1); 
                
                 } 
                
                 // 打印待測試的矩陣  
                
                 cout <<  
                 "矩陣 A :"  
                 << endl; 
                
                 for  
                 ( 
                 int  
                 i = 0; i < N*M; i++) { 
                
                 cout << h_A[i] <<  
                 " " 
                 ; 
                
                 if  
                 ((i + 1) % N == 0) cout << endl; 
                
                 } 
                
                 cout << endl; 
                
                 cout <<  
                 "矩陣 B :"  
                 << endl; 
                
                 for  
                 ( 
                 int  
                 i = 0; i < N*M; i++) { 
                
                 cout << h_B[i] <<  
                 " " 
                 ; 
                
                 if  
                 ((i + 1) % M == 0) cout << endl; 
                
                 } 
                
                 cout << endl; 
                
                 /* 
                
                 ** GPU 計算矩陣相乘 
                
                 */ 
                
                 // 創建並初始化 CUBLAS 庫對象  
                
                 cublasHandle_t handle; 
                
                 status = cublasCreate(&handle); 
                
                 if  
                 (status != CUBLAS_STATUS_SUCCESS) 
                
                 { 
                
                 if  
                 (status == CUBLAS_STATUS_NOT_INITIALIZED) { 
                
                 cout <<  
                 "CUBLAS 對象實例化出錯"  
                 << endl; 
                
                 } 
                
                 getchar 
                 (); 
                
                 return  
                 EXIT_FAILURE; 
                
                 } 
                
                 float  
                 *d_A, *d_B, *d_C; 
                
                 // 在 顯存 中為將要計算的矩陣開辟空間  
                
                 cudaMalloc( 
                
                 ( 
                 void 
                 **)&d_A,     
                 // 指向開辟的空間的指針  
                
                 N*M *  
                 sizeof 
                 ( 
                 float 
                 )     
                 //　需要開辟空間的字節數  
                
                 ); 
                
                 cudaMalloc( 
                
                 ( 
                 void 
                 **)&d_B, 
                
                 N*M *  
                 sizeof 
                 ( 
                 float 
                 ) 
                
                 ); 
                
                 // 在 顯存 中為將要存放運算結果的矩陣開辟空間  
                
                 cudaMalloc( 
                
                 ( 
                 void 
                 **)&d_C, 
                
                 M*M *  
                 sizeof 
                 ( 
                 float 
                 ) 
                
                 ); 
                
                 // 將矩陣數據傳遞進 顯存 中已經開辟好了的空間  
                
                 cublasSetVector( 
                
                 N*M,     
                 // 要存入顯存的元素個數  
                
                 sizeof 
                 ( 
                 float 
                 ),     
                 // 每個元素大小  
                
                 h_A,     
                 // 主機端起始地址  
                
                 1,     
                 // 連續元素之間的存儲間隔  
                
                 d_A,     
                 // GPU 端起始地址  
                
                 1     
                 // 連續元素之間的存儲間隔  
                
                 ); 
                
                 cublasSetVector( 
                
                 N*M, 
                
                 sizeof 
                 ( 
                 float 
                 ), 
                
                 h_B, 
                
                 1, 
                
                 d_B, 
                
                 1 
                
                 ); 
                
                 // 同步函數  
                
                 cudaThreadSynchronize(); 
                
                 // 傳遞進矩陣相乘函數中的參數，具體含義請參考函數手冊。  
                
                 float  
                 a = 1;  
                 float  
                 b = 0; 
                
                 // 矩陣相乘。該函數必然將數組解析成列優先數組  
                
                 cublasSgemm( 
                
                 handle,     
                 // blas 庫對象   
                
                 CUBLAS_OP_T,     
                 // 矩陣 A 屬性參數  
                
                 CUBLAS_OP_T,     
                 // 矩陣 B 屬性參數  
                
                 M,     
                 // A, C 的行數   
                
                 M,     
                 // B, C 的列數  
                
                 N,     
                 // A 的列數和 B 的行數  
                
                 &a,     
                 // 運算式的 α 值  
                
                 d_A,     
                 // A 在顯存中的地址  
                
                 N,     
                 // lda  
                
                 d_B,     
                 // B 在顯存中的地址  
                
                 M,     
                 // ldb  
                
                 &b,     
                 // 運算式的 β 值  
                
                 d_C,     
                 // C 在顯存中的地址(結果矩陣)  
                
                 M     
                 // ldc  
                
                 ); 
                
                 // 同步函數  
                
                 cudaThreadSynchronize(); 
                
                 // 從 顯存 中取出運算結果至 內存中去  
                
                 cublasGetVector( 
                
                 M*M,     
                 //  要取出元素的個數  
                
                 sizeof 
                 ( 
                 float 
                 ),     
                 // 每個元素大小  
                
                 d_C,     
                 // GPU 端起始地址  
                
                 1,     
                 // 連續元素之間的存儲間隔  
                
                 h_C,     
                 // 主機端起始地址  
                
                 1     
                 // 連續元素之間的存儲間隔  
                
                 ); 
                
                 // 打印運算結果  
                
                 cout <<  
                 "計算結果的轉置 ( (A*B)的轉置 )："  
                 << endl; 
                
                 for  
                 ( 
                 int  
                 i = 0; i < M*M; i++) { 
                
                 cout << h_C[i] <<  
                 " " 
                 ; 
                
                 if  
                 ((i + 1) % M == 0) cout << endl; 
                
                 } 
                
                 // 清理掉使用過的內存  
                
                 free 
                 (h_A); 
                
                 free 
                 (h_B); 
                
                 free 
                 (h_C); 
                
                 cudaFree(d_A); 
                
                 cudaFree(d_B); 
                
                 cudaFree(d_C); 
                
                 // 釋放 CUBLAS 庫對象  
                
                 cublasDestroy(handle); 
                
                 getchar 
                 (); 
                
                 return  
                 0; 
                
                 }

5 使用VS下的模板創建

打開VS 2017，我們可以觀察到，在VS2017模板一欄下方出現了“NVIDIA/CUDA 10.1”。

直接新建一個CUDA 10.1 Runtime 項目。

右鍵項目 → 屬性 → 配置屬性 → 鏈接器 → 常規 → 附加庫目錄，添加以下目錄：

示例代碼如下：

 
                 #include "cuda_runtime.h" 
                
                 #include "device_launch_parameters.h" 
                
                 #include <stdio.h> 
                
                 int  
                 main() { 
                
                 int  
                 deviceCount; 
                
                 cudaGetDeviceCount(&deviceCount); 
                
                 int  
                 dev; 
                
                 for  
                 (dev = 0; dev < deviceCount; dev++) 
                
                 { 
                
                 int  
                 driver_version(0), runtime_version(0); 
                
                 cudaDeviceProp deviceProp; 
                
                 cudaGetDeviceProperties(&deviceProp, dev); 
                
                 if  
                 (dev == 0) 
                
                 if  
                 (deviceProp.minor = 9999 && deviceProp.major == 9999) 
                
                 printf 
                 ( 
                 "\n" 
                 ); 
                
                 printf 
                 ( 
                 "\nDevice%d:\"%s\"\n" 
                 , dev, deviceProp.name); 
                
                 cudaDriverGetVersion(&driver_version); 
                
                 printf 
                 ( 
                 "CUDA驅動版本:                                   %d.%d\n" 
                 , driver_version / 1000, (driver_version % 1000) / 10); 
                
                 cudaRuntimeGetVersion(&runtime_version); 
                
                 printf 
                 ( 
                 "CUDA運行時版本:                                 %d.%d\n" 
                 , runtime_version / 1000, (runtime_version % 1000) / 10); 
                
                 printf 
                 ( 
                 "設備計算能力:                                   %d.%d\n" 
                 , deviceProp.major, deviceProp.minor); 
                
                 printf 
                 ( 
                 "Total amount of Global Memory:                  %u bytes\n" 
                 , deviceProp.totalGlobalMem); 
                
                 printf 
                 ( 
                 "Number of SMs:                                  %d\n" 
                 , deviceProp.multiProcessorCount); 
                
                 printf 
                 ( 
                 "Total amount of Constant Memory:                %u bytes\n" 
                 , deviceProp.totalConstMem); 
                
                 printf 
                 ( 
                 "Total amount of Shared Memory per block:        %u bytes\n" 
                 , deviceProp.sharedMemPerBlock); 
                
                 printf 
                 ( 
                 "Total number of registers available per block:  %d\n" 
                 , deviceProp.regsPerBlock); 
                
                 printf 
                 ( 
                 "Warp size:                                      %d\n" 
                 , deviceProp.warpSize); 
                
                 printf 
                 ( 
                 "Maximum number of threads per SM:               %d\n" 
                 , deviceProp.maxThreadsPerMultiProcessor); 
                
                 printf 
                 ( 
                 "Maximum number of threads per block:            %d\n" 
                 , deviceProp.maxThreadsPerBlock); 
                
                 printf 
                 ( 
                 "Maximum size of each dimension of a block:      %d x %d x %d\n" 
                 , deviceProp.maxThreadsDim[0], 
                
                 deviceProp.maxThreadsDim[1], 
                
                 deviceProp.maxThreadsDim[2]); 
                
                 printf 
                 ( 
                 "Maximum size of each dimension of a grid:       %d x %d x %d\n" 
                 , deviceProp.maxGridSize[0], deviceProp.maxGridSize[1], deviceProp.maxGridSize[2]); 
                
                 printf 
                 ( 
                 "Maximum memory pitch:                           %u bytes\n" 
                 , deviceProp.memPitch); 
                
                 printf 
                 ( 
                 "Texture alignmemt:                              %u bytes\n" 
                 , deviceProp.texturePitchAlignment); 
                
                 printf 
                 ( 
                 "Clock rate:                                     %.2f GHz\n" 
                 , deviceProp.clockRate * 1e-6f); 
                
                 printf 
                 ( 
                 "Memory Clock rate:                              %.0f MHz\n" 
                 , deviceProp.memoryClockRate * 1e-3f); 
                
                 printf 
                 ( 
                 "Memory Bus Width:                               %d-bit\n" 
                 , deviceProp.memoryBusWidth); 
                
                 } 
                
                 return  
                 0; 
                
                 }

參考文章

win10+VS2017+Cuda10.0環境配置

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 windows下cuda的安裝 windows下cuda的安裝 windows下cuda、cudnn以及pytorch的安裝 Windows下安裝CUDA8.0 記錄下windows下安裝cuda10.0過程 Windows 10下基於Visual Studio 2019安裝CUDA 11.1 Centos7下安裝CUDA windows下安裝pytorch-gpu時檢測安裝cuda和cudnn以及配置pycharm： windows10環境下安裝深度學習環境anaconda+pytorch+CUDA+cuDDN windows10下安裝tensorflow2.0-GPU和Cupy（不用搞CUDA+cudnn）