主要參考英文帖子。我就不翻譯了哈。很容易懂的。
先說明我的運行平台:
1、IDE:Visual Studio 2012 C# .Net Framework4.5,使用默認安裝路徑;
2、顯卡類型:NVIDIA GeForce GT 755M(筆記本用移動顯卡),CUDA Toolkit版本:cuda_6.5.14_windows_general_64,使用默認安裝路徑。
3、使用的managedCUDA版本和下載鏈接:managedCUDA。作者:kunzmi, version 15。鄭重聲明,版權屬於原作者。在此,對kunzmi表示感謝。
——————————————————————————————————————————————————————————————
C# .Net Framework4.5中配置和使用managedCUDA
一、About managedCuda
ManagedCuda provides an intuitive access to the Cuda driver API for any .net language. It is kind of an equivalent to the runtime API (= a comfortable wrapper of the driver API for C/C++) but written entirely in C# for .net. In contrast to the runtime API, managedCUDA takes a different approach to represent CUDA specifics: managedCuda is object oriented. In general you can find C# classes for each Cuda handle in the driver API. For example, instead of a handle CUContext, managedCUDA provides a CudaContext class. This design allows an intuitive and simple access to all API calls by providing correspondent methods per class. A good example for this wrapping approach is a device variable. In the original Cuda driver API those are given by standard C pointers. In managedCuda these are represented by the class Cuda[Pitched]DeviceVariable<T>. It is a generic class allowing type safe and object oriented access to the Cuda driver API. As a CudaDeviceVariable instance knows about its wrapped data type, array sizes, dimensions and eventually a memory alignment pitch, a simple call to CopyToHost(“hostArray”) is enough. The user doesn’t need to handle the entire C like function arguments, this is all done automatically. Further managedCuda provides specific exceptions in case something goes wrong, i.e. you don’t need to check API call return values, you only need to catch the CudaException just as any other exception.
But still, as a developer using managedCuda you need to know Cuda. You must know how to use contexts, set kernel launch grid configurations etc.
I will shortly describe in the following the main classes used to implement a fully functional Cuda application in C#:
The CudaContext class: This is one of the three main classes and represents a Cuda context. From Cuda 4.0 on, the Cuda API demands (at least) one context per process per device. So for each device you want to use, you need to create a CudaContext instance. In the different constructors you can define several properties, e.g. the deviceID to use. As nearly all managedCuda classes, CudaContext implements IDisposable and the wrapped Cuda context is valid until Dispose() is called. Further CudaContext defines a bunch of static methods to retrieve general information about (possible) Cuda devices. Important for multi threaded applications: In order to use any cuda object related to a context, you must activate the cudaContext by calling the SetCurrent() method from the current thread. This holds for all thread switches. (See the Cuda programming guide for more information).
CudaKernel: Cuda kernels are load from cubin or ptx files. You can load a kernel using the LoadKernel…() methods of a CudaContext using a byte array representation of the kernel file (e.g. an embedded resource) or by specifying the file name where the kernel is stored. Further you need the kernel name as defined in the source *.cu file. The LoadKernel methods return a CudaKernel object bound to the given context. CudaKernel does not implement IDisposable, as the kernels are automatically destroyed as soon as the corresponding context is destroyed.
CudaDeviceVariable and its variations: A CudaDeviceVariable object represents allocated memory on the device. The class knows about the exact memory layout (as array length, array dimension, memory pitch, etc.). As the class is a generic, it also knows about its type and type size. All this simplifies dramatically any data copying as no size parameters are needed. Only the source or destination array must be defined (either a default C# host array or another device variable). Device memory is freed as soon as the CudaDeviceVariable object is disposed.
With these three main classes one can create an entire Cuda accelerated application in C# using only very few code lines.
Other managedCuda classes:
CudaPagelockedHostMemory: In order to use asynchron copy methods (host to device or device to host) the host array must be allocated as pinned or page-locked memory. To realize this, CudaPagelockedHostMemory[2D,3D] allocates the memory using cuda’s cuMemHostAlloc. To simplify access per element, the class provides an index property to get or set single values. When implementing large datasets you must know that each single per element access trespasses the managed/unmanaged memory barrier and must be marshaled. Access is therefore not really fast. To handle large amount of data, a copy of a managed array to the unmanaged memory in one block would be faster.
CudaPagelockedHostMemory_[Type]: As the previous approach using generics and marshalling was not satisfying in terms of speed and direct pointer arithmetic with generics is not possible in C#, I tried something new, what I would call "templates with C#" using T4: A T4 template creates all possible variants like 'float', 'int4', etc. which then access memory directly via pointers. The achieved performance of this approach is close to native arrays. In case you want to use CudaPagelockedHostMemory with your own datatypes, simply copy the tt-file to your project and modify the list of types to process (but be aware of the license: managedCUDA is LGPL!).
CudaManagedMemory_[Type]: Using the same approach as for page locked memory, CudaManagedMemory gives access to the full feature set of managed memory introduced with Cuda 6.5 in .net.
CudaRegisteredHostMemory: In C++, registered host memory is normally allocated memory but with registration it gets usable for asynchron copies. But in the .net world this doesn’t work as expected: Also CudaRegisteredHostMemory is part of ManagedCUDA it shouldn’t be used. Use CudaPagelockedHostMemory instead.
CudaArray[1D,2D,3D]: Represents a CUArray. Either you specify an already existing CUArray as storage location, e.g. from graphics interop, or a new CUArray is created internally. Only if the inner CUArray was allocated by the constructor, it will be freed while disposing.
CudaTextureFoo: Represents a Cuda texture reference. The device memory to bind this texture to can either be created internally by the constructor or passed as an argument. Only if memory is allocated by the constructor it will be freed while disposing.
GraphicsInterop: Several graphics interop resource classes exist, one for every graphics API (DirectX or OpenGL). All these resources must be registered and can be mapped to cuda variables, cuda textures or cuda arrays, depending on their type. For efficient mapping, all resources can be grouped in a CudaGraphicsInteropResourceCollection, so that one single Map() call is enough to finish the task. Have a look at the sample applications to see how to use the collection.
二、Additional libraries:
- CudaFFT: Managed access to cufft*.dll
- CudaRand: Managed access to curand*.dll
- CudaSparse: Managed access to cusparse*.dll
- CudaBlas: Managed access to cublas*.dll
- CudaSolve: Managed access to cusolve*.dll
- NPP: Managed access to npp*.dll
- NVRTC: Managed access to nvrtc*.dll
All libraries have in common that they compile either to 32 or 64 bit in order to handle different wrapped dll names for 32 or 64 bit. They include a basic representation called *NativeMethods to call directly the API functions and wrap handles with C# classes.
CudaBitmapSource is a simple try to use Cuda device memory as a BitmapSource in WPF. It is more like a proof of concept than a ready to use library, especially the fact that BitmapSource is a sealed class makes a proper implementation difficult. If you have ideas for improvements or a better design, please let me know ;-)
三、How To: Setup a C# Cuda project using Visual Studio 2010 (Solution 1):
(My Visual Studio is a German edition, some “translated” menu entries might therefor differ slightly from the original English menu entries.)
You need: Microsoft Visual Studio 201x, Nvidia Cuda Toolkit 7.0, Nvidia Parallel Nsight 4.0 for debugging and of course managedCuda.(注意:本文中CUDA版本為6.5,以下,7.0統一替換為6.5)
- Create a normal C# project,此處選擇C#控制台應用程序 (ConsoleApplication、ibrary、WinForms、WPF,、etc.)。
操作為:打開VS IDE——文件-——新建——項目——Visual C#——控制台應用程序,在名稱中輸入“vectorAdd”,點擊“確定”按鈕,結束。
- 在同一解決方案中,添加一個新的CudaRuntime項目。Add a new CudaRuntime 6.5.0 project to the solution.
操作為:在解決方案資源管理器中,右鍵點擊解決方案“vectorAdd”,右鍵菜單:添加-——新建——項目——NVIDIA——CUDA6.5——Cuda 6.5 Runtime——在名稱中輸入“vectorAddKernel”,點擊“確定”按鈕,結束。可將新創建的項目vectorAddKernel中自動創建的名稱為kernel.cu的CUDA源文件改名為:vectorAdd.cu。
- Delete the Cuda sample code. To enable proper IntelliSense functionality you need to include the following header files to your *.cu file (from toolkit-include folder):
#include <cuda.h>
#include <device_launch_parameters.h>
#include <texture_fetch_functions.h>
#include <builtin_types.h>
#include <vector_functions.h>
#include “float.h”
為了便於IDE找到這些.h文件需要添加庫文件和頭文件路徑,操作為:右鍵點擊項目“vectorAddKernel”屬性-——配置屬性——VC++目錄,依次進行以下設置:
包含目錄:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include
庫目錄:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\lib\x64
此處也可以通過設置環境變量,一勞永逸地解決這個問題,不用再每一個工程單獨添加庫目錄和包含目錄。設置環境變量的方法如下:
安裝完畢后,可以看到系統中多了CUDA_PATH和CUDA_PATH_V6_0兩個環境變量,接下來,還要在系統中添加以下幾個環境變量:
CUDA_SDK_PATH = C:\ProgramData\NVIDIA Corporation\CUDA Samples\v6.0
CUDA_LIB_PATH = %CUDA_PATH%\lib\x64
CUDA_BIN_PATH = %CUDA_PATH%\bin
CUDA_SDK_BIN_PATH = %CUDA_SDK_PATH%\bin\x64
CUDA_SDK_LIB_PATH = %CUDA_SDK_PATH%\common\lib\x64
然后,在系統變量 PATH 的末尾添加:
;%CUDA_LIB_PATH%;%CUDA_BIN_PATH%;%CUDA_SDK_LIB_PATH%;%CUDA_SDK_BIN_PATH%;
- Also add the following defines:
#define _SIZE_T_DEFINED
#ifndef __CUDACC__
#define __CUDACC__
#endif
#ifndef __cplusplus
#define __cplusplus
#endif - Write your kernel code in an “extern C{}” scope:
-
//Includes for IntelliSense #define _SIZE_T_DEFINED #ifndef __CUDACC__ #define __CUDACC__ #endif #ifndef __cplusplus #define __cplusplus #endif #include <cuda.h> #include <device_launch_parameters.h> #include <texture_fetch_functions.h> #include "float.h" #include <builtin_types.h> #include <vector_functions.h>
// Texture reference texture<float2, 2> texref; extern "C" { //kernel code __global__ void kernel(/* parameters */) { } }
- You can also omit ‘extern “C”’ in order to use templated kernels. But then kernel names get mangled (“_Z18GMMReductionKernelILi4ELb1EEviPfiPK6uchar4iPhiiiPj” instead of “GMMReductionKernel”, to look up the right mangled name open the compiled ptx file with a text editor). To load a kernel you need the full mangled name.
- Change the following project properties of the CudaRuntime 7.0 project:
General:
* Output directory: Set it to the source file directory of the C# project ,即vectorAdd\vectorAdd目錄下。前一個vectorAdd是解決方案名稱,后一個vectorAdd是默認創建的 C#控制台應用程序名稱。
* Application type: 實用工具. This avoids a call to the VisualC++ compiler, no C++ output will be created.
CUDA C/C++:
*Compiler Output: $(OutDir)%(FileName)_x64.ptx 或者.cubin 。注意:此處的_x64必須明確指出,否則編譯不通過。如果想編譯輸出32位平台,請將編譯器輸出設置為:$(OutDir)%(FileName)_x86.ptx 或者.cubin 。
*NVCC Compilation Type: “Generate .ptx file (-ptx)” 或者 “Generate .cubin file (-cubin)” respectively 。需要與前一步驟保持一致。
*Target Merchine Platform:64-bit (--machine 64)。
You need to set these properties for all possible targets and configurations (x86/x64, Debug/Release). To handle mixed mode platform kernels, give a different kernel name for x86 and x64, for example $(OutDir)%(FileName)_x86.ptx and $(OutDir)%(FileName)_x64.ptx.
-
- Delete the post build event: We don’t need the CUDA runtime libraries copied.
Build the Cuda project once for each platform。編譯CUDA項目需要的設置:操作為:右鍵點擊項目“vectorAddKernel”——生成自定義——勾選CUDA(.target,.props),點擊“確定”按鈕,結束。
In the C# project, add the newly build kernel files in the C# project source directory to the project.
Set the file properties either to embedded resource (access files by stream (byte[]) when loading kernel images) or set “copy to output directory” to “always” and load the kernel image from file.
注意:此處,除了需要將前一步中生成的vectorAdd_x64.ptx文件添加到項目vectorAdd(方法:右鍵點擊項目“vectorAdd”——添加——現有項-選中vectorAdd_x64.ptx,並添加)之外,還需要將vectorAdd_x64.ptx文件屬性設置為“嵌入的資源”,以便可以通過文件流,獲取該資源中的核函數(方法:右鍵點擊文件“vectorAdd_x64.ptx”——屬性——生成操作-嵌入的資源,或者設置復制到輸出目錄——始終復制)。
Add a reference to the managedCuda assembly。添加對managedCuda 程序集的引用。
四、How To: Setup a C# Cuda project using Visual Studio 2010 (Solution 2 from Brian Jimdar)
Using pre-build events:
In the project properties-page of your C# project, add the following pre-build event:
call "%VS100COMNTOOLS%vsvars32.bat"
for /f %%a IN ('dir /b "$(ProjectDir)Kernels\*.cu"') do nvcc -ptx -arch sm_11 -m 64 -o "$(ProjectDir)PTX\%%~na_64.ptx" "$(ProjectDir)Kernels\%%~na.cu"
for /f %%a IN ('dir /b "$(ProjectDir)Kernels\*.cu"') do nvcc -ptx -arch sm_11 -m 32 -o "$(ProjectDir)PTX\%%~na.ptx" "$(ProjectDir)Kernels\%%~na.cu"
This builds a x86 and x64 version of each file in the .\Kernels directory, outputs it to the .\PTX directory.
五、常見問題解決辦法
1、Assembly.GetManifestResourceStream總返回 null。
運行或調試代碼,發現Assembly.GetManifestResourceStream總是返回null。
if (System.IntPtr.Size == 4) MessageBox.Show("32位操作系統"); else if (System.IntPtr.Size == 8) MessageBox.Show("64位操作系統");
當然了,如果你的操作系統已經是windows7 64位的,如果還出現 IntPtr.Size==4的情況,是因為你的C#項目屬性設置為首選32位的原因。如果想取消,操作為:右鍵點擊項目“vectorAdd”——屬性——生成——取消選中“首選32位”即可。