1.deviceQuery 非常重要,對於編程中遇到的block\grid設置,memory hierarchy 的使用 具有指導意義。
deviceQuery 實際上是一個sample,需要編譯后才能使用。 在 /opt/cuda/cuda70/NVIDIA_CUDA-7.0_Samples 或者loca的cuda 文件夾(這個不確定)。
因為是只讀文件,需要copy 到 home 文件目錄下面,由於會使用 NVIDIA_CUDA-7.0_Samples/common 文件夾中的文件,直接copy NVIDIA_CUDA-7.0_Samples。
make 運行,就得到了deviceQuery 可運行文件。
建議對於任何一個GPU編程,第一個工作就是編譯 deviceQuery。
有一個結果不明白,compute mode (我在nvidia-smi的說明書找到了說明):
Compute mode 的意思是是否允許多個程序 同時使用GPU。
Compute Mode The compute mode flag indicates whether individual or multiple compute applications may run on the GPU.
"Default" means multiple contexts are allowed per device.
"Exclusive Thread" means only one context is allowed per device, usable from one thread at a time.
"Exclusive Process" means only one context is allowed per device, usable from multiple threads at a time. "
“Prohibited" means no contexts are allowed per device (no compute apps).
"EXCLUSIVE_PROCESS" was added in CUDA 4.0. Prior CUDA releases supported only one exclusive mode, which is equivalent to "EXCLUSIVE_THREAD" in CUDA 4.0 and beyond.
For all CUDA-capable products
2. Nvidia-smi 有 nvidia-smi 說明書,http://developer.download.nvidia.com/compute/cuda/6_0/rel/gdk/nvidia-smi.331.38.pdf
Nvidia-smi:NVIDIA System Management Interface. 命令行, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
GPU configuration options (such as ECC memory capability) may be enabled and disabled.
Nvidia-smi命令是在install drive,因此有。
nvidia-smi -i 0 -q 可以顯示所有的信息。 (-i, 表示 gpu的編號)
nvidia-smi -h 幫助命令
3. 使用 cudaGetDeviceProperties()
deviceQuery實際是調用 cudaGetDeviceProperties(),逐條答應各種信息。
例如:程序+結果
void PrintDeviceProperties(cudaDeviceProp devProp)
{
FILE *deviceProperties = fopen("DeviceProperties.txt", "a+");
fprintf(deviceProperties, "Major revision number: %d\n", devProp.major);
fprintf(deviceProperties, "Minor revision number: %d\n", devProp.minor);
fprintf(deviceProperties, "Name: %s\n", devProp.name);
fprintf(deviceProperties, "Total global memory: %u\n", devProp.totalGlobalMem);
fprintf(deviceProperties, "Total shared memory per block: %u\n", devProp.sharedMemPerBlock);
fprintf(deviceProperties, "Total registers per block: %d\n", devProp.regsPerBlock);
fprintf(deviceProperties, "Warp size: %d\n", devProp.warpSize);
fprintf(deviceProperties, "Maximum memory pitch: %u\n", devProp.memPitch);
fprintf(deviceProperties, "Maximum threads per block: %d\n", devProp.maxThreadsPerBlock);
for (int i = 0; i < 3; ++i)
fprintf(deviceProperties, "Maximum dimension %d of block: %d\n", i, devProp.maxThreadsDim[i]);
for (int i = 0; i < 3; ++i)
fprintf(deviceProperties, "Maximum dimension %d of grid: %d\n", i, devProp.maxGridSize[i]);
fprintf(deviceProperties, "Clock rate: %d\n", devProp.clockRate);
fprintf(deviceProperties, "Total constant memory: %u\n", devProp.totalConstMem);
fprintf(deviceProperties, "Texture alignment: %u\n", devProp.textureAlignment);
fprintf(deviceProperties, "Concurrent copy and execution: %s\n", (devProp.deviceOverlap ? "Yes" : "No"));
fprintf(deviceProperties, "Number of multiprocessors: %d\n", devProp.multiProcessorCount);
fprintf(deviceProperties, "Kernel execution timeout: %s\n",
devProp.kernelExecTimeoutEnabled ? "Yes" : "No"));
fclose(deviceProperties);
}
And the result is as follows:Major revision number: 2
Minor revision number: 0
Name: Tesla C2075
Total global memory: 1341849600
Total shared memory per block: 49152
Total registers per block: 32768
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 65535
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1147000
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 14
Kernel execution timeout: No
