PyCUDA 可以通過 Python 訪問 NVIDIA 的 CUDA 並行計算 API。
具體介紹和安裝可以參考 PyCUDA 官網文檔和 pycuda PyPI。
本文涵蓋的內容有:
- 通過 PyCUDA 查詢 GPU 信息。
- NumPy array 和 gpuarray 之間的相互轉換。
- 使用 gpuarray 進行基本的運算。
- 使用 ElementwiseKernel 進行按元素的運算。
- 使用 InclusiveScanKernel 和 ReductionKernel 的 reduce 操作。
本文示例在 GPU 環境下,使用 Jupyter Notebook 導入了以下包:
1 import sys 2 from time import time 3 from functools import reduce 4 5 import numpy as np 6 import pandas as pd 7 import matplotlib 8 from matplotlib import pyplot as plt 9 from IPython.core.interactiveshell import InteractiveShell 10 11 import pycuda 12 import pycuda.autoinit 13 import pycuda.driver as drv 14 from pycuda import gpuarray 15 from pycuda.elementwise import ElementwiseKernel 16 from pycuda.scan import InclusiveScanKernel 17 from pycuda.reduction import ReductionKernel 18 19 InteractiveShell.ast_node_interactivity = "all" 20 print(f'The version of PyCUDA: {pycuda.VERSION}') 21 print(f'The version of Python: {sys.version}')
輸出:
The version of PyCUDA: (2019, 1, 2) The version of Python: 3.6.6 |Anaconda, Inc.| (default, Oct 9 2018, 12:34:16) [GCC 7.3.0]
查詢 GPU 信息
GPU 查詢是一個非常基本的操作,比較常用的重要信息有 GPU 設備名、GPU 顯存、核心數量等。
定義函數:
1 def query_device(): 2 drv.init() 3 print('CUDA device query (PyCUDA version) \n') 4 print(f'Detected {drv.Device.count()} CUDA Capable device(s) \n') 5 for i in range(drv.Device.count()): 6 7 gpu_device = drv.Device(i) 8 print(f'Device {i}: {gpu_device.name()}') 9 compute_capability = float( '%d.%d' % gpu_device.compute_capability() ) 10 print(f'\t Compute Capability: {compute_capability}') 11 print(f'\t Total Memory: {gpu_device.total_memory()//(1024**2)} megabytes') 12 13 # The following will give us all remaining device attributes as seen 14 # in the original deviceQuery. 15 # We set up a dictionary as such so that we can easily index 16 # the values using a string descriptor. 17 18 device_attributes_tuples = gpu_device.get_attributes().items() 19 device_attributes = {} 20 21 for k, v in device_attributes_tuples: 22 device_attributes[str(k)] = v 23 24 num_mp = device_attributes['MULTIPROCESSOR_COUNT'] 25 26 # Cores per multiprocessor is not reported by the GPU! 27 # We must use a lookup table based on compute capability. 28 # See the following: 29 # http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities 30 31 cuda_cores_per_mp = { 5.0 : 128, 5.1 : 128, 5.2 : 128, 6.0 : 64, 6.1 : 128, 6.2 : 128}[compute_capability] 32 33 print(f'\t ({num_mp}) Multiprocessors, ({cuda_cores_per_mp}) CUDA Cores / Multiprocessor: {num_mp*cuda_cores_per_mp} CUDA Cores') 34 35 device_attributes.pop('MULTIPROCESSOR_COUNT') 36 37 for k in device_attributes.keys(): 38 print(f'\t {k}: {device_attributes[k]}')
執行 GPU 查詢操作:
CUDA device query (PyCUDA version) Detected 1 CUDA Capable device(s) Device 0: Tesla P100-PCIE-16GB Compute Capability: 6.0 Total Memory: 16280 megabytes (56) Multiprocessors, (64) CUDA Cores / Multiprocessor: 3584 CUDA Cores ASYNC_ENGINE_COUNT: 2 CAN_MAP_HOST_MEMORY: 1 CLOCK_RATE: 1328500 COMPUTE_CAPABILITY_MAJOR: 6 COMPUTE_CAPABILITY_MINOR: 0 COMPUTE_MODE: DEFAULT CONCURRENT_KERNELS: 1 ECC_ENABLED: 1 GLOBAL_L1_CACHE_SUPPORTED: 1 GLOBAL_MEMORY_BUS_WIDTH: 4096 GPU_OVERLAP: 1 INTEGRATED: 0 KERNEL_EXEC_TIMEOUT: 0 L2_CACHE_SIZE: 4194304 LOCAL_L1_CACHE_SUPPORTED: 1 MANAGED_MEMORY: 1 MAXIMUM_SURFACE1D_LAYERED_LAYERS: 2048 MAXIMUM_SURFACE1D_LAYERED_WIDTH: 32768 MAXIMUM_SURFACE1D_WIDTH: 32768 MAXIMUM_SURFACE2D_HEIGHT: 65536 MAXIMUM_SURFACE2D_LAYERED_HEIGHT: 32768 MAXIMUM_SURFACE2D_LAYERED_LAYERS: 2048 MAXIMUM_SURFACE2D_LAYERED_WIDTH: 32768 MAXIMUM_SURFACE2D_WIDTH: 131072 MAXIMUM_SURFACE3D_DEPTH: 16384 MAXIMUM_SURFACE3D_HEIGHT: 16384 MAXIMUM_SURFACE3D_WIDTH: 16384 MAXIMUM_SURFACECUBEMAP_LAYERED_LAYERS: 2046 MAXIMUM_SURFACECUBEMAP_LAYERED_WIDTH: 32768 MAXIMUM_SURFACECUBEMAP_WIDTH: 32768 MAXIMUM_TEXTURE1D_LAYERED_LAYERS: 2048 MAXIMUM_TEXTURE1D_LAYERED_WIDTH: 32768 MAXIMUM_TEXTURE1D_LINEAR_WIDTH: 134217728 MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH: 16384 MAXIMUM_TEXTURE1D_WIDTH: 131072 MAXIMUM_TEXTURE2D_ARRAY_HEIGHT: 32768 MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES: 2048 MAXIMUM_TEXTURE2D_ARRAY_WIDTH: 32768 MAXIMUM_TEXTURE2D_GATHER_HEIGHT: 32768 MAXIMUM_TEXTURE2D_GATHER_WIDTH: 32768 MAXIMUM_TEXTURE2D_HEIGHT: 65536 MAXIMUM_TEXTURE2D_LINEAR_HEIGHT: 65000 MAXIMUM_TEXTURE2D_LINEAR_PITCH: 2097120 MAXIMUM_TEXTURE2D_LINEAR_WIDTH: 131072 MAXIMUM_TEXTURE2D_MIPMAPPED_HEIGHT: 32768 MAXIMUM_TEXTURE2D_MIPMAPPED_WIDTH: 32768 MAXIMUM_TEXTURE2D_WIDTH: 131072 MAXIMUM_TEXTURE3D_DEPTH: 16384 MAXIMUM_TEXTURE3D_DEPTH_ALTERNATE: 32768 MAXIMUM_TEXTURE3D_HEIGHT: 16384 MAXIMUM_TEXTURE3D_HEIGHT_ALTERNATE: 8192 MAXIMUM_TEXTURE3D_WIDTH: 16384 MAXIMUM_TEXTURE3D_WIDTH_ALTERNATE: 8192 MAXIMUM_TEXTURECUBEMAP_LAYERED_LAYERS: 2046 MAXIMUM_TEXTURECUBEMAP_LAYERED_WIDTH: 32768 MAXIMUM_TEXTURECUBEMAP_WIDTH: 32768 MAX_BLOCK_DIM_X: 1024 MAX_BLOCK_DIM_Y: 1024 MAX_BLOCK_DIM_Z: 64 MAX_GRID_DIM_X: 2147483647 MAX_GRID_DIM_Y: 65535 MAX_GRID_DIM_Z: 65535 MAX_PITCH: 2147483647 MAX_REGISTERS_PER_BLOCK: 65536 MAX_REGISTERS_PER_MULTIPROCESSOR: 65536 MAX_SHARED_MEMORY_PER_BLOCK: 49152 MAX_SHARED_MEMORY_PER_MULTIPROCESSOR: 65536 MAX_THREADS_PER_BLOCK: 1024 MAX_THREADS_PER_MULTIPROCESSOR: 2048 MEMORY_CLOCK_RATE: 715000 MULTI_GPU_BOARD: 0 MULTI_GPU_BOARD_GROUP_ID: 0 PCI_BUS_ID: 0 PCI_DEVICE_ID: 4 PCI_DOMAIN_ID: 0 STREAM_PRIORITIES_SUPPORTED: 1 SURFACE_ALIGNMENT: 512 TCC_DRIVER: 0 TEXTURE_ALIGNMENT: 512 TEXTURE_PITCH_ALIGNMENT: 32 TOTAL_CONSTANT_MEMORY: 65536 UNIFIED_ADDRESSING: 1 WARP_SIZE: 32
在這里,我們發現了有一個 GPU 設備 Tesla P100-PCIE-16GB,其顯存為 16G,核心數目為 3584 個。
NumPy array 和 gpuarray 之間的相互轉換
GPU 有自己的顯存,這區別於主機上的內存,這又稱為設備內存(device memory)。
NumPy array 運行在 CPU 環境(主機端),而 gpuarray 運行在 GPU 環境(設備端),兩者常常需要相互轉換,即 CPU 數據和 GPU 數據之間的傳輸轉換。
1 host_data = np.array([1, 2, 3, 4, 5], dtype=np.float32) 2 device_data = gpuarray.to_gpu(host_data) 3 device_data_x2 = 2 * device_data 4 host_data_x2 = device_data_x2.get() 5 print(host_data_x2)
其輸出:
[ 2. 4. 6. 8. 10.]
進行轉換的時候應該盡可能通過 dtype 指定類型,以避免不必要的性能損失。
gpuarray 的基本運算
按元素運算是天生的可並行計算的操作類型,在進行這種運算時 gpuarray 會自動利用多核進行並行計算。
1 x_host = np.array([1, 2, 3], dtype=np.float32) 2 y_host = np.array([1, 1, 1], dtype=np.float32) 3 z_host = np.array([2, 2, 2], dtype=np.float32) 4 x_device = gpuarray.to_gpu(x_host) 5 y_device = gpuarray.to_gpu(y_host) 6 z_device = gpuarray.to_gpu(z_host) 7 8 x_host + y_host 9 (x_device + y_device).get() 10 11 x_host ** z_host 12 (x_device ** z_device).get() 13 14 x_host / x_host 15 (x_device / x_device).get() 16 17 z_host - x_host 18 (z_device - x_device).get() 19 20 z_host / 2 21 (z_device / 2).get() 22 23 x_host - 1 24 (x_device - 1).get()
輸出:
array([2., 3., 4.], dtype=float32) array([2., 3., 4.], dtype=float32) array([1., 4., 9.], dtype=float32) array([1., 4., 9.], dtype=float32) array([1., 1., 1.], dtype=float32) array([1., 1., 1.], dtype=float32) array([ 1., 0., -1.], dtype=float32) array([ 1., 0., -1.], dtype=float32) array([1., 1., 1.], dtype=float32) array([1., 1., 1.], dtype=float32) array([0., 1., 2.], dtype=float32) array([0., 1., 2.], dtype=float32)
性能比較
1 def simple_speed_test(): 2 host_data = np.float32(np.random.random(50000000)) 3 4 t1 = time() 5 host_data_2x = host_data * np.float32(2) 6 t2 = time() 7 8 print(f'total time to compute on CPU: {t2 - t1}') 9 10 device_data = gpuarray.to_gpu(host_data) 11 12 t1 = time() 13 device_data_2x = device_data * np.float32(2) 14 t2 = time() 15 16 from_device = device_data_2x.get() 17 18 print(f'total time to compute on GPU: {t2 - t1}') 19 print(f'Is the host computation the same as the GPU computation? : {np.allclose(from_device, host_data_2x)}') 20 21 simple_speed_test()
如果是第一次執行會輸出類似:
total time to compute on CPU: 0.14141535758972168 total time to compute on GPU: 2.010883092880249 Is the host computation the same as the GPU computation? : True
而后面再繼續執行幾次,會有類似的輸出:
total time to compute on CPU: 0.1373155117034912 total time to compute on GPU: 0.0006959438323974609 Is the host computation the same as the GPU computation? : True
這是因為在 PyCUDA 中,通常會在程序第一次運行過程中,nvcc 編譯器會對 GPU 代碼進行編譯,然后由 PyCUDA 進行調用。這個編譯時間就是額外的性能損耗。
ElementwiseKernel:按元素運算
我們先看一下 Python 的內置函數 map。
第一個參數 function 以參數序列中的每一個元素調用 function 函數,返回包含每次 function 函數返回值的迭代器(Python2 中 map 輸出的是列表),我們用 list() 把迭代器轉換為列表觀察結果。
list(map(lambda x: x + 10, [1, 2, 3, 4, 5]))
輸出:
[11, 12, 13, 14, 15]
ElementWiseKernel 非常類似於 map 函數。
ElementwiseKernel 函數可以自定義按元素運算的內核。使用時需要嵌入 CUDA C 的代碼。
內核(kernel)在這里可以簡單理解為 CUDA 直接運行在 GPU 的函數。
看代碼:
1 gpu_2x_ker = ElementwiseKernel( 2 "float *in, float *out", 3 "out[i] = 2 * in[i];", 4 "gpu_2x_ker" 5 ) 6 7 def elementwise_kernel_example(): 8 host_data = np.float32(np.random.random(50000000)) 9 t1 = time() 10 host_data_2x = host_data * np.float32(2) 11 t2 = time() 12 print(f'total time to compute on CPU: {t2 - t1}') 13 14 device_data = gpuarray.to_gpu(host_data) 15 # allocate memory for output 16 device_data_2x = gpuarray.empty_like(device_data) 17 18 t1 = time() 19 gpu_2x_ker(device_data, device_data_2x) 20 t2 = time() 21 from_device = device_data_2x.get() 22 print(f'total time to compute on GPU: {t2 - t1}') 23 print(f'Is the host computation the same as the GPU computation? : {np.allclose(from_device, host_data_2x)}') 24 25 elementwise_kernel_example() 26 elementwise_kernel_example() 27 elementwise_kernel_example() 28 elementwise_kernel_example() 29 elementwise_kernel_example()
輸出:
total time to compute on CPU: 0.13545799255371094 total time to compute on GPU: 0.4059629440307617 Is the host computation the same as the GPU computation? : True total time to compute on CPU: 0.13948774337768555 total time to compute on GPU: 0.0001266002655029297 Is the host computation the same as the GPU computation? : True total time to compute on CPU: 0.1357274055480957 total time to compute on GPU: 0.0001552104949951172 Is the host computation the same as the GPU computation? : True total time to compute on CPU: 0.13451647758483887 total time to compute on GPU: 0.0001761913299560547 Is the host computation the same as the GPU computation? : True total time to compute on CPU: 0.1362597942352295 total time to compute on GPU: 0.00011849403381347656 Is the host computation the same as the GPU computation? : True
同樣我們發現在第一次運行時,出現了 nvcc 編譯產生的性能損耗。
ElementwiseKernel 的參數:
class
pycuda.elementwise.
ElementwiseKernel
(arguments, operation, name="kernel", keep=False, options=[], preamble="")
- arguments:該內核定義的傳參。
- operation:該內核定義的內嵌 CUDA C 代碼。
- name:定義的內核名稱。
gpuarray.empty_like 用於分配與 device_data 相同形狀和類型的內存空間。
InclusiveScanKernel 和 ReductionKernel 的 reduce 操作
我們先看一下 Python 標准包 functools 中的 reduce 函數。
reduce(lambda x, y : x + y, [1, 2, 3, 4])
輸出:
10
與 map 函數不同,reduce 執行迭代的二元運算,只輸出一個單值。
我們將使用 InclusiveScan 和 ReductionKernel 來實現類似於 reduce 的操作。
InclusiveScanKernel
InclusiveScanKernel 類似於 reduce,因為它並非輸出單值,輸出與輸入形狀相同。
計算求和的操作,輸出是一個累加的序列:
1 seq = np.array([1, 2, 3, 4], dtype=np.int32) 2 seq_gpu = gpuarray.to_gpu(seq) 3 sum_gpu = InclusiveScanKernel(np.int32, "a+b") 4 print(sum_gpu(seq_gpu).get()) 5 print(np.cumsum(seq))
輸出:
[ 1 3 6 10] [ 1 3 6 10]
查找最大值(最大值向后冒泡):
1 seq = np.array([1,100,-3,-10000, 4, 10000, 66, 14, 21], dtype=np.int32) 2 seq_gpu = gpuarray.to_gpu(seq) 3 max_gpu = InclusiveScanKernel(np.int32, "a > b ? a : b") 4 seq_max_bubble = max_gpu(seq_gpu) 5 print(seq_max_bubble) 6 print(seq_max_bubble.get()[-1]) 7 print(np.max(seq))
輸出:
[ 1 100 100 100 100 10000 10000 10000 10000] 10000 10000
對於 a > b ? a : b ,我們可以想象是做從前往后做一個遍歷(實際是並行的),而對於每個當前元素 cur,都和前一個元素做比較,把最大值賦值給 cur。
這樣,最大值就好像“冒泡”一樣往后移動,最終取最后一個元素即可。
ReductionKernel
實際上,ReductionKernel 就像是執行 ElementWiseKernel 后再執行一個並行掃描內核。
一個計算兩向量內積的例子:
1 a_host = np.array([1, 2, 3], dtype=np.float32) 2 b_host = np.array([4, 5, 6], dtype=np.float32) 3 print(a_host.dot(b_host)) 4 5 dot_prod = ReductionKernel(np.float32, neutral="0", reduce_expr="a+b", 6 map_expr="x[i]*y[i]", arguments="float *x, float *y") 7 a_device = gpuarray.to_gpu(a_host) 8 b_device = gpuarray.to_gpu(b_host) 9 print(dot_prod(a_device, b_device).get())
32.0 32.0
首先對兩向量的每個元素進行 map_expr 的計算,其結果再進行 reduce_expr 的計算(neutral 表示初始值),最終得到兩向量的內積。
好了,到此為止,就是初識 PyCUDA 的一些操作。
原文作者:雨先生
原文鏈接:https://www.cnblogs.com/noluye/p/11465389.html
許可協議:知識共享署名-非商業性使用 4.0 國際許可協議
參考
- PyCUDA 官網文檔
- pycuda PyPI
- 《Hands-On GPU Programming with Python and CUDA》by Dr. Brian Tuomanen
- CUDA系列學習(五)GPU基礎算法: Reduce, Scan, Histogram