【GPU加速系列】PyCUDA(一):上手簡單操作


PyCUDA 可以通過 Python 訪問 NVIDIA 的 CUDA 並行計算 API。

具體介紹和安裝可以參考 PyCUDA 官網文檔和 pycuda PyPI

本文涵蓋的內容有:

  1. 通過 PyCUDA 查詢 GPU 信息。
  2. NumPy array 和 gpuarray 之間的相互轉換。
  3. 使用 gpuarray 進行基本的運算。
  4. 使用 ElementwiseKernel 進行按元素的運算。
  5. 使用 InclusiveScanKernel 和 ReductionKernel 的 reduce 操作。

本文示例在 GPU 環境下,使用 Jupyter Notebook 導入了以下包:

 1 import sys
 2 from time import time
 3 from functools import reduce
 4 
 5 import numpy as np
 6 import pandas as pd
 7 import matplotlib
 8 from matplotlib import pyplot as plt
 9 from IPython.core.interactiveshell import InteractiveShell
10 
11 import pycuda
12 import pycuda.autoinit
13 import pycuda.driver as drv
14 from pycuda import gpuarray
15 from pycuda.elementwise import ElementwiseKernel
16 from pycuda.scan import InclusiveScanKernel
17 from pycuda.reduction import ReductionKernel
18 
19 InteractiveShell.ast_node_interactivity = "all"
20 print(f'The version of PyCUDA: {pycuda.VERSION}')
21 print(f'The version of Python: {sys.version}')

輸出:

The version of PyCUDA: (2019, 1, 2)
The version of Python: 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16) 
[GCC 7.3.0]

查詢 GPU 信息

GPU 查詢是一個非常基本的操作,比較常用的重要信息有 GPU 設備名、GPU 顯存、核心數量等。

定義函數:

 1 def query_device():
 2     drv.init()
 3     print('CUDA device query (PyCUDA version) \n')
 4     print(f'Detected {drv.Device.count()} CUDA Capable device(s) \n')
 5     for i in range(drv.Device.count()):
 6 
 7         gpu_device = drv.Device(i)
 8         print(f'Device {i}: {gpu_device.name()}')
 9         compute_capability = float( '%d.%d' % gpu_device.compute_capability() )
10         print(f'\t Compute Capability: {compute_capability}')
11         print(f'\t Total Memory: {gpu_device.total_memory()//(1024**2)} megabytes')
12 
13         # The following will give us all remaining device attributes as seen 
14         # in the original deviceQuery.
15         # We set up a dictionary as such so that we can easily index
16         # the values using a string descriptor.
17 
18         device_attributes_tuples = gpu_device.get_attributes().items() 
19         device_attributes = {}
20 
21         for k, v in device_attributes_tuples:
22             device_attributes[str(k)] = v
23 
24         num_mp = device_attributes['MULTIPROCESSOR_COUNT']
25 
26         # Cores per multiprocessor is not reported by the GPU!  
27         # We must use a lookup table based on compute capability.
28         # See the following:
29         # http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
30 
31         cuda_cores_per_mp = { 5.0 : 128, 5.1 : 128, 5.2 : 128, 6.0 : 64, 6.1 : 128, 6.2 : 128}[compute_capability]
32 
33         print(f'\t ({num_mp}) Multiprocessors, ({cuda_cores_per_mp}) CUDA Cores / Multiprocessor: {num_mp*cuda_cores_per_mp} CUDA Cores')
34 
35         device_attributes.pop('MULTIPROCESSOR_COUNT')
36 
37         for k in device_attributes.keys():
38             print(f'\t {k}: {device_attributes[k]}')

執行 GPU 查詢操作:

CUDA device query (PyCUDA version) 

Detected 1 CUDA Capable device(s) 

Device 0: Tesla P100-PCIE-16GB
	 Compute Capability: 6.0
	 Total Memory: 16280 megabytes
	 (56) Multiprocessors, (64) CUDA Cores / Multiprocessor: 3584 CUDA Cores
	 ASYNC_ENGINE_COUNT: 2
	 CAN_MAP_HOST_MEMORY: 1
	 CLOCK_RATE: 1328500
	 COMPUTE_CAPABILITY_MAJOR: 6
	 COMPUTE_CAPABILITY_MINOR: 0
	 COMPUTE_MODE: DEFAULT
	 CONCURRENT_KERNELS: 1
	 ECC_ENABLED: 1
	 GLOBAL_L1_CACHE_SUPPORTED: 1
	 GLOBAL_MEMORY_BUS_WIDTH: 4096
	 GPU_OVERLAP: 1
	 INTEGRATED: 0
	 KERNEL_EXEC_TIMEOUT: 0
	 L2_CACHE_SIZE: 4194304
	 LOCAL_L1_CACHE_SUPPORTED: 1
	 MANAGED_MEMORY: 1
	 MAXIMUM_SURFACE1D_LAYERED_LAYERS: 2048
	 MAXIMUM_SURFACE1D_LAYERED_WIDTH: 32768
	 MAXIMUM_SURFACE1D_WIDTH: 32768
	 MAXIMUM_SURFACE2D_HEIGHT: 65536
	 MAXIMUM_SURFACE2D_LAYERED_HEIGHT: 32768
	 MAXIMUM_SURFACE2D_LAYERED_LAYERS: 2048
	 MAXIMUM_SURFACE2D_LAYERED_WIDTH: 32768
	 MAXIMUM_SURFACE2D_WIDTH: 131072
	 MAXIMUM_SURFACE3D_DEPTH: 16384
	 MAXIMUM_SURFACE3D_HEIGHT: 16384
	 MAXIMUM_SURFACE3D_WIDTH: 16384
	 MAXIMUM_SURFACECUBEMAP_LAYERED_LAYERS: 2046
	 MAXIMUM_SURFACECUBEMAP_LAYERED_WIDTH: 32768
	 MAXIMUM_SURFACECUBEMAP_WIDTH: 32768
	 MAXIMUM_TEXTURE1D_LAYERED_LAYERS: 2048
	 MAXIMUM_TEXTURE1D_LAYERED_WIDTH: 32768
	 MAXIMUM_TEXTURE1D_LINEAR_WIDTH: 134217728
	 MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH: 16384
	 MAXIMUM_TEXTURE1D_WIDTH: 131072
	 MAXIMUM_TEXTURE2D_ARRAY_HEIGHT: 32768
	 MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES: 2048
	 MAXIMUM_TEXTURE2D_ARRAY_WIDTH: 32768
	 MAXIMUM_TEXTURE2D_GATHER_HEIGHT: 32768
	 MAXIMUM_TEXTURE2D_GATHER_WIDTH: 32768
	 MAXIMUM_TEXTURE2D_HEIGHT: 65536
	 MAXIMUM_TEXTURE2D_LINEAR_HEIGHT: 65000
	 MAXIMUM_TEXTURE2D_LINEAR_PITCH: 2097120
	 MAXIMUM_TEXTURE2D_LINEAR_WIDTH: 131072
	 MAXIMUM_TEXTURE2D_MIPMAPPED_HEIGHT: 32768
	 MAXIMUM_TEXTURE2D_MIPMAPPED_WIDTH: 32768
	 MAXIMUM_TEXTURE2D_WIDTH: 131072
	 MAXIMUM_TEXTURE3D_DEPTH: 16384
	 MAXIMUM_TEXTURE3D_DEPTH_ALTERNATE: 32768
	 MAXIMUM_TEXTURE3D_HEIGHT: 16384
	 MAXIMUM_TEXTURE3D_HEIGHT_ALTERNATE: 8192
	 MAXIMUM_TEXTURE3D_WIDTH: 16384
	 MAXIMUM_TEXTURE3D_WIDTH_ALTERNATE: 8192
	 MAXIMUM_TEXTURECUBEMAP_LAYERED_LAYERS: 2046
	 MAXIMUM_TEXTURECUBEMAP_LAYERED_WIDTH: 32768
	 MAXIMUM_TEXTURECUBEMAP_WIDTH: 32768
	 MAX_BLOCK_DIM_X: 1024
	 MAX_BLOCK_DIM_Y: 1024
	 MAX_BLOCK_DIM_Z: 64
	 MAX_GRID_DIM_X: 2147483647
	 MAX_GRID_DIM_Y: 65535
	 MAX_GRID_DIM_Z: 65535
	 MAX_PITCH: 2147483647
	 MAX_REGISTERS_PER_BLOCK: 65536
	 MAX_REGISTERS_PER_MULTIPROCESSOR: 65536
	 MAX_SHARED_MEMORY_PER_BLOCK: 49152
	 MAX_SHARED_MEMORY_PER_MULTIPROCESSOR: 65536
	 MAX_THREADS_PER_BLOCK: 1024
	 MAX_THREADS_PER_MULTIPROCESSOR: 2048
	 MEMORY_CLOCK_RATE: 715000
	 MULTI_GPU_BOARD: 0
	 MULTI_GPU_BOARD_GROUP_ID: 0
	 PCI_BUS_ID: 0
	 PCI_DEVICE_ID: 4
	 PCI_DOMAIN_ID: 0
	 STREAM_PRIORITIES_SUPPORTED: 1
	 SURFACE_ALIGNMENT: 512
	 TCC_DRIVER: 0
	 TEXTURE_ALIGNMENT: 512
	 TEXTURE_PITCH_ALIGNMENT: 32
	 TOTAL_CONSTANT_MEMORY: 65536
	 UNIFIED_ADDRESSING: 1
	 WARP_SIZE: 32

在這里,我們發現了有一個 GPU 設備 Tesla P100-PCIE-16GB,其顯存為 16G核心數目為 3584 個

NumPy array 和 gpuarray 之間的相互轉換

GPU 有自己的顯存,這區別於主機上的內存,這又稱為設備內存(device memory)

NumPy array 運行在 CPU 環境(主機端),而 gpuarray 運行在 GPU 環境(設備端),兩者常常需要相互轉換,即 CPU 數據和 GPU 數據之間的傳輸轉換。

1 host_data = np.array([1, 2, 3, 4, 5], dtype=np.float32)
2 device_data = gpuarray.to_gpu(host_data)
3 device_data_x2 = 2 * device_data
4 host_data_x2 = device_data_x2.get()
5 print(host_data_x2)

其輸出:

[ 2.  4.  6.  8. 10.]

進行轉換的時候應該盡可能通過 dtype 指定類型,以避免不必要的性能損失。

gpuarray 的基本運算

按元素運算是天生的可並行計算的操作類型,在進行這種運算時 gpuarray 會自動利用多核進行並行計算。

 1 x_host = np.array([1, 2, 3], dtype=np.float32)
 2 y_host = np.array([1, 1, 1], dtype=np.float32)
 3 z_host = np.array([2, 2, 2], dtype=np.float32)
 4 x_device = gpuarray.to_gpu(x_host)
 5 y_device = gpuarray.to_gpu(y_host)
 6 z_device = gpuarray.to_gpu(z_host)
 7 
 8 x_host + y_host
 9 (x_device + y_device).get()
10 
11 x_host ** z_host
12 (x_device ** z_device).get()
13 
14 x_host / x_host
15 (x_device / x_device).get()
16 
17 z_host - x_host
18 (z_device - x_device).get()
19 
20 z_host / 2
21 (z_device / 2).get()
22 
23 x_host - 1
24 (x_device - 1).get()

輸出:

array([2., 3., 4.], dtype=float32)
array([2., 3., 4.], dtype=float32)
array([1., 4., 9.], dtype=float32)
array([1., 4., 9.], dtype=float32)
array([1., 1., 1.], dtype=float32)
array([1., 1., 1.], dtype=float32)
array([ 1.,  0., -1.], dtype=float32)
array([ 1.,  0., -1.], dtype=float32)
array([1., 1., 1.], dtype=float32)
array([1., 1., 1.], dtype=float32)
array([0., 1., 2.], dtype=float32)
array([0., 1., 2.], dtype=float32)

性能比較

 1 def simple_speed_test():
 2     host_data = np.float32(np.random.random(50000000))
 3 
 4     t1 = time()
 5     host_data_2x =  host_data * np.float32(2)
 6     t2 = time()
 7 
 8     print(f'total time to compute on CPU: {t2 - t1}')
 9 
10     device_data = gpuarray.to_gpu(host_data)
11 
12     t1 = time()
13     device_data_2x =  device_data * np.float32(2)
14     t2 = time()
15 
16     from_device = device_data_2x.get()
17 
18     print(f'total time to compute on GPU: {t2 - t1}')
19     print(f'Is the host computation the same as the GPU computation? : {np.allclose(from_device, host_data_2x)}')
20     
21 simple_speed_test()

如果是第一次執行會輸出類似:

total time to compute on CPU: 0.14141535758972168
total time to compute on GPU: 2.010883092880249
Is the host computation the same as the GPU computation? : True

而后面再繼續執行幾次,會有類似的輸出:

total time to compute on CPU: 0.1373155117034912
total time to compute on GPU: 0.0006959438323974609
Is the host computation the same as the GPU computation? : True

這是因為在 PyCUDA 中,通常會在程序第一次運行過程中,nvcc 編譯器會對 GPU 代碼進行編譯,然后由 PyCUDA 進行調用。這個編譯時間就是額外的性能損耗

ElementwiseKernel:按元素運算

我們先看一下 Python 的內置函數 map

第一個參數 function 以參數序列中的每一個元素調用 function 函數,返回包含每次 function 函數返回值的迭代器(Python2 中 map 輸出的是列表),我們用 list() 把迭代器轉換為列表觀察結果。

list(map(lambda x: x + 10, [1, 2, 3, 4, 5]))

輸出:

[11, 12, 13, 14, 15]

ElementWiseKernel 非常類似於 map 函數。

ElementwiseKernel 函數可以自定義按元素運算內核。使用時需要嵌入 CUDA C 的代碼。

內核(kernel)在這里可以簡單理解為 CUDA 直接運行在 GPU 的函數

看代碼:

 1 gpu_2x_ker = ElementwiseKernel(
 2         "float *in, float *out",
 3         "out[i] = 2 * in[i];",
 4         "gpu_2x_ker"
 5     )
 6 
 7 def elementwise_kernel_example():
 8     host_data = np.float32(np.random.random(50000000))
 9     t1 = time()
10     host_data_2x = host_data * np.float32(2)
11     t2 = time()
12     print(f'total time to compute on CPU: {t2 - t1}')
13 
14     device_data = gpuarray.to_gpu(host_data)
15     # allocate memory for output
16     device_data_2x = gpuarray.empty_like(device_data)
17     
18     t1 = time()
19     gpu_2x_ker(device_data, device_data_2x)
20     t2 = time()
21     from_device = device_data_2x.get()
22     print(f'total time to compute on GPU: {t2 - t1}')
23     print(f'Is the host computation the same as the GPU computation? : {np.allclose(from_device, host_data_2x)}')
24     
25 elementwise_kernel_example()
26 elementwise_kernel_example()
27 elementwise_kernel_example()
28 elementwise_kernel_example()
29 elementwise_kernel_example()

輸出:

total time to compute on CPU: 0.13545799255371094
total time to compute on GPU: 0.4059629440307617
Is the host computation the same as the GPU computation? : True
total time to compute on CPU: 0.13948774337768555
total time to compute on GPU: 0.0001266002655029297
Is the host computation the same as the GPU computation? : True
total time to compute on CPU: 0.1357274055480957
total time to compute on GPU: 0.0001552104949951172
Is the host computation the same as the GPU computation? : True
total time to compute on CPU: 0.13451647758483887
total time to compute on GPU: 0.0001761913299560547
Is the host computation the same as the GPU computation? : True
total time to compute on CPU: 0.1362597942352295
total time to compute on GPU: 0.00011849403381347656
Is the host computation the same as the GPU computation? : True

同樣我們發現在第一次運行時,出現了 nvcc 編譯產生的性能損耗。

ElementwiseKernel 的參數:

class pycuda.elementwise.ElementwiseKernel(argumentsoperationname="kernel"keep=Falseoptions=[]preamble="")

  • arguments:該內核定義的傳參。
  • operation:該內核定義的內嵌 CUDA C 代碼。
  • name:定義的內核名稱。

gpuarray.empty_like 用於分配與 device_data 相同形狀和類型的內存空間。

InclusiveScanKernel 和 ReductionKernel 的 reduce 操作

我們先看一下 Python 標准包 functools 中的 reduce 函數

reduce(lambda x, y : x + y, [1, 2, 3, 4])

輸出:

10

與 map 函數不同,reduce 執行迭代的二元運算,只輸出一個單值

我們將使用 InclusiveScanReductionKernel 來實現類似於 reduce 的操作。

InclusiveScanKernel

InclusiveScanKernel 類似於 reduce,因為它並非輸出單值,輸出與輸入形狀相同。

計算求和的操作,輸出是一個累加的序列:

1 seq = np.array([1, 2, 3, 4], dtype=np.int32)
2 seq_gpu = gpuarray.to_gpu(seq)
3 sum_gpu = InclusiveScanKernel(np.int32, "a+b")
4 print(sum_gpu(seq_gpu).get())
5 print(np.cumsum(seq))

輸出:

[ 1  3  6 10]
[ 1  3  6 10]

查找最大值(最大值向后冒泡):

1 seq = np.array([1,100,-3,-10000, 4, 10000, 66, 14, 21], dtype=np.int32)
2 seq_gpu = gpuarray.to_gpu(seq)
3 max_gpu = InclusiveScanKernel(np.int32, "a > b ? a : b")
4 seq_max_bubble = max_gpu(seq_gpu)
5 print(seq_max_bubble)
6 print(seq_max_bubble.get()[-1])
7 print(np.max(seq))

輸出:

[    1   100   100   100   100 10000 10000 10000 10000]
10000
10000

對於 a > b ? a : b ,我們可以想象是做從前往后做一個遍歷(實際是並行的),而對於每個當前元素 cur,都和前一個元素做比較,把最大值賦值給 cur。

這樣,最大值就好像“冒泡”一樣往后移動,最終取最后一個元素即可。

ReductionKernel

實際上,ReductionKernel 就像是執行 ElementWiseKernel 后再執行一個並行掃描內

一個計算兩向量內積的例子:

1 a_host = np.array([1, 2, 3], dtype=np.float32)
2 b_host = np.array([4, 5, 6], dtype=np.float32)
3 print(a_host.dot(b_host))
4 
5 dot_prod = ReductionKernel(np.float32, neutral="0", reduce_expr="a+b",
6                            map_expr="x[i]*y[i]", arguments="float *x, float *y")
7 a_device = gpuarray.to_gpu(a_host)
8 b_device = gpuarray.to_gpu(b_host)
9 print(dot_prod(a_device, b_device).get())
32.0
32.0

首先對兩向量的每個元素進行 map_expr 的計算,其結果再進行 reduce_expr 的計算(neutral 表示初始值),最終得到兩向量的內積。

好了,到此為止,就是初識 PyCUDA 的一些操作。

 

原文作者:雨先生
原文鏈接:https://www.cnblogs.com/noluye/p/11465389.html  
許可協議:知識共享署名-非商業性使用 4.0 國際許可協議

參考


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM