使用Python寫CUDA程序

使用Python寫CUDA程序有兩種方式：

Numba
PyCUDA

numbapro現在已經不推薦使用了，功能被拆分並分別被集成到accelerate和Numba了。

例子

numba

Numba通過及時編譯機制（JIT）優化Python代碼，Numba可以針對本機的硬件環境進行優化，同時支持CPU和GPU的優化，並且可以和Numpy集成，使Python代碼可以在GPU上運行，只需在函數上方加上相關的指令標記，如下所示：

import numpy as np 
from timeit import default_timer as timer
from numba import vectorize

@vectorize(["float32(float32, float32)"], target='cuda')
def vectorAdd(a, b):
    return a + b

def main():
    N = 320000000

    A = np.ones(N, dtype=np.float32 )
    B = np.ones(N, dtype=np.float32 )
    C = np.zeros(N, dtype=np.float32 )

    start = timer()
    C = vectorAdd(A, B)
    vectorAdd_time = timer() - start

    print("c[:5] = " + str(C[:5]))
    print("c[-5:] = " + str(C[-5:]))

    print("vectorAdd took %f seconds " % vectorAdd_time)

if __name__ == '__main__':
    main()

PyCUDA

PyCUDA的內核函數（kernel）其實就是使用C/C++編寫的，通過動態編譯為GPU微碼，Python代碼與GPU代碼進行交互，如下所示：

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from timeit import default_timer as timer

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void func(float *a, float *b, size_t N)
{
  const int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i >= N)
  {
    return;
  }
  float temp_a = a[i];
  float temp_b = b[i];
  a[i] = (temp_a * 10 + 2 ) * ((temp_b + 2) * 10 - 5 ) * 5;
  // a[i] = a[i] + b[i];
}
""")

func = mod.get_function("func")   

def test(N):
    # N = 1024 * 1024 * 90   # float: 4M = 1024 * 1024

    print("N = %d" % N)

    N = np.int32(N)
    
    a = np.random.randn(N).astype(np.float32)
    b = np.random.randn(N).astype(np.float32)   
    # copy a to aa
    aa = np.empty_like(a)
    aa[:] = a
    # GPU run
    nTheads = 256
    nBlocks = int( ( N + nTheads - 1 ) / nTheads )
    start = timer()
    func(
            drv.InOut(a), drv.In(b), N,
            block=( nTheads, 1, 1 ), grid=( nBlocks, 1 ) )
    run_time = timer() - start  
    print("gpu run time %f seconds " % run_time)    
    # cpu run
    start = timer()
    aa = (aa * 10 + 2 ) * ((b + 2) * 10 - 5 ) * 5
    run_time = timer() - start  

    print("cpu run time %f seconds " % run_time)  

    # check result
    r = a - aa
    print( min(r), max(r) )

def main():
  for n in range(1, 10):
    N = 1024 * 1024 * (n * 10)
    print("------------%d---------------" % n)
    test(N)

if __name__ == '__main__':
    main()

對比

numba使用一些指令標記某些函數進行加速（也可以使用Python編寫內核函數），這一點類似於OpenACC，而PyCUDA需要自己寫kernel，在運行時進行編譯，底層是基於C/C++實現的。通過測試，這兩種方式的加速比基本差不多。但是，numba更像是一個黑盒，不知道內部到底做了什么，而PyCUDA就顯得很直觀。因此，這兩種方式具有不同的應用：

如果只是為了加速自己的算法而不關心CUDA編程，那么直接使用numba會更好。
如果為了學習、研究CUDA編程或者實驗某一個算法在CUDA下的可行性，那么使用PyCUDA。
如果寫的程序將來要移植到C/C++，那么就一定要使用PyCUDA了，因為使用PyCUDA寫的kernel本身就是用CUDA C/C++寫的。

參考文獻

Numba. http://numba.pydata.org/
PyCUDA. https://mathema.tician.de/software/pycuda/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用python寫一個最基本的mapreduce程序 Python通過pycuda使用CUDA擴展 Python寫的ATM程序 python寫注冊程序如何把自己寫的python程序給別人用 python使用easygui寫圖形界面程序使用Python寫Windows Service服務程序使用Python寫Windows Service服務程序 CUDA使用Event進行程序計時使用CUDA加速CPU程序的步驟