Theano2.1.12-基礎知識之使用GPU

本文轉載自查看原文 2015-06-17 19:51 9186 Theano

來自：http://deeplearning.net/software/theano/tutorial/using_gpu.html

using the GPU

想要看GPU的介紹性的討論和對密集並行計算的使用，查閱：GPGPU.

theano設計的一個目標就是在一個抽象層面上進行特定的計算，所以內部的函數編譯器需要靈活的處理這些計算，其中一個靈活性體現在可以在顯卡上進行計算。

當前有兩種方式來使用gpu，一種只支持NVIDIA cards (CUDA backend) ；另一種，還在開發中，可以支持任何 OpenCL設備，就像和NVIDIA cards (GpuArray Backend)一樣。

一、CUDA backend

如果你沒有准備好，那么就需要安裝Nvidia 的 GPU編程工具鏈 (CUDA)，然后配置好 Theano。我們提供了安裝指南 Linux , MacOS and Windows .（我的安裝）。

1.1 測試theano和GPU

為了檢查你的GPU是否啟用了，可以剪切下面的代碼然后保存成一個文件，運行看看。

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print 'Used the cpu'
else:
    print 'Used the gpu'

該程序會計算一堆隨機數的exp() 。注意到我們使用了 shared 函數來確保輸入的x 是存儲在顯卡設備上的。

如果運行該程序(保存文件名為check1.py)，而且device=cpu, 那么計算機將會花費大約 3 ；而在GPU 上，只需要0.64秒。不過 GPU不會一直生成完全和CPU一致的浮點數。作為一個基准來說，調用numpy.exp(x.get_value()) 的一個循環會花費大約 46秒。

$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 3.06635117531 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
  1.62323284]
Used the cpu

$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 0.638810873032 seconds
Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
  1.62323296]
Used the gpu

注意到在theano中GPU的操作在目前來說，只支持 floatX 為 float32類型。

1.2 返回設備分配數據的句柄

在前面的例子中，加速並沒有那么明顯，這是因為函數返回的結果是作為一個 NumPy ndarray，而為了方便，已經從設備復制到主機上了。這就是為什么在device=gpu下很容易交換的原因，不過如果你不建議更少的可移植性，可以通過改變graph來用GPU的存儲結果表示一個計算的過程來得到更大的加速。 gpu_from_host 操作也就是說“將輸入從主機復制到GPU上”，然后在T.exp(x)被GPU版本的exp()替換后進行優化。

from theano import function, config, shared, sandbox
import theano.sandbox.cuda.basic_ops
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
print 'Numpy result is', numpy.asarray(r)
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print 'Used the cpu'
else:
    print 'Used the gpu'

輸出結果為：

$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
Looping 1000 times took 0.34898686409 seconds
Result is <CudaNdarray object at 0x6a7a5f0>
Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
  1.62323296]
Used the gpu

這里我們通過簡單的不要將結果數組復制回主機的方式省掉了大約50%的運行時間。通過每次的函數調用返回的對象不是一個NumPy array，而是一個 “CudaNdarray”，后者可以通過正常的Numpy casting機制（例如numpy.asarray()）來轉換成一個NumPy ndarray。

對更對你可以使用borrow flag加速的資料，查閱：Borrowing when Constructing Function Objects.

1.3 在GPU上加速的是什么？

在當我們接着優化我們的實現的時候，效果的特性也會改變，而且在從設備到設備之間會有所變化，不過現在還是給出一個粗略的想法吧：

只有float32 的數據類型的計算可以加速。針對float64的更好的支持期待將來的硬件，不過在目前（2010年1月）float64還是相當慢的。
當參數是足夠大而保持30個處理器都工作的時候，矩陣乘法，卷積和大型的逐元素計算可以加速大概5-50x。
索引、維度重排和常量時間的reshaping在gpu和cpu上一樣塊。
在張量上基於行/列的求和在gpu上可能會比cpu上慢一點。
設備與主機之間大量的數據的復制是相當慢的，通常會抵消掉在數據上一兩個加速函數的大部分優勢。讓gpu取得性能上的提升的關鍵取決於數據傳輸到設備上的時間消耗。

1.4 在gpu上提升效果的提示

考慮將floatX=float32 加到你的 .theanorc 文件中。
使用theano flag allow_gc=False. 見 GPU Async capabilities
推薦使用構造器，如matrix, vector 和 scalar 來替換dmatrix, dvector 和 dscalar。因為前者當設定floatX = float32 的時候回使用float32類型的變量。
確保你的輸出變量為float32 dtype而不是float64。在graph中更多的float32變量會讓你將更多的工作放在gpu上實現。
使用shared float32變量存儲頻繁訪問的數據（見shared（））來最大程度的減少轉移到gpu設備上花費的時間。當使用gpu的時候，float32 張量共享變量存儲在gpu上，並默認的使用這些變量來消除到gpu上的傳輸時間。（這里的意思應該是創建的時候就放在gpu上，而無需每次調用都從cpu上傳給gpu，從而這份數據能夠一直保持在gpu上，減少多次的傳輸）。
如果你對你得到的效果不滿意，試着用 mode='ProfileMode'來建立你的函數。這在程序終止的時候，會打印出一些時間信息。如果一個op或者apply花費了它共享還多的時間，那么如果你知道一些gpu變成，就可以看看在theano.sandbox.cuda上它是怎么實現的。檢查下載cpu上花費的時間比例Xs(X%) ，和在gpu上花費的時間比例 Xs(X%) 和在傳輸操作上花費的時間比例 Xs(X%) 。這可以告訴你你的graph所花費的時間是在gpu上還是更多的在內存的傳輸上。
使用 nvcc 選項。 nvcc 支持一些選項來加速某些計算： -ftz=true to flush denormals values to zeros., –prec-div=false 和 –prec-sqrt=false 選項可以通過使用更少的精度來對除法和平方根操作進行加速，。你可以通過 nvcc.flags=–use_fast_math Theano flag 來一次啟用它們，或者如子nvcc.flags=-ftz=true –prec-div=false一樣分別對它們進行啟用。

1.5 GPU 異步功能

從Theano 0.6開始，我們就開始使用gpu的異步功能了。這可以讓我們運行的更快，不過可能會讓一些錯誤在它們本應該出現的地方延遲拋出異常。則會導致當分析 theano apply節點的時候有些困難。這里有一個 NVIDIA 驅動特性有助於解決這些問題。如果你將環境變量設置成CUDA_LAUNCH_BLOCKING=1 那么，所有的kernel調用都會自動同步的。這會降低性能，不過卻提供很好的profiling和合理的位置錯誤信息。

該特性會與theano的中間結果的垃圾回收相關聯。為了獲取該特性的大部分效果，你需要禁用gc來在graph中插入同步點。設置theano flag allow_gc=False 來得到甚至更快的速度！不過這會引起內存使用率上升的問題。

1.6 改變共享變量的值

為了改變共享變量的值，即對進程提供新的數據，可以使用函數shared_variable.set_value(new_value). 更詳細的資料，查閱 Understanding Memory Aliasing for Speed and Correctness.

練習：再次拿邏輯回歸做例子

import numpy
import theano
import theano.tensor as T
rng = numpy.random

N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000

# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()

# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])

# Compile expressions to functions
train = theano.function(
            inputs=[x,y],
            outputs=[prediction, xent],
            updates={w:w-0.01*gw, b:b-0.01*gb},
            name = "train")
predict = theano.function(inputs=[x], outputs=prediction,
            name = "predict")

if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
        train.maker.fgraph.toposort()]):
    print 'Used the cpu'
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
          train.maker.fgraph.toposort()]):
    print 'Used the gpu'
else:
    print 'ERROR, not able to tell if theano used the cpu or the gpu'
    print train.maker.fgraph.toposort()

for i in range(training_steps):
    pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()

print "target values for D"
print D[1]

print "prediction on D"
print predict(D[0])

修改並通過使用floatX= float32來在gpu上執行該例子，並使用time python file.py。來查看執行時間 (幫助資料：Configuration Settings and Compiling Mode)。

從cpu到gpu上有速度的提升嗎？

Where does it come from? (Use ProfileMode)

在gpu上如何有更好的速度的提升？

note：

當前只支持32 位 floats (其他待開發)。
有着float32 dtype的Shared 變量默認會放到gpu內存空間上.
當前一個gpu被限制成只允許一個進程。
使用Theano flag device=gpu 來請求使用gpu設備。
當你有多個gpu的時候，使用 device=gpu{0, 1, ...} 來指定具體的那個。
在代碼中使用Theano flag floatX=float32 (through theano.config.floatX) 。
在存儲到一個shared變量之前記得Cast 輸入。
避免本該cast到float32的int32 自動變成float64：
- 在代碼中手動插入cast或者使用 [u]int{8,16}.
- 在均值操作的周圍手動插入cast (這會涉及到length的除法，而這是一個int64類型的).
- 注意：一個新的casting機制在開發中。

答案（Solution）

#!/usr/bin/env python
# Theano tutorial
# Solution to Exercise in section 'Using the GPU'


# 1. Raw results


from __future__ import print_function
import numpy
import theano
import theano.tensor as tt

from theano import sandbox, Out

theano.config.floatX = 'float32'

rng = numpy.random

N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
training_steps = 10000

# Declare Theano symbolic variables
x = theano.shared(D[0], name="x")
y = theano.shared(D[1], name="y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()

# Construct Theano expression graph
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))  # Probability of having a one
prediction = p_1 > 0.5  # The prediction that is done: 0 or 1
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)  # Cross-entropy
cost = tt.cast(xent.mean(), 'float32') + \
       0.01 * (w ** 2).sum()  # The cost to optimize
gw, gb = tt.grad(cost, [w, b])

"""
# Compile expressions to functions
train = theano.function(
            inputs=[x, y],
            outputs=[Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')),borrow=True), Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(xent, 'float32')), borrow=True)],
            updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
            name="train")
predict = theano.function(inputs=[x], outputs=Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')), borrow=True),
            name="predict")
"""

# Compile expressions to functions
train = theano.function(
            inputs=[],
            outputs=[prediction, xent],
            updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
            name="train")
predict = theano.function(inputs=[], outputs=prediction,
            name="predict")

if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
    print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
    print('Used the gpu')
else:
    print('ERROR, not able to tell if theano used the cpu or the gpu')
    print(train.maker.fgraph.toposort())

for i in range(training_steps):
    pred, err = train()
#print "Final model:"
#print w.get_value(), b.get_value()

print("target values for D")
print(D[1])

print("prediction on D")
print(predict())

"""

# 2. Profiling


# 2.1 Profiling for CPU computations

# In your terminal, type:
$ THEANO_FLAGS=profile=True,device=cpu python using_gpu_solution_1.py

# You'll see first the output of the script:
Used the cpu
target values for D
prediction on D

# Followed by the output of profiling.. You'll see profiling results for each function
# in the script, followed by a summary for all functions.
# We'll show here only the summary:

Results were produced using an Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz

Function profiling
==================
  Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
  Time in 10002 calls to Function.__call__: 1.590916e+00s
  Time in Function.fn.__call__: 1.492365e+00s (93.805%)
  Time in thunks: 1.408159e+00s (88.512%)
  Total compile time: 6.309664e+00s
    Number of Apply nodes: 25
    Theano Optimizer time: 4.848340e-01s
       Theano validate time: 5.454302e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 5.691789e+00s

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  59.6%    59.6%       0.839s       4.19e-05s     C    20001       3   theano.tensor.blas_c.CGemv
  30.1%    89.7%       0.424s       4.71e-06s     C    90001      10   theano.tensor.elemwise.Elemwise
   5.5%    95.2%       0.078s       7.79e-02s     Py       1       1   theano.tensor.blas.Gemv
   1.9%    97.1%       0.026s       1.30e-06s     C    20001       3   theano.tensor.basic.Alloc
   1.3%    98.4%       0.018s       1.85e-06s     C    10000       1   theano.tensor.elemwise.Sum
   1.0%    99.4%       0.014s       4.78e-07s     C    30001       4   theano.tensor.elemwise.DimShuffle
   0.6%   100.0%       0.008s       4.23e-07s     C    20001       3   theano.compile.ops.Shape_i
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  59.6%    59.6%       0.839s       4.19e-05s     C     20001        3   CGemv{inplace}
  15.8%    75.4%       0.223s       2.23e-05s     C     10000        1   Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)]
   7.7%    83.1%       0.109s       1.09e-05s     C     10000        1   Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)]
   5.5%    88.7%       0.078s       7.79e-02s     Py       1        1   Gemv{no_inplace}
   4.3%    92.9%       0.060s       6.00e-06s     C     10000        1   Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}
   1.9%    94.8%       0.026s       1.30e-06s     C     20001        3   Alloc
   1.3%    96.1%       0.018s       1.85e-06s     C     10000        1   Sum{acc_dtype=float64}
   0.7%    96.8%       0.009s       4.73e-07s     C     20001        3   InplaceDimShuffle{x}
   0.6%    97.4%       0.009s       8.52e-07s     C     10000        1   Elemwise{sub,no_inplace}
   0.6%    98.0%       0.008s       4.23e-07s     C     20001        3   Shape_i{0}
   0.5%    98.5%       0.007s       7.06e-07s     C     10000        1   Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
   0.5%    98.9%       0.007s       6.57e-07s     C     10000        1   Elemwise{neg,no_inplace}
   0.3%    99.3%       0.005s       4.88e-07s     C     10000        1   InplaceDimShuffle{1,0}
   0.3%    99.5%       0.004s       3.78e-07s     C     10000        1   Elemwise{inv,no_inplace}
   0.2%    99.8%       0.003s       3.44e-07s     C     10000        1   Elemwise{Cast{float32}}
   0.2%   100.0%       0.003s       3.01e-07s     C     10000        1   Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
   0.0%   100.0%       0.000s       8.11e-06s     C        1        1   Elemwise{Composite{[GT(scalar_sigmoid(neg(sub(neg(i0), i1))), i2)]}}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  31.6%    31.6%       0.445s       4.45e-05s   10000     7   CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
  27.9%    59.6%       0.393s       3.93e-05s   10000    17   CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0, TensorConstant{0.999800026417})
  15.8%    75.4%       0.223s       2.23e-05s   10000    14   Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)](y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
   7.7%    83.1%       0.109s       1.09e-05s   10000    15   Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)](Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Alloc.0, y, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0)
   5.5%    88.7%       0.078s       7.79e-02s      1     0   Gemv{no_inplace}(aa, TensorConstant{1.0}, xx, yy, TensorConstant{0.0})
   4.3%    92.9%       0.060s       6.00e-06s   10000    13   Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
   1.3%    94.2%       0.018s       1.85e-06s   10000    16   Sum{acc_dtype=float64}(Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0)
   1.0%    95.2%       0.013s       1.34e-06s   10000     5   Alloc(TensorConstant{0.0}, Shape_i{0}.0)
   0.9%    96.1%       0.013s       1.27e-06s   10000    12   Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
   0.6%    96.7%       0.009s       8.52e-07s   10000     4   Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
   0.5%    97.2%       0.007s       7.06e-07s   10000     9   Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
   0.5%    97.6%       0.007s       6.57e-07s   10000    11   Elemwise{neg,no_inplace}(Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
   0.4%    98.1%       0.006s       6.27e-07s   10000     0   InplaceDimShuffle{x}(b)
   0.4%    98.5%       0.006s       5.90e-07s   10000     1   Shape_i{0}(x)
   0.3%    98.9%       0.005s       4.88e-07s   10000     2   InplaceDimShuffle{1,0}(x)
   0.3%    99.1%       0.004s       3.78e-07s   10000    10   Elemwise{inv,no_inplace}(Elemwise{Cast{float32}}.0)
   0.2%    99.4%       0.003s       3.44e-07s   10000     8   Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
   0.2%    99.6%       0.003s       3.19e-07s   10000     6   InplaceDimShuffle{x}(Shape_i{0}.0)
   0.2%    99.8%       0.003s       3.01e-07s   10000    18   Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
   0.2%   100.0%       0.003s       2.56e-07s   10000     3   Shape_i{0}(y)
   ... (remaining 5 Apply instances account for 0.00%(0.00s) of the runtime)



# 2.2 Profiling for GPU computations

# In your terminal, type:
$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=gpu python using_gpu_solution_1.py

# You'll see first the output of the script:
Used the gpu
target values for D
prediction on D

Results were produced using a GeForce GTX TITAN

# Profiling summary for all functions:

Function profiling
==================
  Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
  Time in 10002 calls to Function.__call__: 3.535239e+00s
  Time in Function.fn.__call__: 3.420863e+00s (96.765%)
  Time in thunks: 2.865905e+00s (81.067%)
  Total compile time: 4.728150e-01s
    Number of Apply nodes: 36
    Theano Optimizer time: 4.283385e-01s
       Theano validate time: 7.687330e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.801418e-02s

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  45.7%    45.7%       1.308s       1.64e-05s     C    80001       9   theano.sandbox.cuda.basic_ops.GpuElemwise
  17.2%    62.8%       0.492s       2.46e-05s     C    20002       4   theano.sandbox.cuda.blas.GpuGemv
  15.1%    77.9%       0.433s       2.17e-05s     C    20001       3   theano.sandbox.cuda.basic_ops.GpuAlloc
   8.2%    86.1%       0.234s       1.17e-05s     C    20002       4   theano.sandbox.cuda.basic_ops.HostFromGpu
   7.2%    93.3%       0.207s       2.07e-05s     C    10000       1   theano.sandbox.cuda.basic_ops.GpuCAReduce
   4.4%    97.7%       0.127s       1.27e-05s     C    10003       4   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.9%    98.6%       0.025s       8.23e-07s     C    30001       4   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.7%    99.3%       0.020s       9.88e-07s     C    20001       3   theano.tensor.elemwise.Elemwise
   0.5%    99.8%       0.014s       7.18e-07s     C    20001       3   theano.compile.ops.Shape_i
   0.2%   100.0%       0.006s       5.78e-07s     C    10000       1   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  17.2%    17.2%       0.492s       2.46e-05s     C     20001        3   GpuGemv{inplace}
   8.2%    25.3%       0.234s       1.17e-05s     C     20002        4   HostFromGpu
   8.0%    33.3%       0.228s       2.28e-05s     C     10001        2   GpuAlloc{memset_0=True}
   7.4%    40.7%       0.211s       2.11e-05s     C     10000        1   GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}
   7.2%    47.9%       0.207s       2.07e-05s     C     10000        1   GpuCAReduce{add}{1}
   7.1%    55.0%       0.205s       2.05e-05s     C     10000        1   GpuAlloc
   6.9%    62.0%       0.198s       1.98e-05s     C     10000        1   GpuElemwise{sub,no_inplace}
   6.9%    68.9%       0.198s       1.98e-05s     C     10000        1   GpuElemwise{inv,no_inplace}
   6.2%    75.1%       0.178s       1.78e-05s     C     10000        1   GpuElemwise{neg,no_inplace}
   5.6%    80.6%       0.159s       1.59e-05s     C     10000        1   GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)]
   4.4%    85.1%       0.127s       1.27e-05s     C     10003        4   GpuFromHost
   4.3%    89.4%       0.124s       1.24e-05s     C     10000        1   GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
   4.2%    93.6%       0.121s       1.21e-05s     C     10000        1   GpuElemwise{ScalarSigmoid}[(0, 0)]
   4.2%    97.7%       0.119s       1.19e-05s     C     10000        1   GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
   0.5%    98.2%       0.014s       7.18e-07s     C     20001        3   Shape_i{0}
   0.5%    98.7%       0.013s       1.33e-06s     C     10001        2   Elemwise{gt,no_inplace}
   0.3%    99.0%       0.010s       9.81e-07s     C     10000        1   GpuDimShuffle{1,0}
   0.3%    99.3%       0.008s       7.90e-07s     C     10000        1   GpuDimShuffle{0}
   0.2%    99.6%       0.007s       6.97e-07s     C     10001        2   GpuDimShuffle{x}
   0.2%    99.8%       0.006s       6.50e-07s     C     10000        1   Elemwise{Cast{float32}}
   ... (remaining 3 Ops account for   0.20%(0.01s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
   8.8%     8.8%       0.251s       2.51e-05s   10000    22   GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0, TensorConstant{0.999800026417})
   8.4%    17.2%       0.241s       2.41e-05s   10000     7   GpuGemv{inplace}(GpuAlloc{memset_0=True}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
   8.0%    25.1%       0.228s       2.28e-05s   10000     5   GpuAlloc{memset_0=True}(CudaNdarrayConstant{[ 0.]}, Shape_i{0}.0)
   7.4%    32.5%       0.211s       2.11e-05s   10000    13   GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}(y, GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
   7.2%    39.7%       0.207s       2.07e-05s   10000    21   GpuCAReduce{add}{1}(GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0)
   7.1%    46.9%       0.205s       2.05e-05s   10000    17   GpuAlloc(GpuDimShuffle{0}.0, Shape_i{0}.0)
   6.9%    53.8%       0.198s       1.98e-05s   10000     4   GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, y)
   6.9%    60.7%       0.198s       1.98e-05s   10000    12   GpuElemwise{inv,no_inplace}(GpuFromHost.0)
   6.2%    66.9%       0.178s       1.78e-05s   10000    11   GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
   5.6%    72.5%       0.159s       1.59e-05s   10000    19   GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)](GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuAlloc.0, y, GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0, GpuFromHost.0)
   4.8%    77.3%       0.138s       1.38e-05s   10000    18   HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0)
   4.4%    81.7%       0.126s       1.26e-05s   10000    10   GpuFromHost(Elemwise{Cast{float32}}.0)
   4.3%    86.0%       0.124s       1.24e-05s   10000     9   GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](GpuGemv{inplace}.0, GpuDimShuffle{x}.0)
   4.2%    90.2%       0.121s       1.21e-05s   10000    15   GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0)
   4.2%    94.4%       0.119s       1.19e-05s   10000    23   GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, CudaNdarrayConstant{0.00999999977648}, GpuCAReduce{add}{1}.0)
   3.4%    97.7%       0.096s       9.61e-06s   10000    16   HostFromGpu(GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}.0)
   0.5%    98.2%       0.013s       1.33e-06s   10000    20   Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5})
   0.3%    98.5%       0.010s       9.81e-07s   10000     2   GpuDimShuffle{1,0}(x)
   0.3%    98.8%       0.008s       8.27e-07s   10000     1   Shape_i{0}(x)
   0.3%    99.1%       0.008s       7.90e-07s   10000    14   GpuDimShuffle{0}(GpuElemwise{inv,no_inplace}.0)
   ... (remaining 16 Apply instances account for 0.90%(0.03s) of the runtime)


# 3. Conclusions

Examine and compare 'Ops' summaries for CPU and GPU. Usually GPU ops 'GpuFromHost' and 'HostFromGpu' by themselves
consume a large amount of extra time, but by making as few as possible data transfers between GPU and CPU, you can minimize their overhead.
Notice that each of the GPU ops consumes more time than its CPU counterpart. This is because the ops operate on small inputs;
if you increase the input data size (e.g. set N = 4000), you will see a gain from using the GPU.

"""

二、 GpuArray Backend

如果你還沒有准備好，你需要安裝 libgpuarray 和至少一個計算工具箱。可以看相關的介紹說明 libgpuarray.

如果使用OpenGL，那么所有設備的類型都支持的，對於該章節剩下的部分，不管你使用的計算設備是什么，都表示是gpu。

waring：我們想完全支持 OpenCL, 在2014年5月的時候，該支持仍然是個想法而已。一些有用的ops仍然沒有被支持，因為想要在舊的后端以最小化變化來移植。

2.1 Testing Theano with GPU

為了查看是否使用的是GPU，可以將下面代碼剪切然后創建個文件運行：

from theano import function, config, shared, tensor, sandbox
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print 'Used the cpu'
else:
    print 'Used the gpu'

該程序只計算一群隨機數的 exp() 。注意到我們使用 theano.shared() 函數來確保輸入x存儲在gpu上。

$ THEANO_FLAGS=device=cpu python check1.py
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 2.6071999073 seconds
Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
  1.62323285]
Used the cpu

$ THEANO_FLAGS=device=cuda0 python check1.py
Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 2.28562092781 seconds
Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
  1.62323285]
Used the gpu

2.2 返回在設備上分配數據的句柄

在默認情況下，在gpu上執行的函數仍然返回一個標准的numpy ndarray。在得到結果之前會有一個遷移操作，將數據傳輸會cpu上從而來確保與cpu代碼的兼容。這可以讓在不改變源代碼的情況下只使用flag device來改變代碼運行的位置。

如果不建議損失一些靈活性，可以讓theano直接返回gpu對象。下面的代碼就是這樣：

from theano import function, config, shared, tensor, sandbox
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.gpuarray.basic_ops.gpu_from_host(tensor.exp(x)))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', numpy.asarray(r)
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print 'Used the cpu'
else:
    print 'Used the gpu'

這里的 theano.sandbox.gpuarray.basic.gpu_from_host() 調用的意思是 “將輸入復制到 GPU上”。然而在優化的階段中，因為結果已經在gpu上了，它會被移除掉（即該函數會被忽略）。這里是為了告訴theano我們想要gpu上的結果。

輸出為：

$ THEANO_FLAGS=device=cuda0 python check2.py
Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
Looping 1000 times took 0.455810785294 seconds
Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
  1.62323285]
Used the gpu

然而每次調用的時間看上去會比之前的兩個調用更少 (的確是會更少，因為這里避免了數據傳輸r)這里這么大的加速是因為gpu上執行的異步過程所導致的，也就是說工作並沒有完成，只是“啟動”了。

返回的對象是一個從pygou上得到的 GpuArray。它幾乎扮演着帶有一些異常的 numpy ndarray ，因為它的數據都在gpu上，你可以將它復制到主機上，然后通過使用平常的numpy cast ，例如numpy.asarray()來轉換成一個常規的ndarray 。

為了更快的速度，可以使用borrow flag，查閱： Borrowing when Constructing Function Objects.

2.3 什么能夠在gpu上加速？

當然在不同設備之間，性能特性還是不太的，同樣的，我們會改進我們的實現。

該backend支持所有的常規theano數據類型 (float32, float64, int, ...)，然而GPU的支持是變化的，而且一些單元沒法處理 double (float64)或者更小的 (小於32 位，比如 int16)數據類型。如果使用了這些單元，那么會在編譯的時候或者運行的時候得到一個錯誤。

復雜的支持還未測試，而且大多數都不行。

通常來說，大的操作，比如矩陣乘法或者有着大量輸入的逐元素操作將會明顯更快的。

2.4 GPU 異步功能

默認情況下，在gpu上所有的操作都是異步的，這可以通過底層的libgpuarray來使得這些操作都是透明的。

當在設備和主機之間進行內存遷移的時候，可以通過引入同步點。當在gpu上釋放活動的（活動的緩沖區就是仍然會被kernel使用的緩沖區）內存緩沖區的時候，可以引入另一個同步點。

可以通過調用它的sync（）方法來對一個特定的GpuArray強制同步。這在做基准的時候可以用來得到准確的耗時計算。

強制的同步點會和中間結果的垃圾回收相關聯。為了得到最快的速度，你應該通過使用theano flag allow_gc=False來禁用垃圾回收器。不過要注意這會導致內存使用提升的問題。

三、直接對gpu編程的一些軟件

撇開theano這種元編程，有：

CUDA: GPU 編程API，是NVIDIA 對C的擴展 (CUDA C)
- 特定供應商
- 成熟的數值庫 (BLAS, RNG, FFT) 。
OpenCL: CUDA的多供應商版本
- 更加的通用和標准。
- 更少的庫，傳播不廣
PyCUDA:對CUDA驅動接口的python綁定，允許通過python來訪問 Nvidia的 CUDA 並行計算API
- 方便:
  
  使用python來更容易的進行GPU 元編程。
  
  從python中能夠抽象的編譯更低層的 CUDA 代碼 (pycuda.driver.SourceModule).
  
  GPU 內存緩存 (pycuda.gpuarray.GPUArray).
  
  幫助文檔.
- 完整性: 綁定了所有的CUDA驅動 API.
- 自動的錯誤檢測：所有的 CUDA 錯誤都會自動的轉到python異常。
- 速度: PyCUDA的底層是用 C++寫的。
- 針對GPU對象，具有很好的內存管理：
  
  對象的清理是和對象的生命周期綁定的 (RAII, ‘Resource Acquisition Is Initialization’).
  
  使得更容易編寫正確的，無漏洞的和不容易崩潰的代碼。
  
  PyCUDA 會知道依賴條件 (例如，它不會在所有分配的內存釋放之前對上下文進行分離)。
(查閱PyCUDA的 documentation 和在PyCUDA上Andreas Kloeckner的 website )
PyOpenCL: PyCUDA for OpenCL

四、學習用PyCUDA編程

如果你已經精通C了，那么你就可以很容易的通過學習來充分利用你的知識，首先用CUDA C來編寫GPU，然后，使用 PyCUDA來訪問 CUDA API。

下面的資源有助於你學習的過程：

CUDA API 和CUDA C: 入門
- NVIDIA’s slides
- Stein’s (NYU) slides
CUDA API 和 CUDA C: 高級
- MIT IAP2009 CUDA (full coverage: lectures, leading Kirk-Hwu textbook, 例子，額外的資源)
- Course U. of Illinois (full lectures, Kirk-Hwu 教科書)
- NVIDIA’s knowledge base (覆蓋范圍廣，從入門到高級)
- practical issues ( grids, blocks 和 threads之間的關系；並在同一頁還有相對應的問題)
- CUDA optimisation
PyCUDA: 入門
- Kloeckner’s slides
- Kloeckner’ website
PYCUDA: 高級
- PyCUDA documentation website

下面的例子是用來說明用PyCUDA來對GPU編程的一個預言。一旦你覺得完全足夠了，你就可以嘗試去做相對應的練習。

Example: PyCUDA

# (from PyCUDA's documentation)
import pycuda.autoinit
import pycuda.driver as drv
import numpy

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(400,1,1), grid=(1,1))

assert numpy.allclose(dest, a*b)
print dest

Exercise

運行之前的例子

修改並執行一個shape（20，10）的矩陣

Example: Theano + PyCUDA

import numpy, theano
import theano.misc.pycuda_init
from pycuda.compiler import SourceModule
import theano.sandbox.cuda as cuda

class PyCUDADoubleOp(theano.Op):
    def __eq__(self, other):
        return type(self) == type(other)

    def __hash__(self):
        return hash(type(self))

    def __str__(self):
        return self.__class__.__name__

    def make_node(self, inp):
        inp = cuda.basic_ops.gpu_contiguous(
           cuda.basic_ops.as_cuda_ndarray_variable(inp))
        assert inp.dtype == "float32"
        return theano.Apply(self, [inp], [inp.type()])

    def make_thunk(self, node, storage_map, _, _2):
        mod = SourceModule("""
    __global__ void my_fct(float * i0, float * o0, int size) {
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if(i<size){
        o0[i] = i0[i]*2;
    }
  }""")
        pycuda_fct = mod.get_function("my_fct")
        inputs = [storage_map[v] for v in node.inputs]
        outputs = [storage_map[v] for v in node.outputs]

        def thunk():
            z = outputs[0]
            if z[0] is None or z[0].shape != inputs[0][0].shape:
                z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
            grid = (int(numpy.ceil(inputs[0][0].size / 512.)), 1)
            pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
                       block=(512, 1, 1), grid=grid)
        return thunk

使用這個代碼來測試：

>>> x = theano.tensor.fmatrix()
>>> f = theano.function([x], PyCUDADoubleOp()(x))
>>> xv = numpy.ones((4, 5), dtype="float32")
>>> assert numpy.allclose(f(xv), xv*2)
>>> print numpy.asarray(f(xv))

Exercise

運行前面的例子

修改並執行兩個矩陣的乘法： x * y.

修改並執行返回兩個輸出： x + y 和 x - y.

(注意到theano當前的逐元素優化只對涉及到單一輸出的計算有用。所以，為了提供基本解決情況下的效率，需要在代碼中顯式的對這兩個操作進行優化)。

修改然后執行來支持跨越行為（stride） (即，避免受限於輸入一定是C-連續的)。

五、注意

查閱 Other Implementations 來了解如何在gpu上處理隨機數

參考資料：

[1]官網：http://deeplearning.net/software/theano/tutorial/using_gpu.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Theano2.1.13-基礎知識之PyCUDA、CUDAMat、Gnumpy的兼容 Theano2.1.7-基礎知識之設置的配置和編譯模式 theano使用GPU踩坑 Halcon 算法加速的基礎知識（多核並行/GPU） VS使用GUID（UUID的基礎知識）使用Jquery解析Json基礎知識 svn基礎知識一 openal 基礎知識網絡基礎知識音頻基礎知識