轉載請注明出處:
http://www.cnblogs.com/darkknightzh/p/5655957.html
參考網址:
http://unix.stackexchange.com/questions/38560/gpu-usage-monitoring-cuda
http://blog.csdn.net/revolver/article/details/49682131
一 在終端中直接安裝
說明:由於nvidia並未給出ubuntu16上面的cuda toolkit,本文方法不一定可行,我這邊安裝成功,感覺完全是瞎貓碰死耗子了。。。不過沒有安裝sample,只是其他程序可以使用顯卡了。
1. 第一個網址,使用
sudo apt-get install nvidia-cuda-toolkit
安裝cuda toolkit,要看網速,下載很慢。還有,網址中說重啟ubuntu有問題(I can't log in to my computer and end up in infinite login screen)。我這邊安裝了之后,正常登陸了,沒有出現問題。
2. 安裝完之后的信息:
裝的是7.5.17,不是最新的7.5.18,但是能用就行。

3. 第二個網址中qed給出了在終端中持續顯示GPU當前的使用率(僅限nvidia的顯卡):
nvidia-smi -l 1
結果:

說明:上面的命令貌似要顯卡支持才行。也可以使用Jonathan提供的命令(目前沒測試):
watch -n0.1 "nvidia-settings -q GPUUtilization -q useddedicatedgpumemory"
160713說明:a. 這條命令顯示信息如下:

b. 其實這條命令就是在終端中顯示‘NVIDIA X serve settings’中的一些信息,如下(NVIDIA X serve settings位置為/usr/share/applications,也可以直接打開該軟件查看):

c. 由於這張圖使用的GPU和之前使用的GPU不一樣,因而參數不一致(比如顯存)。
4. 安裝完cuda之后,安裝cutorch,之后安裝cunn,都安裝成功。使用GPU的程序也能正常運行。
5. 第三個參考網址中給出了測試程序,本處稍微進行了修改,打印出來每次循環執行的時間(CPU版本和GPU版本代碼實際上差不多):
① CPU版本:
require 'torch' require 'nn' require 'optim' --require 'cunn' --require 'cutorch' mnist = require 'mnist' fullset = mnist.traindataset() testset = mnist.testdataset() trainset = { size = 50000, data = fullset.data[{{1,50000}}]:double(), label = fullset.label[{{1,50000}}] } validationset = { size = 10000, data = fullset.data[{{50001,60000}}]:double(), label = fullset.label[{{50001,60000}}] } trainset.data = trainset.data - trainset.data:mean() validationset.data = validationset.data - validationset.data:mean() model = nn.Sequential() model:add(nn.Reshape(1, 28, 28)) model:add(nn.MulConstant(1/256.0*3.2)) model:add(nn.SpatialConvolutionMM(1, 20, 5, 5, 1, 1, 0, 0)) model:add(nn.SpatialMaxPooling(2, 2 , 2, 2, 0, 0)) model:add(nn.SpatialConvolutionMM(20, 50, 5, 5, 1, 1, 0, 0)) model:add(nn.SpatialMaxPooling(2, 2 , 2, 2, 0, 0)) model:add(nn.Reshape(4*4*50)) model:add(nn.Linear(4*4*50, 500)) model:add(nn.ReLU()) model:add(nn.Linear(500, 10)) model:add(nn.LogSoftMax()) model = require('weight-init')(model, 'xavier') criterion = nn.ClassNLLCriterion() --model = model:cuda() --criterion = criterion:cuda() --trainset.data = trainset.data:cuda() --trainset.label = trainset.label:cuda() --validationset.data = validationset.data:cuda() --validationset.label = validationset.label:cuda()--[[]] sgd_params = { learningRate = 1e-2, learningRateDecay = 1e-4, weightDecay = 1e-3, momentum = 1e-4 } x, dl_dx = model:getParameters() step = function(batch_size) local current_loss = 0 local count = 0 local shuffle = torch.randperm(trainset.size) batch_size = batch_size or 200 for t = 1,trainset.size,batch_size do -- setup inputs and targets for this mini-batch local size = math.min(t + batch_size - 1, trainset.size) - t local inputs = torch.Tensor(size, 28, 28)--:cuda() local targets = torch.Tensor(size)--:cuda() for i = 1,size do local input = trainset.data[shuffle[i+t]] local target = trainset.label[shuffle[i+t]] -- if target == 0 then target = 10 end inputs[i] = input targets[i] = target end targets:add(1) local feval = function(x_new) -- reset data if x ~= x_new then x:copy(x_new) end dl_dx:zero() -- perform mini-batch gradient descent local loss = criterion:forward(model:forward(inputs), targets) model:backward(inputs, criterion:backward(model.output, targets)) return loss, dl_dx end _, fs = optim.sgd(feval, x, sgd_params) -- fs is a table containing value of the loss function -- (just 1 value for the SGD optimization) count = count + 1 current_loss = current_loss + fs[1] end -- normalize loss return current_loss / count end eval = function(dataset, batch_size) local count = 0 batch_size = batch_size or 200 for i = 1,dataset.size,batch_size do local size = math.min(i + batch_size - 1, dataset.size) - i local inputs = dataset.data[{{i,i+size-1}}]--:cuda() local targets = dataset.label[{{i,i+size-1}}]:long()--:cuda() local outputs = model:forward(inputs) local _, indices = torch.max(outputs, 2) indices:add(-1) local guessed_right = indices:eq(targets):sum() count = count + guessed_right end return count / dataset.size end max_iters = 5 do local last_accuracy = 0 local decreasing = 0 local threshold = 1 -- how many deacreasing epochs we allow for i = 1,max_iters do timer = torch.Timer() local loss = step() print(string.format('Epoch: %d Current loss: %4f', i, loss)) local accuracy = eval(validationset) print(string.format('Accuracy on the validation set: %4f', accuracy)) if accuracy < last_accuracy then if decreasing > threshold then break end decreasing = decreasing + 1 else decreasing = 0 end last_accuracy = accuracy print('Time elapsed: ' .. i .. 'iter: ' .. timer:time().real .. ' seconds') end end testset.data = testset.data:double() eval(testset)
② GPU版本:
1 require 'torch' 2 require 'nn' 3 require 'optim' 4 require 'cunn' 5 require 'cutorch' 6 mnist = require 'mnist' 7 8 fullset = mnist.traindataset() 9 testset = mnist.testdataset() 10 11 trainset = { 12 size = 50000, 13 data = fullset.data[{{1,50000}}]:double(), 14 label = fullset.label[{{1,50000}}] 15 } 16 17 validationset = { 18 size = 10000, 19 data = fullset.data[{{50001,60000}}]:double(), 20 label = fullset.label[{{50001,60000}}] 21 } 22 23 trainset.data = trainset.data - trainset.data:mean() 24 validationset.data = validationset.data - validationset.data:mean() 25 26 27 model = nn.Sequential() 28 model:add(nn.Reshape(1, 28, 28)) 29 model:add(nn.MulConstant(1/256.0*3.2)) 30 model:add(nn.SpatialConvolutionMM(1, 20, 5, 5, 1, 1, 0, 0)) 31 model:add(nn.SpatialMaxPooling(2, 2 , 2, 2, 0, 0)) 32 model:add(nn.SpatialConvolutionMM(20, 50, 5, 5, 1, 1, 0, 0)) 33 model:add(nn.SpatialMaxPooling(2, 2 , 2, 2, 0, 0)) 34 model:add(nn.Reshape(4*4*50)) 35 model:add(nn.Linear(4*4*50, 500)) 36 model:add(nn.ReLU()) 37 model:add(nn.Linear(500, 10)) 38 model:add(nn.LogSoftMax()) 39 40 model = require('weight-init')(model, 'xavier') 41 42 criterion = nn.ClassNLLCriterion() 43 44 model = model:cuda() 45 criterion = criterion:cuda() 46 trainset.data = trainset.data:cuda() 47 trainset.label = trainset.label:cuda() 48 validationset.data = validationset.data:cuda() 49 validationset.label = validationset.label:cuda()--[[]] 50 51 sgd_params = { 52 learningRate = 1e-2, 53 learningRateDecay = 1e-4, 54 weightDecay = 1e-3, 55 momentum = 1e-4 56 } 57 58 x, dl_dx = model:getParameters() 59 60 step = function(batch_size) 61 local current_loss = 0 62 local count = 0 63 local shuffle = torch.randperm(trainset.size) 64 batch_size = batch_size or 200 65 for t = 1,trainset.size,batch_size do 66 -- setup inputs and targets for this mini-batch 67 local size = math.min(t + batch_size - 1, trainset.size) - t 68 local inputs = torch.Tensor(size, 28, 28):cuda() 69 local targets = torch.Tensor(size):cuda() 70 for i = 1,size do 71 local input = trainset.data[shuffle[i+t]] 72 local target = trainset.label[shuffle[i+t]] 73 -- if target == 0 then target = 10 end 74 inputs[i] = input 75 targets[i] = target 76 end 77 targets:add(1) 78 local feval = function(x_new) 79 -- reset data 80 if x ~= x_new then x:copy(x_new) end 81 dl_dx:zero() 82 83 -- perform mini-batch gradient descent 84 local loss = criterion:forward(model:forward(inputs), targets) 85 model:backward(inputs, criterion:backward(model.output, targets)) 86 87 return loss, dl_dx 88 end 89 90 _, fs = optim.sgd(feval, x, sgd_params) 91 92 -- fs is a table containing value of the loss function 93 -- (just 1 value for the SGD optimization) 94 count = count + 1 95 current_loss = current_loss + fs[1] 96 end 97 98 -- normalize loss 99 return current_loss / count 100 end 101 102 eval = function(dataset, batch_size) 103 local count = 0 104 batch_size = batch_size or 200 105 106 for i = 1,dataset.size,batch_size do 107 local size = math.min(i + batch_size - 1, dataset.size) - i 108 local inputs = dataset.data[{{i,i+size-1}}]:cuda() 109 local targets = dataset.label[{{i,i+size-1}}]:long():cuda() 110 local outputs = model:forward(inputs) 111 local _, indices = torch.max(outputs, 2) 112 indices:add(-1) 113 local guessed_right = indices:eq(targets):sum() 114 count = count + guessed_right 115 end 116 117 return count / dataset.size 118 end 119 120 max_iters = 5 121 122 do 123 local last_accuracy = 0 124 local decreasing = 0 125 local threshold = 1 -- how many deacreasing epochs we allow 126 for i = 1,max_iters do 127 timer = torch.Timer() 128 129 local loss = step() 130 print(string.format('Epoch: %d Current loss: %4f', i, loss)) 131 local accuracy = eval(validationset) 132 print(string.format('Accuracy on the validation set: %4f', accuracy)) 133 if accuracy < last_accuracy then 134 if decreasing > threshold then break end 135 decreasing = decreasing + 1 136 else 137 decreasing = 0 138 end 139 last_accuracy = accuracy 140 141 print('Time elapsed: ' .. i .. 'iter: ' .. timer:time().real .. ' seconds') 142 end 143 end 144 145 testset.data = testset.data:double() 146 eval(testset)
==================================================================================
17012更新:
今天重新試了一下上面的程序,提示下面的錯誤:
Epoch: 1 Current loss: 0.652170 /home/XXX/torch/install/bin/luajit: testGPU.lua:113: invalid arguments: CudaLongTensor CudaTensor expected arguments: [*CudaByteTensor*] CudaLongTensor long | *CudaLongTensor* CudaLongTensor long | [*CudaByteTensor*] CudaLongTensor CudaLongTensor | *CudaLongTensor* CudaLongTensor CudaLongTensor stack traceback: [C]: in function 'eq' testGPU.lua:113: in function 'eval' testGPU.lua:131: in main chunk [C]: in function 'dofile' ...gram/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50
在GPU代碼第113行加上下面一句話,就可以成功運行了:
indices=indices:cuda()
真是見鬼了。。。
170121更新結束
==================================================================================
6. CPU和GPU使用率
① CPU版本
CPU情況:

GPU情況:

② GPU版本
CPU情況:

GPU情況:

7. 可以看出,CPU版本的程序,CPU全部使用上了,GPU則基本沒用。GPU版本,只有一個核心(線程)的CPU完全是用上了,其他的則在圍觀。。。而GPU使用率已經很高了。
8. 時間比較
CPU版本:
Epoch: 1 Current loss: 0.619644 Accuracy on the validation set: 0.924800 Time elapsed: 1iter: 895.69850516319 seconds Epoch: 2 Current loss: 0.225129 Accuracy on the validation set: 0.949000 Time elapsed: 2iter: 914.15352702141 seconds
GPU版本:
Epoch: 1 Current loss: 0.687380 Accuracy on the validation set: 0.925300 Time elapsed: 1iter: 14.031280994415 seconds Epoch: 2 Current loss: 0.231011 Accuracy on the validation set: 0.944000 Time elapsed: 2iter: 13.848378896713 seconds Epoch: 3 Current loss: 0.167991 Accuracy on the validation set: 0.959800 Time elapsed: 3iter: 14.071791887283 seconds Epoch: 4 Current loss: 0.135209 Accuracy on the validation set: 0.963700 Time elapsed: 4iter: 14.238609790802 seconds Epoch: 5 Current loss: 0.113471 Accuracy on the validation set: 0.966800 Time elapsed: 5iter: 14.328102111816 seconds
說明:① CPU為4790K@4.4GHZ(8線程全開時,應該沒有這么高的主頻,具體多少沒注意);GPU為nvidia GTX 970。
② 由於CPU版本的執行時間實在太長,我都懷疑程序是否有問題了。。。但是看着CPU一直100%的全力工作,又不忍心暫停。直到第一次循環結束,用了將近900s,才意識到,原來程序應該木有錯誤。。。等第二次循環結束,就直接停止測試了。。。GPU版本的程序,每次循環則只用14s,時間上差距。。。額,使用CPU執行時間是GPU執行時間的64倍。。。
160727更新:
用了780和k80測試了一下,780要用18s迭代一次epoch,k80。。。額,迭代一次要23s(使用一個核心)。當然,只針對我這里的程序是這個結果,其他的,不太清楚。
============================================================================================
170121更新
使用筆記本的1060顯卡測試了一下上面的程序,迭代一次用時10s(不保證其他條件完全一致,目前使用的是cuda8.0),不過即便是移動端的1060(雖說10系列移動端已經沒有m標志了,但是參數和桌面版還是不完全一樣),也還是比桌面版的970要強一點。
170121更新結束
============================================================================================
170505更新
重新配置了torch,使用1080Ti的顯卡。但是測試上面的程序,迭代一次用時9s(不保證其他條件完全一致,目前使用的是cuda8.0)。理論上1080Ti比1060性能強一倍應該是有的,但是上面的程序迭代時,差距沒有體現出來。累覺不愛。。。/(ㄒoㄒ)/~~
170505更新結束
============================================================================================
170613更新
使用tensorflow進行訓練,同樣的程序,迭代一次,k80單核要1.2s多,1080Ti要0.36s。性能差距體現出來了。之前性能差距無法體現出來的原因是,上面的測試程序過於簡單(和程序有關,和torch及tensorflow無關。如果torch上復雜的程序,這兩個卡性能差距也差不多這樣),不能完全發揮1080Ti的性能(不清楚上面的程序,k80是否完全發揮出來了)。新的測試程序,1080Ti和K80的GPU utilization基本上都是在90%——100%,這種情況下,才能真正考驗這兩個顯卡的性能差距。
170613更新結束
============================================================================================
二 在官網下載安裝
170121更新
https://developer.nvidia.com/cuda-downloads中可以下載cuda。
1. 若下載deb文件
然后使用如下命令安裝:
sudo dpkg -i cuda-repo-ubuntu1604-8-0-local_8.0.44-1_amd64.deb sudo apt-get update sudo apt-get install cuda
之后編輯.bashrc:
gedit .bashrc
輸入:
export PATH=/usr/local/cuda-8.0/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-8.0/bin/lib64:$LD_LIBRARY_PATH
之后終端中輸入:
source ~/.bashrc
之后再輸入:
nvcc --version

2. 若下載run文件
終端中輸入:
sudo sh cuda_8.0.61_375.26_linux.run
之后按照說明安裝即可(沒用過這種方式,因而不確定是否需要添加PATH變量。如果不能識別nvcc,添加PATH變量之后,source ~/.bashrc即可)。
170121更新結束
============================================================================================
