（原）Ubuntu16中安裝cuda toolkit

本文轉載自查看原文 2016-07-09 15:21 19222 linux

轉載請注明出處：

http://www.cnblogs.com/darkknightzh/p/5655957.html

參考網址：

https://devtalk.nvidia.com/default/topic/862537/cuda-setup-and-installation/installing-cuda-toolkit-on-ubuntu-14-04/

http://unix.stackexchange.com/questions/38560/gpu-usage-monitoring-cuda

http://blog.csdn.net/revolver/article/details/49682131

一在終端中直接安裝

說明：由於nvidia並未給出ubuntu16上面的cuda toolkit，本文方法不一定可行，我這邊安裝成功，感覺完全是瞎貓碰死耗子了。。。不過沒有安裝sample，只是其他程序可以使用顯卡了。

1. 第一個網址，使用

sudo apt-get install nvidia-cuda-toolkit

安裝cuda toolkit，要看網速，下載很慢。還有，網址中說重啟ubuntu有問題（I can't log in to my computer and end up in infinite login screen）。我這邊安裝了之后，正常登陸了，沒有出現問題。

2. 安裝完之后的信息：

裝的是7.5.17，不是最新的7.5.18，但是能用就行。

3. 第二個網址中qed給出了在終端中持續顯示GPU當前的使用率（僅限nvidia的顯卡）：

nvidia-smi -l 1

結果：

說明：上面的命令貌似要顯卡支持才行。也可以使用Jonathan提供的命令（目前沒測試）：

watch -n0.1 "nvidia-settings -q GPUUtilization -q useddedicatedgpumemory"

160713說明：a. 這條命令顯示信息如下：

b. 其實這條命令就是在終端中顯示‘NVIDIA X serve settings’中的一些信息，如下（NVIDIA X serve settings位置為/usr/share/applications，也可以直接打開該軟件查看）：

c. 由於這張圖使用的GPU和之前使用的GPU不一樣，因而參數不一致（比如顯存）。

4. 安裝完cuda之后，安裝cutorch，之后安裝cunn，都安裝成功。使用GPU的程序也能正常運行。

5. 第三個參考網址中給出了測試程序，本處稍微進行了修改，打印出來每次循環執行的時間（CPU版本和GPU版本代碼實際上差不多）：

① CPU版本：

require 'torch'
require 'nn'
require 'optim'
--require 'cunn'
--require 'cutorch'
mnist = require 'mnist'

fullset = mnist.traindataset()
testset = mnist.testdataset()

trainset = {
    size = 50000,
    data = fullset.data[{{1,50000}}]:double(),
    label = fullset.label[{{1,50000}}]
}

validationset = {
    size = 10000,
    data = fullset.data[{{50001,60000}}]:double(),
    label = fullset.label[{{50001,60000}}]
}

trainset.data = trainset.data - trainset.data:mean()
validationset.data = validationset.data - validationset.data:mean()


model = nn.Sequential()
model:add(nn.Reshape(1, 28, 28))
model:add(nn.MulConstant(1/256.0*3.2))
model:add(nn.SpatialConvolutionMM(1, 20, 5, 5, 1, 1, 0, 0))
model:add(nn.SpatialMaxPooling(2, 2 , 2, 2, 0, 0))
model:add(nn.SpatialConvolutionMM(20, 50, 5, 5, 1, 1, 0, 0))
model:add(nn.SpatialMaxPooling(2, 2 , 2, 2, 0, 0))
model:add(nn.Reshape(4*4*50))
model:add(nn.Linear(4*4*50, 500))
model:add(nn.ReLU())
model:add(nn.Linear(500, 10))
model:add(nn.LogSoftMax())

model = require('weight-init')(model, 'xavier')

criterion = nn.ClassNLLCriterion()

--model = model:cuda()
--criterion = criterion:cuda()
--trainset.data = trainset.data:cuda()
--trainset.label = trainset.label:cuda()
--validationset.data = validationset.data:cuda()
--validationset.label = validationset.label:cuda()--[[]]

sgd_params = {
   learningRate = 1e-2,
   learningRateDecay = 1e-4,
   weightDecay = 1e-3,
   momentum = 1e-4
}

x, dl_dx = model:getParameters()

step = function(batch_size)
    local current_loss = 0
    local count = 0
    local shuffle = torch.randperm(trainset.size)
    batch_size = batch_size or 200
    for t = 1,trainset.size,batch_size do
        -- setup inputs and targets for this mini-batch
        local size = math.min(t + batch_size - 1, trainset.size) - t
        local inputs = torch.Tensor(size, 28, 28)--:cuda()
        local targets = torch.Tensor(size)--:cuda()
        for i = 1,size do
            local input = trainset.data[shuffle[i+t]]
            local target = trainset.label[shuffle[i+t]]
            -- if target == 0 then target = 10 end
            inputs[i] = input
            targets[i] = target
        end
        targets:add(1)
        local feval = function(x_new)
            -- reset data
            if x ~= x_new then x:copy(x_new) end
            dl_dx:zero()

            -- perform mini-batch gradient descent
            local loss = criterion:forward(model:forward(inputs), targets)
            model:backward(inputs, criterion:backward(model.output, targets))

            return loss, dl_dx
        end

        _, fs = optim.sgd(feval, x, sgd_params)

        -- fs is a table containing value of the loss function
        -- (just 1 value for the SGD optimization)
        count = count + 1
        current_loss = current_loss + fs[1]
    end

    -- normalize loss
    return current_loss / count
end

eval = function(dataset, batch_size)
    local count = 0
    batch_size = batch_size or 200
    
    for i = 1,dataset.size,batch_size do
        local size = math.min(i + batch_size - 1, dataset.size) - i
        local inputs = dataset.data[{{i,i+size-1}}]--:cuda()
        local targets = dataset.label[{{i,i+size-1}}]:long()--:cuda()
        local outputs = model:forward(inputs)
        local _, indices = torch.max(outputs, 2)
        indices:add(-1)
        local guessed_right = indices:eq(targets):sum()
        count = count + guessed_right
    end

    return count / dataset.size
end

max_iters = 5

do
    local last_accuracy = 0
    local decreasing = 0
    local threshold = 1 -- how many deacreasing epochs we allow
    for i = 1,max_iters do
        timer = torch.Timer()
      
        local loss = step()
        print(string.format('Epoch: %d Current loss: %4f', i, loss))
        local accuracy = eval(validationset)
        print(string.format('Accuracy on the validation set: %4f', accuracy))
        if accuracy < last_accuracy then
            if decreasing > threshold then break end
            decreasing = decreasing + 1
        else
            decreasing = 0
        end
        last_accuracy = accuracy
        
        print('Time elapsed: ' .. i .. 'iter: ' .. timer:time().real .. ' seconds')
    end
end

testset.data = testset.data:double()
eval(testset)

② GPU版本：

  1 require 'torch'
  2 require 'nn'
  3 require 'optim'
  4 require 'cunn'
  5 require 'cutorch'
  6 mnist = require 'mnist'
  7 
  8 fullset = mnist.traindataset()
  9 testset = mnist.testdataset()
 10 
 11 trainset = {
 12     size = 50000,
 13     data = fullset.data[{{1,50000}}]:double(),
 14     label = fullset.label[{{1,50000}}]
 15 }
 16 
 17 validationset = {
 18     size = 10000,
 19     data = fullset.data[{{50001,60000}}]:double(),
 20     label = fullset.label[{{50001,60000}}]
 21 }
 22 
 23 trainset.data = trainset.data - trainset.data:mean()
 24 validationset.data = validationset.data - validationset.data:mean()
 25 
 26 
 27 model = nn.Sequential()
 28 model:add(nn.Reshape(1, 28, 28))
 29 model:add(nn.MulConstant(1/256.0*3.2))
 30 model:add(nn.SpatialConvolutionMM(1, 20, 5, 5, 1, 1, 0, 0))
 31 model:add(nn.SpatialMaxPooling(2, 2 , 2, 2, 0, 0))
 32 model:add(nn.SpatialConvolutionMM(20, 50, 5, 5, 1, 1, 0, 0))
 33 model:add(nn.SpatialMaxPooling(2, 2 , 2, 2, 0, 0))
 34 model:add(nn.Reshape(4*4*50))
 35 model:add(nn.Linear(4*4*50, 500))
 36 model:add(nn.ReLU())
 37 model:add(nn.Linear(500, 10))
 38 model:add(nn.LogSoftMax())
 39 
 40 model = require('weight-init')(model, 'xavier')
 41 
 42 criterion = nn.ClassNLLCriterion()
 43 
 44 model = model:cuda()
 45 criterion = criterion:cuda()
 46 trainset.data = trainset.data:cuda()
 47 trainset.label = trainset.label:cuda()
 48 validationset.data = validationset.data:cuda()
 49 validationset.label = validationset.label:cuda()--[[]]
 50 
 51 sgd_params = {
 52    learningRate = 1e-2,
 53    learningRateDecay = 1e-4,
 54    weightDecay = 1e-3,
 55    momentum = 1e-4
 56 }
 57 
 58 x, dl_dx = model:getParameters()
 59 
 60 step = function(batch_size)
 61     local current_loss = 0
 62     local count = 0
 63     local shuffle = torch.randperm(trainset.size)
 64     batch_size = batch_size or 200
 65     for t = 1,trainset.size,batch_size do
 66         -- setup inputs and targets for this mini-batch
 67         local size = math.min(t + batch_size - 1, trainset.size) - t
 68         local inputs = torch.Tensor(size, 28, 28):cuda()
 69         local targets = torch.Tensor(size):cuda()
 70         for i = 1,size do
 71             local input = trainset.data[shuffle[i+t]]
 72             local target = trainset.label[shuffle[i+t]]
 73             -- if target == 0 then target = 10 end
 74             inputs[i] = input
 75             targets[i] = target
 76         end
 77         targets:add(1)
 78         local feval = function(x_new)
 79             -- reset data
 80             if x ~= x_new then x:copy(x_new) end
 81             dl_dx:zero()
 82 
 83             -- perform mini-batch gradient descent
 84             local loss = criterion:forward(model:forward(inputs), targets)
 85             model:backward(inputs, criterion:backward(model.output, targets))
 86 
 87             return loss, dl_dx
 88         end
 89 
 90         _, fs = optim.sgd(feval, x, sgd_params)
 91 
 92         -- fs is a table containing value of the loss function
 93         -- (just 1 value for the SGD optimization)
 94         count = count + 1
 95         current_loss = current_loss + fs[1]
 96     end
 97 
 98     -- normalize loss
 99     return current_loss / count
100 end
101 
102 eval = function(dataset, batch_size)
103     local count = 0
104     batch_size = batch_size or 200
105     
106     for i = 1,dataset.size,batch_size do
107         local size = math.min(i + batch_size - 1, dataset.size) - i
108         local inputs = dataset.data[{{i,i+size-1}}]:cuda()
109         local targets = dataset.label[{{i,i+size-1}}]:long():cuda()
110         local outputs = model:forward(inputs)
111         local _, indices = torch.max(outputs, 2)
112         indices:add(-1)
113         local guessed_right = indices:eq(targets):sum()
114         count = count + guessed_right
115     end
116 
117     return count / dataset.size
118 end
119 
120 max_iters = 5
121 
122 do
123     local last_accuracy = 0
124     local decreasing = 0
125     local threshold = 1 -- how many deacreasing epochs we allow
126     for i = 1,max_iters do
127         timer = torch.Timer()
128       
129         local loss = step()
130         print(string.format('Epoch: %d Current loss: %4f', i, loss))
131         local accuracy = eval(validationset)
132         print(string.format('Accuracy on the validation set: %4f', accuracy))
133         if accuracy < last_accuracy then
134             if decreasing > threshold then break end
135             decreasing = decreasing + 1
136         else
137             decreasing = 0
138         end
139         last_accuracy = accuracy
140         
141         print('Time elapsed: ' .. i .. 'iter: ' .. timer:time().real .. ' seconds')
142     end
143 end
144 
145 testset.data = testset.data:double()
146 eval(testset)

==================================================================================

17012更新：

今天重新試了一下上面的程序，提示下面的錯誤：

Epoch: 1 Current loss: 0.652170	
/home/XXX/torch/install/bin/luajit: testGPU.lua:113: invalid arguments: CudaLongTensor CudaTensor 
expected arguments: [*CudaByteTensor*] CudaLongTensor long | *CudaLongTensor* CudaLongTensor long | [*CudaByteTensor*] CudaLongTensor CudaLongTensor | *CudaLongTensor* CudaLongTensor CudaLongTensor
stack traceback:
	[C]: in function 'eq'
	testGPU.lua:113: in function 'eval'
	testGPU.lua:131: in main chunk
	[C]: in function 'dofile'
	...gram/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x00405d50

在GPU代碼第113行加上下面一句話，就可以成功運行了：

indices=indices:cuda()

真是見鬼了。。。

170121更新結束

==================================================================================

6. CPU和GPU使用率

① CPU版本

CPU情況：

GPU情況：

② GPU版本

CPU情況：

GPU情況：

7. 可以看出，CPU版本的程序，CPU全部使用上了，GPU則基本沒用。GPU版本，只有一個核心（線程）的CPU完全是用上了，其他的則在圍觀。。。而GPU使用率已經很高了。

8. 時間比較

CPU版本：

Epoch: 1 Current loss: 0.619644
Accuracy on the validation set: 0.924800
Time elapsed: 1iter: 895.69850516319 seconds
Epoch: 2 Current loss: 0.225129
Accuracy on the validation set: 0.949000
Time elapsed: 2iter: 914.15352702141 seconds

GPU版本：

Epoch: 1 Current loss: 0.687380
Accuracy on the validation set: 0.925300
Time elapsed: 1iter: 14.031280994415 seconds
Epoch: 2 Current loss: 0.231011
Accuracy on the validation set: 0.944000
Time elapsed: 2iter: 13.848378896713 seconds
Epoch: 3 Current loss: 0.167991
Accuracy on the validation set: 0.959800
Time elapsed: 3iter: 14.071791887283 seconds
Epoch: 4 Current loss: 0.135209
Accuracy on the validation set: 0.963700
Time elapsed: 4iter: 14.238609790802 seconds
Epoch: 5 Current loss: 0.113471
Accuracy on the validation set: 0.966800
Time elapsed: 5iter: 14.328102111816 seconds

說明：① CPU為4790K@4.4GHZ（8線程全開時，應該沒有這么高的主頻，具體多少沒注意）；GPU為nvidia GTX 970。

② 由於CPU版本的執行時間實在太長，我都懷疑程序是否有問題了。。。但是看着CPU一直100%的全力工作，又不忍心暫停。直到第一次循環結束，用了將近900s，才意識到，原來程序應該木有錯誤。。。等第二次循環結束，就直接停止測試了。。。GPU版本的程序，每次循環則只用14s，時間上差距。。。額，使用CPU執行時間是GPU執行時間的64倍。。。

160727更新：

用了780和k80測試了一下，780要用18s迭代一次epoch，k80。。。額，迭代一次要23s（使用一個核心）。當然，只針對我這里的程序是這個結果，其他的，不太清楚。

============================================================================================

170121更新

使用筆記本的1060顯卡測試了一下上面的程序，迭代一次用時10s（不保證其他條件完全一致，目前使用的是cuda8.0），不過即便是移動端的1060（雖說10系列移動端已經沒有m標志了，但是參數和桌面版還是不完全一樣），也還是比桌面版的970要強一點。

170121更新結束

============================================================================================

170505更新

重新配置了torch，使用1080Ti的顯卡。但是測試上面的程序，迭代一次用時9s（不保證其他條件完全一致，目前使用的是cuda8.0）。理論上1080Ti比1060性能強一倍應該是有的，但是上面的程序迭代時，差距沒有體現出來。累覺不愛。。。/(ㄒoㄒ)/~~

170505更新結束

============================================================================================

170613更新

使用tensorflow進行訓練，同樣的程序，迭代一次，k80單核要1.2s多，1080Ti要0.36s。性能差距體現出來了。之前性能差距無法體現出來的原因是，上面的測試程序過於簡單（和程序有關，和torch及tensorflow無關。如果torch上復雜的程序，這兩個卡性能差距也差不多這樣），不能完全發揮1080Ti的性能（不清楚上面的程序，k80是否完全發揮出來了）。新的測試程序，1080Ti和K80的GPU utilization基本上都是在90%——100%，這種情況下，才能真正考驗這兩個顯卡的性能差距。

170613更新結束

============================================================================================

二在官網下載安裝

170121更新

https://developer.nvidia.com/cuda-downloads中可以下載cuda。

1. 若下載deb文件

然后使用如下命令安裝：

sudo dpkg -i cuda-repo-ubuntu1604-8-0-local_8.0.44-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda

之后編輯.bashrc：

gedit .bashrc

輸入：

export PATH=/usr/local/cuda-8.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/bin/lib64:$LD_LIBRARY_PATH

之后終端中輸入：

source ~/.bashrc

之后再輸入：

nvcc --version

2. 若下載run文件

終端中輸入：

sudo sh cuda_8.0.61_375.26_linux.run

之后按照說明安裝即可（沒用過這種方式，因而不確定是否需要添加PATH變量。如果不能識別nvcc，添加PATH變量之后，source ~/.bashrc即可）。

170121更新結束

============================================================================================

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 （原）ubuntu14及ubuntu16中安裝docker （原）Ubuntu16中安裝nvidia的顯卡驅動（原）Ubuntu16 中安裝torch版的cudnn （原）Ubuntu16中編譯caffe 純凈Ubuntu16安裝CUDA(9.1)和cuDNN （原）Ubuntu16中卸載並重新安裝google的Protocol Buffers （原+轉）ubuntu16中莫名死機及重新安裝顯卡驅動 Ubuntu16系統中安裝htpasswd （原）ubuntu16中簡單的使用google的protobuf （原+轉）ubuntu16中安裝opencv2.4.11(2.4.13)

（原）Ubuntu16中安裝cuda toolkit

一 在終端中直接安裝

二 在官網下載安裝

1. 若下載deb文件

2. 若下載run文件

免責聲明！

一在終端中直接安裝

二在官網下載安裝