記一次編譯tensorflow-gpu爬過的坑

本文轉載自查看原文 2019-04-07 10:15 2662 nccl2/ centos7/ cuda10.1/ tensorflow-gpu/ cuda10.0

廢話不多說，先說最終成功的版本：系統=>centos7 ,cuda=>10.0 ,cudnn=>7.5 ,nccl=>源碼編譯, tensorflow=>最新版本源碼編譯

第一次嘗試：cuda=>10.1 cudnn=>7.5 nccl=>2.4.2

1.cuda下載包：*.run,，直接 sh ./*.run 按照提示選擇就能安裝，一般選擇默認路徑 /usr/local/cuda方便后續操作

配置環境，在/etc/profile末尾加上

export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local//lib64:$LD_LIBRARY_PATH"

2.cudnn 解壓后文件夾為cuda，將頭文件和庫文件分別拷貝到cuda對應的目錄下：

sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64

更改執行權限

sudo chmod a+r /usr/local/cuda/include/cudnn.h 
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

查看nvcc是否成功

nvcc --version

3.安裝nccl

目前官網只有*.rpm格式，網上說的deb格式沒找到，所以沒法試驗是否能用，所以使用rpm安裝

rpm -ivh nccl*.rpm

但是這一步是解壓，會解壓到/var/nccl*目錄下，發現下面有三個rpm文件，依次rpm安裝

4.安裝bazel

因為編譯tensorflow需要使用google的bazel，看網上教程讓下載bazel-0.24.1-dist.zip，解壓后編譯

./compile.sh

發現報錯，需要安裝cmake（見后面）

編譯報錯，忘了什么錯了，搜索無果，重新下載bazel-0.24.1-installer-linux-x86_64.sh版本在線安裝，直接運行，成功！

5.安裝cmake

下載cmake>3.4的版本,解壓編譯安裝

./configure
gmake
make install

配置環境變量

PATH=/usr/local/cmake/bin:$PATH
export PATH

6.編譯tensorflow

按照提示選擇路徑及插件

Please specify the location of python. [Default is /usr/bin/python]: 
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: n
Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
Do you wish to build TensorFlow with GDR support? [y/N]: N
Do you wish to build TensorFlow with VERBS support? [y/N]: N
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N
Do you wish to build TensorFlow with CUDA support? [y/N]: Y
Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 10.0]:10.1
Please specify the location where CUDA 10.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 
Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.1]:  
Do you wish to build TensorFlow with TensorRT support? [y/N]: N
Please specify the NCCL version you want to use. [Leave empty to default to NCCL 2]: 2.4.2
Please specify the location where NCCL 2 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]: 
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1] 
Do you want to use clang as CUDA compiler? [y/N]: N
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: /usr/bin/gcc
Do you wish to build TensorFlow with MPI support? [y/N]: N
Please specify optimization flags to use during compilation when bazel option “–config=opt” is specified [Default is -march=native]: 
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:N

使用編譯命令

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

報錯

Cuda Configuration Error: No library found under: /usr/local/cuda-10.1/lib64/libcublas.so.10.1, /usr/local/cuda-10.1/lib64/stubs/libcublas.so.10.1, /usr/local/cuda-10.1/lib/powerpc64le-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x86_64-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x64/libcublas.so.10.1, /usr/local/cuda-10.1/lib/libcublas.so.10.1, /usr/local/cuda-10.1/libcublas.so.10.1

搜索后發現大部分人都認為cuda10.1尚不可用，只能放棄，中間試過加入鏈接（https://github.com/tensorflow/tensorflow/issues/26289）

sudo ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcublas.so.10.1.0.105 /usr/lib64/libcublas.so.10.0

執行編譯后報新的錯誤

Cuda Configuration Error: None of the libraries match their SONAME: /home/bernard/opt/cuda_test/cuda/lib64/libcublas.so.10.1

決定卸掉10.1，重裝10.0

第二次嘗試：cuda=>10.0 cudnn=>7.5 nccl=>2.4.2

1.下載cuda10.0的安裝包，其他不變

2.編譯tensorflow時報新的錯誤

fatal error: nccl.h: No such file or directory

找不到nccl.h，就是說上面那種方式安裝失敗

搜索發現需要安裝 libnccl2 libnccl-dev libnccl-static ，但是網上教程都是ubuntu的使用apt get 安裝，centos只有yum，嘗試執行，報錯

No package "libnccl" available

3.使用rpm卸載nccl,重新編譯安裝nccl

github上clone下nccl項目，編譯安裝

cd nccl
make -j src.build
make src.build
yum install build-essential devscripts debhelper
make pkg.debian.build

4.重新編譯tensorflow

Please specify the location of python. [Default is /usr/bin/python]: 
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: n
Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
Do you wish to build TensorFlow with GDR support? [y/N]: N
Do you wish to build TensorFlow with VERBS support? [y/N]: N
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N
Do you wish to build TensorFlow with CUDA support? [y/N]: Y
Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 10.0]:
Please specify the location where CUDA 10.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 
Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]:  
Do you wish to build TensorFlow with TensorRT support? [y/N]: N
Please specify the NCCL version you want to use. [Leave empty to default to NCCL 2]: 
Please specify the location where NCCL 2 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]: 
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1] 
Do you want to use clang as CUDA compiler? [y/N]: N
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: /usr/bin/gcc
Do you wish to build TensorFlow with MPI support? [y/N]: N
Please specify optimization flags to use during compilation when bazel option “–config=opt” is specified [Default is -march=native]: 
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:N

標紅的做了修改，其他不變，大概等一個小時后編譯完成

轉換為whl文件

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

使用pip安裝

pip install /tmp/tensorflow_pkg/*.whl

成功截圖

5.測試tensorflow,gpu是否可用

import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

報了一個很奇怪的錯誤

開始以為是沒有編譯tensorboard依賴，看了源碼發現並不需要另外下載，最后查看了一下tensorboard的文件時間，發現是以前安裝的沒有卸載干凈，pip uninstall 卸載后重新安裝，一切正常

總結

其實安裝完cuda和cudnn后可以直接pip install tensorflow-gpu的，不用自己重新編譯（也就不需要安裝cmake,bazel)，當初以為沒有最新版本，所以自己編譯，后來發現直接安裝的編譯環境就是cuda10.0，不過貼合系統的編譯總是好用的，哈哈！

下面是直接安裝的截圖，AVX2沒有正常使用，所以還是編譯一把好點

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 安裝tensorflow-gpu版本 Tensorflow-gpu版本安裝記一次lombok踩坑記 TensorFlow-GPU、Win10、m1050Ti、anaconda、VSCODE，新手排坑記一次Idea+Gradle編譯Spring源碼遇到的坑 linux安裝TensorFlow-GPU版本 conda配置安裝pytorch tensorflow-gpu Anaconda安裝Tensorflow-gpu（2020.7) tensorflow-gpu 使用的常見錯誤 #Ubuntu 18.04 安裝tensorflow-gpu 1.9