這一系列基本上是屬於我自己進行到了那個步驟就做到那個步驟的
由於新裝了GPU (GTX750ti)和CUDA9.0、CUDNN7.1版本的軟件,所以希望TensorFlow能在GPU上運行,也算上補上之前的承諾
說了下初衷,由於現在新的CUDA版本對TensorFlow的支持不好,只能采取編譯源碼的方式進行
所以大概分為以下幾個步驟
1.安裝依賴庫(這部分我已經做過了,不進行介紹,可以看前邊的依賴庫,基本一致)
sudo apt-get install openjdk-8-jdk
jdk是bazel必須的
2.安裝Git(有的就跳過這一步)
3.安裝TensorFlow的build工具bazel
4.配置並編譯TensorFlow源碼
5.安裝並配置環境變量
1.安裝依賴庫
2.安裝Git
使用
sudo apt-get install git
git clone --recursive https://github.com/tensorflow/tensorflow
3. 安裝TensorFlow的build工具bazel
這一步比較麻煩,是因為apt-get中沒有bazel這個工具
因此需要到GitHub上先下載,再進行安裝 下載地址是https://github.com/bazelbuild/bazel/releases
選擇正確版本下載,這里序號看下TensorFlow的版本需求,具體對BAZEL的需求可以查看configure.py文件,比如我這個版本中就有這樣的一段
_TF_BAZELRC_FILENAME = '.tf_configure.bazelrc' _TF_WORKSPACE_ROOT = '' _TF_BAZELRC = '' _TF_CURRENT_BAZEL_VERSION = None _TF_MIN_BAZEL_VERSION = '0.27.1' _TF_MAX_BAZEL_VERSION = '1.1.0'
每個字段的意思從字面上就可以得知,_TF_BAZELRC_FILENAME是使用bazel編譯時使用的配置文件(沒有特別細致的研究,https://www.cnblogs.com/shouhuxianjian/p/9416934.html里邊有解釋),_TF_MIN_BAZEL_VERSION = '0.27.1'是最低的bazel版本需求
使用sudo命令安裝.sh文件即可
sudo chmod +x ./bazel*.sh sudo ./bazel-0.*.sh
4.配置並編譯TensorFlow源碼
首先是配置,可以針對自己的需求進行選擇和裁剪。這一步特別麻煩,有很多選項需要選擇,我的選擇如下:
1 jourluohua@jour:~/tools/tensorflow$ ./configure 2 WARNING: Running Bazel server needs to be killed, because the startup options are different. 3 You have bazel 0.14.1 installed. 4 Please specify the location of python. [Default is /usr/bin/python]: 5 6 7 Found possible Python library paths: 8 /usr/local/lib/python2.7/dist-packages 9 /usr/lib/python2.7/dist-packages 10 Please input the desired Python library path to use. Default is [/usr/local/lib/python2.7/dist-packages] 11 12 Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: Y 13 jemalloc as malloc support will be enabled for TensorFlow. 14 15 Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n 16 No Google Cloud Platform support will be enabled for TensorFlow. 17 18 Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n 19 No Hadoop File System support will be enabled for TensorFlow. 20 21 Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n 22 No Amazon S3 File System support will be enabled for TensorFlow. 23 24 Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n 25 No Apache Kafka Platform support will be enabled for TensorFlow. 26 27 Do you wish to build TensorFlow with XLA JIT support? [y/N]: y 28 XLA JIT support will be enabled for TensorFlow. 29 30 Do you wish to build TensorFlow with GDR support? [y/N]: y 31 GDR support will be enabled for TensorFlow. 32 33 Do you wish to build TensorFlow with VERBS support? [y/N]: y 34 VERBS support will be enabled for TensorFlow. 35 36 Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N 37 No OpenCL SYCL support will be enabled for TensorFlow. 38 39 Do you wish to build TensorFlow with CUDA support? [y/N]: y 40 CUDA support will be enabled for TensorFlow. 41 42 Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 8 43 44 45 Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 46 47 48 Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 49 50 51 Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 52 53 54 Do you wish to build TensorFlow with TensorRT support? [y/N]: N 55 No TensorRT support will be enabled for TensorFlow. 56 57 Please specify the NCCL version you want to use. [Leave empty to default to NCCL 1.3]: 58 59 60 Please specify a list of comma-separated Cuda compute capabilities you want to build with. 61 You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus. 62 Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 5.0] 63 64 65 Do you want to use clang as CUDA compiler? [y/N]: N 66 nvcc will be used as CUDA compiler. 67 68 Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 69 70 71 Do you wish to build TensorFlow with MPI support? [y/N]: N 72 No MPI support will be enabled for TensorFlow. 73 74 Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 75 76 77 Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: N 78 Not configuring the WORKSPACE for Android builds. 79 80 Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details. 81 --config=mkl # Build with MKL support. 82 --config=monolithic # Config for mostly static monolithic build. 83 Configuration finished
然后使用bazel進行編譯(本步驟非常容易出問題,而且特別耗時),這里使用 -c opt是編譯release版本的,使用-c dbg是編譯debug版本的
bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
中間會遇到很多問題,這里列舉一些不方便查的錯誤。
1)比如會遇到CXX的錯誤,然后具體的錯誤還很難排查(只顯示哪個配置文件的哪一行出錯,並不顯示具體錯誤)。需要查看具體錯誤信息的時候,建議添加--verbose_failures選項。
2)遇到CXX的錯誤,(做編譯的都知道,比較成熟C++的代碼穩定性比較好,兼容性也比較好,移植起來也比較方便,一般不會遇到編譯器和環境問題)可能是編譯器gcc的版本問題,可以添加--cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0"
3)遇到virtual memory exhausted: Cannot allocate memory 錯誤。這是因為swap分區沒有設置或者swap分區容量設置太小的問題,使用free -m命令可以得知這個錯誤,可以使用擴展swap分區容量的方法。大概的命令如下
mkdir /home/jourluohua/swap rm -rf /home/jourluohua/swap dd if=/dev/zero of=/home/jourluohua/swap bs=1024 count=4096000
mkswap /home/jourluohua/swap
sudo swapon /home/jourluohua/swap
意思是設置4096000個1024byte大小的塊,一共是4G。如果問題還是沒有解決,以為bazel默認是使用多線程編譯模式,可以手動添加 -j 2選項,將使用的線程固定在2
4)遇到AttributeError: 'module' object has no attribute 'IntEnum' 這個問題比較模糊,使用python -c "import enum"的時候沒有錯誤,但是里邊確實沒有IntEnum的屬性,查找后發現是需要安裝enum34包來解決,Python不太好的一點就是各種包非常混亂,
pip install enum34 --user
5)遇到AttributeError: attribute '__doc__' of 'type' objects is not writable錯誤。這個問題其實挺棘手的,自身是體系結構方向,一般使用的語言也是C++,對Python不是很熟悉,也許是我的編譯環境出了問題?檢查查了下__doc__是Python里邊注釋。
先寫了個小程序復現了這個問題:
#!/usr/bin/python from functools import wraps #from https://stackoverflow.com/questions/39010366/functools-wrapper-attributeerror-attribute-doc-of-type-objects-is-not def memoize(f): """ Memoization decorator for functions taking one or more arguments. Saves repeated api calls for a given value, by caching it. """ @wraps(f) class memodict(dict): """memodict""" def __init__(self, f): self.f = f def __call__(self, *args): return self[args] def __missing__(self, key): ret = self[key] = self.f(*key) return ret return memodict(f) @memoize def a(): """blah""" pass
出現了同樣的錯誤:
Traceback (most recent call last): File "ipy.py", line 20, in <module> @memoize File "ipy.py", line 9, in memoize class memodict(dict): File "/usr/lib/python2.7/functools.py", line 33, in update_wrapper setattr(wrapper, attr, getattr(wrapped, attr)) AttributeError: attribute '__doc__' of 'type' objects is not writable
打開出問題的Python代碼,原來的代碼是這樣
@tf_export(v1=["VariableAggregation"]) class VariableAggregation(enum.Enum): NONE = 0 SUM = 1 MEAN = 2 ONLY_FIRST_REPLICA = 3 ONLY_FIRST_TOWER = 3 # DEPRECATED def __hash__(self): return hash(self.value) # LINT.ThenChange(//tensorflow/core/framework/variable.proto) # # Note that we are currently relying on the integer values of the Python enums # matching the integer values of the proto enums. VariableAggregation.__doc__ = ( VariableAggregationV2.__doc__ + "* `ONLY_FIRST_TOWER`: Deprecated alias for `ONLY_FIRST_REPLICA`.\n ")
大概就是要將VariableAggregation的注釋設置成VariableAggregationV2加上額外的一段"* `ONLY_FIRST_TOWER`: Deprecated alias for `ONLY_FIRST_REPLICA`.\n ",猜想既然不允許在class聲明外做這個事情,那么直接在class中設置是否可行?
修改后的代碼如下:
@tf_export(v1=["VariableAggregation"]) class VariableAggregation(enum.Enum): NONE = 0 SUM = 1 MEAN = 2 ONLY_FIRST_REPLICA = 3 ONLY_FIRST_TOWER = 3 # DEPRECATED __doc__ = (VariableAggregationV2.__doc__ + "* `ONLY_FIRST_TOWER`: Deprecated alias for `ONLY_FIRST_REPLICA`.\n ") def __hash__(self): return hash(self.value) # LINT.ThenChange(//tensorflow/core/framework/variable.proto) # # Note that we are currently relying on the integer values of the Python enums # matching the integer values of the proto enums. #VariableAggregation.__doc__ = ( # VariableAggregationV2.__doc__ + # "* `ONLY_FIRST_TOWER`: Deprecated alias for `ONLY_FIRST_REPLICA`.\n ")
6)遇到LargeZipFile: Zipfile size would require ZIP64 extensions 問題,這個問題其實很明顯,就是文件太大了,在需要壓縮的時候,需要配置一下ZIP64選項,而默認應該是不支持的,修改/usr/lib/python2.7/dist-packages/wheel/archive.py文件
將 zip = zipfile.ZipFile(open(zip_filename, "wb+"), "w",compression=zipfile.ZIP_DEFLATED)改成zip = zipfile.ZipFile(open(zip_filename, "wb+"), "w",compression=zipfile.ZIP_DEFLATED, allowZip64=True)就可以。
但是說實話,debug版本還是太大了,超過了zip可以壓縮的大小,主要是CRC32校驗那里過不去,對於我不是急需,就沒有修改這里,畢竟Python2.7已經不再更新,沒有努力的必要,Python3.5以上的版本這里都沒有問題。
還有一些其他缺庫的問題,一般都比較好搜索,就不一一列舉在這里。
5.安裝並配置環境變量
使用pip進行安裝
$ pip install /tmp/tensorflow_pkg/tensorflow --user # with no spaces after tensorflow hit tab before hitting enter to fill in blanks
最后就是測試
import tensorflow as tf sess = tf.InteractiveSession() sess.close()
如果每一步都不報錯的,TensorFlow就編譯並安裝成功了
