TensorFlow入門——bazel編譯（帶GPU）

本文轉載自查看原文 2018-06-13 23:12 8970 機器學習/ 深度學習/ DL/ 並行/ Python/ TensorFlow/ ML

這一系列基本上是屬於我自己進行到了那個步驟就做到那個步驟的

由於新裝了GPU (GTX750ti)和CUDA9.0、CUDNN7.1版本的軟件，所以希望TensorFlow能在GPU上運行，也算上補上之前的承諾

說了下初衷，由於現在新的CUDA版本對TensorFlow的支持不好，只能采取編譯源碼的方式進行

所以大概分為以下幾個步驟

1.安裝依賴庫（這部分我已經做過了，不進行介紹，可以看前邊的依賴庫，基本一致）

sudo apt-get install openjdk-8-jdk

jdk是bazel必須的

2.安裝Git（有的就跳過這一步）

3.安裝TensorFlow的build工具bazel

4.配置並編譯TensorFlow源碼

5.安裝並配置環境變量

1.安裝依賴庫

2.安裝Git

使用

sudo apt-get install git
git clone --recursive https://github.com/tensorflow/tensorflow

3. 安裝TensorFlow的build工具bazel

這一步比較麻煩，是因為apt-get中沒有bazel這個工具

因此需要到GitHub上先下載，再進行安裝下載地址是https://github.com/bazelbuild/bazel/releases

選擇正確版本下載，這里序號看下TensorFlow的版本需求，具體對BAZEL的需求可以查看configure.py文件，比如我這個版本中就有這樣的一段

_TF_BAZELRC_FILENAME = '.tf_configure.bazelrc'
_TF_WORKSPACE_ROOT = ''
_TF_BAZELRC = ''
_TF_CURRENT_BAZEL_VERSION = None
_TF_MIN_BAZEL_VERSION = '0.27.1'
_TF_MAX_BAZEL_VERSION = '1.1.0'

每個字段的意思從字面上就可以得知，_TF_BAZELRC_FILENAME是使用bazel編譯時使用的配置文件（沒有特別細致的研究，https://www.cnblogs.com/shouhuxianjian/p/9416934.html里邊有解釋），_TF_MIN_BAZEL_VERSION = '0.27.1'是最低的bazel版本需求

使用sudo命令安裝.sh文件即可

sudo chmod +x ./bazel*.sh
sudo ./bazel-0.*.sh

4.配置並編譯TensorFlow源碼

首先是配置，可以針對自己的需求進行選擇和裁剪。這一步特別麻煩，有很多選項需要選擇，我的選擇如下：

1 jourluohua@jour:~/tools/tensorflow$ ./configure
2 WARNING: Running Bazel server needs to be killed, because the startup options are different.
3 You have bazel 0.14.1 installed.
4 Please specify the location of python. [Default is /usr/bin/python]:
5
6
7 Found possible Python library paths:
8 /usr/local/lib/python2.7/dist-packages
9 /usr/lib/python2.7/dist-packages
10 Please input the desired Python library path to use. Default is [/usr/local/lib/python2.7/dist-packages]
11
12 Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: Y
13 jemalloc as malloc support will be enabled for TensorFlow.
14
15 Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
16 No Google Cloud Platform support will be enabled for TensorFlow.
17
18 Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
19 No Hadoop File System support will be enabled for TensorFlow.
20
21 Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
22 No Amazon S3 File System support will be enabled for TensorFlow.
23
24 Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
25 No Apache Kafka Platform support will be enabled for TensorFlow.
26
27 Do you wish to build TensorFlow with XLA JIT support? [y/N]: y
28 XLA JIT support will be enabled for TensorFlow.
29
30 Do you wish to build TensorFlow with GDR support? [y/N]: y
31 GDR support will be enabled for TensorFlow.
32
33 Do you wish to build TensorFlow with VERBS support? [y/N]: y
34 VERBS support will be enabled for TensorFlow.
35
36 Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N
37 No OpenCL SYCL support will be enabled for TensorFlow.
38
39 Do you wish to build TensorFlow with CUDA support? [y/N]: y
40 CUDA support will be enabled for TensorFlow.
41
42 Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 8
43
44
45 Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
46
47
48 Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]:
49
50
51 Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
52
53
54 Do you wish to build TensorFlow with TensorRT support? [y/N]: N
55 No TensorRT support will be enabled for TensorFlow.
56
57 Please specify the NCCL version you want to use. [Leave empty to default to NCCL 1.3]:
58
59
60 Please specify a list of comma-separated Cuda compute capabilities you want to build with.
61 You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
62 Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 5.0]
63
64
65 Do you want to use clang as CUDA compiler? [y/N]: N
66 nvcc will be used as CUDA compiler.
67
68 Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
69
70
71 Do you wish to build TensorFlow with MPI support? [y/N]: N
72 No MPI support will be enabled for TensorFlow.
73
74 Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:
75
76
77 Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: N
78 Not configuring the WORKSPACE for Android builds.
79
80 Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
81 --config=mkl # Build with MKL support.
82 --config=monolithic # Config for mostly static monolithic build.
83 Configuration finished

View Code

然后使用bazel進行編譯(本步驟非常容易出問題，而且特別耗時)，這里使用 -c opt是編譯release版本的，使用-c dbg是編譯debug版本的

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

中間會遇到很多問題，這里列舉一些不方便查的錯誤。

1）比如會遇到CXX的錯誤，然后具體的錯誤還很難排查（只顯示哪個配置文件的哪一行出錯，並不顯示具體錯誤）。需要查看具體錯誤信息的時候，建議添加--verbose_failures選項。

2）遇到CXX的錯誤，（做編譯的都知道，比較成熟C++的代碼穩定性比較好，兼容性也比較好，移植起來也比較方便，一般不會遇到編譯器和環境問題）可能是編譯器gcc的版本問題，可以添加--cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0"

3）遇到virtual memory exhausted: Cannot allocate memory 錯誤。這是因為swap分區沒有設置或者swap分區容量設置太小的問題，使用free -m命令可以得知這個錯誤，可以使用擴展swap分區容量的方法。大概的命令如下

mkdir /home/jourluohua/swap
rm -rf /home/jourluohua/swap
dd if=/dev/zero of=/home/jourluohua/swap bs=1024 count=4096000
mkswap /home/jourluohua/swap
sudo swapon /home/jourluohua/swap

意思是設置4096000個1024byte大小的塊，一共是4G。如果問題還是沒有解決，以為bazel默認是使用多線程編譯模式，可以手動添加 -j 2選項，將使用的線程固定在2

4）遇到AttributeError: 'module' object has no attribute 'IntEnum' 這個問題比較模糊，使用python -c "import enum"的時候沒有錯誤，但是里邊確實沒有IntEnum的屬性，查找后發現是需要安裝enum34包來解決，Python不太好的一點就是各種包非常混亂，

pip install enum34 --user

5）遇到AttributeError: attribute '__doc__' of 'type' objects is not writable錯誤。這個問題其實挺棘手的，自身是體系結構方向，一般使用的語言也是C++，對Python不是很熟悉，也許是我的編譯環境出了問題？檢查查了下__doc__是Python里邊注釋。

先寫了個小程序復現了這個問題：

#!/usr/bin/python
from functools import wraps

#from https://stackoverflow.com/questions/39010366/functools-wrapper-attributeerror-attribute-doc-of-type-objects-is-not
def memoize(f):
    """ Memoization decorator for functions taking one or more arguments.
        Saves repeated api calls for a given value, by caching it.
    """
    @wraps(f)
    class memodict(dict):
       """memodict"""
       def __init__(self, f):
           self.f = f
       def __call__(self, *args):
           return self[args]
       def __missing__(self, key):
           ret = self[key] = self.f(*key)
           return ret
    return memodict(f)

@memoize
def a():
    """blah"""
    pass

出現了同樣的錯誤：

Traceback (most recent call last):
  File "ipy.py", line 20, in <module>
    @memoize
  File "ipy.py", line 9, in memoize
    class memodict(dict):
  File "/usr/lib/python2.7/functools.py", line 33, in update_wrapper
    setattr(wrapper, attr, getattr(wrapped, attr))
AttributeError: attribute '__doc__' of 'type' objects is not writable

打開出問題的Python代碼，原來的代碼是這樣

@tf_export(v1=["VariableAggregation"])
class VariableAggregation(enum.Enum):
  NONE = 0
  SUM = 1
  MEAN = 2
  ONLY_FIRST_REPLICA = 3
  ONLY_FIRST_TOWER = 3  # DEPRECATED
  
  def __hash__(self):
    return hash(self.value)


# LINT.ThenChange(//tensorflow/core/framework/variable.proto)
#
# Note that we are currently relying on the integer values of the Python enums
# matching the integer values of the proto enums.

VariableAggregation.__doc__ = (
    VariableAggregationV2.__doc__ +
    "* `ONLY_FIRST_TOWER`: Deprecated alias for `ONLY_FIRST_REPLICA`.\n  ")

大概就是要將VariableAggregation的注釋設置成VariableAggregationV2加上額外的一段"* `ONLY_FIRST_TOWER`: Deprecated alias for `ONLY_FIRST_REPLICA`.\n "，猜想既然不允許在class聲明外做這個事情，那么直接在class中設置是否可行？

修改后的代碼如下：

@tf_export(v1=["VariableAggregation"])
class VariableAggregation(enum.Enum):
  NONE = 0
  SUM = 1
  MEAN = 2
  ONLY_FIRST_REPLICA = 3
  ONLY_FIRST_TOWER = 3  # DEPRECATED
  __doc__ = (VariableAggregationV2.__doc__ + "* `ONLY_FIRST_TOWER`: Deprecated alias for `ONLY_FIRST_REPLICA`.\n  ")
  def __hash__(self):
    return hash(self.value)


# LINT.ThenChange(//tensorflow/core/framework/variable.proto)
#
# Note that we are currently relying on the integer values of the Python enums
# matching the integer values of the proto enums.

#VariableAggregation.__doc__ = (
 #   VariableAggregationV2.__doc__ +
  #  "* `ONLY_FIRST_TOWER`: Deprecated alias for `ONLY_FIRST_REPLICA`.\n  ")

6）遇到LargeZipFile: Zipfile size would require ZIP64 extensions 問題，這個問題其實很明顯，就是文件太大了，在需要壓縮的時候，需要配置一下ZIP64選項，而默認應該是不支持的，修改/usr/lib/python2.7/dist-packages/wheel/archive.py文件

將 zip = zipfile.ZipFile(open(zip_filename, "wb+"), "w",compression=zipfile.ZIP_DEFLATED）改成zip = zipfile.ZipFile(open(zip_filename, "wb+"), "w",compression=zipfile.ZIP_DEFLATED, allowZip64=True)就可以。

但是說實話，debug版本還是太大了，超過了zip可以壓縮的大小，主要是CRC32校驗那里過不去，對於我不是急需，就沒有修改這里，畢竟Python2.7已經不再更新，沒有努力的必要，Python3.5以上的版本這里都沒有問題。

還有一些其他缺庫的問題，一般都比較好搜索，就不一一列舉在這里。

5.安裝並配置環境變量

使用pip進行安裝

$ pip install /tmp/tensorflow_pkg/tensorflow --user

# with no spaces after tensorflow hit tab before hitting enter to fill in blanks

最后就是測試

import tensorflow as tf
sess = tf.InteractiveSession()
sess.close()

如果每一步都不報錯的，TensorFlow就編譯並安裝成功了

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 tensorflow lite 編譯和安裝二使用bazel編譯 bazel編譯tensorflow 生成libtensorflow_inference.so 和 libandroid_tensorflow_inference_java.jar Tensorflow r1.12及tensorflow serving r1.12 GPU版本編譯遇到的問題 tensorflow的gpu版本錯誤 tensorflow - GPU 加速關於TensorFlow的GPU設置 Anaconda 安裝tensorflow(GPU) tensorflow gpu 版本安裝 TensorFlow中使用GPU TensorFlow GPU 的使用