使用A卡(AMD Radeon RX470)進行機器學習的失敗經歷


想趕上機器學習ML深度學習的熱潮不容易,光是顯卡就是一筆不小的投入。網上搜索了一下,見A卡也可以勉強用於ML,遂想用手頭有的一張A卡(RX470)進行學習,過程不易,記錄之。

一、試用WSL2,失敗。

    到AMD ROCM官網查看,不支持windows平台,基本上推薦Ubuntu,心想正好在windows10上安裝WSL2,最新版已經升到20.04,過程不贅述。安裝好anaconda和ROCM后,rocminfo查看,報告找不到GPU,網上搜索后,確定wsl暫時(據微軟說,解決方案正在研發中)不支持直接訪問硬件,所以本方法失敗。

二、物理機安裝ubuntu20.04

按照教程安裝rocm和anaconda 后, 安裝tensorflow-rocm。安裝很順利,一切就緒,進入python,import tensorflow,報錯!

(base) python@python-MS-7972:~$ python
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Traceback (most recent call last):
  File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/python/pywrap_tensorflow.py", line 64, in <module>
    from tensorflow.python._pywrap_tensorflow_internal import *
ImportError: librocsolver.so.0: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/__init__.py", line 41, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/python/__init__.py", line 40, in <module>
    from tensorflow.python.eager import context
  File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 35, in <module>
    from tensorflow.python import pywrap_tfe
  File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/python/pywrap_tfe.py", line 28, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/python/pywrap_tensorflow.py", line 83, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/python/pywrap_tensorflow.py", line 64, in <module>
    from tensorflow.python._pywrap_tensorflow_internal import *
ImportError: librocsolver.so.0: cannot open shared object file: No such file or directory


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

又經過一番艱苦卓絕的搜索:),終於發現正確解決方案,竟然只是安裝 rocm-libs!

suso apt install rocm-libs

但是由於rocm-libs的庫文件都安裝在/opt/rocm-4.3.0下面的多個子路徑中,因此需要條件到LD路徑中。

我這里采用的時在/etc/ld.so.conf.d下面創建一個新的獨立配置文件 rocm_4.3.0_libs.conf

/opt/rocm-4.3.0/lib
/opt/rocm-4.3.0/rocsolver/lib
/opt/rocm-4.3.0/rocblas/lib
/opt/rocm-4.3.0/rocclr/lib

再次進入python導入tensorflow,終於Ok了!

但是,不要高興得太早,隨便寫段代碼:

import os
import numpy as np
import pandas as pd
import matplotlib as plot
import keras 
from keras.utils import np_utils
from keras.datasets import  mnist
from keras.models import Sequential
from keras.layers import Dense,Dropout,MaxPooling2D,MaxPooling1D,Conv1D,Conv2D,LSTM

def main():
    (x_train,y_train),(x_test,y_test) = mnist.load_data()
    x_train=np_utils.normalize(x_train,2)
    y_train=np_utils.to_categorical(y_train)
    x_test=np_utils.normalize(x_test,2)
    y_test=np_utils.to_categorical(y_test)
    
    model=Sequential()
    model.add(Dense())
    
if __name__=='__main__':
    main()

結果有報錯了:

2021-09-26 09:42:15.973798: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libamdhip64.so
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
已放棄 (核心已轉儲)

經搜索,gf803系列顯卡(RX 470/480/570/580/590)竟然已經不在AMD得ROCM3.7版本以后得支持名單中!!讓我哭一會兒:(

不過,據說按照這個網址的辦法可以解決,但是我按照步驟依次安裝(除了pytorch的兩個)之后,tensorflow倒是可以引入使用了,但tensorflow還是沒有找到GPU,用的還是CPU!我放棄了,你們哪位TX試試吧,如果試好了,請一定告訴我。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM