TVM優化c++部署實踐

本文轉載自查看原文 2021-11-10 06:09 872

TVM優化c++部署實踐

使用TVM導入神經網絡模型：

模型支持pytorch , tensorflow , onnx, caffe 等。平時pytorch用的多，這里給一種pytorch的導入方式。

github代碼倉：https://github.com/leoluopy/autotvm_tutorial　

def relay_import_from_torch(model, direct_to_mod_param=False):

# 模型輸入模型是 NCHW次序，tvm目前支持動態shape

input_shape = [1, 3, 544, 960]

input_data = torch.randn(input_shape)

# 使用隨機數據，運行一次模型，記錄張量運算

scripted_model = torch.jit.trace(model, input_data).eval()

input_name = "input0"

shape_list = [(input_name, input_shape)]

# 導入模型和權重

mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)

if direct_to_mod_param:

return mod, params

# target = tvm.target.Target("llvm", host="llvm")

# dev = tvm.cpu(0)

# 設定目標平台和設備號，可以是其它平台，如ARM GPU ,蘋果手機GPU等

target = tvm.target.cuda()

dev = tvm.device(str(target), 0)

with tvm.transform.PassContext(opt_level=3):

# 編譯模型至目標平台，保存在lib變量中，后面可以被導出。

lib = relay.build(mod, target=target, params=params)

# 使用編譯好的lib初始化 graph＿executor ，后面用於推理

tvm_model = graph_executor.GraphModule(lib["default"](dev))

return tvm_model, dev

初始化了推理需要的graph_executor。代碼很簡單，去github倉庫扒下來扒，這里介紹另外一種，導出為so文件，然后加載so文件進行推理的方式。

使用TVM導出目標平台推理代碼：

lib.export_library("centerFace_relay.so")

當然這里還沒有進行schedule參數搜索，雖然相對於原始的pytorch接口也能有一定優化，但是還沒有發揮最大功力。

TVM的python推理接口實踐：

來，上代碼。 so文件是剛才導出的推理庫，也可以是后面搜索得到的推理庫，等下后文介紹。

frame = cv2.imread("./ims/6.jpg")

target = tvm.target.cuda()

dev = tvm.device(str(target), 0)

lib = tvm.runtime.load_module("centerFace_relay.so")

tvm_centerPoseModel = runtime.GraphModule(lib["default"](dev))

input_tensor, img_h_new, img_w_new, scale_w, scale_h, raw_scale = centerFacePreprocess(frame)

tvm_centerPoseModel.set_input("input0", tvm.nd.array(input_tensor.astype("float32")))

for i in range(100):

# 推理速率演示，推理多次后時間會穩定下來

t0 = time.time()

tvm_centerPoseModel.run()

print("tvm inference cost: {}".format(time.time() - t0))

heatmap, scale, offset, lms = torch.tensor(tvm_centerPoseModel.get_output(0).asnumpy()), \

torch.tensor(tvm_centerPoseModel.get_output(1).asnumpy()), \

torch.tensor(tvm_centerPoseModel.get_output(2).asnumpy()), \

torch.tensor(tvm_centerPoseModel.get_output(3).asnumpy()),

dets, lms = centerFacePostProcess(heatmap, scale, offset, lms, img_h_new, img_w_new, scale_w, scale_h, raw_scale)

centerFaceWriteOut(dets, lms, frame)

現在打通了一個完整的流程，使用tvm導入模型 ---> 編譯並導出so庫 ---> 加載so庫 ---> 推理

上面的編譯並導出so庫，在windows平台導出dll 庫。編譯的過程使用tvm默認的schedule參數，也有一定的優化效果，測試下來，之前使用了一個centerface的pytorch模型推理50W像素的圖片大約需要12ms [ 1080ti ］，默認編譯后推理時間大約是 6ms 。

對比上面，除了使用默認的schedule參數進行推理，可以搜索更優的schedule參數。測試相同的情況，centerface推理時間3.5ms。有了大約一倍的提升。關鍵是性能沒有損失！

對應的總體流程就變成了：

使用tvm導入模型 ---> 搜索最優scheduel參數 --- > 編譯並導出so庫 ---> 加載so庫 ---> 推理

使用autoTVM搜索最優推理代碼：

python 搜索代碼．

def case_autotvm_relay_centerFace():

# InitCenterFacePy封裝了pytorch的加載代碼

model = InitCenterFacePy()

# tvm搜索完成后將結果保存在.log中

log_file = "centerFace.log"

dtype = "float32"

# 初始化優化器，及優化選項

tuning_option = {

"log_filename": log_file,

"tuner": "xgb",

# "n_trial": 1,

"n_trial": 2000,

"early_stopping": 600,

"measure_option": autotvm.measure_option(

builder=autotvm.LocalBuilder(timeout=10),

runner=autotvm.LocalRunner(number=20, repeat=3, timeout=4, min_repeat_ms=150),

}

print("Extract tasks centerFace...")

mod, params, = relay_import_from_torch(model.module.cpu(), direct_to_mod_param=True)

input_shape = [1, 3, 544, 960]

target = tvm.target.cuda()

tasks = autotvm.task.extract_from_program(

mod["main"], target=target, params=params, ops=(relay.op.get("nn.conv2d"),)

)

# run tuning tasks

print("Tuning...")

tune_tasks(tasks, **tuning_option)

# compile kernels with history best records

# 模型搜索完成后，進行耗時統計。

profile_autvm_centerFace(mod, target, params, input_shape, dtype, log_file)

TVM驗證推理時間：

tvm提供了耗時的統計，下面是代碼。

def profile_autvm_centerFace(mod, target, params, input_shape, dtype, log_file):

with autotvm.apply_history_best(log_file):

print("Compile...")

with tvm.transform.PassContext(opt_level=3):

lib = relay.build_module.build(mod, target=target, params=params)

# load parameters

dev = tvm.device(str(target), 0)

module = runtime.GraphModule(lib["default"](dev))

data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))

module.set_input("input0", data_tvm)

# evaluate

print("Evaluate inference time cost...")

ftimer = module.module.time_evaluator("run", dev, number=1, repeat=100)

prof_res = np.array(ftimer().results) * 1000 # convert to millisecond

print(

"Mean inference time (std dev): %.2f ms (%.2f ms)"

% (np.mean(prof_res), np.std(prof_res))

)

lib.export_library("centerFace_relay.so")

TVM的c++推理接口實踐:

上面把python部分的東西都講完了，得到了一個目標平台編譯好的動態庫。神經網絡的部署不僅僅是推理，還有其它的代碼，往往都是一些效率要求很高的場景，一般都使用c++作為目標平台的編碼語言。so庫得到后，如何推理呢，下面上代碼：［主要兩部分，完整代碼見git 倉庫，或者上知識星球獲取］

初始化部分：

DLDevice dev{kDLGPU, 0};

// for windows , the suffix should be dll

mod_factory = tvm::runtime::Module::LoadFromFile(lib_path, "so");

// 通過動態庫獲取模型實例 gmod

gmod = mod_factory.GetFunction("default")(dev);

// 獲取函數指針: 設置推理輸入

set_input = gmod.GetFunction("set_input");

get_output = gmod.GetFunction("get_output");

run = gmod.GetFunction("run");

// Use the C++ API

// 輸入輸出的內存空間 gpu設備上

x = tvm::runtime::NDArray::Empty({1, 3, 544, 960}, DLDataType{kDLFloat, 32, 1}, dev);

heatmap_gpu = tvm::runtime::NDArray::Empty({1, 1, 136, 240}, DLDataType{kDLFloat, 32, 1}, dev);

scale_gpu = tvm::runtime::NDArray::Empty({1, 2, 136, 240}, DLDataType{kDLFloat, 32, 1}, dev);

offset_gpu = tvm::runtime::NDArray::Empty({1, 2, 136, 240}, DLDataType{kDLFloat, 32, 1}, dev);

lms_gpu = tvm::runtime::NDArray::Empty({1, 10, 136, 240}, DLDataType{kDLFloat, 32, 1}, dev);

推理部分：

值得注意的是： cv::dnn::blobFromImage真是一個好用的函數，構造好 NCHW排列的輸入內存塊，opencv內置了openmp 加速，在樹莓派，各種手機上這個函數也很好用。

int h = frame.rows;

int w = frame.cols;

float img_h_new = int(ceil(h / 32) * 32);

float img_w_new = int(ceil(w / 32) * 32);

float scale_h = img_h_new / float(h);

float scale_w = img_w_new / float(w);

cv::Mat input_tensor = cv::dnn::blobFromImage(frame, 1.0, cv::Size(img_w_new, img_h_new),

cv::Scalar(0, 0, 0),

true,

false, CV_32F);

x.CopyFromBytes(input_tensor.data, 1 * 3 * 544 * 960 * sizeof(float));

set_input("input0", x);

timeval t0, t1;

gettimeofday(&t0, NULL);

run();

gettimeofday(&t1, NULL);

printf("inference cost: %f \n", t1.tv_sec - t0.tv_sec + (t1.tv_usec - t0.tv_usec) / 1000000.);

get_output(0, heatmap_gpu);

get_output(1, scale_gpu);

get_output(2, offset_gpu);

get_output(3, lms_gpu);

tvm::runtime::NDArray heatmap_cpu = heatmap_gpu.CopyTo(DLDevice{kDLCPU, 0});

tvm::runtime::NDArray scale_cpu = scale_gpu.CopyTo(DLDevice{kDLCPU, 0});

tvm::runtime::NDArray offset_cpu = offset_gpu.CopyTo(DLDevice{kDLCPU, 0});

tvm::runtime::NDArray lms_cpu = lms_gpu.CopyTo(DLDevice{kDLCPU, 0});

TVM部署樹莓派卷積神經網絡

介紹如果將神經網絡使用TVM編譯，導出動態鏈接庫文件，最后部署在樹莓派端(PC端)，並且運行起來。

環境搭建

需要安裝LLVM，主要運行環境是CPU(樹莓派的GPU暫時不用，內存有點小)，所以LLVM是必須的。

安裝交叉編譯器:

Cross Compiler

交叉編譯器是什么，就是可以在PC平台上編譯生成，可以直接在樹莓派上運行的可執行文件。在TVM中，需要利用交叉編譯器在PC端編譯模型並且優化，然后生成適用於樹莓派(arm構架)使用的動態鏈接庫。

有這個動態鏈接庫，就可以直接調用樹莓派端的TVM運行時環境，調用這個動態鏈接庫，從而執行神經網絡的前向操作了。

那么怎么安裝呢？需要安裝叫做/usr/bin/arm-linux-gnueabihf-g++的交叉編譯器，在Ubuntu系統中，直接sudo apt-get install g++-arm-linux-gnueabihf即可，注意名稱不能錯，需要的是hf(Hard-float)版本。

安裝完后，執行/usr/bin/arm-linux-gnueabihf-g++ -v命令，就可以看到輸出信息:

prototype@prototype-X299-UD4-Pro:~/$ /usr/bin/arm-linux-gnueabihf-g++ -v

Using built-in specs.

COLLECT_GCC=/usr/bin/arm-linux-gnueabihf-g++

COLLECT_LTO_WRAPPER=/usr/lib/gcc-cross/arm-linux-gnueabihf/5/lto-wrapper

Target: arm-linux-gnueabihf

Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-armhf-cross/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-armhf-cross --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-armhf-cross --with-arch-directory=arm --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libgcj --enable-objc-gc --enable-multiarch --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard --with-mode=thumb --disable-werror --enable-multilib --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=arm-linux-gnueabihf --program-prefix=arm-linux-gnueabihf- --includedir=/usr/arm-linux-gnueabihf/include

Thread model: posix

gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9)

樹莓派環境搭建

因為是在PC端利用TVM編譯神經網絡的，在樹莓派端只需要編譯TVM的運行時環境即可(TVM可以分為兩個部分，一部分為編譯時，另一個為運行時，兩者可以拆開)。

這里附上官方的命令，注意樹莓派端也需要安裝llvm，樹莓派端的llvm可以在llvm官方找到已經編譯好的壓縮包，解壓后添加環境變量即可：

git clone --recursive https://github.com/dmlc/tvm

cd tvm

mkdir build

cp cmake/config.cmake build # 這里修改config.cmake使其支持llvm

cd build

cmake ..

make runtime

在樹莓派上編譯TVM的運行時，不需要花很久的時間。

完成部署

在PC端利用TVM部署C++模型

如何利用TVM的C++端去部署，官方也有比較詳細的文檔，這里利用TVM和OpenCV讀取一張圖片，使用之前導出的動態鏈接庫，運行神經網絡對這張圖片進行推斷。

需要的頭文件為：

#include <cstdio>

#include <dlpack/dlpack.h>

#include <opencv4/opencv2/opencv.hpp>

#include <tvm/runtime/module.h>

#include <tvm/runtime/registry.h>

#include <tvm/runtime/packed_func.h>

#include <fstream>

其實這里只需要TVM的運行時，另外dlpack是存放張量的一個結構。其中OpenCV用於讀取圖片，而fstream用於讀取json和參數信息：

tvm::runtime::Module mod_dylib =

tvm::runtime::Module::LoadFromFile("../files/mobilenet.so");

std::ifstream json_in("../files/mobilenet.json", std::ios::in);

std::string json_data((std::istreambuf_iterator<char>(json_in)), std::istreambuf_iterator<char>());

json_in.close();

// parameters in binary

std::ifstream params_in("../files/mobilenet.params", std::ios::binary);

std::string params_data((std::istreambuf_iterator<char>(params_in)), std::istreambuf_iterator<char>());

params_in.close();

TVMByteArray params_arr;

params_arr.data = params_data.c_str();

params_arr.size = params_data.length();

在讀取完信息后，要利用之前讀取的信息，構建TVM中的運行圖(Graph_runtime)：

int dtype_code = kDLFloat;

int dtype_bits = 32;

int dtype_lanes = 1;

int device_type = kDLCPU;

int device_id = 0;

tvm::runtime::Module mod = (*tvm::runtime::Registry::Get("tvm.graph_runtime.create"))

(json_data, mod_dylib, device_type, device_id);

然后利用TVM中函數建立一個輸入的張量類型，分配空間：

DLTensor *x;

int in_ndim = 4;

int64_t in_shape[4] = {1, 3, 128, 128};

TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &x);

其中DLTensor是個靈活的結構，可以包容各種類型的張量，在創建了這個張量后，需要將OpenCV中讀取的圖像信息傳入到這個張量結構中：

// 這里依然讀取了papar.png這張圖

image = cv::imread("/home/prototype/CLionProjects/tvm-cpp/data/paper.png");

cv::cvtColor(image, frame, cv::COLOR_BGR2RGB);

cv::resize(frame, input, cv::Size(128,128));

float data[128 * 128 * 3];

// 在這個函數中將OpenCV中的圖像數據轉化為CHW的形式

Mat_to_CHW(data, input);

需要注意，因為OpenCV中的圖像數據的保存順序是(128,128,3)，所以這里需要調整過來，其中Mat_to_CHW函數的具體內容是:

void Mat_to_CHW(float *data, cv::Mat &frame)

{

assert(data && !frame.empty());

unsigned int volChl = 128 * 128;

for(int c = 0; c < 3; ++c)

{

for (unsigned j = 0; j < volChl; ++j)

data[c*volChl + j] = static_cast<float>(float(frame.data[j * 3 + c]) / 255.0);

}

當然別忘了除以255.0因為在Pytorch中所有的權重信息的范圍都是0-1。

在將OpenCV中的圖像數據轉化后，將轉化后的圖像數據拷貝到之前的張量類型中:

// x為之前的張量類型 data為之前開辟的浮點型空間

memcpy(x->data, &data, 3 * 128 * 128 * sizeof(float));

然后設置運行圖的輸入(x)和輸出(y):

// get the function from the module(set input data)

tvm::runtime::PackedFunc set_input = mod.GetFunction("set_input");

set_input("0", x);

// get the function from the module(load patameters)

tvm::runtime::PackedFunc load_params = mod.GetFunction("load_params");

load_params(params_arr);

DLTensor* y;

int out_ndim = 2;

int64_t out_shape[2] = {1, 3,};

TVMArrayAlloc(out_shape, out_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &y);

// get the function from the module(run it)

tvm::runtime::PackedFunc run = mod.GetFunction("run");

// get the function from the module(get output data)

tvm::runtime::PackedFunc get_output = mod.GetFunction("get_output");

此刻就可以運行了：

run();

get_output(0, y);

// 將輸出的信息打印出來

auto result = static_cast<float*>(y->data);

for (int i = 0; i < 3; i++)

cout<<result[i]<<endl;

最后的輸出信息是

13.8204

-7.31387

-6.8253

可以看到，成功識別出了布這張圖片，到底為止在C++端的部署就完畢了。

在樹莓派上的部署

在樹莓派上的部署其實也是很簡單的，與上述步驟中不同的地方是需要設置target為樹莓派專用:

target = tvm.target.arm_cpu('rasp3b')

點進去其實可以發現rasp3b對應着-target=armv7l-linux-gnueabihf：

trans_table = {

"pixel2": ["-model=snapdragon835", "-target=arm64-linux-android -mattr=+neon"],

"mate10": ["-model=kirin970", "-target=arm64-linux-android -mattr=+neon"],

"mate10pro": ["-model=kirin970", "-target=arm64-linux-android -mattr=+neon"],

"p20": ["-model=kirin970", "-target=arm64-linux-android -mattr=+neon"],

"p20pro": ["-model=kirin970", "-target=arm64-linux-android -mattr=+neon"],

"rasp3b": ["-model=bcm2837", "-target=armv7l-linux-gnueabihf -mattr=+neon"],

"rk3399": ["-model=rk3399", "-target=aarch64-linux-gnu -mattr=+neon"],

"pynq": ["-model=pynq", "-target=armv7a-linux-eabi -mattr=+neon"],

"ultra96": ["-model=ultra96", "-target=aarch64-linux-gnu -mattr=+neon"],

}

還有一點改動的是，在導出.so的時候需要加入cc="/usr/bin/arm-linux-gnueabihf-g++"，此時的/usr/bin/arm-linux-gnueabihf-g++為之前下載的交叉編譯器。

path_lib = '../tvm/deploy_lib.so'

lib.export_library(path_lib, cc="/usr/bin/arm-linux-gnueabihf-g++")

可以導出來樹莓派需要的幾個文件，將這幾個文件移到樹莓派中，隨后利用上面說到的C++部署代碼去部署就可以了。

參考鏈接：

https://blog.csdn.net/weixin_33514140/article/details/112775067

https://blog.csdn.net/m0_62789066/article/details/120855166t/m0_62789066/article/details/120855166

github代碼倉：https://github.com/leoluopy/autotvm_tutorial　

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 tvm模型c++部署調用gpu C++的性能優化實踐 C++服務編譯耗時優化原理及實踐（美團） TVM圖優化與算子融合 TVM部署和集成Deploy and Integration 關於TVM TVM： TVM 優化 ARM GPU 上的移動深度學習 TVM圖優化（以Op Fusion為例） C++各種優化