利用 OpenVINO 進行推理加速（一）

本文轉載自查看原文 2021-05-09 22:06 4264

這里介紹下，

如何下載和編譯 OpenVINO
利用 Benchmark 進行性能評估
如何利用 OpenVINO 提供的 Mutli-device Plugin 將模型加載到多個設備上

OpenVINO 專注於物聯網場景，對於一些邊緣端的低算力設備，借助 OpenVINO 可以通過調度 MKLDNN 庫 CLDNN 庫來在 CPU，iGPU，FPGA 以及其他設備上，加速部署的模型推理的速度；

一個標准的邊緣端的推理過程可以分為以下幾步：編譯模型，優化模型，部署模型；

1. 下載和編譯 OpenVINO

需要 clone 代碼並且編譯源碼：

# 1. clone OpenVINO 源碼
$ git clone https://github.com/openvinotoolkit/openvino.git

$ cd openvino
$ git submodule update --init --recursive
$ chmod +x install_dependencies.sh
$ ./install_dependencies.sh

# 2. 編譯源碼
$ cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_PYTHON=ON \
-DPYTHON_EXECUTABLE=/usr/bin/python3.6
..
$ make --jobs=$(nproc --all)

# 3. 安裝
$ cmake --install . --prefix /opt/intel/coneypo_test/

# 4. 啟用 OpenVINO 環境
$ source /opt/intel/coneypo_test/bin/setupvars.sh
# 注意配置 OpenCV 的環境，export 到 openvino 的路徑
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/hddl/code/openvino/inference-engine/temp/opencv_4.5.2_ubuntu18/opencv/lib/

2. 通過 Benchmark app 來進行性能評估

2.1 硬件配置

我這邊測試的主機硬件配置：

2.2 下載和轉換模型

2.2.1 下載模型

以 alexnet 網絡為例，借助 OpenVINO 提供工具，從 open model zoo，通過模型名稱檢索來進行下載：

$ cd /opt/intel/coneypo_test/deployment_tools/open_model_zoo/tools/downloader/

$ /opt/intel/coneypo_test/deployment_tools/open_model_zoo/tools/downloader# python3 downloader.py --name alexnet
################|| Downloading models ||################

========== Downloading /opt/intel/coneypo_test/deployment_tools/open_model_zoo/tools/downloader/public/alexnet/alexnet.prototxt
... 100%, 3 KB, 36857 KB/s, 0 seconds passed

========== Downloading /opt/intel/coneypo_test/deployment_tools/open_model_zoo/tools/downloader/public/alexnet/alexnet.caffemodel
... 100%, 238146 KB, 13041 KB/s, 18 seconds passed

################|| Post-processing ||################

========== Replacing text in /opt/intel/coneypo_test/deployment_tools/open_model_zoo/tools/downloader/public/alexnet/alexnet.prototxt

2.2.2 轉換模型

需要將 caffe 模型轉換成 OpenVINO 的 IR 文件：

$ cd /opt/intel/zt_debug/deployment_tools/model_optimizer

$ python3 mo.py --input_model /opt/intel/zt_debug/deployment_tools/open_model_zoo/tool
Model Optimizer arguments:
Common parameters:
        - Path to the Input Model:      /opt/intel/zt_debug/deployment_tools/open_model_zoo/tools/downloader/public/alexnet/alexnet.caffemodel
        - Path for generated IR:        /opt/intel/zt_debug/deployment_tools/model_optimizer/.
        - IR output name:       alexnet
        - Log level:    ERROR
        - Batch:        Not specified, inherited from the model
        - Input layers:         Not specified, inherited from the model
        - Output layers:        Not specified, inherited from the model
        - Input shapes:         Not specified, inherited from the model
        - Mean values:  Not specified
        - Scale values:         Not specified
        - Scale factor:         Not specified
        - Precision of IR:      FP32
        - Enable fusing:        True
        - Enable grouped convolutions fusing:   True
        - Move mean values to preprocess section:       None
        - Reverse input channels:       False
Caffe specific parameters:
        - Path to Python Caffe* parser generated from caffe.proto:      /opt/intel/zt_debug/deployment_tools/model_optimizer/mo/utils/../front/caffe/proto
        - Enable resnet optimization:   True
        - Path to the Input prototxt:   /opt/intel/zt_debug/deployment_tools/open_model_zoo/tools/downloader/public/alexnet/alexnet.prototxt
        - Path to CustomLayersMapping.xml:      /opt/intel/zt_debug/deployment_tools/model_optimizer/mo/utils/../../extensions/front/caffe/CustomLayersMappin
        - Path to a mean file:  Not specified
        - Offsets for a mean file:      Not specified
        - Inference Engine found in:    /opt/intel/zt_auto/python/python3.6/openvino
Inference Engine version:       2.1.custom_zt/AutoPlugin_c6e9314a9e96f74183023323dc6a026cd0b4549e
Model Optimizer version:            custom_zt/AutoPlugin_c6e9314a9e96f74183023323dc6a026cd0b4549e
[ SUCCESS ] Generated IR version 10 model.
[ SUCCESS ] XML file: /opt/intel/zt_debug/deployment_tools/model_optimizer/alexnet.xml
[ SUCCESS ] BIN file: /opt/intel/zt_debug/deployment_tools/model_optimizer/alexnet.bin
[ SUCCESS ] Total execution time: 19.84 seconds.
[ SUCCESS ] Memory consumed: 1513 MB.

alexnet.xml 和 alexnet.bin 就是 OpenVINO 所需要的模型 IR 文件；

2.3 Benchmark

然后通過同步或者異步的方式，將 alexnet 模型通過 Multi-device plugin 加載到不同的設備上，然后利用 OpenVINO 提供的 benchmark_app 來評估性能：

$ cd /home/test/code/openvino/bin/intel64/Release

# sync, 使用 CPU 和 GPU
$ ./benchmark_app -m /opt/intel/coneypo_test/deployment_tools/model_optimizer/alexnet.xml -i /opt/intel/coneypo_test/deployment_tools/inference_engine/samples/python/hello_classification/images/cat_1.png -api sync -d "MULTI:CPU,GPU"

# async，使用 CPU 和 MYRIAD
$ ./benchmark_app -m /opt/intel/coneypo_test/deployment_tools/model_optimizer/alexnet.xml -i /opt/intel/coneypo_test/deployment_tools/inference_engine/samples/python/hello_classification/images/cat_1.png -api async -d "MULTI:CPU,MYRIAD"

可以看到 sync 模式下，性能表現差不太多；

走異步的話，可以看到使用 MUTLI device plugin ，加載到多個設備上面異步做推理，會顯著提高 FPS:

2.4 加載流程

具體實現：

以 ./benchmark_app \

　　-m alexnet.xml \

　　-i cat_1.png \

　　-api sync \

　　-d MULTI:CPU,GPU,MYRIAD

為例：

2.4.1 解析和驗證輸入

輸入的 IR 文件，不是 .blob，isNetworkCompiled 為 false:

bool isNetworkCompiled = fileExt(FLAGS_m) == "blob";

2.4.2 加載推理引擎（Inference Engine）

Core ie;

需要使用 CPU， MKLDNN 庫作為一個共享庫被加載：

const auto extension_ptr = std::make_shared<InferenceEngine::Extension>(FLAGS_l);
ie.AddExtension(extension_ptr);

需要使用 iGPU，加載 clDNN 庫；

auto ext = config.at("GPU").at(CONFIG_KEY(CONFIG_FILE));
ie.SetConfig({{CONFIG_KEY(CONFIG_FILE), ext}}, "GPU");

2.4.3 配置設備

2.4.4 讀取 IR 文件

isNetworkCompiled 是 false，所以需要執行這一步，讀取 IR 文件；

CNNNetwork cnnNetwork = ie.ReadNetwork(input_model);

2.4.5 調整網絡大小，來匹配輸入圖像尺寸和 batch size

cnnNetwork.reshape(shapes);

batchSize = (!FLAGS_layout.empty()) ? getBatchSize(app_inputs_info) : cnnNetwork.getBatchSize();

2.4.6 配置網絡輸入輸出

2.4.7 將模型加載到設備上

注意這里 "device_name" 傳進去的是 ”MULTI:CPU,GPU,MYRIAD" 這種字段，還沒有進行解析；

exeNetwork = ie.LoadNetwork(cnnNetwork, device_name);

因為使用了 "-d MULTI:CPU,GPU,MYRIAD"，所以走到了 MULTI Plugin 里；

通過 openvino/inference-engine/src/multi_device/multi_device_plugin.cpp 的 MultiDeviceInferencePlugin::LoadExeNetworkImpl() 去實現加載網絡；

對於每個 device：CPU, GPU, MYRIAD，都要 LoadNetwork 一遍：

ExecutableNetworkInternal::Ptr MultiDeviceInferencePlugin::LoadExeNetworkImpl()
{
    ...
    for (auto& p : metaDevices) {
        ...
        auto exec_net = GetCore()->LoadNetwork(network, deviceName, deviceConfig);
        ...
    }
    ...
    return std::make_shared<MultiDeviceExecutableNetwork>(executableNetworkPerDevice,
                                                          metaDevices,
                                                          multiNetworkConfig,
                                                          enablePerfCounters);
}

上個函數 LoadExeNetworkImpl 中會創建 MultiDeviceExecutableNetwork 對象；

這里面會對於每個設備，networkValue.first = CPU/GPU/MYRIAD, networkValue.second = AlexNet/AlexNet/AlexNet;

如果傳入的 -d 沒有為設備配置 nireq，會通過 GetMetric() 拿到這個設備的 optimalNum，這個值是這個設備的最優的，處理推理請求的個數；

MultiDeviceExecutableNetwork::MultiDeviceExecutableNetwork()
{
    for (auto&& networkValue : _networksPerDevice) {
        ...
        auto numRequests = (_devicePriorities.end() == itNumRequests ||
            itNumRequests->numRequestsPerDevices == -1) ? optimalNum : itNumRequests->numRequestsPerDevices;
        ...     
        for (auto&& workerRequest : workerRequests) {
            ...
            workerRequest._inferRequest = network.CreateInferRequest();
            ...
        }
    }
}

注意這里 network.CreateInferRequest() ，是在 openvino/inference-engine/src/inference_engine/cpp/ie_executable_network.cpp 里面實現的，並不是調用 Multi-plugin 的里面的 CreateInferRequest()，

如果我們給定 nireq=6，創建六個推理請求，可以看到對於設備 CPU/GPU/MYRIAD, workerRequests 的大小分別是 2/1/4，所以會分別啟動 2/1/4 個 workerRequest；

關於 idleWorkerRequests 和 workerRequests 的定義:

DeviceMap<NotBusyWorkerRequests>                            _idleWorkerRequests;
DeviceMap<std::vector<WorkerInferRequest>>                  _workerRequests;

之后會詳細介紹 WorkerInferRequest 的工作原理；

2.4.8 配置最優化參數

如果已經配置了推理個數，會跳過這一步；

如果沒有配置，就會通過 GetMetric(METRIC_KEY(OPTIMAL_NUMBER_OF_INFER_REQUESTS)) 拿到這個網絡最優的 nireq:

nireq = exeNetwork.GetMetric(key).as<unsigned int>();

2.4.9 創建推理請求（Infer Requests）

通過 InferRequestsQueue 來將 nireq 個推理請求下發到要執行的網絡，創建了 nireq 個 InferReqWrap 對象：

InferRequestsQueue inferRequestsQueue(exeNetwork, nireq);

可以在 openvino/inference-engine/samples/benchmark_app/infer_request_wrap.hpp 看到 InferRequestsQueue 這個類的定義：

InferRequestsQueue(InferenceEngine::ExecutableNetwork& net, size_t nireq) {
        for (size_t id = 0; id < nireq; id++) {
            requests.push_back(
                std::make_shared<InferReqWrap>(net, id, std::bind(&InferRequestsQueue::putIdleRequest, this, std::placeholders::_1, std::placeholders::_2)));
            _idleIds.push(id);
        }
        resetTimes();
}

InferReqWrap 里面會通過 net.CreateInferRequest()來創建推理請求：

 explicit InferReqWrap(InferenceEngine::ExecutableNetwork& net, size_t id, QueueCallbackFunction callbackQueue)
        : _request(net.CreateInferRequest()), _id(id), _callbackQueue(callbackQueue) {
        _request.SetCompletionCallback([&]() {
            _endTime = Time::now();
            _callbackQueue(_id, getExecutionTimeInMilliseconds());
        });
    }

比如我們配置 nireq=6，需要創建六個推理請求，通過 openvino/inference-engine/src/multi_device/multi_device_exec_network.cpp 里面的 CreateInferRequest 來創建：

IInferRequestInternal::Ptr MultiDeviceExecutableNetwork::CreateInferRequest() {
    auto syncRequestImpl = CreateInferRequestImpl(_networkInputs, _networkOutputs);
    syncRequestImpl->setPointerToExecutableNetworkInternal(shared_from_this());
    return std::make_shared<MultiDeviceAsyncInferRequest>(std::static_pointer_cast<MultiDeviceInferRequest>(syncRequestImpl),
                                                          _needPerfCounters,
                                                          std::static_pointer_cast<MultiDeviceExecutableNetwork>(shared_from_this()),
                                                          _callbackExecutor);
}

CreateInferRequestImpl() 會根據推理個數，構建 request_to_share_blobs_with，然后分配到各個設備上面；

不斷更新 _devicePrioritiesInitial 來決定將當前的 Infer Request 加載到哪個設備上面去；

它的定義：InferenceEngine::InferRequest request_to_share_blobs_with:

InferenceEngine::IInferRequestInternal::Ptr MultiDeviceExecutableNetwork::CreateInferRequestImpl(InferenceEngine::InputsDataMap networkInputs,
                                                                                                InferenceEngine::OutputsDataMap networkOutputs) {
    auto num = _numRequestsCreated++;
    ...
    for (const auto& device : _devicePrioritiesInitial) {
        auto& dev_requests = _workerRequests[device.deviceName];
        ...
        if ((num - sum) < dev_requests.size()) {
            request_to_share_blobs_with = dev_requests.at(num - sum)._inferRequest;
            break;
        }
        sum += dev_requests.size();
    }
    return std::make_shared<MultiDeviceInferRequest>(networkInputs, networkOutputs, request_to_share_blobs_with);
}

這個函數會去創建 MultiDeviceInferRequest 對象，而這個類是在 openvino/inference-engine/src/multi_device/multi_device_infer_request.cpp 里面定義的；

創建 MultiDeviceInferRequest 對象，會去配置這個 Infer Request，給到網絡的輸入輸出:

MultiDeviceInferRequest::MultiDeviceInferRequest(const InputsDataMap&   networkInputs,
                                                 const OutputsDataMap&  networkOutputs,
                                                 InferRequest request_to_share_blobs_with)
        : IInferRequestInternal(networkInputs, networkOutputs) {
    if (request_to_share_blobs_with) {
        for (const auto &it : _networkInputs) {
                _inputs[it.first] = request_to_share_blobs_with.GetBlob(it.first);
        }
        for (const auto &it : _networkOutputs)
            _outputs[it.first] = request_to_share_blobs_with.GetBlob(it.first);
        return;
    }
    // Allocate all input blobs
    for (const auto &it : networkInputs) {
        Layout l = it.second->getLayout();
        Precision p = it.second->getPrecision();
        SizeVector dims = it.second->getTensorDesc().getDims();

        TensorDesc desc = TensorDesc(p, dims, l);
        _inputs[it.first] = make_blob_with_precision(desc);
        _inputs[it.first]->allocate();
    }
    // Allocate all output blobs
    for (const auto &it : networkOutputs) {
        Layout l = it.second->getLayout();
        Precision p = it.second->getPrecision();
        SizeVector dims = it.second->getTensorDesc().getDims();

        TensorDesc desc = TensorDesc(p, dims, l);
        _outputs[it.first] = make_blob_with_precision(desc);
        _outputs[it.first]->allocate();
    }
}

關於 OpenVINO 里面 GetBlob() 的定義：

Get blobs allocated by an infer request using InferenceEngine::InferRequest::GetBlob() and feed an image and the input data to the blobs.

意思就是說，OpenVINO 中推理請求（Infer request）會通過 InferenceEngine::InferRequest::GetBlob() 來將網絡輸入的數據喂給網絡；

MultiDeviceExecutableNetwork::CreateInferRequest() 的返回值還會創建 MultiDeviceAsyncInferRequest 對象；

這個類是在 openvino/inference-engine/src/multi_device/multi_device_async_infer_request.cpp 里面定義的：

MultiDeviceAsyncInferRequest::MultiDeviceAsyncInferRequest(
    const MultiDeviceInferRequest::Ptr&         inferRequest,
    const bool                                  needPerfCounters,
    const MultiDeviceExecutableNetwork::Ptr&    multiDeviceExecutableNetwork,
    const ITaskExecutor::Ptr&                   callbackExecutor) :
    AsyncInferRequestThreadSafeDefault(inferRequest, nullptr, callbackExecutor),
    _multiDeviceExecutableNetwork{multiDeviceExecutableNetwork},
    _inferRequest{inferRequest},
    _needPerfCounters{needPerfCounters} {
    // this executor starts the inference while  the task (checking the result) is passed to the next stage
    struct ThisRequestExecutor : public ITaskExecutor {
        explicit ThisRequestExecutor(MultiDeviceAsyncInferRequest* _this_) : _this{_this_} {}
        void run(Task task) override {
            auto workerInferRequest = _this->_workerInferRequest;
            workerInferRequest->_task = std::move(task);
            workerInferRequest->_inferRequest.StartAsync();
        };
        MultiDeviceAsyncInferRequest* _this = nullptr;
    };
    _pipeline = {
        // if the request is coming with device-specific remote blobs make sure it is scheduled to the specific device only:
        { /*TaskExecutor*/ std::make_shared<ImmediateExecutor>(), /*task*/ [this] {
               // by default, no preferred device:
               _multiDeviceExecutableNetwork->_thisPreferredDeviceName = "";
               // if any input is remote (e.g. was set with SetBlob), let' use the corresponding device
               for (const auto &it : _multiDeviceExecutableNetwork->GetInputsInfo()) {
                   auto b = _inferRequest->GetBlob(it.first);
                   auto r = b->as<RemoteBlob>();
                   if (r) {
                       const auto name = r->getDeviceName();
                       const auto res = std::find_if(
                               _multiDeviceExecutableNetwork->_devicePrioritiesInitial.cbegin(),
                               _multiDeviceExecutableNetwork->_devicePrioritiesInitial.cend(),
                               [&name](const MultiDevicePlugin::DeviceInformation& d){
                                    return d.deviceName == name;
                                   });
                       if (_multiDeviceExecutableNetwork->_devicePrioritiesInitial.cend() == res) {
                           IE_THROW() << "None of the devices (for which current MULTI-device configuration was "
                                                 "initialized) supports a remote blob created on the device named " << name;

                       } else {
                            // it is ok to take the c_str() here (as pointed in the multi_device_exec_network.hpp we need to use const char*)
                            // as the original strings are from the "persistent" vector (with the right lifetime)
                           _multiDeviceExecutableNetwork->_thisPreferredDeviceName = res->deviceName.c_str();
                           break;
                       }
                   }
               }
        }},
        // as the scheduling algo may select any device, this stage accepts the scheduling decision (actual workerRequest)
        // then sets the device-agnostic blobs to the actual (device-specific) request
        {
         /*TaskExecutor*/ _multiDeviceExecutableNetwork, /*task*/ [this] {
               _workerInferRequest = MultiDeviceExecutableNetwork::_thisWorkerInferRequest;
               _inferRequest->SetBlobsToAnotherRequest(_workerInferRequest->_inferRequest);
        }},
        // final task in the pipeline:
        { /*TaskExecutor*/std::make_shared<ThisRequestExecutor>(this), /*task*/ [this] {
              auto status = _workerInferRequest->_status;
              if (InferenceEngine::StatusCode::OK != status) {
                  if (nullptr != InferenceEngine::CurrentException())
                      std::rethrow_exception(InferenceEngine::CurrentException());
                  else
                      IE_EXCEPTION_SWITCH(status, ExceptionType,
                        InferenceEngine::details::ThrowNow<ExceptionType>{}
                            <<= std::stringstream{} << IE_LOCATION
                            <<  InferenceEngine::details::ExceptionTraits<ExceptionType>::string());
              }
              if (_needPerfCounters)
                  _perfMap = _workerInferRequest->_inferRequest.GetPerformanceCounts();
        }}
    };
}

2.4.10 進行推理

根據是 sync 還是 async，來對創建的 inferRequest 進行推理；

if (FLAGS_api == "sync") {
    inferRequest->infer();
} else {
    inferRequest->wait();
    inferRequest->startAsync();
}

如果是 sync，會去執行 InferReqWrap 類里面定義的 infer():

void infer() {
        _startTime = Time::now();
        _request.Infer();
        _endTime = Time::now();
        _callbackQueue(_id, getExecutionTimeInMilliseconds());
    }

會去執行之前 MultiDeviceAsyncInferRequest 類里面，pipeline 里面創建的任務，去執行 SetBlobsToAnotherRequest：

{
 /*TaskExecutor*/ _multiDeviceExecutableNetwork, /*task*/ [this] {
       _workerInferRequest = MultiDeviceExecutableNetwork::_thisWorkerInferRequest;
       _inferRequest->SetBlobsToAnotherRequest(_workerInferRequest->_inferRequest);
}},

而 SetBlobsToAnotherRequest() 里面就是真正的去 GetBlob() 來做推理：

void MultiDeviceInferRequest::SetBlobsToAnotherRequest(InferRequest& req) {
    for (const auto &it : _networkInputs) {
        auto &name = it.first;
        // this request is already in BUSY state, so using the internal functions safely
        auto blob = GetBlob(name);
        if (req.GetBlob(name) != blob) {
            req.SetBlob(name, blob);
        }
    }
    for (const auto &it : _networkOutputs) {
        auto &name = it.first;
        // this request is already in BUSY state, so using the internal functions safely
        auto blob = GetBlob(name);
        if (req.GetBlob(name) != blob) {
            req.SetBlob(name, blob);
        }
    }
}

如果我們指定 -d "MULTI:CPU(1),MYRIAD(3),GPU(2)"，分別在 CPU/MYRIAD/GPU 上面創建 1/3/2 個推理請求：

如果我們指定 -d "MULTI:CPU,MYRIAD,GPU"，但是沒有給每個設備指定 Infer request；

會按照設備順序，先在 CPU 上面創建 Infer request, 在 async 的情況下，因為有四個物理核，所以可以創建四個，剩下的兩個 Infer request 就會創建到第二個設備， MYRIAD 上面：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 tensorRT（一）| tensorRT如何進行推理加速？（建議收藏）利用protege創建本體並進行簡單推理 YOLOX在OpenVINO、ONNXRUNTIME、TensorRT上面推理部署與速度比較利用neon技術對矩陣旋轉進行加速利用neon技術對矩陣旋轉進行加速（2） YOLOv5最新6.1在OpenCV DNN、OpenVINO、ONNXRUNTIME上推理對比 R語言stan進行貝葉斯推理分析 mmdetection使用現有的模型進行推理通過Hook進行游戲的全局加速【翻譯】借助 NeoCPU 在 CPU 上進行 CNN 模型推理優化