這里介紹下,
- 如何下載和編譯 OpenVINO
- 利用 Benchmark 進行性能評估
- 如何利用 OpenVINO 提供的 Mutli-device Plugin 將模型加載到多個設備上
OpenVINO 專注於物聯網場景,對於一些邊緣端的低算力設備,借助 OpenVINO 可以通過調度 MKLDNN 庫 CLDNN 庫來在 CPU,iGPU,FPGA 以及其他設備上,加速部署的模型推理的速度;
一個標准的邊緣端的推理過程可以分為以下幾步:編譯模型,優化模型,部署模型;
1. 下載和編譯 OpenVINO
需要 clone 代碼並且編譯源碼:
# 1. clone OpenVINO 源碼 $ git clone https://github.com/openvinotoolkit/openvino.git $ cd openvino $ git submodule update --init --recursive $ chmod +x install_dependencies.sh $ ./install_dependencies.sh # 2. 編譯源碼 $ cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_PYTHON=ON \ -DPYTHON_EXECUTABLE=/usr/bin/python3.6 .. $ make --jobs=$(nproc --all) # 3. 安裝 $ cmake --install . --prefix /opt/intel/coneypo_test/ # 4. 啟用 OpenVINO 環境 $ source /opt/intel/coneypo_test/bin/setupvars.sh # 注意配置 OpenCV 的環境,export 到 openvino 的路徑 $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/hddl/code/openvino/inference-engine/temp/opencv_4.5.2_ubuntu18/opencv/lib/
2. 通過 Benchmark app 來進行性能評估
2.1 硬件配置
我這邊測試的主機硬件配置:
2.2 下載和轉換模型
2.2.1 下載模型
以 alexnet 網絡為例,借助 OpenVINO 提供工具,從 open model zoo,通過模型名稱檢索來進行下載:
$ cd /opt/intel/coneypo_test/deployment_tools/open_model_zoo/tools/downloader/
$ /opt/intel/coneypo_test/deployment_tools/open_model_zoo/tools/downloader# python3 downloader.py --name alexnet ################|| Downloading models ||################ ========== Downloading /opt/intel/coneypo_test/deployment_tools/open_model_zoo/tools/downloader/public/alexnet/alexnet.prototxt ... 100%, 3 KB, 36857 KB/s, 0 seconds passed ========== Downloading /opt/intel/coneypo_test/deployment_tools/open_model_zoo/tools/downloader/public/alexnet/alexnet.caffemodel ... 100%, 238146 KB, 13041 KB/s, 18 seconds passed ################|| Post-processing ||################ ========== Replacing text in /opt/intel/coneypo_test/deployment_tools/open_model_zoo/tools/downloader/public/alexnet/alexnet.prototxt
2.2.2 轉換模型
需要將 caffe 模型轉換成 OpenVINO 的 IR 文件:
$ cd /opt/intel/zt_debug/deployment_tools/model_optimizer
$ python3 mo.py --input_model /opt/intel/zt_debug/deployment_tools/open_model_zoo/tool Model Optimizer arguments: Common parameters: - Path to the Input Model: /opt/intel/zt_debug/deployment_tools/open_model_zoo/tools/downloader/public/alexnet/alexnet.caffemodel - Path for generated IR: /opt/intel/zt_debug/deployment_tools/model_optimizer/. - IR output name: alexnet - Log level: ERROR - Batch: Not specified, inherited from the model - Input layers: Not specified, inherited from the model - Output layers: Not specified, inherited from the model - Input shapes: Not specified, inherited from the model - Mean values: Not specified - Scale values: Not specified - Scale factor: Not specified - Precision of IR: FP32 - Enable fusing: True - Enable grouped convolutions fusing: True - Move mean values to preprocess section: None - Reverse input channels: False Caffe specific parameters: - Path to Python Caffe* parser generated from caffe.proto: /opt/intel/zt_debug/deployment_tools/model_optimizer/mo/utils/../front/caffe/proto - Enable resnet optimization: True - Path to the Input prototxt: /opt/intel/zt_debug/deployment_tools/open_model_zoo/tools/downloader/public/alexnet/alexnet.prototxt - Path to CustomLayersMapping.xml: /opt/intel/zt_debug/deployment_tools/model_optimizer/mo/utils/../../extensions/front/caffe/CustomLayersMappin - Path to a mean file: Not specified - Offsets for a mean file: Not specified - Inference Engine found in: /opt/intel/zt_auto/python/python3.6/openvino Inference Engine version: 2.1.custom_zt/AutoPlugin_c6e9314a9e96f74183023323dc6a026cd0b4549e Model Optimizer version: custom_zt/AutoPlugin_c6e9314a9e96f74183023323dc6a026cd0b4549e [ SUCCESS ] Generated IR version 10 model. [ SUCCESS ] XML file: /opt/intel/zt_debug/deployment_tools/model_optimizer/alexnet.xml [ SUCCESS ] BIN file: /opt/intel/zt_debug/deployment_tools/model_optimizer/alexnet.bin [ SUCCESS ] Total execution time: 19.84 seconds. [ SUCCESS ] Memory consumed: 1513 MB.
alexnet.xml 和 alexnet.bin 就是 OpenVINO 所需要的模型 IR 文件;
2.3 Benchmark
然后通過同步或者異步的方式,將 alexnet 模型通過 Multi-device plugin 加載到不同的設備上,然后利用 OpenVINO 提供的 benchmark_app 來評估性能:
$ cd /home/test/code/openvino/bin/intel64/Release
# sync, 使用 CPU 和 GPU $ ./benchmark_app -m /opt/intel/coneypo_test/deployment_tools/model_optimizer/alexnet.xml -i /opt/intel/coneypo_test/deployment_tools/inference_engine/samples/python/hello_classification/images/cat_1.png -api sync -d "MULTI:CPU,GPU" # async,使用 CPU 和 MYRIAD $ ./benchmark_app -m /opt/intel/coneypo_test/deployment_tools/model_optimizer/alexnet.xml -i /opt/intel/coneypo_test/deployment_tools/inference_engine/samples/python/hello_classification/images/cat_1.png -api async -d "MULTI:CPU,MYRIAD"
可以看到 sync 模式下,性能表現差不太多;
走異步的話,可以看到使用 MUTLI device plugin ,加載到多個設備上面異步做推理,會顯著提高 FPS:
2.4 加載流程
具體實現:
以 ./benchmark_app \
-m alexnet.xml \
-i cat_1.png \
-api sync \
-d MULTI:CPU,GPU,MYRIAD
為例:
2.4.1 解析和驗證輸入
輸入的 IR 文件,不是 .blob,isNetworkCompiled 為 false:
bool isNetworkCompiled = fileExt(FLAGS_m) == "blob";
2.4.2 加載推理引擎(Inference Engine)
Core ie;
需要使用 CPU, MKLDNN 庫作為一個共享庫被加載:
const auto extension_ptr = std::make_shared<InferenceEngine::Extension>(FLAGS_l);
ie.AddExtension(extension_ptr);
需要使用 iGPU,加載 clDNN 庫;
auto ext = config.at("GPU").at(CONFIG_KEY(CONFIG_FILE)); ie.SetConfig({{CONFIG_KEY(CONFIG_FILE), ext}}, "GPU");
2.4.3 配置設備
2.4.4 讀取 IR 文件
CNNNetwork cnnNetwork = ie.ReadNetwork(input_model);
2.4.5 調整網絡大小,來匹配輸入圖像尺寸和 batch size
cnnNetwork.reshape(shapes);
batchSize = (!FLAGS_layout.empty()) ? getBatchSize(app_inputs_info) : cnnNetwork.getBatchSize();
2.4.6 配置網絡輸入輸出
2.4.7 將模型加載到設備上
注意這里 "device_name" 傳進去的是 ”MULTI:CPU,GPU,MYRIAD" 這種字段,還沒有進行解析;
exeNetwork = ie.LoadNetwork(cnnNetwork, device_name);
因為使用了 "-d MULTI:CPU,GPU,MYRIAD",所以走到了 MULTI Plugin 里;
通過 openvino/inference-engine/src/multi_device/multi_device_plugin.cpp 的 MultiDeviceInferencePlugin::LoadExeNetworkImpl() 去實現加載網絡;
對於每個 device:CPU, GPU, MYRIAD,都要 LoadNetwork 一遍:
ExecutableNetworkInternal::Ptr MultiDeviceInferencePlugin::LoadExeNetworkImpl() { ... for (auto& p : metaDevices) { ... auto exec_net = GetCore()->LoadNetwork(network, deviceName, deviceConfig); ... } ... return std::make_shared<MultiDeviceExecutableNetwork>(executableNetworkPerDevice, metaDevices, multiNetworkConfig, enablePerfCounters); }
上個函數 LoadExeNetworkImpl 中會創建 MultiDeviceExecutableNetwork 對象;
這里面會對於每個設備,networkValue.first = CPU/GPU/MYRIAD, networkValue.second = AlexNet/AlexNet/AlexNet;
如果傳入的 -d 沒有為設備配置 nireq,會通過 GetMetric() 拿到這個設備的 optimalNum,這個值是這個設備的最優的,處理推理請求的個數;
MultiDeviceExecutableNetwork::MultiDeviceExecutableNetwork() { for (auto&& networkValue : _networksPerDevice) { ... auto numRequests = (_devicePriorities.end() == itNumRequests || itNumRequests->numRequestsPerDevices == -1) ? optimalNum : itNumRequests->numRequestsPerDevices; ... for (auto&& workerRequest : workerRequests) { ... workerRequest._inferRequest = network.CreateInferRequest(); ... } } }
注意這里 network.CreateInferRequest() ,是在 openvino/inference-engine/src/inference_engine/cpp/ie_executable_network.cpp 里面實現的,並不是調用 Multi-plugin 的里面的 CreateInferRequest(),
如果我們給定 nireq=6,創建六個推理請求,可以看到對於設備 CPU/GPU/MYRIAD, workerRequests 的大小分別是 2/1/4,所以會分別啟動 2/1/4 個 workerRequest;
關於 idleWorkerRequests 和 workerRequests 的定義:
DeviceMap<NotBusyWorkerRequests> _idleWorkerRequests;
DeviceMap<std::vector<WorkerInferRequest>> _workerRequests;
之后會詳細介紹 WorkerInferRequest 的工作原理;
2.4.8 配置最優化參數
如果已經配置了推理個數,會跳過這一步;
如果沒有配置,就會通過 GetMetric(METRIC_KEY(OPTIMAL_NUMBER_OF_INFER_REQUESTS)) 拿到這個網絡最優的 nireq:
nireq = exeNetwork.GetMetric(key).as<unsigned int>();
2.4.9 創建推理請求(Infer Requests)
通過 InferRequestsQueue 來將 nireq 個推理請求下發到要執行的網絡,創建了 nireq 個 InferReqWrap 對象:
InferRequestsQueue inferRequestsQueue(exeNetwork, nireq);
可以在 openvino/inference-engine/samples/benchmark_app/infer_request_wrap.hpp 看到 InferRequestsQueue 這個類的定義:
InferRequestsQueue(InferenceEngine::ExecutableNetwork& net, size_t nireq) { for (size_t id = 0; id < nireq; id++) { requests.push_back( std::make_shared<InferReqWrap>(net, id, std::bind(&InferRequestsQueue::putIdleRequest, this, std::placeholders::_1, std::placeholders::_2))); _idleIds.push(id); } resetTimes(); }
InferReqWrap 里面會通過 net.CreateInferRequest()來創建推理請求:
explicit InferReqWrap(InferenceEngine::ExecutableNetwork& net, size_t id, QueueCallbackFunction callbackQueue) : _request(net.CreateInferRequest()), _id(id), _callbackQueue(callbackQueue) { _request.SetCompletionCallback([&]() { _endTime = Time::now(); _callbackQueue(_id, getExecutionTimeInMilliseconds()); }); }
比如我們配置 nireq=6,需要創建六個推理請求,通過 openvino/inference-engine/src/multi_device/multi_device_exec_network.cpp 里面的 CreateInferRequest 來創建:
IInferRequestInternal::Ptr MultiDeviceExecutableNetwork::CreateInferRequest() { auto syncRequestImpl = CreateInferRequestImpl(_networkInputs, _networkOutputs); syncRequestImpl->setPointerToExecutableNetworkInternal(shared_from_this()); return std::make_shared<MultiDeviceAsyncInferRequest>(std::static_pointer_cast<MultiDeviceInferRequest>(syncRequestImpl), _needPerfCounters, std::static_pointer_cast<MultiDeviceExecutableNetwork>(shared_from_this()), _callbackExecutor); }
CreateInferRequestImpl() 會根據推理個數,構建 request_to_share_blobs_with,然后分配到各個設備上面;
不斷更新 _devicePrioritiesInitial 來決定將當前的 Infer Request 加載到哪個設備上面去;
InferenceEngine::IInferRequestInternal::Ptr MultiDeviceExecutableNetwork::CreateInferRequestImpl(InferenceEngine::InputsDataMap networkInputs, InferenceEngine::OutputsDataMap networkOutputs) { auto num = _numRequestsCreated++; ... for (const auto& device : _devicePrioritiesInitial) { auto& dev_requests = _workerRequests[device.deviceName]; ... if ((num - sum) < dev_requests.size()) { request_to_share_blobs_with = dev_requests.at(num - sum)._inferRequest; break; } sum += dev_requests.size(); } return std::make_shared<MultiDeviceInferRequest>(networkInputs, networkOutputs, request_to_share_blobs_with); }
這個函數會去創建 MultiDeviceInferRequest 對象,而這個類是在 openvino/inference-engine/src/multi_device/multi_device_infer_request.cpp 里面定義的;
創建 MultiDeviceInferRequest 對象,會去配置這個 Infer Request,給到網絡的輸入輸出:
MultiDeviceInferRequest::MultiDeviceInferRequest(const InputsDataMap& networkInputs, const OutputsDataMap& networkOutputs, InferRequest request_to_share_blobs_with) : IInferRequestInternal(networkInputs, networkOutputs) { if (request_to_share_blobs_with) { for (const auto &it : _networkInputs) { _inputs[it.first] = request_to_share_blobs_with.GetBlob(it.first); } for (const auto &it : _networkOutputs) _outputs[it.first] = request_to_share_blobs_with.GetBlob(it.first); return; } // Allocate all input blobs for (const auto &it : networkInputs) { Layout l = it.second->getLayout(); Precision p = it.second->getPrecision(); SizeVector dims = it.second->getTensorDesc().getDims(); TensorDesc desc = TensorDesc(p, dims, l); _inputs[it.first] = make_blob_with_precision(desc); _inputs[it.first]->allocate(); } // Allocate all output blobs for (const auto &it : networkOutputs) { Layout l = it.second->getLayout(); Precision p = it.second->getPrecision(); SizeVector dims = it.second->getTensorDesc().getDims(); TensorDesc desc = TensorDesc(p, dims, l); _outputs[it.first] = make_blob_with_precision(desc); _outputs[it.first]->allocate(); } }
關於 OpenVINO 里面 GetBlob() 的定義:
Get blobs allocated by an infer request using InferenceEngine::InferRequest::GetBlob() and feed an image and the input data to the blobs.
意思就是說,OpenVINO 中推理請求(Infer request)會通過 InferenceEngine::InferRequest::GetBlob() 來將網絡輸入的數據喂給網絡;
MultiDeviceExecutableNetwork::CreateInferRequest() 的返回值還會創建 MultiDeviceAsyncInferRequest 對象;
這個類是在 openvino/inference-engine/src/multi_device/multi_device_async_infer_request.cpp 里面定義的:
MultiDeviceAsyncInferRequest::MultiDeviceAsyncInferRequest( const MultiDeviceInferRequest::Ptr& inferRequest, const bool needPerfCounters, const MultiDeviceExecutableNetwork::Ptr& multiDeviceExecutableNetwork, const ITaskExecutor::Ptr& callbackExecutor) : AsyncInferRequestThreadSafeDefault(inferRequest, nullptr, callbackExecutor), _multiDeviceExecutableNetwork{multiDeviceExecutableNetwork}, _inferRequest{inferRequest}, _needPerfCounters{needPerfCounters} { // this executor starts the inference while the task (checking the result) is passed to the next stage struct ThisRequestExecutor : public ITaskExecutor { explicit ThisRequestExecutor(MultiDeviceAsyncInferRequest* _this_) : _this{_this_} {} void run(Task task) override { auto workerInferRequest = _this->_workerInferRequest; workerInferRequest->_task = std::move(task); workerInferRequest->_inferRequest.StartAsync(); }; MultiDeviceAsyncInferRequest* _this = nullptr; }; _pipeline = { // if the request is coming with device-specific remote blobs make sure it is scheduled to the specific device only: { /*TaskExecutor*/ std::make_shared<ImmediateExecutor>(), /*task*/ [this] { // by default, no preferred device: _multiDeviceExecutableNetwork->_thisPreferredDeviceName = ""; // if any input is remote (e.g. was set with SetBlob), let' use the corresponding device for (const auto &it : _multiDeviceExecutableNetwork->GetInputsInfo()) { auto b = _inferRequest->GetBlob(it.first); auto r = b->as<RemoteBlob>(); if (r) { const auto name = r->getDeviceName(); const auto res = std::find_if( _multiDeviceExecutableNetwork->_devicePrioritiesInitial.cbegin(), _multiDeviceExecutableNetwork->_devicePrioritiesInitial.cend(), [&name](const MultiDevicePlugin::DeviceInformation& d){ return d.deviceName == name; }); if (_multiDeviceExecutableNetwork->_devicePrioritiesInitial.cend() == res) { IE_THROW() << "None of the devices (for which current MULTI-device configuration was " "initialized) supports a remote blob created on the device named " << name; } else { // it is ok to take the c_str() here (as pointed in the multi_device_exec_network.hpp we need to use const char*) // as the original strings are from the "persistent" vector (with the right lifetime) _multiDeviceExecutableNetwork->_thisPreferredDeviceName = res->deviceName.c_str(); break; } } } }}, // as the scheduling algo may select any device, this stage accepts the scheduling decision (actual workerRequest) // then sets the device-agnostic blobs to the actual (device-specific) request { /*TaskExecutor*/ _multiDeviceExecutableNetwork, /*task*/ [this] { _workerInferRequest = MultiDeviceExecutableNetwork::_thisWorkerInferRequest; _inferRequest->SetBlobsToAnotherRequest(_workerInferRequest->_inferRequest); }}, // final task in the pipeline: { /*TaskExecutor*/std::make_shared<ThisRequestExecutor>(this), /*task*/ [this] { auto status = _workerInferRequest->_status; if (InferenceEngine::StatusCode::OK != status) { if (nullptr != InferenceEngine::CurrentException()) std::rethrow_exception(InferenceEngine::CurrentException()); else IE_EXCEPTION_SWITCH(status, ExceptionType, InferenceEngine::details::ThrowNow<ExceptionType>{} <<= std::stringstream{} << IE_LOCATION << InferenceEngine::details::ExceptionTraits<ExceptionType>::string()); } if (_needPerfCounters) _perfMap = _workerInferRequest->_inferRequest.GetPerformanceCounts(); }} }; }
2.4.10 進行推理
根據是 sync 還是 async,來對創建的 inferRequest 進行推理;
if (FLAGS_api == "sync") { inferRequest->infer(); } else { inferRequest->wait(); inferRequest->startAsync(); }
如果是 sync,會去執行 InferReqWrap 類里面定義的 infer():
void infer() { _startTime = Time::now(); _request.Infer(); _endTime = Time::now(); _callbackQueue(_id, getExecutionTimeInMilliseconds()); }
會去執行之前 MultiDeviceAsyncInferRequest 類里面,pipeline 里面創建的任務,去執行 SetBlobsToAnotherRequest:
{ /*TaskExecutor*/ _multiDeviceExecutableNetwork, /*task*/ [this] { _workerInferRequest = MultiDeviceExecutableNetwork::_thisWorkerInferRequest; _inferRequest->SetBlobsToAnotherRequest(_workerInferRequest->_inferRequest); }},
而 SetBlobsToAnotherRequest() 里面就是真正的去 GetBlob() 來做推理:
void MultiDeviceInferRequest::SetBlobsToAnotherRequest(InferRequest& req) { for (const auto &it : _networkInputs) { auto &name = it.first; // this request is already in BUSY state, so using the internal functions safely auto blob = GetBlob(name); if (req.GetBlob(name) != blob) { req.SetBlob(name, blob); } } for (const auto &it : _networkOutputs) { auto &name = it.first; // this request is already in BUSY state, so using the internal functions safely auto blob = GetBlob(name); if (req.GetBlob(name) != blob) { req.SetBlob(name, blob); } } }
如果我們指定 -d "MULTI:CPU(1),MYRIAD(3),GPU(2)",分別在 CPU/MYRIAD/GPU 上面創建 1/3/2 個推理請求:
如果我們指定 -d "MULTI:CPU,MYRIAD,GPU",但是沒有給每個設備指定 Infer request;
會按照設備順序,先在 CPU 上面創建 Infer request, 在 async 的情況下,因為有四個物理核,所以可以創建四個,剩下的兩個 Infer request 就會創建到第二個設備, MYRIAD 上面: