使用tensorRT C++ API搭建MLP網絡詳解

本文轉載自查看原文 2022-04-15 00:47 1439 模型壓縮/加速/部署

本文詳細說明，如何使用 tensorrt C++ API搭建MPL網絡，實現推理，幫助與我類似的小白更快上手C++ 版本的方法，我將介紹內容為：簡單介紹、visual studio如何配置、MLP網絡搭建步驟及詳細說明、原始代碼與改編代碼。

同篇關聯python API文章為：https://www.cnblogs.com/tangjunjun/p/16154788.html

一.簡單介紹

TensorRT是英偉達針對自家平台做的一個加速包，可以認為 TensorRT 是一個只有前向傳播的深度學習框架，這個框架可以將 Caffe，TensorFlow 的網絡模型解析，然后與 TensorRT 中對應的層進行一一映射，把其他框架的模型統一全部轉換到 TensorRT 中，然后在 TensorRT 中可以針對 NVIDIA 自家 GPU 實施優化策略，並進行部署加速。根據官方文檔，使用TensorRT，在CPU或者GPU模式下其可提供10X乃至100X的加速。本人的實際經驗中，TensorRT提供了20X的加速

TensorRT主要做了這么兩件事情，來提升模型的運行速度：

TensorRT支持INT8和FP16的計算。深度學習網絡在訓練時，通常使用 32 位或 16 位數據。TensorRT則在網絡的推理時選用不這么高的精度，達到加速推斷的目的
TensorRT對於網絡結構進行了重構，把一些能夠合並的運算合並在了一起，針對GPU的特性做了優化

二.visual studio的環境配置

簡單介紹visual studio的環境配置，前提條件你已經將tensorrt庫相應放在cuda文件夾下了：

前提條件：

將 C:\bag-tangjunjun\TensorRT-8.2.5.1.Windows10.x86_64.cuda-11.4.cudnn8.2\TensorRT-8.2.5.1\include中頭文件復制到C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include
將C:\bag-tangjunjun\TensorRT-8.2.5.1.Windows10.x86_64.cuda-11.4.cudnn8.2\TensorRT-8.2.5.1\lib 中所有lib文件復制到C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\lib\x64
將C:\bag-tangjunjun\TensorRT-8.2.5.1.Windows10.x86_64.cuda-11.4.cudnn8.2\TensorRT-8.2.5.1\lib 中所有dll文件復制到C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\bin

步驟：

①選擇項目——>xxxx(你的項目名稱)屬性——>VC++目錄——>包含目錄，添加庫文件路徑；

如：C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\include

②選擇項目——>xxxx(你的項目名稱)屬性——>VC++目錄——>庫目錄，添加庫文件路徑；

如：C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\lib\x64

③選擇項目——>xxxx(你的項目名稱)屬性——>鏈接器——>輸入——>附加依賴項，添加以下文件；

nvinfer.lib
nvinfer_plugin.lib
cudart.lib

注：也可將Tensorrt中bin文件夾下的所有.lib后綴添加進來。

三.tensorrt C++ API 以搭建MLP網絡結構，詳細說明步驟:

需引用頭文件如下：

#include "NvInferRuntimeCommon.h"
#include <cassert>
#include "NvInfer.h"    // TensorRT library
#include "iostream"     // Standard input/output library
#include <map>          // for weight maps
#include <fstream>      // for file-handling
#include <chrono>       // for timing the execution
using namespace nvinfer1;

構建引擎engine，並將其保存為文件形式

①構建glogging，為創建builder做准備，簡單創建代碼如下:

class Logger : public nvinfer1::ILogger
{
    void log(Severity severity, const char* msg) noexcept override
    {
        // suppress info-level messages
        if (severity != Severity::kINFO)
            std::cout << msg << std::endl;
    }
} gLogger;

②創建builder，使用gLogger

IBuilder* builder = createInferBuilder(gLogger); // Create builder with the help of logger   構建builder

③構建網絡

INetworkDefinition* network = builder->createNetworkV2(0U); //創造網絡

網絡構建完畢后，需為網絡添加結構，可以使用onnx/caffe/uft解析添加網絡,但本篇博客使用C++ API 構建網絡，如下：

ITensor* data = network->addInput("data", DataType::kFLOAT, Dims3{ 1, 1, 1 });// Create an input with proper *name 創建輸入，參數:名稱  類型  維度
IFullyConnectedLayer* fc1 = network->addFullyConnected(*data, 1, weightMap["linear.weight"], weightMap["linear.bias"]);    // Add layer for MLP 參數：輸入 輸出 w權重 b權重
fc1->getOutput(0)->setName("out");  // set output with *name 設置fc1層的輸出，（對特殊的網絡層通過ITensor->setName()方法設定名稱，方便后面的操作）；指定網絡的output節點，tensorrt必須指定輸出節點，否則有可能會在優化過程中將該節點優化掉
network->markOutput(*fc1->getOutput(0)); //設為網絡的輸出，防止被優化掉

其中weightMap為權重保存變量，類似一個字典

④設置網絡參數

調用TensorRT的builder來創建優化的runtime。 builder的其中一個功能是搜索其CUDA內核目錄以獲得最快的實現，因此用來構建優化的engine的GPU設備和實際跑的GPU設備一定要是相同的才行，這也是為什么無法適應其它環境原因。

builder具有許多屬性，可以通過設置這些屬性來控制網絡運行的精度，以及自動調整參數。還可以查詢builder以找出硬件本身支持的降低的精度類型。

有個特別重要的屬性，最大batch size ：大batch size指定TensorRT將要優化的batch大小。在運行時，只能選擇比這個值小的batch。

config有個workspace size：各種layer算法通常需要臨時工作空間。這個參數限制了網絡中所有的層可以使用的最大的workspace空間大小。如果分配的空間不足，TensorRT可能無法找到給定層的實現。

IBuilderConfig* config = builder->createBuilderConfig(); // Create hardware configs為builder分配內存，默認全部分配

builder->setMaxBatchSize(1); // Set configurations
config->setMaxWorkspaceSize(1 << 20); // Set workspace size

⑤創建引擎engine

ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);// Build CUDA Engine using network and configurations
network->destroy();//順帶銷毀網絡，釋放內存

⑥引擎engine序列化

IHostMemory** modelStream；//引擎變量聲明，並保存序列化結果
(*modelStream) = engine->serialize(); //調用序列化方法

⑦釋放內存

engine->destroy(); //釋放engine結構
builder->destroy();//釋放builder結構

⑧保存序列化引擎

 // Open the file and write the contents there in binary format
    std::ofstream p(file_engine, std::ios::binary);
    if (!p) {
        std::cerr << "could not open plan output file" << std::endl;
        return;
    }
    p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());

其中modelStream為序列化的變量，file_engine為保存engine的地址，如："C:\\Users\\Administrator\\Desktop\\code\\tensorrt-code\\mlp\\mlp.wts"

⑨釋放序列化內存

modelStream->destroy();

以上為tensorrt C++ API 將網絡編譯成engine，並保存的全部流程，若后續更改不同網絡，主要更改步驟③構建網絡模塊。

重載引擎文件，並實現推理：

①讀取引擎engine

    char* trtModelStream{ nullptr }; //指針函數,創建保存engine序列化文件結果
    size_t size{ 0 };

    // read model from the engine file
    std::ifstream file(file_engine, std::ios::binary);
    if (file.good()) {
        file.seekg(0, file.end);
        size = file.tellg();
        file.seekg(0, file.beg);
        trtModelStream = new char[size];
        assert(trtModelStream);
        file.read(trtModelStream, size);
        file.close();
    }

其中file_engine為：file_engine = "C:\\Users\\Administrator\\Desktop\\code\\tensorrt-code\\mlp\\mlp.engine"

②反序列化

   // create a runtime (required for deserialization of model) with NVIDIA's logger
    IRuntime* runtime = createInferRuntime(gLogger); //反序列化方法
    assert(runtime != nullptr);
    // deserialize engine for using the char-stream
    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);
    assert(engine != nullptr);
    /*
    一個engine可以有多個execution context，並允許將同一套weights用於多個推理任務。可以在並行的CUDA streams流中按每個stream流一個engine和一個context來處理圖像。每個context在engine相同的GPU上創建。
    */
    runtime->destroy(); //順道銷毀runtime，釋放內存

其中gLogger來源創建引擎構建的glogging

以上為初始化過程，從以下為實施推理過程

③構建可執行方法

IExecutionContext* context = engine->createExecutionContext(); // create execution context -- required for inference executions

④設置輸入輸出

 float out[1];  // array for output
    float data[1]; // array for input
    for (float& i : data)
        i = 12.0;   // put any value for input

⑤調用推理

   // do inference using the parameters
    doInference(*context, data, out, 1);

void doInference(IExecutionContext& context, float* input, float* output, int batchSize) {
    const ICudaEngine& engine = context.getEngine();  // Get engine from the context
    // Pointers to input and output device buffers to pass to engine.
    void* buffers[2]; // Engine requires exactly IEngine::getNbBindings() number of buffers.

    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings()
    const int inputIndex = engine.getBindingIndex("data");
    const int outputIndex = engine.getBindingIndex("out");

    // Create GPU buffers on device -- allocate memory for input and output
    cudaMalloc(&buffers[inputIndex], batchSize * INPUT_SIZE * sizeof(float));
    cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float));

    // create CUDA stream for simultaneous CUDA operations
    cudaStream_t stream;
    cudaStreamCreate(&stream);

    // copy input from host (CPU) to device (GPU)  in stream
    cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_SIZE * sizeof(float), cudaMemcpyHostToDevice, stream);

    // execute inference using context provided by engine
    context.enqueue(batchSize, buffers, stream, nullptr);//*******************************************************************************************************************重點推理****************

    // copy output back from device (GPU) to host (CPU)
    cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost,
        stream);

    // synchronize the stream to prevent issues (block CUDA and wait for CUDA operations to be completed)
    cudaStreamSynchronize(stream);

    // Release stream and buffers (memory)
    cudaStreamDestroy(stream);
    cudaFree(buffers[inputIndex]);
    cudaFree(buffers[outputIndex]);
}

以上為tensorrt實現推理過程

四.代碼部分

Ⅰ.權重加載代碼說明及wts文件說名：

Tensorrt對權重的使用-->如何保存tensorrt的權重文件wts，如何使用C++加載權重wts文件

①保存權重.wts文件說明：

12 表示保存12個層

conv1bias就是第一個卷積層的偏置系數，后面的0指的是 kFLOAT 類型，也就是float 32；后面的20是系數的個數，因為輸出是20，所以偏置是20個；

convlfiter是卷積核的系數，因為是20個5 x 5的卷積核，所以有20 x 5 x 5=500個參數。

你用相應工具解析解析模型將層名和權值參數鍵值對存到這個文件中就可以了。

②C++如何調用權重wts

/*
Weights是類別類型
class Weights
{
public:
    DataType type;      //!< The type of the weights.
    const void* values; //!< The weight values, in a contiguous array.
    int64_t count;      //!< The number of weights in the array.
};

*/
//file為文件路徑
std::map<std::string, Weights> loadWeights(const std::string file) {
    /**
     * Parse the .wts file and store weights in dict format.
     *
     * @param file path to .wts file
     * @return weight_map: dictionary containing weights and their values
     */

    std::cout << "[INFO]: Loading weights..." << file << std::endl;
    std::map<std::string, Weights> weightMap;  //定義聲明

    // Open Weight file
    std::ifstream input(file);
    assert(input.is_open() && "[ERROR]: Unable to load weight file...");

    // Read number of weights
    int32_t count;
    input >> count;
    assert(count > 0 && "Invalid weight map file.");

    // Loop through number of line, actually the number of weights & biases
    while (count--) {
        // TensorRT weights
        Weights wt{DataType::kFLOAT, nullptr, 0};
        uint32_t size;
        // Read name and type of weights
        std::string w_name;
        input >> w_name >> std::dec >> size;
        wt.type = DataType::kFLOAT;

        uint32_t *val = reinterpret_cast<uint32_t *>(malloc(sizeof(val) * size));
        for (uint32_t x = 0, y = size; x < y; ++x) {
            // Change hex values to uint32 (for higher values)
            input >> std::hex >> val[x]; //hex為16進制
        }
        wt.values = val;
        wt.count = size;

        // Add weight values against its name (key)
        weightMap[w_name] = wt;  //將權重結果保存此處
    }
    return weightMap;
}

Ⅱ.原始代碼如下：

使用tensorrt C++ API 編寫mlp推理，需要三個文件，頭文件logging.h，源文件mlp.cpp，權重文件wts。

以下將給出這三個文件：

頭文件 logging.h

/*
 * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

#ifndef TENSORRT_LOGGING_H
#define TENSORRT_LOGGING_H


#include "NvInferRuntimeCommon.h"
#include <cassert>
#include <ctime>
#include <iomanip>
#include <iostream>
#include <ostream>
#include <sstream>
#include <string>

//using namespace nvinfer1;

using Severity = nvinfer1::ILogger::Severity;


class LogStreamConsumerBuffer : public std::stringbuf
{
public:
    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)
        : mOutput(stream)
        , mPrefix(prefix)
        , mShouldLog(shouldLog)
    {
    }

    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)
        : mOutput(other.mOutput)
    {
    }

    ~LogStreamConsumerBuffer()
    {
        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence
        // std::streambuf::pptr() gives a pointer to the current position of the output sequence
        // if the pointer to the beginning is not equal to the pointer to the current position,
        // call putOutput() to log the output to the stream
        if (pbase() != pptr())
        {
            putOutput();
        }
    }

    // synchronizes the stream buffer and returns 0 on success
    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,
    // resetting the buffer and flushing the stream
    virtual int sync()
    {
        putOutput();
        return 0;
    }

    void putOutput()
    {
        if (mShouldLog)
        {
            // prepend timestamp
            std::time_t timestamp = std::time(nullptr);
            tm* tm_local = std::localtime(&timestamp);
            std::cout << "[";
            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << "/";
            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << "/";
            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << "-";
            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << ":";
            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << ":";
            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << "] ";
            // std::stringbuf::str() gets the string contents of the buffer
            // insert the buffer contents pre-appended by the appropriate prefix into the stream
            mOutput << mPrefix << str();
            // set the buffer to empty
            str("");
            // flush the stream
            mOutput.flush();
        }
    }

    void setShouldLog(bool shouldLog)
    {
        mShouldLog = shouldLog;
    }

private:
    std::ostream& mOutput;
    std::string mPrefix;
    bool mShouldLog;
};

//!
//! \class LogStreamConsumerBase
//! \brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer
//!
class LogStreamConsumerBase
{
public:
    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)
        : mBuffer(stream, prefix, shouldLog)
    {
    }

protected:
    LogStreamConsumerBuffer mBuffer;
};

//!
//! \class LogStreamConsumer
//! \brief Convenience object used to facilitate use of C++ stream syntax when logging messages.
//!  Order of base classes is LogStreamConsumerBase and then std::ostream.
//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field
//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.
//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.
//!  Please do not change the order of the parent classes.
//!
class LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream
{
public:
    //! \brief Creates a LogStreamConsumer which logs messages with level severity.
    //!  Reportable severity determines if the messages are severe enough to be logged.
    LogStreamConsumer(Severity reportableSeverity, Severity severity)
        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)
        , std::ostream(&mBuffer) // links the stream buffer with the stream
        , mShouldLog(severity <= reportableSeverity)
        , mSeverity(severity)
    {
    }

    LogStreamConsumer(LogStreamConsumer&& other)
        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)
        , std::ostream(&mBuffer) // links the stream buffer with the stream
        , mShouldLog(other.mShouldLog)
        , mSeverity(other.mSeverity)
    {
    }

    void setReportableSeverity(Severity reportableSeverity)
    {
        mShouldLog = mSeverity <= reportableSeverity;
        mBuffer.setShouldLog(mShouldLog);
    }

private:
    static std::ostream& severityOstream(Severity severity)
    {
        return severity >= Severity::kINFO ? std::cout : std::cerr;
    }

    static std::string severityPrefix(Severity severity)
    {
        switch (severity)
        {
        case Severity::kINTERNAL_ERROR: return "[F] ";
        case Severity::kERROR: return "[E] ";
        case Severity::kWARNING: return "[W] ";
        case Severity::kINFO: return "[I] ";
        case Severity::kVERBOSE: return "[V] ";
        default: assert(0); return "";
        }
    }

    bool mShouldLog;
    Severity mSeverity;
};

//! \class Logger
//!
//! \brief Class which manages logging of TensorRT tools and samples
//!
//! \details This class provides a common interface for TensorRT tools and samples to log information to the console,
//! and supports logging two types of messages:
//!
//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)
//! - Test pass/fail messages
//!
//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is
//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.
//!
//! In the future, this class could be extended to support dumping test results to a file in some standard format
//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).
//!
//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger
//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT
//! library and messages coming from the sample.
//!
//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the
//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger
//! object.

class Logger : public nvinfer1::ILogger
{
public:
    Logger(Severity severity = Severity::kWARNING)
        : mReportableSeverity(severity)
    {
    }

    //!
    //! \enum TestResult
    //! \brief Represents the state of a given test
    //!
    enum class TestResult
    {
        kRUNNING, //!< The test is running
        kPASSED,  //!< The test passed
        kFAILED,  //!< The test failed
        kWAIVED   //!< The test was waived
    };

    //!
    //! \brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger
    //! \return The nvinfer1::ILogger associated with this Logger
    //!
    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,
    //! we can eliminate the inheritance of Logger from ILogger
    //!
    nvinfer1::ILogger& getTRTLogger()
    {
        return *this;
    }

    //!
    //! \brief Implementation of the nvinfer1::ILogger::log() virtual method
    //!
    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the
    //! inheritance from nvinfer1::ILogger
    //!
    void log(Severity severity, const char* msg) noexcept override
    {
        LogStreamConsumer(mReportableSeverity, severity) << "[TRT] " << std::string(msg) << std::endl;
    }

    //!
    //! \brief Method for controlling the verbosity of logging output
    //!
    //! \param severity The logger will only emit messages that have severity of this level or higher.
    //!
    void setReportableSeverity(Severity severity)
    {
        mReportableSeverity = severity;
    }

    //!
    //! \brief Opaque handle that holds logging information for a particular test
    //!
    //! This object is an opaque handle to information used by the Logger to print test results.
    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used
    //! with Logger::reportTest{Start,End}().
    //!
   
    class TestAtom
    {
    public:
        TestAtom(TestAtom&&) = default;

    private:
        friend class Logger;

        TestAtom(bool started, const std::string& name, const std::string& cmdline)
            : mStarted(started)
            , mName(name)
            , mCmdline(cmdline)
        {
        }

        bool mStarted;
        std::string mName;
        std::string mCmdline;
    };

    //!
    //! \brief Define a test for logging
    //!
    //! \param[in] name The name of the test.  This should be a string starting with
    //!                  "TensorRT" and containing dot-separated strings containing
    //!                  the characters [A-Za-z0-9_].
    //!                  For example, "TensorRT.sample_googlenet"
    //! \param[in] cmdline The command line used to reproduce the test
    //
    //! \return a TestAtom that can be used in Logger::reportTest{Start,End}().
    //!
    static TestAtom defineTest(const std::string& name, const std::string& cmdline)
    {
        return TestAtom(false, name, cmdline);
    }

    //!
    //! \brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments
    //!        as input
    //!
    //! \param[in] name The name of the test
    //! \param[in] argc The number of command-line arguments
    //! \param[in] argv The array of command-line arguments (given as C strings)
    //!
    //! \return a TestAtom that can be used in Logger::reportTest{Start,End}().
    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)
    {
        auto cmdline = genCmdlineString(argc, argv);
        return defineTest(name, cmdline);
    }

    //!
    //! \brief Report that a test has started.
    //!
    //! \pre reportTestStart() has not been called yet for the given testAtom
    //!
    //! \param[in] testAtom The handle to the test that has started
    //!
    static void reportTestStart(TestAtom& testAtom)
    {
        reportTestResult(testAtom, TestResult::kRUNNING);
        assert(!testAtom.mStarted);
        testAtom.mStarted = true;
    }

    //!
    //! \brief Report that a test has ended.
    //!
    //! \pre reportTestStart() has been called for the given testAtom
    //!
    //! \param[in] testAtom The handle to the test that has ended
    //! \param[in] result The result of the test. Should be one of TestResult::kPASSED,
    //!                   TestResult::kFAILED, TestResult::kWAIVED
    //!
    static void reportTestEnd(const TestAtom& testAtom, TestResult result)
    {
        assert(result != TestResult::kRUNNING);
        assert(testAtom.mStarted);
        reportTestResult(testAtom, result);
    }

    static int reportPass(const TestAtom& testAtom)
    {
        reportTestEnd(testAtom, TestResult::kPASSED);
        return EXIT_SUCCESS;
    }

    static int reportFail(const TestAtom& testAtom)
    {
        reportTestEnd(testAtom, TestResult::kFAILED);
        return EXIT_FAILURE;
    }

    static int reportWaive(const TestAtom& testAtom)
    {
        reportTestEnd(testAtom, TestResult::kWAIVED);
        return EXIT_SUCCESS;
    }

    static int reportTest(const TestAtom& testAtom, bool pass)
    {
        return pass ? reportPass(testAtom) : reportFail(testAtom);
    }

    Severity getReportableSeverity() const
    {
        return mReportableSeverity;
    }

private:
    //!
    //! \brief returns an appropriate string for prefixing a log message with the given severity
    //!
    static const char* severityPrefix(Severity severity)
    {
        switch (severity)
        {
        case Severity::kINTERNAL_ERROR: return "[F] ";
        case Severity::kERROR: return "[E] ";
        case Severity::kWARNING: return "[W] ";
        case Severity::kINFO: return "[I] ";
        case Severity::kVERBOSE: return "[V] ";
        default: assert(0); return "";
        }
    }

    //!
    //! \brief returns an appropriate string for prefixing a test result message with the given result
    //!
    static const char* testResultString(TestResult result)
    {
        switch (result)
        {
        case TestResult::kRUNNING: return "RUNNING";
        case TestResult::kPASSED: return "PASSED";
        case TestResult::kFAILED: return "FAILED";
        case TestResult::kWAIVED: return "WAIVED";
        default: assert(0); return "";
        }
    }

    //!
    //! \brief returns an appropriate output stream (cout or cerr) to use with the given severity
    //!
    static std::ostream& severityOstream(Severity severity)
    {
        return severity >= Severity::kINFO ? std::cout : std::cerr;
    }

    //!
    //! \brief method that implements logging test results
    //!
    static void reportTestResult(const TestAtom& testAtom, TestResult result)
    {
        severityOstream(Severity::kINFO) << "&&&& " << testResultString(result) << " " << testAtom.mName << " # "
                                         << testAtom.mCmdline << std::endl;
    }

    //!
    //! \brief generate a command line string from the given (argc, argv) values
    //!
    static std::string genCmdlineString(int argc, char const* const* argv)
    {
        std::stringstream ss;
        for (int i = 0; i < argc; i++)
        {
            if (i > 0)
                ss << " ";
            ss << argv[i];
        }
        return ss.str();
    }

    Severity mReportableSeverity;
};

namespace
{

//!
//! \brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE
//!
//! Example usage:
//!
//!     LOG_VERBOSE(logger) << "hello world" << std::endl;
//!
inline LogStreamConsumer LOG_VERBOSE(const Logger& logger)
{
    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);
}

//!
//! \brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO
//!
//! Example usage:
//!
//!     LOG_INFO(logger) << "hello world" << std::endl;
//!
inline LogStreamConsumer LOG_INFO(const Logger& logger)
{
    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);
}

//!
//! \brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING
//!
//! Example usage:
//!
//!     LOG_WARN(logger) << "hello world" << std::endl;
//!
inline LogStreamConsumer LOG_WARN(const Logger& logger)
{
    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);
}

//!
//! \brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR
//!
//! Example usage:
//!
//!     LOG_ERROR(logger) << "hello world" << std::endl;
//!
inline LogStreamConsumer LOG_ERROR(const Logger& logger)
{
    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);
}

//!
//! \brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR
//         ("fatal" severity)
//!
//! Example usage:
//!
//!     LOG_FATAL(logger) << "hello world" << std::endl;
//!
inline LogStreamConsumer LOG_FATAL(const Logger& logger)
{
    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);
}

} // anonymous namespace

#endif // TENSORRT_LOGGING_H

logging.h

源文件mlp.cpp

#include "NvInfer.h"    // TensorRT library
#include "iostream"     // Standard input/output library
#include "logging.h"    // logging file -- by NVIDIA
#include <map>          // for weight maps
#include <fstream>      // for file-handling
#include <chrono>       // for timing the execution





// provided by nvidia for using TensorRT APIs
using namespace nvinfer1;

// Logger from TRT API
static Logger gLogger;




const int INPUT_SIZE = 1;
const int OUTPUT_SIZE = 1;











/*******************************************************構建engine將其序列化並保存*******************************************************************************/

/*
Weights是類別類型
class Weights
{
public:
    DataType type;      //!< The type of the weights.
    const void* values; //!< The weight values, in a contiguous array.
    int64_t count;      //!< The number of weights in the array.
};
*/

//載入權重函數
std::map<std::string, Weights> loadWeights(const std::string file) {
    /**
     * Parse the .wts file and store weights in dict format.
     *
     * @param file path to .wts file
     * @return weight_map: dictionary containing weights and their values
     */

    std::cout << "[INFO]: Loading weights..." << file << std::endl;
    std::map<std::string, Weights> weightMap;  //定義聲明

    // Open Weight file
    std::ifstream input(file);
    assert(input.is_open() && "[ERROR]: Unable to load weight file...");

    // Read number of weights
    int32_t count;
    input >> count;
    assert(count > 0 && "Invalid weight map file.");

    // Loop through number of line, actually the number of weights & biases
    while (count--) {
        // TensorRT weights
        Weights wt{DataType::kFLOAT, nullptr, 0};
        uint32_t size;
        // Read name and type of weights
        std::string w_name;
        input >> w_name >> std::dec >> size;
        wt.type = DataType::kFLOAT;

        uint32_t *val = reinterpret_cast<uint32_t *>(malloc(sizeof(val) * size));
        for (uint32_t x = 0, y = size; x < y; ++x) {
            // Change hex values to uint32 (for higher values)
            input >> std::hex >> val[x]; //hex為16進制
        }
        wt.values = val;
        wt.count = size;

        // Add weight values against its name (key)
        weightMap[w_name] = wt;  //將權重結果保存此處
    }
    return weightMap;
}
//構建engine函數
ICudaEngine *createMLPEngine(unsigned int maxBatchSize, IBuilder *builder, IBuilderConfig *config, DataType dt, 
    const std::string file_wts) {
    /**
     * Create Multi-Layer Perceptron using the TRT Builder and Configurations
     *
     * @param maxBatchSize: batch size for built TRT model
     * @param builder: to build engine and networks
     * @param config: configuration related to Hardware
     * @param dt: datatype for model layers
     * @return engine: TRT model
     */

    //std::cout << "[INFO]: Creating MLP using TensorRT..." << std::endl;

    // Load Weights from relevant file
    std::map<std::string, Weights> weightMap = loadWeights(file_wts);  //載入權重中

    // Create an empty network
    INetworkDefinition *network = builder->createNetworkV2(0U); //創造網絡

    // Create an input with proper *name
    ITensor *data = network->addInput("data", DataType::kFLOAT, Dims3{1, 1, 1});
    assert(data);

    // Add layer for MLP 輸入 輸出 w權重 b權重
    IFullyConnectedLayer *fc1 = network->addFullyConnected(*data, 1,
                                                           weightMap["linear.weight"],
                                                           weightMap["linear.bias"]);
    assert(fc1);

    // set output with *name
    fc1->getOutput(0)->setName("out");

    // mark the output
    network->markOutput(*fc1->getOutput(0));

    // Set configurations
    builder->setMaxBatchSize(1);
    // Set workspace size
    config->setMaxWorkspaceSize(1 << 20);

    // Build CUDA Engine using network and configurations
    ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);
    assert(engine != nullptr);

    // Don't need the network any more
    // free captured memory
    network->destroy();//銷毀網絡 

    // Release host memory
    for (auto &mem: weightMap) {
        free((void *) (mem.second.values));
    }

    return engine;
}
//構建模型，主要調用ICudaEngine函數
void APIToModel(unsigned int maxBatchSize, IHostMemory **modelStream,
    const std::string file_wts) {
    /**
     * Create engine using TensorRT APIs
     *
     * @param maxBatchSize: for the deployed model configs
     * @param modelStream: shared memory to store serialized model
     */

    // Create builder with the help of logger
    IBuilder *builder = createInferBuilder(gLogger); //構建builder

    // Create hardware configs
    IBuilderConfig *config = builder->createBuilderConfig(); //為builder分配內存，默認全部分配

    // Build an engine
    ICudaEngine *engine = createMLPEngine(maxBatchSize, builder, config, DataType::kFLOAT,file_wts); //創造引擎   
    // DataType::kFLOAT指的是數據類型,kFLOAT=0表示32float；kHALF=1表示16float；
    // kFLOAT = 0 32-bit floating point format. kHALF = 1  16-bit floating-point format.kINT8=2 8-bit integer representing a quantized floating-point value.
    //kINT32=3 Signed 32-bit integer format.kBOOL=4 8-bit boolean. 0 = false, 1 = true, other values undefined.
       


    assert(engine != nullptr);

    // serialize the engine into binary stream
    (*modelStream) = engine->serialize(); //調用序列化方法

    // free up the memory
    engine->destroy(); //釋放engine結構
    builder->destroy();//釋放builder結構
}
//載入wts權重與構建網絡，轉為engine，並將其序列化保存
void performSerialization(std::string file_wts= "C:\\Users\\Administrator\\Desktop\\C++code\\trt-try\\mlp.wts",
    std::string file_engine= "C:\\Users\\Administrator\\Desktop\\C++code\\trt-try\\mlp1.engine") {
    /**
     將網絡Serialization保存engine格式保存
     file_wts表示載入權重路徑
     file_engine表示保存engine路徑
     */
    // Shared memory object
    IHostMemory *modelStream{nullptr}; //創建IHostMemory變量，用於保存engine模型

    // Write model into stream
    APIToModel(1, &modelStream,file_wts);
    assert(modelStream != nullptr);


    std::cout << "[INFO]: Writing engine into binary..." << std::endl;

    // Open the file and write the contents there in binary format
    std::ofstream p(file_engine, std::ios::binary);
    if (!p) {
        std::cerr << "could not open plan output file" << std::endl;
        return;
    }
    p.write(reinterpret_cast<const char *>(modelStream->data()), modelStream->size());

    // Release the memory
    modelStream->destroy();

    std::cout << "[INFO]: Successfully created TensorRT engine..." << std::endl;
    std::cout << "\n\tRun inference using `./mlp -d`" << std::endl;

}










/******************************************************************加載engine並推理*********************************************/


void doInference(IExecutionContext &context, float *input, float *output, int batchSize) {
    /**
     * Perform inference using the CUDA context
     *
     * @param context: context created by engine
     * @param input: input from the host
     * @param output: output to save on host
     * @param batchSize: batch size for TRT model
     */

    // Get engine from the context
    const ICudaEngine &engine = context.getEngine();

    // Pointers to input and output device buffers to pass to engine.
    // Engine requires exactly IEngine::getNbBindings() number of buffers.
    assert(engine.getNbBindings() == 2);
    void *buffers[2];

    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings()
    const int inputIndex = engine.getBindingIndex("data");
    const int outputIndex = engine.getBindingIndex("out");

    // Create GPU buffers on device -- allocate memory for input and output
    cudaMalloc(&buffers[inputIndex], batchSize * INPUT_SIZE * sizeof(float));
    cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float));

    // create CUDA stream for simultaneous CUDA operations
    cudaStream_t stream;
    cudaStreamCreate(&stream);

    // copy input from host (CPU) to device (GPU)  in stream
    cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_SIZE * sizeof(float), cudaMemcpyHostToDevice, stream);

    // execute inference using context provided by engine
    context.enqueue(batchSize, buffers, stream, nullptr);

    // copy output back from device (GPU) to host (CPU)
    cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost,
                    stream);

    // synchronize the stream to prevent issues
    //      (block CUDA and wait for CUDA operations to be completed)
    cudaStreamSynchronize(stream);

    // Release stream and buffers (memory)
    cudaStreamDestroy(stream);
    cudaFree(buffers[inputIndex]);
    cudaFree(buffers[outputIndex]);
}









ICudaEngine* inite_engine(std::string file_engine = "C:\\Users\\Administrator\\Desktop\\C++code\\trt-try\\mlp.engine"){
    /*
    讀取engine文件，將其反序列化，構造engine結構，相當於網絡初始化
    */

    char* trtModelStream{ nullptr }; //指針函數,創建保存engine序列化文件結果
    size_t size{ 0 };

    // read model from the engine file
    std::ifstream file(file_engine, std::ios::binary);
    if (file.good()) {
        file.seekg(0, file.end);
        size = file.tellg();
        file.seekg(0, file.beg);
        trtModelStream = new char[size];
        assert(trtModelStream);
        file.read(trtModelStream, size);
        file.close();
    }

    // create a runtime (required for deserialization of model) with NVIDIA's logger
    IRuntime* runtime = createInferRuntime(gLogger); //反序列化方法
    assert(runtime != nullptr);
    // deserialize engine for using the char-stream
    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);
    assert(engine != nullptr);

    /*
    一個engine可以有多個execution context，並允許將同一套weights用於多個推理任務。可以在並行的CUDA streams流中按每個stream流一個engine和一個context來處理圖像。每個context在engine相同的GPU上創建。
    */
    runtime->destroy();
    return engine;

};



auto infer(ICudaEngine* engine) {

 
    // create execution context -- required for inference executions
    IExecutionContext* context = engine->createExecutionContext();
    assert(context != nullptr);

    float out[1];  // array for output
    float data[1]; // array for input
    for (float& i : data)
        i = 12.0;   // put any value for input

    // time the execution
    auto start = std::chrono::system_clock::now();

    // do inference using the parameters
    doInference(*context, data, out, 1);

    // time the execution
    auto end = std::chrono::system_clock::now();
    std::cout << "\n[INFO]: Time taken by execution: "
        << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms" << std::endl;


    // free the captured space
    context->destroy();
    engine->destroy();
    //runtime->destroy();

    std::cout << "\nInput:\t" << data[0];
    std::cout << "\nOutput:\t";
    for (float i : out) {
        std::cout << i;
    }
    std::cout << std::endl;

    return out;

};



int checkArgs(int argc, char **argv) {
    /**
     * Parse command line arguments
     *
     * @param argc: argument count
     * @param argv: arguments vector
     * @return int: a flag to perform operation
     */

    if (argc != 2) {
        std::cerr << "[ERROR]: Arguments not right!" << std::endl;
        std::cerr << "./mlp -s   // serialize model to plan file" << std::endl;
        std::cerr << "./mlp -d   // deserialize plan file and run inference" << std::endl;
        return -1;
    }
    if (std::string(argv[1]) == "-s") {
        return 1;
    } else if (std::string(argv[1]) == "-d") {
        return 2;
    }
    return -1;
}

int argc = 2;

int main() {
    
    //int argc = 2;
    int args = 2;
    //int args = checkArgs(argc, argv);
    if (args == 1)
    {
        performSerialization();
    }
    else if (args == 2)
    {
        ICudaEngine* engine = inite_engine();
        auto out=infer(engine);
        std::cout << out << std::endl;
        //performInference();
    }
    return 0;
}

mlp.cpp

權重文件mlp.wts

2
linear.weight 1 3fff7e32
linear.bias 1 3c138a5a

結果示意如下：

Ⅲ.修改代碼使用

通過研究可以使用簡單logging的定義，將無需logging.h文件，任可運行，代碼如下：

#include "NvInfer.h"    // TensorRT library
#include "iostream"     // Standard input/output library
//#include "logging.h"    // logging file -- by NVIDIA
#include <map>          // for weight maps
#include <fstream>      // for file-handling
#include <chrono>       // for timing the execution



// provided by nvidia for using TensorRT APIs
using namespace nvinfer1;


/*****************************************源代碼調用的logger********************************************/
// Logger from TRT API
//static Logger gLogger;


//using namespace nvinfer1;
/****************************************修改后調用的logger*****************************************/
#include "NvInferRuntimeCommon.h"
#include <cassert>
class Logger : public nvinfer1::ILogger
{
    void log(Severity severity, const char* msg) noexcept override
    {
        // suppress info-level messages
        if (severity != Severity::kINFO)
            std::cout << msg << std::endl;
    }
} gLogger;








const int INPUT_SIZE = 1;
const int OUTPUT_SIZE = 1;











/*******************************************************構建engine將其序列化並保存*******************************************************************************/

/*
Weights是類別類型
class Weights
{
public:
    DataType type;      //!< The type of the weights.
    const void* values; //!< The weight values, in a contiguous array.
    int64_t count;      //!< The number of weights in the array.
};
*/

//載入權重函數
std::map<std::string, Weights> loadWeights(const std::string file) {
    /**
     * Parse the .wts file and store weights in dict format.
     *
     * @param file path to .wts file
     * @return weight_map: dictionary containing weights and their values
     */

    std::cout << "[INFO]: Loading weights..." << file << std::endl;
    std::map<std::string, Weights> weightMap;  //定義聲明

    // Open Weight file
    std::ifstream input(file);
    assert(input.is_open() && "[ERROR]: Unable to load weight file...");

    // Read number of weights
    int32_t count;
    input >> count;
    assert(count > 0 && "Invalid weight map file.");

    // Loop through number of line, actually the number of weights & biases
    while (count--) {
        // TensorRT weights
        Weights wt{ DataType::kFLOAT, nullptr, 0 };
        uint32_t size;
        // Read name and type of weights
        std::string w_name;
        input >> w_name >> std::dec >> size;
        wt.type = DataType::kFLOAT;

        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));
        for (uint32_t x = 0, y = size; x < y; ++x) {
            // Change hex values to uint32 (for higher values)
            input >> std::hex >> val[x]; //hex為16進制
        }
        wt.values = val;
        wt.count = size;

        // Add weight values against its name (key)
        weightMap[w_name] = wt;  //將權重結果保存此處
    }
    return weightMap;
}
//構建engine函數
ICudaEngine* createMLPEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt,
    const std::string file_wts) {
    /**
     * Create Multi-Layer Perceptron using the TRT Builder and Configurations
     *
     * @param maxBatchSize: batch size for built TRT model
     * @param builder: to build engine and networks
     * @param config: configuration related to Hardware
     * @param dt: datatype for model layers
     * @return engine: TRT model
     */

     //std::cout << "[INFO]: Creating MLP using TensorRT..." << std::endl;

     // Load Weights from relevant file
    std::map<std::string, Weights> weightMap = loadWeights(file_wts);  //載入權重中

    // Create an empty network
    INetworkDefinition* network = builder->createNetworkV2(0U); //創造網絡

    // Create an input with proper *name
    ITensor* data = network->addInput("data", DataType::kFLOAT, Dims3{ 1, 1, 1 });
    assert(data);

    // Add layer for MLP 輸入 輸出 w權重 b權重
    IFullyConnectedLayer* fc1 = network->addFullyConnected(*data, 1,
        weightMap["linear.weight"],
        weightMap["linear.bias"]);
    assert(fc1);

    // set output with *name
    fc1->getOutput(0)->setName("out"); 
    /*設置fc1層的輸出，（對特殊的網絡層通過ITensor->setName()方法設定名稱，方便后面的操作）；
    指定網絡的output節點，tensorrt必須指定輸出節點，否則有可能會在優化過程中將該節點優化掉
    */
    // mark the output
    network->markOutput(*fc1->getOutput(0)); //設為網絡的輸出，防止被優化掉

    // Set configurations
    builder->setMaxBatchSize(1);
    // Set workspace size
    config->setMaxWorkspaceSize(1 << 20);

    // Build CUDA Engine using network and configurations
    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
    assert(engine != nullptr);

    // Don't need the network any more
    // free captured memory
    network->destroy();//銷毀網絡 

    // Release host memory
    for (auto& mem : weightMap) {
        free((void*)(mem.second.values));
    }

    return engine;
}
//構建模型，主要調用ICudaEngine函數
void APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream,
    const std::string file_wts) {
    /**
     * Create engine using TensorRT APIs
     *
     * @param maxBatchSize: for the deployed model configs
     * @param modelStream: shared memory to store serialized model
     */

     // Create builder with the help of logger
    IBuilder* builder = createInferBuilder(gLogger); //構建builder

    // Create hardware configs
    IBuilderConfig* config = builder->createBuilderConfig(); //為builder分配內存，默認全部分配

    // Build an engine
    ICudaEngine* engine = createMLPEngine(maxBatchSize, builder, config, DataType::kFLOAT, file_wts); //創造引擎   
    // DataType::kFLOAT指的是數據類型,kFLOAT=0表示32float；kHALF=1表示16float；
    // kFLOAT = 0 32-bit floating point format. kHALF = 1  16-bit floating-point format.kINT8=2 8-bit integer representing a quantized floating-point value.
    //kINT32=3 Signed 32-bit integer format.kBOOL=4 8-bit boolean. 0 = false, 1 = true, other values undefined.



    assert(engine != nullptr);

    // serialize the engine into binary stream
    (*modelStream) = engine->serialize(); //調用序列化方法

    // free up the memory
    engine->destroy(); //釋放engine結構
    builder->destroy();//釋放builder結構
}
//載入wts權重與構建網絡，轉為engine，並將其序列化保存
void performSerialization(std::string file_wts = "C:\\Users\\Administrator\\Desktop\\code\\tensorrt-code\\mlp\\mlp.wts",
    std::string file_engine = "C:\\Users\\Administrator\\Desktop\\code\\tensorrt-code\\mlp\\mlp.engine") {
    /**
     將網絡Serialization保存engine格式保存
     file_wts表示載入權重路徑
     file_engine表示保存engine路徑
     */
     // Shared memory object
    IHostMemory* modelStream{ nullptr }; //創建IHostMemory變量，用於保存engine模型

    // Write model into stream
    APIToModel(1, &modelStream, file_wts);
    assert(modelStream != nullptr);


    std::cout << "[INFO]: Writing engine into binary..." << std::endl;

    // Open the file and write the contents there in binary format
    std::ofstream p(file_engine, std::ios::binary);
    if (!p) {
        std::cerr << "could not open plan output file" << std::endl;
        return;
    }
    p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());

    // Release the memory
    modelStream->destroy();

    std::cout << "[INFO]: Successfully created TensorRT engine..." << std::endl;
    std::cout << "\n\tRun inference using `./mlp -d`" << std::endl;

}










/******************************************************************加載engine並推理*********************************************/


void doInference(IExecutionContext& context, float* input, float* output, int batchSize) {
    /**
     * Perform inference using the CUDA context
     *
     * @param context: context created by engine
     * @param input: input from the host
     * @param output: output to save on host
     * @param batchSize: batch size for TRT model
     */

     // Get engine from the context
    const ICudaEngine& engine = context.getEngine();

    // Pointers to input and output device buffers to pass to engine.
    // Engine requires exactly IEngine::getNbBindings() number of buffers.
    assert(engine.getNbBindings() == 2);
    void* buffers[2];

    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings()
    const int inputIndex = engine.getBindingIndex("data");
    const int outputIndex = engine.getBindingIndex("out");

    // Create GPU buffers on device -- allocate memory for input and output
    cudaMalloc(&buffers[inputIndex], batchSize * INPUT_SIZE * sizeof(float));
    cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float));

    // create CUDA stream for simultaneous CUDA operations
    cudaStream_t stream;
    cudaStreamCreate(&stream);

    // copy input from host (CPU) to device (GPU)  in stream
    cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_SIZE * sizeof(float), cudaMemcpyHostToDevice, stream);

    // execute inference using context provided by engine
    context.enqueue(batchSize, buffers, stream, nullptr);//*******************************************************************************************************************重點推理****************

    // copy output back from device (GPU) to host (CPU)
    cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost,
        stream);

    // synchronize the stream to prevent issues (block CUDA and wait for CUDA operations to be completed)
    cudaStreamSynchronize(stream);

    // Release stream and buffers (memory)
    cudaStreamDestroy(stream);
    cudaFree(buffers[inputIndex]);
    cudaFree(buffers[outputIndex]);
}









ICudaEngine* inite_engine(std::string file_engine = "C:\\Users\\Administrator\\Desktop\\code\\tensorrt-code\\mlp\\mlp.engine") {
    /*
    讀取engine文件，將其反序列化，構造engine結構，相當於網絡初始化
    */

    char* trtModelStream{ nullptr }; //指針函數,創建保存engine序列化文件結果
    size_t size{ 0 };

    // read model from the engine file
    std::ifstream file(file_engine, std::ios::binary);
    if (file.good()) {
        file.seekg(0, file.end);
        size = file.tellg();
        file.seekg(0, file.beg);
        trtModelStream = new char[size];
        assert(trtModelStream);
        file.read(trtModelStream, size);
        file.close();
    }

    // create a runtime (required for deserialization of model) with NVIDIA's logger
    IRuntime* runtime = createInferRuntime(gLogger); //反序列化方法
    assert(runtime != nullptr);
    // deserialize engine for using the char-stream
    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);
    assert(engine != nullptr);

    /*
    一個engine可以有多個execution context，並允許將同一套weights用於多個推理任務。可以在並行的CUDA streams流中按每個stream流一個engine和一個context來處理圖像。每個context在engine相同的GPU上創建。
    */
    runtime->destroy();
    return engine;

};



auto infer(ICudaEngine* engine) {


    // create execution context -- required for inference executions
    IExecutionContext* context = engine->createExecutionContext();
    assert(context != nullptr);

    float out[1];  // array for output
    float data[1]; // array for input
    for (float& i : data)
        i = 12.0;   // put any value for input

    // time the execution
    auto start = std::chrono::system_clock::now();

    // do inference using the parameters
    doInference(*context, data, out, 1);

    // time the execution
    auto end = std::chrono::system_clock::now();
    std::cout << "\n[INFO]: Time taken by execution: "
        << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms" << std::endl;


    // free the captured space
    context->destroy();
    engine->destroy();
    //runtime->destroy();

    std::cout << "\nInput:\t" << data[0];
    std::cout << "\nOutput:\t";
    for (float i : out) {
        std::cout << i;
    }
    std::cout << std::endl;

    return out;

};






int main() {

    //int argc = 2;
    int args = 1;
    //int args = checkArgs(argc, argv);
    if (args == 1)
    {
        performSerialization();
    }
    else if (args == 2)
    {
        ICudaEngine* engine = inite_engine();
        auto out = infer(engine);
        std::cout << out << std::endl;
        //performInference();
    }
    return 0;
}

修改后mlp.cpp

tensorRT下載路徑：https://developer.nvidia.com/nvidia-tensorrt-8x-download

參考原文：

https://blog.csdn.net/just_sort/article/details/104772653

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用tensorRT python API搭建MLP網絡詳解 tensorRT C++ API Python API vs C++ API of TensorRT 《二》TensorRT之C++接口使用 python MLP 神經網絡使用 MinMaxScaler 沒有 StandardScaler效果好 tensorRT使用python進行網絡定義【轉】OpenCV中使用神經網絡 CvANN_MLP 基於Keras搭建MLP 基於PyTorch與TensorRT的cifar10推理加速引擎(C++) 【C++ 】 deque使用詳解