Python Backend

Triton 提供了 pipeline 的功能，但是 Triton 的 pipeline 只能將輸入和輸出串聯到一起，太過於簡單靜態了，不支持控制流，比如循環、判斷等，模型和模型之間的數據格式不靈活，只能是向量。pipeline 太死板了，有沒有辦法支持更加靈活的操作呢？答案是使用 Python Backend 或者自己開發 C++ Backend。

使用 Python Backend 的好處是開發的速度快，並同時擁有 Python 語言的靈活。舉個例子，人臉檢測模型 MTCNN。MTCNN 是由三個模型組成的，三個模型分別是三個神經網絡。除開這三個神經網絡做前向傳播之外，剩下的操作太過於靈活以至於幾乎不能使用 pipeline 來搭建成一個模型，使到一次請求即可完成整個流程。使用了 Python Backend 之后，我們就可以將這些復雜的邏輯全部集成到一起，客戶端只需要一次請求，就可以得到最終的結果。

因為英偉達的沒有 Python Backend 中 python 的接口文檔，所以在文章的最后，我整理了相關的接口，方便查詢。這些 python 的接口既有 python 文件的，也有 pybind11 導出的。

這里有一個相對復雜和完整的例子，這里例子是人臉識別模型。客戶端調用的時候，只需要輸入一張圖片，輸出就可以得到帶有人臉標注信息的圖片。這整個流程需要分為幾個步驟：人臉檢測，截取圖片，特征提取，特征匹配等。本文不會講這個例子，本文使用英偉達倉庫中一個簡單的 add_sub 來作為例子，主要是為了闡明如何使用 Python Backend

地址：https://github.com/zzk0/triton/tree/master/face/face_ensemble_python

例子

這個簡單的例子來自英偉達的倉庫中。

目錄結構：

(transformers) percent1@ubuntu:~/triton/triton/models/example_python$ tree
.
├── 1
│   ├── model.py         # 模型對應的腳本文件
│   └── __pycache__
├── client.py            # 客戶端腳本，可以不放在這里
└── config.pbtxt         # 模型配置文件

服務端配置

這里例子叫做 add_sub，兩個輸入，兩個輸出，輸出分別是兩個輸入相加、相減的結果。

name: "example_python"
backend: "python"
input [
  {
    name: "INPUT0"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]
input [
  {
    name: "INPUT1"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]
output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]
output [
  {
    name: "OUTPUT1"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]

instance_group [
  {
    kind: KIND_CPU
  }
]

服務端

model.py 中需要提供三個接口：initialize, execute, finalize。其中 initialize 和 finalize 是模型實例初始化、模型實例清理的時候會調用的。如果有 n 個模型實例，那么會調用 n 次這兩個函數。

import json
import triton_python_backend_utils as pb_utils

class TritonPythonModel:

    def initialize(self, args):
        self.model_config = model_config = json.loads(args['model_config'])
        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
        output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1")
        self.output0_dtype = pb_utils.triton_string_to_numpy(output0_config['data_type'])
        self.output1_dtype = pb_utils.triton_string_to_numpy(output1_config['data_type'])

    def execute(self, requests):
        output0_dtype = self.output0_dtype
        output1_dtype = self.output1_dtype
        responses = []
        for request in requests:
            in_0 = pb_utils.get_input_tensor_by_name(request, 'INPUT0')
            in_1 = pb_utils.get_input_tensor_by_name(request, 'INPUT1')
            out_0, out_1 = (in_0.as_numpy() + in_1.as_numpy(),
                            in_0.as_numpy() - in_1.as_numpy())
            out_tensor_0 = pb_utils.Tensor('OUTPUT0', out_0.astype(output0_dtype))
            out_tensor_1 = pb_utils.Tensor('OUTPUT1', out_1.astype(output1_dtype))
            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor_0, out_tensor_1])
            responses.append(inference_response)
        return responses

    def finalize(self):
        print('Cleaning up...')

客戶端

接下來，寫一個腳本調用一下服務。

import numpy as np
import tritonclient.http as httpclient


if __name__ == '__main__':
    triton_client = httpclient.InferenceServerClient(url='127.0.0.1:8000')

    inputs = []
    inputs.append(httpclient.InferInput('INPUT0', [4], "FP32"))
    inputs.append(httpclient.InferInput('INPUT1', [4], "FP32"))
    input_data0 = np.random.randn(4).astype(np.float32)
    input_data1 = np.random.randn(4).astype(np.float32)
    inputs[0].set_data_from_numpy(input_data0, binary_data=False)
    inputs[1].set_data_from_numpy(input_data1, binary_data=False)
    outputs = []
    outputs.append(httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
    outputs.append(httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))

    results = triton_client.infer('example_python', inputs=inputs, outputs=outputs)
    output_data0 = results.as_numpy('OUTPUT0')
    output_data1 = results.as_numpy('OUTPUT1')

    print(input_data0)
    print(input_data1)
    print(output_data0)
    print(output_data1)

我們可以驗證一下結果，前兩行是輸入，后兩行是輸出。可以看到前兩行加起來為第三行，相減為第四行。至此，一個簡單的 python backend 就完成了。

[ 0.81060416 -0.18330468 -1.6702048  -0.28633776]
[ 0.06001309  0.801739   -0.9306069   0.18076313]
[ 0.8706173   0.6184343  -2.6008117  -0.10557464]
[ 0.75059104 -0.98504364 -0.73959786 -0.4671009 ]

導出 Python 環境

一般來說，我們需要各種各樣的包，但是默認的 Python 環境中是缺少的，此時我們需要將自己的環境打包放上去。Python backend 默認是 Python 3.8，如果需要換 Python 版本，那么需要自己構建一邊 Python Backend。如果不用換版本，只需要導出環境即可。

以 opencv 為例子，我們從頭創建一個新的 conda 環境：

export PYTHONNOUSERSITE=True
conda create -n triton python=3.8
pip install opencv-python  # conda 安裝不了，用 pip
conda install conda-pack
conda-pack  # 運行打包程序，將會打包到運行的目錄下面
apt update
apt install ffmpeg libsm6 libxext6 -y --fix-missing  # 安裝 opencv 的依賴, -y 表示 yes

調用其他模型

在 python backend 中，我們可以調用其他模型，從而實現類似 pipeline 的功能，避免和客戶端之間過多的通信。

調用的方法如下：

封裝一個 InferenceRequest 對象，設置模型的名字、需要的輸出名字、輸入的向量
調用 exec 方法，執行前向傳播
獲取輸出，使用 get_output_tensor_by_name 來獲取。

下面以調用 facenet 為例子：

inference_request = pb_utils.InferenceRequest(
    model_name='facenet',
    requested_output_names=[self.Facenet_outputs[0]],
    inputs=[pb_utils.Tensor(self.Facenet_inputs[0], face_img.astype(np.float32))]
)
inference_response = inference_request.exec()
pre = utils.pb_tensor_to_numpy(pb_utils.get_output_tensor_by_name(inference_response, self.Facenet_outputs[0]))

遇到的問題及解決辦法

Tensor is stored in GPU and cannot be converted to NumPy.

Tensor 被存儲在 GPU 上，不能轉成 Numpy。然后，Triton 沒有提供其他接口去獲取數據。目前沒有比較好的解決辦法，用一個笨方法來解決，存在一定的性能損耗，不過不算很大。這只能等 Triton 那邊把相應的接口做出來了。我先將 Tensor 通過 dlpack 轉成 Pytorch 的 Tensor，然后調用 numpy 方法。

def pb_tensor_to_numpy(pb_tensor):
    if pb_tensor.is_cpu():
        return pb_tensor.as_numpy()
    else:
        pytorch_tensor = from_dlpack(pb_tensor.to_dlpack())
        return pytorch_tensor.cpu().numpy()

附：Python Backend 接口

因為 Python Backend 的倉庫沒有整理接口，所以這一節整理 Python Backend 的接口，看看有哪些方法可以調用。

utils

triton_python_backend_utils 中的接口在這個文件。

serialize_byte_tensor(input_tensor)
deserialize_bytes_tensor(encoded_tensor)
get_input_tensor_by_name(inference_request, name)
get_output_tensor_by_name(inference_response, name)
get_input_config_by_name(model_config, name)
get_output_config_by_name(model_config, name)
triton_to_numpy_type(data_type)
numpy_to_triton_type(data_type)
triton_string_to_numpy(triton_type_string)

Pybind11 導出的類

Python Backend 采用了 pybind11 來導出部分 python 類，我們可以從這個文件里面找到 Tensor、InferenceRequest、InferenceResponse、TritonError 的定義。

Tensor

Tensor 中比較重要的方法是 as_numpy，可以將 Tensor 變成 numpy 數組。

  py::class_<PbTensor, std::shared_ptr<PbTensor>>(module, "Tensor")
      .def(py::init(&PbTensor::FromNumpy))
      .def("name", &PbTensor::Name)
      .def("as_numpy", &PbTensor::AsNumpy)
      .def("triton_dtype", &PbTensor::TritonDtype)
      .def("to_dlpack", &PbTensor::ToDLPack)
      .def("is_cpu", &PbTensor::IsCPU)
      .def("from_dlpack", &PbTensor::FromDLPack);

InferenceRequest

我們可以從這個 InferenceRequest 構造出一個對其他 backend 的調用，注意看 pybind11 中的定義，我們可以設置 model_name, model_version, requested_output_names, inputs 來設置需要請求的模型、版本、輸出、輸入，最后調用 exec 方法來執行。

  py::class_<InferRequest, std::shared_ptr<InferRequest>>(
      module, "InferenceRequest")
      .def(
          py::init<
              const std::string&, uint64_t,
              const std::vector<std::shared_ptr<PbTensor>>&,
              const std::vector<std::string>&, const std::string&,
              const int64_t>(),
          py::arg("request_id") = "", py::arg("correlation_id") = 0,
          py::arg("inputs"), py::arg("requested_output_names"),
          py::arg("model_name"), py::arg("model_version") = -1)
      .def(
          "inputs", &InferRequest::Inputs,
          py::return_value_policy::reference_internal)
      .def("request_id", &InferRequest::RequestId)
      .def("correlation_id", &InferRequest::CorrelationId)
      .def("exec", &InferRequest::Exec)
      .def(
          "async_exec",
          [](std::shared_ptr<InferRequest>& infer_request) {
            py::object loop =
                py::module_::import("asyncio").attr("get_running_loop")();
            py::cpp_function callback = [infer_request]() {
              auto response = infer_request->Exec();
              return response;
            };
            py::object future =
                loop.attr("run_in_executor")(py::none(), callback);
            return future;
          })
      .def(
          "requested_output_names", &InferRequest::RequestedOutputNames,
          py::return_value_policy::reference_internal);

InferenceResponse

  py::class_<InferResponse>(module, "InferenceResponse")
      .def(
          py::init<
              const std::vector<std::shared_ptr<PbTensor>>&,
              std::shared_ptr<PbError>>(),
          py::arg("output_tensors"), py::arg("error") = nullptr)
      .def(
          "output_tensors", &InferResponse::OutputTensors,
          py::return_value_policy::reference)
      .def("has_error", &InferResponse::HasError)
      .def("error", &InferResponse::Error);

TritonError

  py::class_<PbError, std::shared_ptr<PbError>>(module, "TritonError")
      .def(py::init<std::string>())
      .def("message", &PbError::Message);

TritonModelException

  py::register_exception<PythonBackendException>(
      module, "TritonModelException");

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。