TVM代碼流程分析

本文轉載自查看原文 2021-11-13 18:00 100

TVM代碼流程分析

TVM - 代碼生成流程

本節主要介紹TVM的代碼生成流程，即調用relay.build或tvm.build之后發生了什么，將深入到TVM的源代碼進行剖析。（這里采用的依然是TVM v0.6）

首先區分兩個build的區別：tvm.build主要針對單一算子（參照Tensor Expression一文），而relay.build是針對整個模型進行編譯（參照GCN優化一文），而Relay最后也會調用到tvm::build做代碼生成。

relay.build

通常的模型編譯由以下兩條語句完成。

# Build with Relay

with relay.build_config(opt_level=0):

    graph, lib, params = relay.build(func, target, params=params)

跟蹤細節

那么對relay.build進行跟蹤，跳轉進來是python/tvm/relay/build_module.py（這里是因為在relay/__init__.py中將build函數直接import到relay的命名空間，因此跳過了build_module這一層），其中的build函數是build_module內的全局函數(helper)。

def build(mod, target=None, target_host=None, params=None):

    # do somthing

    if isinstance(autotvm.DispatchContext.current, autotvm.FallbackContext):

        tophub_context = autotvm.tophub.context(list(target.values()))

    else:

        tophub_context = autotvm.util.EmptyContext()

    with tophub_context:

        bld_mod = BuildModule()

        graph_json, mod, params = bld_mod.build(func, target, target_host, params)

    return graph_json, mod, params

首先是尋找AutoTVM是否有預先tune好的參數記錄，然后構造tophub_context，在其內部構建了BuildModule之后，才跳轉到BuildModule.build，然后返回BuildModule.__init__中的內容。

class BuildModule(object):

    """Build a Relay function to run on TVM graph runtime. This class is used

    to expose the `RelayBuildModule` APIs implemented in C++.

"""

    def __init__(self):

        self.mod = _build_module._BuildModule()

        self._get_graph_json = self.mod["get_graph_json"]

        self._get_module = self.mod["get_module"]

        self._build = self.mod["build"]

        self._optimize = self.mod["optimize"]

        self._set_params_func = self.mod["set_params"]

        self._get_params_func = self.mod["get_params"]

    def build(self, func, target=None, target_host=None, params=None):

        target = _update_target(target)

        # Setup the params.

        if params:

            self._set_params(params)

        # Build the function

        self._build(func, target, target_host)

        # Get artifacts

        graph_json = self.get_json()

        mod = self.get_module()

        params = self.get_params()

        return graph_json, mod, params

而_build_module._BuildModule()又通過FFI在python/tvm/relay/_build_module.py中與C++函數建立聯系（tvm._ffi._cytpes.function.Function.__call__）。

from tvm._ffi.function import _init_api

_init_api("relay.build_module", __name__)

對應的C++函數在src/relay/backend/build_module.cc

runtime::Module RelayBuildCreate() {

  auto exec = make_object<RelayBuildModule>();

  return runtime::Module(exec);

TVM_REGISTER_GLOBAL("relay.build_module._BuildModule")

.set_body([](TVMArgs args, TVMRetValue* rv) {

  *rv = RelayBuildCreate();

});

也就是注冊了一個RelayBuildModule供調用，由於主要用的是build函數，因此到RelayBuildModule中找對應的函數。這里TVM又用PackedFunc做了一層封裝，見下。

PackedFunc GetFunction(const std::string& name,

                         const ObjectPtr<Object>& sptr_to_self) final {

      // ...

      if (name == "build") {

      return PackedFunc([sptr_to_self, this](TVMArgs args, TVMRetValue* rv) {

        CHECK_EQ(args.num_args, 3);

        this->Build(args[0], args[1], args[2]);

});

      // ...

也就是調用的是this->Build，再跳轉過去會指向BuildRelay。

  void BuildRelay(

      Function func,

      const std::unordered_map<std::string, tvm::runtime::NDArray>& params) {

    // Optimize input Relay Function and returns Relay Module

    relay::Module relay_module = Optimize(func, targets_, params);

    // Get the updated function.

    func = relay_module->Lookup("main");

    // Generate code for the updated function.

    graph_codegen_ = std::unique_ptr<GraphCodegen>(new GraphCodegen());

    graph_codegen_->Init(nullptr, targets_);

    graph_codegen_->Codegen(func);

    ret_.graph_json = graph_codegen_->GetJSON();

    ret_.params = graph_codegen_->GetParams();

    auto lowered_funcs = graph_codegen_->GetLoweredFunc();

    if (lowered_funcs.size() == 0) {

      LOG(WARNING) << "no lowered funcs exist in the compiled module";

    } else {

      ret_.mod = tvm::build(

        lowered_funcs,

        target_host_,

        BuildConfig::Current());

經過多番跳轉，終於到達build的核心模塊，再來看TVM逐步做的工作。

優化
計算圖生成
后端代碼生成

優化

先是優化Optimize，可以看到這里的優化主要是設備無關的優化，是graph-level的針對tensor運算的優化。（這里的優化pass都已經在C++中實現，先前版本的NNVM似乎還是在Python中調用）

  relay::Module Optimize(

      Function func,

      const TargetsMap& targets,

      const std::unordered_map<std::string, runtime::NDArray>& params) {

    // BindParamsByName(func, params)

    // Perform Module->Module optimizations.

    relay::Module relay_module = relay::ModuleNode::FromExpr(func);

    Array<Pass> pass_seqs;

    // Run all dialect legalization passes.

    // ...

    pass_seqs.push_back(transform::SimplifyInference());

//

    // ...fskip

//

    pass_seqs.push_back(transform::EliminateCommonSubexpr(fskip));

    pass_seqs.push_back(transform::CombineParallelConv2D(3));

    pass_seqs.push_back(transform::CombineParallelDense(3));

    pass_seqs.push_back(transform::FoldConstant());

    pass_seqs.push_back(transform::FoldScaleAxis());

    pass_seqs.push_back(transform::CanonicalizeCast());

    pass_seqs.push_back(transform::CanonicalizeOps());

    // ...AlterOpLayout

    pass_seqs.push_back(transform::FoldConstant());

    // Create a sequential pass and perform optimizations.

    transform::Pass seq = transform::Sequential(pass_seqs);

    // ... judge & do

    relay_module = seq(relay_module);

    // Handle heterogeneous compilation.

    transform::PassContext pass_ctx = PassContext::Current();

    if (targets_.size() > 1) {

      relay_module =

          RunDeviceAnnotationPass(relay_module, pass_ctx->fallback_device);

    // Fuse the operations if it is needed.

    relay_module = transform::FuseOps()(relay_module);

    relay_module = transform::InferType()(relay_module);

    CHECK(relay_module.defined());

    return relay_module;

計算圖生成

對應GraphCodegen類，以同樣的方式調用src/relay/backend/build_module.cc中的relay.build_module._GraphRuntimeCodegen（一樣是FFI），然后跳轉至src/relay/backend/graph_runtime_codegen.cc，其中已經用TVM_REGISTER_GLOBAL注冊了對應函數，即用GraphRuntimeCodegenModule生成對應Object。

因此實際graph_codegen_->Codegen的函數是一個PackedFunc，定義在GraphRuntimeCodegen.Codegen，用來將relay::Function func進行遍歷，然后生成計算圖。

后端代碼生成

Relay得到lower后的函數，最后一步則是交給tvm::build做代碼生成，跳轉到src/codegen/build_module.cc中的build函數（注意這里重載了幾個版本），然后跳轉到核心build，注意這里的build函數支持異構編譯，只要再inputs划分好不同硬件設施即可。

// Build for heterogeneous execution.

runtime::Module build(const Map<Target, Array<LoweredFunc>>& inputs,

                      const Target& target_host,

                      const BuildConfig& config) {

  Array<LoweredFunc> fhost_all;

  std::vector<runtime::Module> device_modules;

  Target target_host_val = target_host;

  if (!target_host.defined()) {

    for (const auto& it : inputs) {

      if (it.first->device_type == kDLCPU) {

        target_host_val = it.first;

        break;

  if (!target_host_val.defined()) {

    target_host_val = DefaultTargetHost(target_host_val);

  for (const auto& it : inputs) {

    auto host_dev_funcs =

        split_dev_host_funcs(it.second, it.first, target_host_val, config);

    auto& fhost = host_dev_funcs[0];

    auto& fdevice = host_dev_funcs[1];

    // Get the module for a certain target.

    runtime::Module mdev = DeviceBuild(fdevice, it.first);

    for (const auto& it : fhost) {

      fhost_all.push_back(it);

    device_modules.push_back(mdev);

  runtime::Module mhost = codegen::Build(fhost_all, target_host_val->str());

  // Import all modules

  for (const auto& it : device_modules) {

    if (it.operator->()) {

      mhost.Import(it);

  return mhost;

當中最最核心的則是mhost = codegen::Build，最后跳轉過去就開始調用代碼生成模塊了（src/codegen/codegen.cc）。

runtime::Module Build(const Array<LoweredFunc>& funcs,

                      const std::string& target) {

  // do something

  std::string build_f_name = "codegen.build_" + mode;

  // the build function.

  const PackedFunc* bf = runtime::Registry::Get(build_f_name);

  runtime::Module m = transformed_funcs.empty() ?

                      (*bf)(funcs, target) :

                      (*bf)(transformed_funcs, target);

  return m;

以生成LLVM IR為例，codegen.build_llvm會在src/codegen/llvm/llvm_module.cc注冊，然后調用同個文件中的LLVMModuleNode->Init。這時會跳轉到src/codegen/llvm/codegen_llvm.cc中的CodeGenLLVM類進行代碼生成。

tvm.build

用tvm.build對算子進行編譯則是按照以下方式進行調用，例子來自Tensor Expression。

s = tvm.create_schedule(C.op)

tgt = "llvm" # "cuda"

fadd = tvm.build(s,[A,B,C],target=tgt,name="myadd")

調用tvm.build后首先跳轉到python/tvm/build_module.py，其中的build函數主要做兩個步驟：

lower高層次代碼
后端代碼生成

代碼變換

lower高層次代碼對應的是

flist = lower(inputs,args,name=name,binds=binds)

而lower函數同樣在python/tvm/build_module.py中，類似於relay.build中的Optimize，但這里執行的是operator-level的優化，主要針對循環變換。

def lower(sch,

          args,

          name="default_function",

          binds=None,

          simple_mode=False):

    # initialization

    # Phase 0

    if isinstance(sch, schedule.Schedule):

        stmt = form_body(sch)

    for f in lower_phase0:

        stmt = f(stmt)

    compact = ir_pass.VerifyCompactBuffer(stmt)

    binds, arg_list = get_binds(args, compact, binds)

    # Phase 1

    stmt = ir_pass.RewriteForTensorCore(stmt, sch, binds)

    stmt = ir_pass.StorageFlatten(stmt, binds, 64, cfg.instrument_bound_checkers)

    stmt = ir_pass.CanonicalSimplify(stmt)

    for f in lower_phase1:

        stmt = f(stmt)

    # Phase 2

    if not simple_mode:

        stmt = ir_pass.LoopPartition(stmt, cfg.partition_const_loop)

    if cfg.disable_vectorize:

        stmt = ir_pass.SkipVectorize(stmt)

    else:

        stmt = ir_pass.VectorizeLoop(stmt)

    stmt = ir_pass.InjectVirtualThread(stmt)

    stmt = ir_pass.InjectDoubleBuffer(stmt, cfg.double_buffer_split_loop)

    stmt = ir_pass.StorageRewrite(stmt)

    stmt = ir_pass.UnrollLoop(

        stmt,

        cfg.auto_unroll_max_step,

        cfg.auto_unroll_max_depth,

        cfg.auto_unroll_max_extent,

        cfg.unroll_explicit)

    for f in lower_phase2:

        stmt = f(stmt)

    # Phase 3

    stmt = ir_pass.Simplify(stmt)

    stmt = ir_pass.RemoveNoOp(stmt)

    if not cfg.disable_select_rewriting:

        stmt = ir_pass.RewriteUnsafeSelect(stmt)

    for f in lower_phase3:

        stmt = f(stmt)

    # Instrument BoundCheckers

    if cfg.instrument_bound_checkers:

        stmt = ir_pass.InstrumentBoundCheckers(stmt)

    if simple_mode:

        return stmt

    return ir_pass.MakeAPI(stmt, name, arg_list, 0, cfg.restricted_func)

優化Pass的主體實施都在src/api/api_pass.cc中，以tvm.ir_pass進行注冊（注意由於C++函數中已經在tvm的命名空間里，故搜索時直接搜ir_pass才會出來對應的API）。

代碼生成

lower完之后就進入到后端代碼生成，對應build函數中的

mhost = codegen.build_module(fhost_all, str(target_host))

同樣的原理，跳轉至tvm/codegen.py，初始化tvm.codegen的API codegen._Build，調用FFI，跳轉至src/api/api_codegen.cc，最后跳轉至src/codegen/codegen.cc中的tvm::Build，之后的后端代碼生成則與relay.build相同。

TVM - Tensor Expression

本節以向量加法為例，記錄TVM最最基本的Tensor Expression的使用，以及簡單的編譯運行流程。

下面的代碼為簡單的向量加法，參考自Tensor Expression官方教程，在TVM v0.6下執行（注意與v0.7dev的模塊有區別)。

import tvm

import numpy as np

# Tensor Expression

# args: (shape, label)

A = tvm.placeholder((10,), name='A')

B = tvm.placeholder((10,), name='B')

# args: (shape, function, label)

# function represented in lambda expression (element-wise)

#     lambda axis1, axis2, ... : f(axis1, axis2, ...)

C = tvm.compute((10,), lambda i: A[i] + B[i], name="C")

# generate schedule

s = tvm.create_schedule(C.op)

# print low level codes

print(tvm.lower(s,[A,B,C],simple_mode=True))

其中placeholder代表特定維度的張量，最后生成的代碼會要求用戶輸入兩個tensor，如果是C++代碼，則要求用戶輸入兩個float*。注意，會發現這個過程實際上是沒有計算發生的，而只是定義了計算如何進行。

輸出的low-level代碼如下所示，還是相當好理解的，即i從0到10循環，循環內每次計算C[i]的值。

produce C {

  for (i, 0, 10) {

    C[i] = (A[i] + B[i])

一些常用的循環優化API可以在這里找到。這里使用循環分割split作為嘗試。

split(parent[, factor, nparts])

Split the stage either by factor providing outer scope, or both. Return outer, innervaiable of iteration.

bx, tx = s[C].split(C.op.axis[0],factor=2)

print(tvm.lower(s,[A,B,C],simple_mode=True))

由於對schedule的操作是原地變換，因此可以直接輸出lower后的代碼，發現確實已經改變了，原來的循環體變成5*2的循環。

produce C {

  for (i.outer, 0, 5) {

    for (i.inner, 0, 2) {

      C[((i.outer*2) + i.inner)] = (A[((i.outer*2) + i.inner)] + B[((i.outer*2) + i.inner)])

當然這一個schedule變換並沒有帶來任何好處，只是為了說明Tensor Expression應該怎么用。

之后就可以調用build生成目標代碼了，可以設置target和target_host。

tgt = "c" # "llvm", "cuda"

fadd = tvm.build(s,[A,B,C],target=tgt,name="myadd")

然后可以創造運行時環境，進行運行測試。

n = 10

ctx = tvm.context(tgt,0)

a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), ctx)

b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), ctx)

c = tvm.nd.array(np.zeros(n,dtype=C.dtype), ctx)

fadd(a,b,c) # run

# test

tvm.testing.assert_allclose(c.asnumpy(),a.asnumpy() + b.asnumpy())

print(fadd.get_source())

生成的C代碼如下

for (int32_t i_outer = 0; i_outer < 5; ++i_outer) {

  for (int32_t i_inner = 0; i_inner < 2; ++i_inner) {

    C[((i_outer * 2) + i_inner)] = (A[((i_outer * 2) + i_inner)] + B[((i_outer * 2) + i_inner)]);

生成的myadd.c完整代碼如下

最后通過fadd.save("myadd.c")保存文件。

TVM - Relay IR Pass

本節介紹Relay IR Pass的構造。

Relay IR Pass核心依然是在C++中實現，但提供了Python接口，方便上層直接調用並對計算流圖進行變換優化。

Pass管理器在include/tvm/relay/transform.h中，里面包含所有Pass的聲明，希望做到

管理調度不同的優化pass
收集需要的分析信息，並且保持是最新的
減少程序員實現新pass的麻煩

Python的接口函數聲明在python/tvm/relay/transform.py中，在python/tvm/relay/_transform.py中通過FFI對C++函數進行調用，命名空間為relay._transform。

具體C++的實現則分為兩個部分：

高層IR圖變換，源碼在src/relay/pass中，集中變換則是在src/relay/backend/build_module.cc中的relay::Module Optimize
后端代碼的圖變換，源碼在src/relay/backend/vm中，集中變換在python/tvm/build_module.py中的lower函數

Pass的構造

PassInfo

·         class PassInfoNode : public RelayNode {

·         std::string name;

·         int opt_level;

·         std::vector<std::string> required;

·         };

PassContext

·         class PassContextNode : public RelayNode {

·         public:

·         ErrorReporter err_reporter;

·         int opt_level{2};

·         int fallback_device{static_cast<int>(kDLCPU)};

·         tvm::Array<tvm::Expr> required_pass;

·         tvm::Array<tvm::Expr> disabled_pass;

·         };

·

·         class PassContext : public NodeRef {

·         public:

·         TVM_DLL static PassContext Create();

·         TVM_DLL static PassContext Current();

·         /* Other fields are omitted. */

·

·         private:

·         // The entry of a pass context scope.

·         TVM_DLL void EnterWithScope();

·         // The exit of a pass context scope.

·         TVM_DLL void ExitWithScope();

·

·         // Classes to get the Python `with` like syntax.

·         friend class tvm::With<PassContext>;

·         };

·

·         struct RelayPassContextThreadLocalEntry {

·         /*! \brief The default pass context. */

·         PassContext default_context;

·         /*! \brief The current pass context. */

·         std::stack<PassContext> context_stack;

·         RelayPassContextThreadLocalEntry() {

·             default_context = PassContext(make_node<PassContextNode>());

·         }

·         };

·

·         /*! \brief The thread-local store to hold the pass context. */

·         typedef dmlc::ThreadLocalStore<RelayPassContextThreadLocalEntry>

·             RelayPassContextThreadLocalStore;

Pass Constructs：提供基類

·         class PassNode : RelayNode {

·         virtual PassInfo Info() const = 0;

·         virtual Module operator()(const IRModule& mod

·                                     const PassContext& pass_ctx) const = 0;

·         };

也就是說，一個Pass一定是作用在特定context下的IRModule，所有Pass都設計成Module到Module的映射，完整Pass的定義在src/relay/ir/transform.cc和src/ir/transform.cc中。

Module-Level

class ModulePassNode : PassNode {

  PassInfo pass_info;

  runtime::TypedPackedFunc<Module(Module, PassContext)> pass_func;

  Module operator()(const Module& mod, const PassContext& pass_ctx) const final;

  // Other members/methods are omitted

};

Function-Level

class FunctionPassNode : PassNode {

  PassInfo pass_info;

  runtime::TypedPackedFunc<Function(Function, Module, PassContext)> pass_func;

  Module operator()(const Module& mod, const PassContext& pass_ctx) const final;

  bool SkipFunction(const Function& func) const;

  // Other members/methods are omitted...

};

Sequential

類似於PyTorch中的nn.Sequential，順序執行多個Pass

class SequentialPassNode : PassNode {

  PassInfo pass_info;

  // Passes need to be executed.

  Array<Pass> passes;

  bool PassEnabled(const PassInfo& info) const;

  Module operator()(const Module& mod, const PassContext& pass_ctx) const final;

};

References

TVM內置Pass索引，https://docs.tvm.ai/api/python/relay/transform.html
Relay Pass Infrastructure, https://tvm.apache.org/docs/dev/relay_pass_infra.html

初識TVM - 立交橋跳水冠軍的文章 - 知乎，https://zhuanlan.zhihu.com/p/88188955

TVM Codebase Walkthrough by Example, https://docs.tvm.ai/dev/codebase_walkthrough.html
TVM圖編譯器Relay簡單探究 - 鄭思澤的文章 - 知乎, https://zhuanlan.zhihu.com/p/91283238
謝睿峰, TVM/VTA代碼生成流程, https://krantz-xrf.github.io/2019/10/24/tvm-workflow.html
https://discuss.tvm.ai/t/relationship-between-tvm-build-and-relay-build/4166
https://blog.csdn.net/qq_33287871/article/details/113898181
https://www.cnblogs.com/wangtianning1223/p/14662970.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 TVM代碼生成codegen [Android]Android焦點流程代碼分析 TeamTalk Android代碼分析（業務流程篇）虛擬機創建流程中neutron代碼分析（一）虛擬機創建流程中neutron代碼分析（二）使用pycallgraph分析python代碼函數調用流程以及框架九、uboot 代碼流程分析---board_init_f 十、uboot 代碼流程分析---relloc_code 高通Android UEFI XBL 代碼流程分析 Vue項目啟動代碼執行流程分析