mlx5 dpdk ovs offload


 https://github.com/Mellanox/OVS/blob/master/Documentation/topics/dpdk/phy.rst

https://developer.nvidia.com/zh-cn/blog/improving-performance-for-nfv-infrastructure-and-agile-cloud-data-centers/

在 2018 年紅帽峰會上, NVIDIA Mellanox 宣布推出 開放網絡功能虛擬化基礎設施( NFVI )和雲數據中心解決方案 。該解決方案將 Red Hat Enterprise Linux 雲軟件與 NVIDIA Mellanox NIC 硬件的機箱內支持相結合。我們與 Red Hat 的密切合作和聯合驗證產生了一個完全集成的解決方案,它提供了高性能和高效率,並且易於部署。該解決方案包括開源數據路徑加速技術,包括數據平面開發工具包( 數據平面開發套件 , DPDK )和 Open vSwitch ( OvS )加速。

私有雲和通信服務提供商正在改造其基礎設施,以實現超規模公共雲提供商的靈活性和效率。這種轉變基於兩個基本原則:分解和虛擬化。

分解將網絡軟件與底層硬件分離。服務器和網絡虛擬化通過使用 hypervisor 和 overlay 網絡共享行業標准服務器和網絡設備來提高效率。這些顛覆性功能提供了靈活性、靈活性和軟件可編程性等好處。然而,由於基於內核的 hypervisor 和虛擬交換,它們也會對網絡性能造成嚴重的影響,這兩種方法都無法有效地消耗主機 CPU 周期進行網絡數據包處理。為解決網絡性能下降而過度配置 CPU 核心會導致較高的資本支出,從而使通過服務器虛擬化獲得硬件效率的目標落空。

為了應對這些挑戰, Red Hat 和 NVIDIA Mellanox 向市場推出了一款高效、硬件加速、緊密集成的 NFVI 和雲數據中心解決方案,該解決方案將 Red Hat Enterprise Linux 操作系統與運行 DPDK 的 NVIDIA Mellanox ConnectX-5 網絡適配器以及加速交換和數據包處理( ASAP )相結合2) OvS 卸載技術。

 

Mellanox OFED棧的架構

 

 


mlx4 VPI Driver
ConnectX®-3可以作為一個IB(InfiniBand)適配器,或者一個以太網卡。mlx4是ConnectX® 家族適配器的低層驅動實現。OFED驅動支持IB和以太網配置,為了適應這些配置,這個驅動被分為下面模塊:

mlx4_core:處理底層功能如設備初始化和固件。同時控制資源分配從而讓IB和以太網功能可以互不干擾地共享設備
mlx4_ib:處理IB功能並且插入到IB中間層
mlx4_en:drivers/net/ethernet/mellanox/mlx4下一個10/24GigE的驅動,處理以太網功能
libmlx4 is a userspace driver for Mellanox ConnectX InfiniBand HCAs.It is a plug-in module for libibverbs that allows programs to useMellanox hardware directly from userspace.
mlx5 Driver
mlx5是the Connect-IB® and ConnectX®-4適配器的底層驅動實現。Connect-IB作為IB適配器而ConnectX-4作為一個VPI適配器(IB和以太網)。mlx5包括了以下內核模塊:

mlx5_core:作為一個通用功能庫(比如重置后初始化設備),Connect-IB® and ConnectX®-4適配卡需要這些功能。mlx5_core也為ConnectX®-4實現了以太網接口。和mlx4_en/core不同的是,mlx5驅動不需要mlx5_en模塊因為以太網功能已經內置在mlx_core模塊中了。
mlx5_ib:處理IB功能
libmlx5:實現指定硬件的用戶空間功能。如果固件和驅動不兼容,這個驅動不會加載並且會打印一條信息在dmesg中。下面是libmlx5的環境變量:
MLX5_FREEZE_ON_ERROR_CQE
MLX5_POST_SEND_PREFER_BF
MLX5_SHUT_UP_BF
MLX5_SINGLE_THREADED
 

ASAP2OvS 卸載加速

OvS 硬件卸載解決方案將基於軟件的緩慢虛擬交換機數據包性能提高一個數量級。從本質上講, OvS 硬件卸載提供了兩個方面的最佳選擇:數據路徑的硬件加速以及未經修改的 OvS 控制路徑,以實現匹配操作規則的靈活性和編程。 NVIDIA Mellanox 是這一突破性技術的先驅,在 OvS 、 Linux 內核、 DPDK 和 OpenStack 開源社區中引領了支持這一創新所需的開放架構。

ASAP(2) fully and transparently offloads the OvS virtual switch and router datapath processing to the NIC. 圖 1 。ASAP2OvS 卸載解決方案。

圖 1 顯示了 NVIDIA Mellanox open ASAP 2 OvS 卸載技術。它完全透明地將虛擬交換機和路由器數據路徑處理卸載到 NIC 嵌入式交換機( e-switch )。 NVIDIA Mellanox 為核心框架和 API (如 tcflower )的上游開發做出了貢獻,使它們可以在 Linux 內核和 OvS 版本中使用。這些 api 極大地加速了網絡功能,如覆蓋、交換、路由、安全和負載平衡。

正如在 Red Hat 實驗室進行的性能測試所證實的, NVIDIA Mellanox ASAP2該技術為大型虛擬可擴展局域網( VXLAN )數據包提供了接近 100g 的線速率吞吐量,而不消耗任何 CPU 周期。對於小包裹,ASAP2將 OvS VXLAN 數據包速率提高了 10 倍,從使用 12 個 CPU 內核的每秒 500 萬個數據包提高到每秒消耗 0 個 CPU 核的 5500 萬個數據包。

雲通信服務提供商和企業可以盡快實現基礎設施的總體效率2– 基於的高性能解決方案,同時釋放 CPU 內核,以便在同一服務器上打包更多虛擬網絡功能( vnf )和雲本地應用程序。這有助於減少服務器占用空間並節省大量的資本支出。ASAP2已從 OSP13 和 RHEL7 . 5 作為技術預覽版提供,從 OSP16 . 1 和 RHEL8 . 2 開始正式提供。

OVS-DPDK 加速

如果您想保持現有較慢的 OvS virtio 數據路徑,但仍然需要一些加速,可以使用 NVIDIA Mellanox DPDK 解決方案來提高 OvS 性能。圖 2 顯示了 OvS over DPDK 解決方案使用 DPDK 軟件庫和輪詢模式驅動程序( PMD ),以消耗 CPU 核心為代價,顯著提高了數據包速率。

The NVIDIA Mellanox DPDK solution to boost OvS performance improves packet rate to accelerate the existing OvS virtio data path. 圖 2 。 OVS-DPDK 解決方案圖。

使用開源 DPDK 技術, NVIDIA Mellanox ConnectX-5 NIC 提供業界最佳的裸機數據包速率,即每秒 1 . 39 億個數據包,用於在 DPDK 上運行 OvS 、 VNF 或雲應用程序。 RHEL7 . 5 完全支持 Red Hat 

網絡架構師在選擇適合其 IT 基礎設施需求的最佳技術時經常面臨許多選擇。在決定是否 ASAP2 而 DPDK ,由於 ASAP 的巨大優勢,決策變得更加容易2技術超過 DPDK 。

由於 SR-IOV 數據路徑,與ASAP2和使用傳統的較慢 virtio 數據路徑的 DPDK 相比, OvS 卸載實現了顯著更高的性能。進一步,ASAP2通過將流卸載到 NIC 來節省 CPU 核心,在 NIC 中 DPDK 消耗 CPU 核心以次優方式處理數據包。像 DPDK 一樣,ASAP2OvS offload 是一種開源技術,在開源社區中得到了充分的支持,並在業界得到了廣泛的采用。

有關更多信息,請參閱以下資源:

https://isrc.iscas.ac.cn/gitlab/mirrors/git.yoctoproject.org/linux-yocto/commit/f8e8fa0262eaf544490a11746c524333158ef0b6

 drivers/infiniband/hw/mlx5/ib_rep.c 

 

 

內核:https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c

 

 

 

 

 

 ivers/net/ethernet/mellanox/mlx5/core/en_tc.c:2664:   const struct flow_action_entry *act;

 

 

dpdk 

 

mlx5_pci_probe

          mlx5_init_once

static int
mlx5_init_once(void)
{
struct mlx5_shared_data *sd;
struct mlx5_local_data *ld = &mlx5_local_data;
int ret = 0;

if (mlx5_init_shared_data())
return -rte_errno;
sd = mlx5_shared_data;
assert(sd);
rte_spinlock_lock(&sd->lock);
switch (rte_eal_process_type()) {
case RTE_PROC_PRIMARY:
if (sd->init_done)
break;
LIST_INIT(&sd->mem_event_cb_list);
rte_rwlock_init(&sd->mem_event_rwlock);
rte_mem_event_callback_register("MLX5_MEM_EVENT_CB",
mlx5_mr_mem_event_cb, NULL);
ret = mlx5_mp_init_primary();
if (ret)
goto out;
sd->init_done = true;
break;
case RTE_PROC_SECONDARY:
if (ld->init_done)
break;
ret = mlx5_mp_init_secondary();
if (ret)
goto out;
++sd->secondary_cnt;
ld->init_done = true;
break;
default:
break;
}
out:
rte_spinlock_unlock(&sd->lock);
return ret;
}

 

 

 

 

內核

mlx5_init_once
     mlx5_eswitch_init
               esw_offloads_init_reps

 

kernel 

 

下發命令

err = parse_tc_fdb_actions(priv, &rule->action, flow, extack);
err = mlx5e_tc_add_fdb_flow(priv, flow, extack);
static int parse_tc_fdb_actions(struct mlx5e_priv *priv,
                                struct flow_action *flow_action,
                                struct mlx5e_tc_flow *flow,
                                struct netlink_ext_ack *extack)
{
        
        case FLOW_ACTION_VLAN_PUSH:
        case FLOW_ACTION_VLAN_POP:
                        if (act->id == FLOW_ACTION_VLAN_PUSH &&
                            (action & MLX5_FLOW_CONTEXT_ACTION_VLAN_POP)) {
                                /* Replace vlan pop+push with vlan modify */
                                action &= ~MLX5_FLOW_CONTEXT_ACTION_VLAN_POP;
                                err = add_vlan_rewrite_action(priv,
                                                              MLX5_FLOW_NAMESPACE_FDB,
                                                              act, parse_attr, hdrs,
                                                              &action, extack);
                        } else {
                                err = parse_tc_vlan_action(priv, act, attr, &action);
                        }
                        if (err)
                                return err;

                        attr->split_count = attr->out_count;
                        break;
}
mlx5e_tc_add_fdb_flow
{
     struct mlx5_esw_flow_attr *attr = flow->esw_attr;
     err = mlx5_eswitch_add_vlan_action(esw, attr);  
    flow_flag_set(flow, OFFLOADED);
}

 flow_flag_set(flow, OFFLOADED);
 MLX5E_TC_FLOW_FLAG_OFFLOADED
static bool mlx5e_is_offloaded_flow(struct mlx5e_tc_flow *flow) { return flow_flag_test(flow, OFFLOADED); } int mlx5_eswitch_add_vlan_action(struct mlx5_eswitch *esw,
                                 struct mlx5_esw_flow_attr *attr)
{
        struct offloads_fdb *offloads = &esw->fdb_table.offloads;
        struct mlx5_eswitch_rep *vport = NULL;
        bool push, pop, fwd;
        int err = 0;

        /* nop if we're on the vlan push/pop non emulation mode */
        if (mlx5_eswitch_vlan_actions_supported(esw->dev, 1))
                return 0;
        vport = esw_vlan_action_get_vport(attr, push, pop);

        if (push) {
                if (vport->vlan_refcount)
                        goto skip_set_push;

                err = __mlx5_eswitch_set_vport_vlan(esw, vport->vport, attr->vlan_vid[0], 0,
                                                    SET_VLAN_INSERT | SET_VLAN_STRIP);
                if (err)
                        goto out;
                vport->vlan = attr->vlan_vid[0];
skip_set_push:
                vport->vlan_refcount++;
        }

}
        mlx5_eswitch_add_vlan_action-->  __mlx5_eswitch_set_vport_vlan
                --> modify_esw_vport_cvlan
                      -->MLX5_SET
                      --> modify_esw_vport_context_cmd
                           -->mlx5_cmd_exec
                               --> mlx5_cmd_invoke
int mlx5_cmd_exec(struct mlx5_core_dev *dev, void *in, int in_size, void *out,
                  int out_size)
{
        int err;

        err = cmd_exec(dev, in, in_size, out, out_size, NULL, NULL, false);
        return err ? : mlx5_cmd_check(dev, in, out);
}

 

 

static int query_esw_vport_context_cmd(struct mlx5_core_dev *dev, u16 vport,
                                       void *out, int outlen)
{
        u32 in[MLX5_ST_SZ_DW(query_esw_vport_context_in)] = {};

        MLX5_SET(query_esw_vport_context_in, in, opcode,
                 MLX5_CMD_OP_QUERY_ESW_VPORT_CONTEXT);
        MLX5_SET(modify_esw_vport_context_in, in, vport_number, vport);
        MLX5_SET(modify_esw_vport_context_in, in, other_vport, 1);
        return mlx5_cmd_exec(dev, in, sizeof(in), out, outlen);
}

 

 

 dpdk

flow_list_create-->
      flow_drv_translate-->fops->translate
 

__flow_dv_translate --> flow_dv_matcher_register
cache_matcher->matcher_object = mlx5_glue->dv_create_flow_matcher(sh->ctx, &dv_attr, tbl->obj);

 

static inline int
flow_drv_translate(struct rte_eth_dev *dev, struct mlx5_flow *dev_flow,
                   const struct rte_flow_attr *attr,
                   const struct rte_flow_item items[],
                   const struct rte_flow_action actions[],
                   struct rte_flow_error *error)
{
        const struct mlx5_flow_driver_ops *fops;
        enum mlx5_flow_drv_type type = dev_flow->flow->drv_type;

        assert(type > MLX5_FLOW_TYPE_MIN && type < MLX5_FLOW_TYPE_MAX);
        fops = flow_get_drv_ops(type);
        return fops->translate(dev, dev_flow, attr, items, actions, error);
}

 

const struct mlx5_flow_driver_ops mlx5_flow_dv_drv_ops = {
        .validate = flow_dv_validate,
        .prepare = flow_dv_prepare,
        .translate = flow_dv_translate,
        .apply = flow_dv_apply,
        .remove = flow_dv_remove,
        .destroy = flow_dv_destroy,
        .query = flow_dv_query,
        .create_mtr_tbls = flow_dv_create_mtr_tbl,
        .destroy_mtr_tbls = flow_dv_destroy_mtr_tbl,
        .create_policer_rules = flow_dv_create_policer_rules,
        .destroy_policer_rules = flow_dv_destroy_policer_rules,
        .counter_alloc = flow_dv_counter_allocate,
        .counter_free = flow_dv_counter_free,
        .counter_query = flow_dv_counter_query,
};

 

query_nic_vport

 

enum {
        MLX5_GET_HCA_CAP_OP_MOD_GENERAL_DEVICE = 0x0 << 1,
        MLX5_GET_HCA_CAP_OP_MOD_ETHERNET_OFFLOAD_CAPS = 0x1 << 1,
        MLX5_GET_HCA_CAP_OP_MOD_QOS_CAP = 0xc << 1,
};

enum {
        MLX5_HCA_CAP_OPMOD_GET_MAX   = 0,
        MLX5_HCA_CAP_OPMOD_GET_CUR   = 1,
};

enum {
        MLX5_CAP_INLINE_MODE_L2,
        MLX5_CAP_INLINE_MODE_VPORT_CONTEXT,
        MLX5_CAP_INLINE_MODE_NOT_REQUIRED,
};

enum {
        MLX5_INLINE_MODE_NONE,
        MLX5_INLINE_MODE_L2,
        MLX5_INLINE_MODE_IP,
        MLX5_INLINE_MODE_TCP_UDP,
        MLX5_INLINE_MODE_RESERVED4,
        MLX5_INLINE_MODE_INNER_L2,
        MLX5_INLINE_MODE_INNER_IP,
        MLX5_INLINE_MODE_INNER_TCP_UDP,
};

/* HCA bit masks indicating which Flex parser protocols are already enabled. */
#define MLX5_HCA_FLEX_IPV4_OVER_VXLAN_ENABLED (1UL << 0)
#define MLX5_HCA_FLEX_IPV6_OVER_VXLAN_ENABLED (1UL << 1)
#define MLX5_HCA_FLEX_IPV6_OVER_IP_ENABLED (1UL << 2)
#define MLX5_HCA_FLEX_GENEVE_ENABLED (1UL << 3)
#define MLX5_HCA_FLEX_CW_MPLS_OVER_GRE_ENABLED (1UL << 4)
#define MLX5_HCA_FLEX_CW_MPLS_OVER_UDP_ENABLED (1UL << 5)
#define MLX5_HCA_FLEX_P_BIT_VXLAN_GPE_ENABLED (1UL << 6)
#define MLX5_HCA_FLEX_VXLAN_GPE_ENABLED (1UL << 7)
 

mlx5_dev_spawn
   -->mlx5_devx_cmd_query_hca_att
MLX5_GET_HCA_CAP_OP_MOD_GENERAL_DEVICE 

mlx5_devx_cmd_query_hca_attr(struct ibv_context *ctx,
                             struct mlx5_hca_attr *attr)
{
        uint32_t in[MLX5_ST_SZ_DW(query_hca_cap_in)] = {0};
        uint32_t out[MLX5_ST_SZ_DW(query_hca_cap_out)] = {0};
        void *hcattr;
        int status, syndrome, rc;

        MLX5_SET(query_hca_cap_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_CAP);
        MLX5_SET(query_hca_cap_in, in, op_mod,
                 MLX5_GET_HCA_CAP_OP_MOD_GENERAL_DEVICE |
                 MLX5_HCA_CAP_OPMOD_GET_CUR);

        rc = mlx5_glue->devx_general_cmd(ctx,
                                         in, sizeof(in), out, sizeof(out));
}
.devx_general_cmd = mlx5_glue_devx_general_cmd
mlx5_glue_devx_general_cmd(struct ibv_context *ctx,
                           const void *in, size_t inlen,
                           void *out, size_t outlen)
{
#ifdef HAVE_IBV_DEVX_OBJ
        return mlx5dv_devx_general_cmd(ctx, in, inlen, out, outlen);
#else
        (void)ctx;
        (void)in;
        (void)inlen;
        (void)out;
        (void)outlen;
        return -ENOTSUP;
#endif
}

 

 

 

static int
mlx5_devx_cmd_query_nic_vport_context(struct ibv_context *ctx,
                                      unsigned int vport,
                                      struct mlx5_hca_attr *attr)
{
        uint32_t in[MLX5_ST_SZ_DW(query_nic_vport_context_in)] = {0};
        uint32_t out[MLX5_ST_SZ_DW(query_nic_vport_context_out)] = {0};
        void *vctx;
        int status, syndrome, rc;

        /* Query NIC vport context to determine inline mode. */
        MLX5_SET(query_nic_vport_context_in, in, opcode,
                 MLX5_CMD_OP_QUERY_NIC_VPORT_CONTEXT);
        MLX5_SET(query_nic_vport_context_in, in, vport_number, vport);
        if (vport)
                MLX5_SET(query_nic_vport_context_in, in, other_vport, 1);
        rc = mlx5_glue->devx_general_cmd(ctx,
                                         in, sizeof(in),
                                         out, sizeof(out));
        if (rc)
                goto error;
        status = MLX5_GET(query_nic_vport_context_out, out, status);
        syndrome = MLX5_GET(query_nic_vport_context_out, out, syndrome);
        if (status) {
                DRV_LOG(DEBUG, "Failed to query NIC vport context, "
                        "status %x, syndrome = %x",
                        status, syndrome);
                return -1;
        }
        vctx = MLX5_ADDR_OF(query_nic_vport_context_out, out,
                            nic_vport_context);
        attr->vport_inline_mode = MLX5_GET(nic_vport_context, vctx, min_wqe_inline_mode); return 0;
error:
        rc = (rc > 0) ? -rc : rc;
        return rc;
}

 

 下發命令  flow_dv_create_action_push_vlan

 

   flow_dv_create_action_push_vlan-->
           flow_dv_push_vlan_action_resource_register --> mlx5_glue->dr_create_flow_action_push_vlan

    .dr_create_flow_action_push_vlan =mlx5_glue_dr_create_flow_action_push_vlan,
                          mlx5dv_dr_action_create_push_vlan

 

$Q sh -- '$<' '$@' \
                HAVE_MLX5DV_DR_VLAN \
                infiniband/mlx5dv.h \
                func mlx5dv_dr_action_create_push_vlan \
                $(AUTOCONF_OUTPUT)

 

 mlx5dv_dr_action_create_push_vlan

https://github.com/linux-rdma/rdma-core/blob/364dbcc7c04afce7d3085a9b420bb59bc9d99d95/providers/mlx5/dr_action.c

struct mlx5dv_dr_action *mlx5dv_dr_action_create_push_vlan(struct mlx5dv_dr_domain *dmn,
                               __be32 vlan_hdr)
{
    uint32_t vlan_hdr_h = be32toh(vlan_hdr);
    uint16_t ethertype = vlan_hdr_h >> 16;
    struct mlx5dv_dr_action *action;

    if (ethertype != SVLAN_ETHERTYPE && ethertype != CVLAN_ETHERTYPE) {
        dr_dbg(dmn, "Invalid vlan ethertype\n");
        errno = EINVAL;
        return NULL;
    }

    action = dr_action_create_generic(DR_ACTION_TYP_PUSH_VLAN);
    if (!action)
        return NULL;

    action->push_vlan.vlan_hdr = vlan_hdr_h;
    return action;
}

 

https://github.com/linux-rdma/rdma-core/blob/364dbcc7c04afce7d3085a9b420bb59bc9d99d95/providers/mlx5/dr_action.c

 

                case RTE_FLOW_ACTION_TYPE_OF_PUSH_VLAN:
                        flow_dev_get_vlan_info_from_items(items, &vlan);
                        vlan.eth_proto = rte_be_to_cpu_16
                             ((((const struct rte_flow_action_of_push_vlan *)
                                                   actions->conf)->ethertype));
                        found_action = mlx5_flow_find_action
                                        (actions + 1,
                                         RTE_FLOW_ACTION_TYPE_OF_SET_VLAN_VID);
                        if (found_action)
                                mlx5_update_vlan_vid_pcp(found_action, &vlan);
                        found_action = mlx5_flow_find_action
                                        (actions + 1,
                                         RTE_FLOW_ACTION_TYPE_OF_SET_VLAN_PCP);
                        if (found_action)
                                mlx5_update_vlan_vid_pcp(found_action, &vlan);
                        if (flow_dv_create_action_push_vlan
                                            (dev, attr, &vlan, dev_flow, error))
                                return -rte_errno;
                        dev_flow->dv.actions[actions_n++] =
                                           dev_flow->dv.push_vlan_res->action;
                        action_flags |= MLX5_FLOW_ACTION_OF_PUSH_VLAN;
                        break;
    if (flow_dv_matcher_register(dev, &matcher, &tbl_key, dev_flow, error))
                return -rte_errno;
        return 0;

 

 

flow_dv_matcher_register -->
        mlx5_glue->dv_create_flow_matcher(
                 --> mlx5_glue_dv_create_flow_matcher
                        -->mlx5dv_create_flow_matcher
                        -->mlx5dv_dr_matcher_create

 

 https://www.mellanox.com/related-docs/user_manuals/Ethernet_Adapters_Programming_Manual.pdf

 

RDMA Mellanox VPI verbs API +  DV APIs

 

 

mlx5: Introduce DEVX APIs

 ibv_create_flow

 

 

drivers/net/mlx5/mlx5_flow_dv.c:7341: mlx5_glue->dv_create_flow(dv->matcher->matcher_object, drivers/net/mlx5/mlx5_flow_dv.c:7909: mlx5_glue->dv_create_flow(dtb->any_matcher, drivers/net/mlx5/mlx5_flow_dv.c:8098: mlx5_glue->dv_create_flow(dtb->color_matcher,
drivers/net/mlx5/mlx5_flow_verbs.c:1722:                verbs->flow = mlx5_glue->create_flow(verbs->hrxq->qp,
drivers/net/mlx4/mlx4_flow.c:1117:      flow->ibv_flow = mlx4_glue->create_flow(qp, flow->ibv_attr);
drivers/net/mlx5/mlx5_flow.c:511:               flow = mlx5_glue->create_flow(drop->qp, &flow_attr.attr);

mlx4_glue_create_flow(struct ibv_qp *qp, struct ibv_flow_attr *flow)
{
        return ibv_create_flow(qp, flow);
}
drivers/net/mlx5/mlx5_flow.c:511
mlx5_flow_discover_priorities -->
          mlx5_glue->create_flow
const struct mlx4_glue *mlx4_glue = &(const struct mlx4_glue){
        .create_flow = mlx4_glue_create_flow,

 

Prerequisites


This driver relies on external libraries and kernel drivers for resources allocations and initialization. The following dependencies are not part of DPDK and must be installed separately:

libibverbs

User space Verbs framework used by librte_pmd_mlx5. This library provides a generic interface between the kernel and low-level user space drivers such as libmlx5.

It allows slow and privileged operations (context initialization, hardware resources allocations) to be managed by the kernel and fast operations to never leave user space.

libmlx5

Low-level user space driver library for Mellanox ConnectX-4 devices, it is automatically loaded by libibverbs.

This library basically implements send/receive calls to the hardware queues.

Kernel modules (mlnx-ofed-kernel)

They provide the kernel-side Verbs API and low level device drivers that manage actual hardware initialization and resources sharing with user space processes.

Unlike most other PMDs, these modules must remain loaded and bound to their devices:

mlx5_core: hardware driver managing Mellanox ConnectX-4 devices and related Ethernet kernel network devices.
mlx5_ib: InifiniBand device driver.
ib_uverbs: user space driver for Verbs (entry point for libibverbs).
Firmware update

Mellanox OFED releases include firmware updates for ConnectX-4 adapters.

Because each release provides new features, these updates must be applied to match the kernel modules and libraries they come with.

 

 

Load the kernel modules:

modprobe -a ib_uverbs mlx5_core mlx5_ib 

Alternatively if MLNX_OFED is fully installed, the following script can be run:

/etc/init.d/openibd restart

MLX5_SET + flow_dv_matcher_register

 

__flow_dv_translate
{
     flow_dv_translate_item_vlan
        case RTE_FLOW_ITEM_TYPE_VXLAN:
                        flow_dv_translate_item_vxlan(match_mask, match_value,
                                                     items, tunnel);
                        last_item = MLX5_FLOW_LAYER_VXLAN;
                        break;
    tbl_key.domain = attr->transfer;
        tbl_key.direction = attr->egress;
        tbl_key.table_id = dev_flow->group;
        if (flow_dv_matcher_register(dev, &matcher, &tbl_key, dev_flow, error))
                return -rte_errno;
        return 0;
}
flow_dv_translate_item_vlan(struct mlx5_flow *dev_flow,
                            void *matcher, void *key,
                            const struct rte_flow_item *item,
                            int inner)
{
        const struct rte_flow_item_vlan *vlan_m = item->mask;
        const struct rte_flow_item_vlan *vlan_v = item->spec;
        void *headers_m;
        void *headers_v;
        uint16_t tci_m;
        uint16_t tci_v;

        if (!vlan_v)
                return;
        if (!vlan_m)
                vlan_m = &rte_flow_item_vlan_mask;
        if (inner) {
                headers_m = MLX5_ADDR_OF(fte_match_param, matcher,
                                         inner_headers);
                headers_v = MLX5_ADDR_OF(fte_match_param, key, inner_headers);
        } else {
                headers_m = MLX5_ADDR_OF(fte_match_param, matcher,
                                         outer_headers);
                headers_v = MLX5_ADDR_OF(fte_match_param, key, outer_headers);
                /*
                 * This is workaround, masks are not supported,
                 * and pre-validated.
                 */
                dev_flow->dv.vf_vlan.tag =
                        rte_be_to_cpu_16(vlan_v->tci) & 0x0fff;
        }
        tci_m = rte_be_to_cpu_16(vlan_m->tci);
        tci_v = rte_be_to_cpu_16(vlan_m->tci & vlan_v->tci);
        MLX5_SET(fte_match_set_lyr_2_4, headers_m, cvlan_tag, 1);
        MLX5_SET(fte_match_set_lyr_2_4, headers_v, cvlan_tag, 1);
        MLX5_SET(fte_match_set_lyr_2_4, headers_m, first_vid, tci_m);
        MLX5_SET(fte_match_set_lyr_2_4, headers_v, first_vid, tci_v);
        MLX5_SET(fte_match_set_lyr_2_4, headers_m, first_cfi, tci_m >> 12);
        MLX5_SET(fte_match_set_lyr_2_4, headers_v, first_cfi, tci_v >> 12);
        MLX5_SET(fte_match_set_lyr_2_4, headers_m, first_prio, tci_m >> 13);
        MLX5_SET(fte_match_set_lyr_2_4, headers_v, first_prio, tci_v >> 13);
        MLX5_SET(fte_match_set_lyr_2_4, headers_m, ethertype,
                 rte_be_to_cpu_16(vlan_m->inner_type));
        MLX5_SET(fte_match_set_lyr_2_4, headers_v, ethertype,
                 rte_be_to_cpu_16(vlan_m->inner_type & vlan_v->inner_type));
}

 

 port信息

static int
__flow_dv_translate
{
     case RTE_FLOW_ACTION_TYPE_PORT_ID:
                        if (flow_dv_translate_action_port_id(dev, action,
                                                             &port_id, error))
                                return -rte_errno;
                        port_id_resource.port_id = port_id;
                        if (flow_dv_port_id_action_resource_register
                            (dev, &port_id_resource, dev_flow, error))
                                return -rte_errno;
                        dev_flow->dv.actions[actions_n++] =
                                dev_flow->dv.port_id_action->action;
                        action_flags |= MLX5_FLOW_ACTION_PORT_ID;
                        break;

 

flow_dv_translate_action_port_id(struct rte_eth_dev *dev,
                                 const struct rte_flow_action *action,
                                 uint32_t *dst_port_id,
                                 struct rte_flow_error *error)
{
        uint32_t port;
        struct mlx5_priv *priv;
        const struct rte_flow_action_port_id *conf =
                        (const struct rte_flow_action_port_id *)action->conf;

        port = conf->original ? dev->data->port_id : conf->id;
        priv = mlx5_port_to_eswitch_info(port, false); if (!priv)
                return rte_flow_error_set(error, -rte_errno,
                                          RTE_FLOW_ERROR_TYPE_ACTION,
                                          NULL,
                                          "No eswitch info was found for port");
#ifdef HAVE_MLX5DV_DR_DEVX_PORT
        /*
         * This parameter is transferred to
         * mlx5dv_dr_action_create_dest_ib_port().
         */
        *dst_port_id = priv->ibv_port; #else
        /*
         * Legacy mode, no LAG configurations is supported.
         * This parameter is transferred to
         * mlx5dv_dr_action_create_dest_vport().
         */
        *dst_port_id = priv->vport_id;
#endif
        return 0;
}

 

struct mlx5_priv *
mlx5_port_to_eswitch_info(uint16_t port, bool valid)
{
        struct rte_eth_dev *dev;
        struct mlx5_priv *priv;

        if (port >= RTE_MAX_ETHPORTS) {
                rte_errno = EINVAL;
                return NULL;
        }
        if (!valid && !rte_eth_dev_is_valid_port(port)) {
                rte_errno = ENODEV;
                return NULL;
        }
        dev = &rte_eth_devices[port];
        priv = dev->data->dev_private; if (!(priv->representor || priv->master)) {
                rte_errno = EINVAL;
                return NULL;
        }
        return priv;
}

 

fops->apply

mlx5_flow_start->flow_drv_apply -->fops->apply(dev, flow, error)
           --> __flow_dv_apply

 

 

old version

 

static int flow_dv_translate(struct rte_eth_dev *dev,
          struct mlx5_flow *dev_flow,
          const struct rte_flow_attr *attr,
          const struct rte_flow_item items[],
          const struct rte_flow_action actions[] __rte_unused,
          struct rte_flow_error *error)
{
    struct priv *priv = dev->data->dev_private;
    uint64_t priority = attr->priority;
    struct mlx5_flow_dv_matcher matcher = {
        .mask = {
            .size = sizeof(matcher.mask.buf),
        },
    };
    void *match_value = dev_flow->dv.value.buf;
    uint8_t inner = 0;

    if (priority == MLX5_FLOW_PRIO_RSVD)
        priority = priv->config.flow_prio - 1;
    for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++)
        flow_dv_create_item(&matcher, match_value, items, dev_flow,
                    inner);
    matcher.crc = rte_raw_cksum((const void *)matcher.mask.buf,
                     matcher.mask.size);
    if (priority == MLX5_FLOW_PRIO_RSVD)
        priority = priv->config.flow_prio - 1;
    matcher.priority = mlx5_flow_adjust_priority(dev, priority,
                             matcher.priority);
    matcher.egress = attr->egress;
    if (flow_dv_matcher_register(dev, &matcher, dev_flow, error))
        return -rte_errno;
    for (; actions->type != RTE_FLOW_ACTION_TYPE_END; actions++)
        flow_dv_create_action(actions, dev_flow);
    return 0;
}


flow_dv_create_item(void *matcher, void *key,
            const struct rte_flow_item *item,
            struct mlx5_flow *dev_flow,
            int inner)
{
    struct mlx5_flow_dv_matcher *tmatcher = matcher;

    switch (item->type) {
    case RTE_FLOW_ITEM_TYPE_VOID:
    case RTE_FLOW_ITEM_TYPE_END:
        break;
    case RTE_FLOW_ITEM_TYPE_ETH:
        flow_dv_translate_item_eth(tmatcher->mask.buf, key, item,
                       inner);
        tmatcher->priority = MLX5_PRIORITY_MAP_L2;
        break;
    case RTE_FLOW_ITEM_TYPE_VLAN:
        flow_dv_translate_item_vlan(tmatcher->mask.buf, key, item,
                        inner);
        break;
    case RTE_FLOW_ITEM_TYPE_IPV4:
        flow_dv_translate_item_ipv4(tmatcher->mask.buf, key, item,
                        inner);
        tmatcher->priority = MLX5_PRIORITY_MAP_L3;
        dev_flow->dv.hash_fields |=
            mlx5_flow_hashfields_adjust(dev_flow, inner,
                            MLX5_IPV4_LAYER_TYPES,
                            MLX5_IPV4_IBV_RX_HASH);
        break;
    case RTE_FLOW_ITEM_TYPE_IPV6:
        flow_dv_translate_item_ipv6(tmatcher->mask.buf, key, item,
                        inner);
        tmatcher->priority = MLX5_PRIORITY_MAP_L3;
        dev_flow->dv.hash_fields |=
            mlx5_flow_hashfields_adjust(dev_flow, inner,
                            MLX5_IPV6_LAYER_TYPES,
                            MLX5_IPV6_IBV_RX_HASH);
        break;
    case RTE_FLOW_ITEM_TYPE_TCP:
        flow_dv_translate_item_tcp(tmatcher->mask.buf, key, item,
                       inner);
        tmatcher->priority = MLX5_PRIORITY_MAP_L4;
        dev_flow->dv.hash_fields |=
            mlx5_flow_hashfields_adjust(dev_flow, inner,
                            ETH_RSS_TCP,
                            (IBV_RX_HASH_SRC_PORT_TCP |
                             IBV_RX_HASH_DST_PORT_TCP));
        break;
    case RTE_FLOW_ITEM_TYPE_UDP:
        flow_dv_translate_item_udp(tmatcher->mask.buf, key, item,
                       inner);
        tmatcher->priority = MLX5_PRIORITY_MAP_L4;
        dev_flow->verbs.hash_fields |=
            mlx5_flow_hashfields_adjust(dev_flow, inner,
                            ETH_RSS_TCP,
                            (IBV_RX_HASH_SRC_PORT_TCP |
                             IBV_RX_HASH_DST_PORT_TCP));
        break;
    case RTE_FLOW_ITEM_TYPE_NVGRE:
        flow_dv_translate_item_nvgre(tmatcher->mask.buf, key, item,
                         inner);
        break;
    case RTE_FLOW_ITEM_TYPE_GRE:
        flow_dv_translate_item_gre(tmatcher->mask.buf, key, item,
                       inner);
        break;
    case RTE_FLOW_ITEM_TYPE_VXLAN:
    case RTE_FLOW_ITEM_TYPE_VXLAN_GPE:
        flow_dv_translate_item_vxlan(tmatcher->mask.buf, key, item,
                         inner);
        break;
    default:
        break;
    }
}

flow_dv_create_action(struct rte_eth_dev *dev,
                      const struct rte_flow_action *action,
                      struct mlx5_flow *dev_flow,
+                     const struct rte_flow_attr *attr,
                      struct rte_flow_error *error)
 {
        const struct rte_flow_action_queue *queue;
        const struct rte_flow_action_rss *rss;
        int actions_n = dev_flow->dv.actions_n;
        struct rte_flow *flow = dev_flow->flow;
+       const struct rte_flow_action *action_ptr = action;
 
        switch (action->type) {
        case RTE_FLOW_ACTION_TYPE_VOID:
                actions_n++;
                break;
+       case RTE_FLOW_ACTION_TYPE_RAW_ENCAP:
+               /* Handle encap action with preceding decap */
+               if (flow->actions & MLX5_FLOW_ACTION_RAW_DECAP) {
+                       dev_flow->dv.actions[actions_n].type =
+                               MLX5DV_FLOW_ACTION_IBV_FLOW_ACTION;
+                       dev_flow->dv.actions[actions_n].action =
+                                       flow_dv_create_action_raw_encap
+                                                               (dev, action,
+                                                                attr, error);
+                       if (!(dev_flow->dv.actions[actions_n].action))
+                               return -rte_errno;
+                       dev_flow->dv.encap_decap_verbs_action =
+                               dev_flow->dv.actions[actions_n].action;
+               } else {
+                       /* Handle encap action without preceding decap */
+                       dev_flow->dv.actions[actions_n].type =
+                               MLX5DV_FLOW_ACTION_IBV_FLOW_ACTION;
+                       dev_flow->dv.actions[actions_n].action =
+                                       flow_dv_create_action_l2_encap
+                                                       (dev, action, error);
+                       if (!(dev_flow->dv.actions[actions_n].action))
+                               return -rte_errno;
+                       dev_flow->dv.encap_decap_verbs_action =
+                               dev_flow->dv.actions[actions_n].action;
+               }
+               flow->actions |= MLX5_FLOW_ACTION_RAW_ENCAP;
+               actions_n++;
+               break;
+       case RTE_FLOW_ACTION_TYPE_RAW_DECAP:
+               /* Check if this decap action is followed by encap. */
+               for (; action_ptr->type != RTE_FLOW_ACTION_TYPE_END &&
+                      action_ptr->type != RTE_FLOW_ACTION_TYPE_RAW_ENCAP;
+                      action_ptr++) {
+               }
+               /* Handle decap action only if it isn't followed by encap */
+               if (action_ptr->type != RTE_FLOW_ACTION_TYPE_RAW_ENCAP) {
+                       dev_flow->dv.actions[actions_n].type =
+                               MLX5DV_FLOW_ACTION_IBV_FLOW_ACTION;
+                       dev_flow->dv.actions[actions_n].action =
+                                       flow_dv_create_action_l2_decap(dev,
+                                                                      error);
+                       if (!(dev_flow->dv.actions[actions_n].action))
+                               return -rte_errno;
+                       dev_flow->dv.encap_decap_verbs_action =
+                               dev_flow->dv.actions[actions_n].action;
+                       actions_n++;
+               }
+               /* If decap is followed by encap, handle it at encap case. */
+               break;
+               flow->actions |= MLX5_FLOW_ACTION_RAW_DECAP;
        default:
                break;
        }

 

 ===========================================================================================================================

 

/** Switch information returned by mlx5_nl_switch_info(). */
struct mlx5_switch_info {
        uint32_t master:1; /**< Master device. */
        uint32_t representor:1; /**< Representor device. */
        enum mlx5_phys_port_name_type name_type; /** < Port name type. */
        int32_t pf_num; /**< PF number (valid for pfxvfx format only). */
        int32_t port_name; /**< Representor port name. */
        uint64_t switch_id; /**< Switch identifier. */
};



static const struct rte_flow_ops mlx5_flow_ops = {
        .validate = mlx5_flow_validate,
        .create = mlx5_flow_create,
        .destroy = mlx5_flow_destroy,
        .flush = mlx5_flow_flush,
        .isolate = mlx5_flow_isolate,
        .query = mlx5_flow_query,
};

 

 

rte_flow_create

        

dp_netdev_flow_offload_put
netdev_flow_put->flow_ flow_api->flow_put
      netdev_offload_dpdk_flow_put
            netdev_offload_dpdk_add_flow
                 netdev_offload_dpdk_action
                     netdev_offload_dpdk_flow_create
                          netdev_dpdk_rte_flow_create
                            rte_flow_create -->mlx5_flow_create flow_list_create

 

OVS已經支持基本的HW offload框架,OVS內部的flow表示被轉化成基於DPDK rte_flow的描述后通過rte_flow_create() API將流表規則下發到硬件,后續OVS接收到的報文將會被硬件打上flow mark標簽,OVS可以迅速地從flow mark索引到相應的flow並執行其action。

/* Create a flow rule on a given port. */
struct rte_flow *
rte_flow_create(uint16_t port_id,
                const struct rte_flow_attr *attr,
                const struct rte_flow_item pattern[],
                const struct rte_flow_action actions[],
                struct rte_flow_error *error)
{
        struct rte_eth_dev *dev = &rte_eth_devices[port_id];
        struct rte_flow *flow;
        const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error); if (unlikely(!ops))
                return NULL;
        if (likely(!!ops->create)) {
                flow = ops->create(dev, attr, pattern, actions, error);
                if (flow == NULL)
                        flow_err(port_id, -rte_errno, error);
                return flow;
        }
        rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
                           NULL, rte_strerror(ENOSYS));
        return NULL;
}


/* Get generic flow operations structure from a port. */
const struct rte_flow_ops *
rte_flow_ops_get(uint16_t port_id, struct rte_flow_error *error)
{
        struct rte_eth_dev *dev = &rte_eth_devices[port_id];
        const struct rte_flow_ops *ops;
        int code;

        if (unlikely(!rte_eth_dev_is_valid_port(port_id)))
                code = ENODEV;
        else if (unlikely(!dev->dev_ops->filter_ctrl ||
                          dev->dev_ops->filter_ctrl(dev,
                                                    RTE_ETH_FILTER_GENERIC,
                                                    RTE_ETH_FILTER_GET,
                                                    &ops) ||
                          !ops))
                code = ENOSYS;
        else
                return ops;
        rte_flow_error_set(error, code, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
                           NULL, rte_strerror(code));
        return NULL;
}


 

mlx5_dev_filter_ctrl

mlx5_dev_filter_ctrl(struct rte_eth_dev *dev,
                     enum rte_filter_type filter_type,
                     enum rte_filter_op filter_op,
                     void *arg)
{
        switch (filter_type) {
        case RTE_ETH_FILTER_GENERIC:
                if (filter_op != RTE_ETH_FILTER_GET) {
                        rte_errno = EINVAL;
                        return -rte_errno;
                }
                *(const void **)arg = &mlx5_flow_ops;
                return 0;
        case RTE_ETH_FILTER_FDIR:
                return flow_fdir_ctrl_func(dev, filter_op, arg);
        default:
                DRV_LOG(ERR, "port %u filter type (%d) not supported",
                        dev->data->port_id, filter_type);
                rte_errno = ENOTSUP;
                return -rte_errno;
        }
        return 0;
}

 .filter_ctrl = mlx5_dev_filter_ctrl,

 

 

 

 mlx5_pci_probe
        mlx5_sysfs_switch_info
        
    mlx5_sysfs_switch_info(unsigned int ifindex, struct mlx5_switch_info *info)
{
        char ifname[IF_NAMESIZE];
        char port_name[IF_NAMESIZE];
        FILE *file;
        struct mlx5_switch_info data = {
                .master = 0,
                .representor = 0,
                .name_type = MLX5_PHYS_PORT_NAME_TYPE_NOTSET,
                .port_name = 0,
                .switch_id = 0,
        };
        DIR *dir;
        bool port_switch_id_set = false;
        bool device_dir = false;
        char c;
        int ret;

        if (!if_indextoname(ifindex, ifname)) {
                rte_errno = errno;
                return -rte_errno;
        }

        MKSTR(phys_port_name, "/sys/class/net/%s/phys_port_name",
              ifname);
        MKSTR(phys_switch_id, "/sys/class/net/%s/phys_switch_id",
              ifname);
        MKSTR(pci_device, "/sys/class/net/%s/device",
              ifname);

        file = fopen(phys_port_name, "rb");
        if (file != NULL) {
                ret = fscanf(file, "%s", port_name);
                fclose(file);
                if (ret == 1)
                        mlx5_translate_port_name(port_name, &data);
        }
        file = fopen(phys_switch_id, "rb");
        if (file == NULL) {
                rte_errno = errno;
                return -rte_errno;
        }
        port_switch_id_set =
                fscanf(file, "%" SCNx64 "%c", &data.switch_id, &c) == 2 &&
                c == '\n';
        fclose(file);
        dir = opendir(pci_device);
        if (dir != NULL) {
                closedir(dir);
                device_dir = true;
        }
        if (port_switch_id_set) {
                /* We have some E-Switch configuration. */
                mlx5_sysfs_check_switch_info(device_dir, &data);
        }
        *info = data;
        assert(!(data.master && data.representor));
        if (data.master && data.representor) {
                DRV_LOG(ERR, "ifindex %u device is recognized as master"
                             " and as representor", ifindex);
                rte_errno = ENODEV;
                return -rte_errno;
        }
        return 0;
}

 

 

struct mlx5_priv

每個一vport對應一個mlx5_priv

struct mlx5_priv {
        struct rte_eth_dev_data *dev_data;  /* Pointer to device data. */
        struct mlx5_ibv_shared *sh; /* Shared IB device context. */
        uint32_t ibv_port; /* IB device port number. */
        struct rte_pci_device *pci_dev; /* Backend PCI device. */
        struct rte_ether_addr mac[MLX5_MAX_MAC_ADDRESSES]; /* MAC addresses. */
        BITFIELD_DECLARE(mac_own, uint64_t, MLX5_MAX_MAC_ADDRESSES);
        /* Bit-field of MAC addresses owned by the PMD. */
        uint16_t vlan_filter[MLX5_MAX_VLAN_IDS]; /* VLAN filters table. */
        unsigned int vlan_filter_n; /* Number of configured VLAN filters. */
        /* Device properties. */
        uint16_t mtu; /* Configured MTU. */
        unsigned int isolated:1; /* Whether isolated mode is enabled. */
        unsigned int representor:1; /* Device is a port representor. */
        unsigned int master:1; /* Device is a E-Switch master. */
        unsigned int dr_shared:1; /* DV/DR data is shared. */
        unsigned int counter_fallback:1; /* Use counter fallback management. */
        unsigned int mtr_en:1; /* Whether support meter. */
        uint16_t domain_id; /* Switch domain identifier. */
        uint16_t vport_id; /* Associated VF vport index (if any). */
        uint32_t vport_meta_tag; /* Used for vport index match ove VF LAG. */
        uint32_t vport_meta_mask; /* Used for vport index field match mask. */
        int32_t representor_id; /* Port representor identifier. */
        int32_t pf_bond; /* >=0 means PF index in bonding configuration. */
}


struct mlx5_priv *
mlx5_dev_to_eswitch_info(struct rte_eth_dev *dev)
{
        struct mlx5_priv *priv;

        priv = dev->data->dev_private;
        if (!(priv->representor || priv->master)) {
                rte_errno = EINVAL;
                return NULL;
        }
        return priv;
}



 

tatic struct rte_eth_dev *
mlx5_dev_spawn(struct rte_device *dpdk_dev,
           struct mlx5_dev_spawn_data *spawn,
           struct mlx5_dev_config config)
{
    const struct mlx5_switch_info *switch_info = &spawn->info;
    struct mlx5_ibv_shared *sh = NULL;
    struct ibv_port_attr port_attr;
    struct mlx5dv_context dv_attr = { .comp_mask = 0 };
    struct rte_eth_dev *eth_dev = NULL;
    struct mlx5_priv *priv = NULL;
    int err = 0;
    unsigned int hw_padding = 0;
    unsigned int mps;
    unsigned int cqe_comp;
    unsigned int cqe_pad = 0;
    unsigned int tunnel_en = 0;
    unsigned int mpls_en = 0;
    unsigned int swp = 0;
    unsigned int mprq = 0;
    unsigned int mprq_min_stride_size_n = 0;
    unsigned int mprq_max_stride_size_n = 0;
    unsigned int mprq_min_stride_num_n = 0;
    unsigned int mprq_max_stride_num_n = 0;
    struct rte_ether_addr mac;
    char name[RTE_ETH_NAME_MAX_LEN];
    int own_domain_id = 0;
    uint16_t port_id;
    unsigned int i;

    /* Determine if this port representor is supposed to be spawned. */
    if (switch_info->representor && dpdk_dev->devargs) {
        struct rte_eth_devargs eth_da;

        err = rte_eth_devargs_parse(dpdk_dev->devargs->args, &eth_da);
        if (err) {
            rte_errno = -err;
            DRV_LOG(ERR, "failed to process device arguments: %s",
                strerror(rte_errno));
            return NULL;
        }
        for (i = 0; i < eth_da.nb_representor_ports; ++i)
            if (eth_da.representor_ports[i] ==
                (uint16_t)switch_info->port_name)
                break;
        if (i == eth_da.nb_representor_ports) {
            rte_errno = EBUSY;
            return NULL;
        }
    }
    /* Build device name. */
    if (!switch_info->representor)
        strlcpy(name, dpdk_dev->name, sizeof(name));
    else
        snprintf(name, sizeof(name), "%s_representor_%u",
             dpdk_dev->name, switch_info->port_name);
    /* check if the device is already spawned */
    if (rte_eth_dev_get_port_by_name(name, &port_id) == 0) {
        rte_errno = EEXIST;
        return NULL;
    }
    DRV_LOG(DEBUG, "naming Ethernet device \"%s\"", name);
    if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
        eth_dev = rte_eth_dev_attach_secondary(name);
        if (eth_dev == NULL) {
            DRV_LOG(ERR, "can not attach rte ethdev");
            rte_errno = ENOMEM;
            return NULL;
        }
        eth_dev->device = dpdk_dev;
        eth_dev->dev_ops = &mlx5_dev_sec_ops;
        err = mlx5_proc_priv_init(eth_dev);
        if (err)
            return NULL;
        /* Receive command fd from primary process */
        err = mlx5_mp_req_verbs_cmd_fd(eth_dev);
        if (err < 0)
            return NULL;
        /* Remap UAR for Tx queues. */
        err = mlx5_tx_uar_init_secondary(eth_dev, err);
        if (err)
            return NULL;
        /*
         * Ethdev pointer is still required as input since
         * the primary device is not accessible from the
         * secondary process.
         */
        eth_dev->rx_pkt_burst = mlx5_select_rx_function(eth_dev);
        eth_dev->tx_pkt_burst = mlx5_select_tx_function(eth_dev);
        return eth_dev;
    }
    sh = mlx5_alloc_shared_ibctx(spawn);
    if (!sh)
        return NULL;
    config.devx = sh->devx;
#ifdef HAVE_IBV_MLX5_MOD_SWP
    dv_attr.comp_mask |= MLX5DV_CONTEXT_MASK_SWP;
#endif
    /*
     * Multi-packet send is supported by ConnectX-4 Lx PF as well
     * as all ConnectX-5 devices.
     */
#ifdef HAVE_IBV_DEVICE_TUNNEL_SUPPORT
    dv_attr.comp_mask |= MLX5DV_CONTEXT_MASK_TUNNEL_OFFLOADS;
#endif
#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
    dv_attr.comp_mask |= MLX5DV_CONTEXT_MASK_STRIDING_RQ;
#endif
    mlx5_glue->dv_query_device(sh->ctx, &dv_attr);
    if (dv_attr.flags & MLX5DV_CONTEXT_FLAGS_MPW_ALLOWED) {
        if (dv_attr.flags & MLX5DV_CONTEXT_FLAGS_ENHANCED_MPW) {
            DRV_LOG(DEBUG, "enhanced MPW is supported");
            mps = MLX5_MPW_ENHANCED;
        } else {
            DRV_LOG(DEBUG, "MPW is supported");
            mps = MLX5_MPW;
        }
    } else {
        DRV_LOG(DEBUG, "MPW isn't supported");
        mps = MLX5_MPW_DISABLED;
    }
#ifdef HAVE_IBV_MLX5_MOD_SWP
    if (dv_attr.comp_mask & MLX5DV_CONTEXT_MASK_SWP)
        swp = dv_attr.sw_parsing_caps.sw_parsing_offloads;
    DRV_LOG(DEBUG, "SWP support: %u", swp);
#endif
    config.swp = !!swp;
#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
    if (dv_attr.comp_mask & MLX5DV_CONTEXT_MASK_STRIDING_RQ) {
        struct mlx5dv_striding_rq_caps mprq_caps =
            dv_attr.striding_rq_caps;

        DRV_LOG(DEBUG, "\tmin_single_stride_log_num_of_bytes: %d",
            mprq_caps.min_single_stride_log_num_of_bytes);
        DRV_LOG(DEBUG, "\tmax_single_stride_log_num_of_bytes: %d",
            mprq_caps.max_single_stride_log_num_of_bytes);
        DRV_LOG(DEBUG, "\tmin_single_wqe_log_num_of_strides: %d",
            mprq_caps.min_single_wqe_log_num_of_strides);
        DRV_LOG(DEBUG, "\tmax_single_wqe_log_num_of_strides: %d",
            mprq_caps.max_single_wqe_log_num_of_strides);
        DRV_LOG(DEBUG, "\tsupported_qpts: %d",
            mprq_caps.supported_qpts);
        DRV_LOG(DEBUG, "device supports Multi-Packet RQ");
        mprq = 1;
        mprq_min_stride_size_n =
            mprq_caps.min_single_stride_log_num_of_bytes;
        mprq_max_stride_size_n =
            mprq_caps.max_single_stride_log_num_of_bytes;
        mprq_min_stride_num_n =
            mprq_caps.min_single_wqe_log_num_of_strides;
        mprq_max_stride_num_n =
            mprq_caps.max_single_wqe_log_num_of_strides;
        config.mprq.stride_num_n = RTE_MAX(MLX5_MPRQ_STRIDE_NUM_N,
                           mprq_min_stride_num_n);
    }
#endif
    if (RTE_CACHE_LINE_SIZE == 128 &&
        !(dv_attr.flags & MLX5DV_CONTEXT_FLAGS_CQE_128B_COMP))
        cqe_comp = 0;
    else
        cqe_comp = 1;
    config.cqe_comp = cqe_comp;
#ifdef HAVE_IBV_MLX5_MOD_CQE_128B_PAD
    /* Whether device supports 128B Rx CQE padding. */
    cqe_pad = RTE_CACHE_LINE_SIZE == 128 &&
          (dv_attr.flags & MLX5DV_CONTEXT_FLAGS_CQE_128B_PAD);
#endif
#ifdef HAVE_IBV_DEVICE_TUNNEL_SUPPORT
    if (dv_attr.comp_mask & MLX5DV_CONTEXT_MASK_TUNNEL_OFFLOADS) {
        tunnel_en = ((dv_attr.tunnel_offloads_caps &
                  MLX5DV_RAW_PACKET_CAP_TUNNELED_OFFLOAD_VXLAN) &&
                 (dv_attr.tunnel_offloads_caps &
                  MLX5DV_RAW_PACKET_CAP_TUNNELED_OFFLOAD_GRE));
    }
    DRV_LOG(DEBUG, "tunnel offloading is %ssupported",
        tunnel_en ? "" : "not ");
#else
    DRV_LOG(WARNING,
        "tunnel offloading disabled due to old OFED/rdma-core version");
#endif
    config.tunnel_en = tunnel_en;
#ifdef HAVE_IBV_DEVICE_MPLS_SUPPORT
    mpls_en = ((dv_attr.tunnel_offloads_caps &
            MLX5DV_RAW_PACKET_CAP_TUNNELED_OFFLOAD_CW_MPLS_OVER_GRE) &&
           (dv_attr.tunnel_offloads_caps &
            MLX5DV_RAW_PACKET_CAP_TUNNELED_OFFLOAD_CW_MPLS_OVER_UDP));
    DRV_LOG(DEBUG, "MPLS over GRE/UDP tunnel offloading is %ssupported",
        mpls_en ? "" : "not ");
#else
    DRV_LOG(WARNING, "MPLS over GRE/UDP tunnel offloading disabled due to"
        " old OFED/rdma-core version or firmware configuration");
#endif
    config.mpls_en = mpls_en;
    /* Check port status. */
    err = mlx5_glue->query_port(sh->ctx, spawn->ibv_port, &port_attr);
    if (err) {
        DRV_LOG(ERR, "port query failed: %s", strerror(err));
        goto error;
    }
    if (port_attr.link_layer != IBV_LINK_LAYER_ETHERNET) {
        DRV_LOG(ERR, "port is not configured in Ethernet mode");
        err = EINVAL;
        goto error;
    }
    if (port_attr.state != IBV_PORT_ACTIVE)
        DRV_LOG(DEBUG, "port is not active: \"%s\" (%d)",
            mlx5_glue->port_state_str(port_attr.state),
            port_attr.state);
    /* Allocate private eth device data. */
    priv = rte_zmalloc("ethdev private structure",
               sizeof(*priv),
               RTE_CACHE_LINE_SIZE);
    if (priv == NULL) {
        DRV_LOG(ERR, "priv allocation failure");
        err = ENOMEM;
        goto error;
    }
    priv->sh = sh;
    priv->ibv_port = spawn->ibv_port;
    priv->mtu = RTE_ETHER_MTU;
#ifndef RTE_ARCH_64
    /* Initialize UAR access locks for 32bit implementations. */
    rte_spinlock_init(&priv->uar_lock_cq);
    for (i = 0; i < MLX5_UAR_PAGE_NUM_MAX; i++)
        rte_spinlock_init(&priv->uar_lock[i]);
#endif
    /* Some internal functions rely on Netlink sockets, open them now. */
    priv->nl_socket_rdma = mlx5_nl_init(NETLINK_RDMA);
    priv->nl_socket_route =    mlx5_nl_init(NETLINK_ROUTE);
    priv->nl_sn = 0;
    priv->representor = !!switch_info->representor;
    priv->master = !!switch_info->master;
    priv->domain_id = RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID;
    /*
     * Currently we support single E-Switch per PF configurations
     * only and vport_id field contains the vport index for
     * associated VF, which is deduced from representor port name.
     * For example, let's have the IB device port 10, it has
     * attached network device eth0, which has port name attribute
     * pf0vf2, we can deduce the VF number as 2, and set vport index
     * as 3 (2+1). This assigning schema should be changed if the
     * multiple E-Switch instances per PF configurations or/and PCI
     * subfunctions are added.
     */ priv->vport_id = switch_info->representor ? switch_info->port_name + 1 : -1; /* representor_id field keeps the unmodified port/VF index. */
    priv->representor_id = switch_info->representor ?
                   switch_info->port_name : -1;
    /*
     * Look for sibling devices in order to reuse their switch domain
     * if any, otherwise allocate one.
     */
    RTE_ETH_FOREACH_DEV_OF(port_id, dpdk_dev) {
        const struct mlx5_priv *opriv =
            rte_eth_devices[port_id].data->dev_private;

        if (!opriv ||
            opriv->domain_id ==
            RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID)
            continue;
        priv->domain_id = opriv->domain_id;
        break;
    }
    if (priv->domain_id == RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID) {
        err = rte_eth_switch_domain_alloc(&priv->domain_id);
        if (err) {
            err = rte_errno;
            DRV_LOG(ERR, "unable to allocate switch domain: %s",
                strerror(rte_errno));
            goto error;
        }
        own_domain_id = 1;
    }
    err = mlx5_args(&config, dpdk_dev->devargs);
    if (err) {
        err = rte_errno;
        DRV_LOG(ERR, "failed to process device arguments: %s",
            strerror(rte_errno));
        goto error;
    }
    config.hw_csum = !!(sh->device_attr.device_cap_flags_ex &
                IBV_DEVICE_RAW_IP_CSUM);
    DRV_LOG(DEBUG, "checksum offloading is %ssupported",
        (config.hw_csum ? "" : "not "));
#if !defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42) && \
    !defined(HAVE_IBV_DEVICE_COUNTERS_SET_V45)
    DRV_LOG(DEBUG, "counters are not supported");
#endif
#ifndef HAVE_IBV_FLOW_DV_SUPPORT
    if (config.dv_flow_en) {
        DRV_LOG(WARNING, "DV flow is not supported");
        config.dv_flow_en = 0;
    }
#endif
    config.ind_table_max_size =
        sh->device_attr.rss_caps.max_rwq_indirection_table_size;
    /*
     * Remove this check once DPDK supports larger/variable
     * indirection tables.
     */
    if (config.ind_table_max_size > (unsigned int)ETH_RSS_RETA_SIZE_512)
        config.ind_table_max_size = ETH_RSS_RETA_SIZE_512;
    DRV_LOG(DEBUG, "maximum Rx indirection table size is %u",
        config.ind_table_max_size);
    config.hw_vlan_strip = !!(sh->device_attr.raw_packet_caps &
                  IBV_RAW_PACKET_CAP_CVLAN_STRIPPING);
    DRV_LOG(DEBUG, "VLAN stripping is %ssupported",
        (config.hw_vlan_strip ? "" : "not "));
    config.hw_fcs_strip = !!(sh->device_attr.raw_packet_caps &
                 IBV_RAW_PACKET_CAP_SCATTER_FCS);
    DRV_LOG(DEBUG, "FCS stripping configuration is %ssupported",
        (config.hw_fcs_strip ? "" : "not "));
#if defined(HAVE_IBV_WQ_FLAG_RX_END_PADDING)
    hw_padding = !!sh->device_attr.rx_pad_end_addr_align;
#elif defined(HAVE_IBV_WQ_FLAGS_PCI_WRITE_END_PADDING)
    hw_padding = !!(sh->device_attr.device_cap_flags_ex &
            IBV_DEVICE_PCI_WRITE_END_PADDING);
#endif
    if (config.hw_padding && !hw_padding) {
        DRV_LOG(DEBUG, "Rx end alignment padding isn't supported");
        config.hw_padding = 0;
    } else if (config.hw_padding) {
        DRV_LOG(DEBUG, "Rx end alignment padding is enabled");
    }
    config.tso = (sh->device_attr.tso_caps.max_tso > 0 &&
              (sh->device_attr.tso_caps.supported_qpts &
               (1 << IBV_QPT_RAW_PACKET)));
    if (config.tso)
        config.tso_max_payload_sz = sh->device_attr.tso_caps.max_tso;
    /*
     * MPW is disabled by default, while the Enhanced MPW is enabled
     * by default.
     */
    if (config.mps == MLX5_ARG_UNSET)
        config.mps = (mps == MLX5_MPW_ENHANCED) ? MLX5_MPW_ENHANCED :
                              MLX5_MPW_DISABLED;
    else
        config.mps = config.mps ? mps : MLX5_MPW_DISABLED;
    DRV_LOG(INFO, "%sMPS is %s",
        config.mps == MLX5_MPW_ENHANCED ? "enhanced " : "",
        config.mps != MLX5_MPW_DISABLED ? "enabled" : "disabled");
    if (config.cqe_comp && !cqe_comp) {
        DRV_LOG(WARNING, "Rx CQE compression isn't supported");
        config.cqe_comp = 0;
    }
    if (config.cqe_pad && !cqe_pad) {
        DRV_LOG(WARNING, "Rx CQE padding isn't supported");
        config.cqe_pad = 0;
    } else if (config.cqe_pad) {
        DRV_LOG(INFO, "Rx CQE padding is enabled");
    }
    if (config.mprq.enabled && mprq) {
        if (config.mprq.stride_num_n > mprq_max_stride_num_n ||
            config.mprq.stride_num_n < mprq_min_stride_num_n) {
            config.mprq.stride_num_n =
                RTE_MAX(MLX5_MPRQ_STRIDE_NUM_N,
                    mprq_min_stride_num_n);
            DRV_LOG(WARNING,
                "the number of strides"
                " for Multi-Packet RQ is out of range,"
                " setting default value (%u)",
                1 << config.mprq.stride_num_n);
        }
        config.mprq.min_stride_size_n = mprq_min_stride_size_n;
        config.mprq.max_stride_size_n = mprq_max_stride_size_n;
    } else if (config.mprq.enabled && !mprq) {
        DRV_LOG(WARNING, "Multi-Packet RQ isn't supported");
        config.mprq.enabled = 0;
    }
    if (config.max_dump_files_num == 0)
        config.max_dump_files_num = 128;
    eth_dev = rte_eth_dev_allocate(name);
    if (eth_dev == NULL) {
        DRV_LOG(ERR, "can not allocate rte ethdev");
        err = ENOMEM;
        goto error;
    }
    /* Flag to call rte_eth_dev_release_port() in rte_eth_dev_close(). */
    eth_dev->data->dev_flags |= RTE_ETH_DEV_CLOSE_REMOVE;
    if (priv->representor) {
        eth_dev->data->dev_flags |= RTE_ETH_DEV_REPRESENTOR;
        eth_dev->data->representor_id = priv->representor_id;
    }
    eth_dev->data->dev_private = priv;
    priv->dev_data = eth_dev->data;
    eth_dev->data->mac_addrs = priv->mac;
    eth_dev->device = dpdk_dev;
    /* Configure the first MAC address by default. */
    if (mlx5_get_mac(eth_dev, &mac.addr_bytes)) {
        DRV_LOG(ERR,
            "port %u cannot get MAC address, is mlx5_en"
            " loaded? (errno: %s)",
            eth_dev->data->port_id, strerror(rte_errno));
        err = ENODEV;
        goto error;
    }
    DRV_LOG(INFO,
        "port %u MAC address is %02x:%02x:%02x:%02x:%02x:%02x",
        eth_dev->data->port_id,
        mac.addr_bytes[0], mac.addr_bytes[1],
        mac.addr_bytes[2], mac.addr_bytes[3],
        mac.addr_bytes[4], mac.addr_bytes[5]);
#ifndef NDEBUG
    {
        char ifname[IF_NAMESIZE];

        if (mlx5_get_ifname(eth_dev, &ifname) == 0)
            DRV_LOG(DEBUG, "port %u ifname is \"%s\"",
                eth_dev->data->port_id, ifname);
        else
            DRV_LOG(DEBUG, "port %u ifname is unknown",
                eth_dev->data->port_id);
    }
#endif
    /* Get actual MTU if possible. */
    err = mlx5_get_mtu(eth_dev, &priv->mtu);
    if (err) {
        err = rte_errno;
        goto error;
    }
    DRV_LOG(DEBUG, "port %u MTU is %u", eth_dev->data->port_id,
        priv->mtu);
    /* Initialize burst functions to prevent crashes before link-up. */
    eth_dev->rx_pkt_burst = removed_rx_burst;
    eth_dev->tx_pkt_burst = removed_tx_burst;
    eth_dev->dev_ops = &mlx5_dev_ops;
    /* Register MAC address. */
    claim_zero(mlx5_mac_addr_add(eth_dev, &mac, 0, 0));
    if (config.vf && config.vf_nl_en)
        mlx5_nl_mac_addr_sync(eth_dev);
    priv->tcf_context = mlx5_flow_tcf_context_create();
    if (!priv->tcf_context) {
        err = -rte_errno;
        DRV_LOG(WARNING,
            "flow rules relying on switch offloads will not be"
            " supported: cannot open libmnl socket: %s",
            strerror(rte_errno));
    } else {
        struct rte_flow_error error;
        unsigned int ifindex = mlx5_ifindex(eth_dev);

        if (!ifindex) {
            err = -rte_errno;
            error.message =
                "cannot retrieve network interface index";
        } else {
            err = mlx5_flow_tcf_init(priv->tcf_context,
                         ifindex, &error);
        }
        if (err) {
            DRV_LOG(WARNING,
                "flow rules relying on switch offloads will"
                " not be supported: %s: %s",
                error.message, strerror(rte_errno));
            mlx5_flow_tcf_context_destroy(priv->tcf_context);
            priv->tcf_context = NULL;
        }
    }
    TAILQ_INIT(&priv->flows);
    TAILQ_INIT(&priv->ctrl_flows);
    /* Hint libmlx5 to use PMD allocator for data plane resources */
    struct mlx5dv_ctx_allocators alctr = {
        .alloc = &mlx5_alloc_verbs_buf,
        .free = &mlx5_free_verbs_buf,
        .data = priv,
    };
    mlx5_glue->dv_set_context_attr(sh->ctx,
                       MLX5DV_CTX_ATTR_BUF_ALLOCATORS,
                       (void *)((uintptr_t)&alctr));
    /* Bring Ethernet device up. */
    DRV_LOG(DEBUG, "port %u forcing Ethernet interface up",
        eth_dev->data->port_id);
    mlx5_set_link_up(eth_dev);
    /*
     * Even though the interrupt handler is not installed yet,
     * interrupts will still trigger on the async_fd from
     * Verbs context returned by ibv_open_device().
     */
    mlx5_link_update(eth_dev, 0);
#ifdef HAVE_IBV_DEVX_OBJ
    if (config.devx) {
        err = mlx5_devx_cmd_query_hca_attr(sh->ctx, &config.hca_attr);
        if (err) {
            err = -err;
            goto error;
        }
    }
#endif
#ifdef HAVE_MLX5DV_DR_ESWITCH
    if (!(config.hca_attr.eswitch_manager && config.dv_flow_en &&
          (switch_info->representor || switch_info->master)))
        config.dv_esw_en = 0;
#else
    config.dv_esw_en = 0;
#endif
    /* Store device configuration on private structure. */
    priv->config = config;
    if (config.dv_flow_en) {
        err = mlx5_alloc_shared_dr(priv);
        if (err)
            goto error;
    }
    /* Supported Verbs flow priority number detection. */
    err = mlx5_flow_discover_priorities(eth_dev);
    if (err < 0) {
        err = -err;
        goto error;
    }
    priv->config.flow_prio = err;
    /* Add device to memory callback list. */
    rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock);
    LIST_INSERT_HEAD(&mlx5_shared_data->mem_event_cb_list,
             sh, mem_event_cb);
    rte_rwlock_write_unlock(&mlx5_shared_data->mem_event_rwlock);
    return eth_dev;
error:
    if (priv) {
        if (priv->sh)
            mlx5_free_shared_dr(priv);
        if (priv->nl_socket_route >= 0)
            close(priv->nl_socket_route);
        if (priv->nl_socket_rdma >= 0)
            close(priv->nl_socket_rdma);
        if (priv->tcf_context)
            mlx5_flow_tcf_context_destroy(priv->tcf_context);
        if (own_domain_id)
            claim_zero(rte_eth_switch_domain_free(priv->domain_id));
        rte_free(priv);
        if (eth_dev != NULL)
            eth_dev->data->dev_private = NULL;
    }
    if (eth_dev != NULL) {
        /* mac_addrs must not be freed alone because part of dev_private */
        eth_dev->data->mac_addrs = NULL;
        rte_eth_dev_release_port(eth_dev);
    }
    if (sh)
        mlx5_free_shared_ibctx(sh);
    assert(err > 0);
    rte_errno = err;
    return NULL;
}

 

> > > #0  mlx5dv_dr_table_create (dmn=0x555556c641b0, level=65534) at
> > > ../providers/mlx5/dr_table.c:183
> > > #1  0x0000555555dfaeaa in flow_dv_tbl_resource_get (dev=<optimized
> > > out>, table_id=65534, egress=<optimized out>, transfer=<optimized
> > > out>,
> > > error=0x7fffffffdca0) at
> > > /home/bganne/src/dpdk/drivers/net/mlx5/mlx5_flow_dv.c:6746
> > > #2  0x0000555555e02b28 in __flow_dv_translate
> > > (dev=dev at entry=0x555556bbcdc0 <rte_eth_devices>,
> > > dev_flow=0x100388300, attr=<optimized out>, items=<optimized out>,
> > > actions=<optimized out>, error=<optimized out>) at
> > > /home/bganne/src/dpdk/drivers/net/mlx5/mlx5_flow_dv.c:7503
> > > #3  0x0000555555e04954 in flow_dv_translate (dev=0x555556bbcdc0
> > > <rte_eth_devices>, dev_flow=<optimized out>, attr=<optimized out>,
> > > items=<optimized out>, actions=<optimized out>, error=<optimized
> > > out>)
> > at
> > > /home/bganne/src/dpdk/drivers/net/mlx5/mlx5_flow_dv.c:8841
> > > #4  0x0000555555df152f in flow_drv_translate (error=0x7fffffffdca0,
> > > actions=0x7fffffffdce0, items=0x7fffffffdcc0, attr=0x7fffffffbb88,
> > > dev_flow=<optimized out>, dev=0x555556bbcdc0 <rte_eth_devices>) at
> > > /home/bganne/src/dpdk/drivers/net/mlx5/mlx5_flow.c:2571
> > > #5  flow_create_split_inner (error=0x7fffffffdca0, external=false,
> > > actions=0x7fffffffdce0, items=0x7fffffffdcc0, attr=0x7fffffffbb88,
> > > prefix_layers=0, sub_flow=0x0, flow=0x1003885c0, dev=0x555556bbcdc0
> > > <rte_eth_devices>) at
> > > /home/bganne/src/dpdk/drivers/net/mlx5/mlx5_flow.c:3490
> > > #6  flow_create_split_metadata (error=0x7fffffffdca0,
> > > external=false, actions=0x7fffffffdce0, items=0x7fffffffdcc0,
> > > attr=0x7fffffffbb88, prefix_layers=0, flow=0x1003885c0,
> > > dev=0x555556bbcdc0 <rte_eth_devices>) at
> > > /home/bganne/src/dpdk/drivers/net/mlx5/mlx5_flow.c:3865
> > > #7  flow_create_split_meter (error=0x7fffffffdca0, external=false,
> > > actions=0x7fffffffdce0, items=<optimized out>, attr=0x7fffffffdc94,
> > > flow=0x1003885c0, dev=0x555556bbcdc0 <rte_eth_devices>) at
> > > /home/bganne/src/dpdk/drivers/net/mlx5/mlx5_flow.c:4121
> > > #8  flow_create_split_outer (error=0x7fffffffdca0, external=false,
> > > actions=0x7fffffffdce0, items=<optimized out>, attr=0x7fffffffdc94,
> > > flow=0x1003885c0, dev=0x555556bbcdc0 <rte_eth_devices>) at
> > > /home/bganne/src/dpdk/drivers/net/mlx5/mlx5_flow.c:4178
> > > #9  flow_list_create (dev=dev at entry=0x555556bbcdc0
> > > <rte_eth_devices>, list=list at entry=0x0,
> > > attr=attr at entry=0x7fffffffdc94, items=items at entry=0x7fffffffdcc0,
> > > actions=actions at entry=0x7fffffffdce0,
> > > external=external at entry=false, error=0x7fffffffdca0) at
> > > /home/bganne/src/dpdk/drivers/net/mlx5/mlx5_flow.c:4306
> > > #10 0x0000555555df8587 in mlx5_flow_discover_mreg_c
> > > (dev=dev at entry=0x555556bbcdc0 <rte_eth_devices>) at
> > > /home/bganne/src/dpdk/drivers/net/mlx5/mlx5_flow.c:5747
> > > #11 0x0000555555d692a6 in mlx5_dev_spawn (config=...,
> > > spawn=0x1003e9e00,
> > > dpdk_dev=0x555556dd6fe0) at
> > > /home/bganne/src/dpdk/drivers/net/mlx5/mlx5.c:2763
> > > #12 mlx5_pci_probe (pci_drv=<optimized out>, pci_dev=<optimized
> > > out>) at
> > > /home/bganne/src/dpdk/drivers/net/mlx5/mlx5.c:3363
> > > #13 0x0000555555a411c8 in pci_probe_all_drivers ()
> > > #14 0x0000555555a412f8 in rte_pci_probe ()
> > > #15 0x0000555555a083da in rte_bus_probe ()
> > > #16 0x00005555559f204d in rte_eal_init ()
> > > #17 0x00005555556a0d45 in main ()
> > >

 

 netdev-offload-dpdk

 https://github.com/Mellanox/OVS/blob/master/lib/netdev-offload-dpdk.c

 

const struct netdev_flow_api netdev_offload_dpdk = {
    .type = "dpdk_flow_api",
    .flow_put = netdev_offload_dpdk_flow_put,
    .flow_del = netdev_offload_dpdk_flow_del,
    .init_flow_api = netdev_offload_dpdk_init_flow_api,
};

 

netdev_offload_tc

const struct netdev_flow_api netdev_offload_tc = {
   .type = "linux_tc",
   .flow_flush = netdev_tc_flow_flush,
   .flow_dump_create = netdev_tc_flow_dump_create,
   .flow_dump_destroy = netdev_tc_flow_dump_destroy,
   .flow_dump_next = netdev_tc_flow_dump_next,
   .flow_put = netdev_tc_flow_put,
   .flow_get = netdev_tc_flow_get,
   .flow_del = netdev_tc_flow_del,
   .init_flow_api = netdev_tc_init_flow_api,
};
netdev_tc_flow_put
        tc_replace_flower    
            nl_msg_put_flower_options
               tc_get_tc_cls_policy
tc_get_tc_cls_policy(enum tc_offload_policy policy)
{
    if (policy == TC_POLICY_SKIP_HW) {
        return TCA_CLS_FLAGS_SKIP_HW;
    } else if (policy == TC_POLICY_SKIP_SW) {
        return TCA_CLS_FLAGS_SKIP_SW;
    }

    return 0;
}            

 

Generic Flow API簡介

Classification功能

Classification功能是指網卡在收包時,將符合某種規則的包放入指定的隊列。網卡一般支持一種或多種classification功能,以intel 700系列網卡為例,其支持MAC/VLAN filter、Ethertype filter、Cloud filter、flow director等等。不同的網卡可能支持不同種類的filter,例如intel 82599系列網卡支持n-tuple、L2_tunnel、flow director等,下表列出了DPDK中不同驅動對filter的支持。即使intel 82599系列網卡和intel 700系列網卡都支持flow director,它們支持的方式也不一樣。那DPDK是如何支持不同網卡的filter的呢?

現有flow API

DPDK定義所有網卡的filter類型以及每種filter的基本屬性,並提供相應的接口給上層應用,所以要求用戶對filter屬性有一定的概念。以flow director為例,其中有兩個數據結構為:

 * A structure used to contain extend input of flow
 */
struct rte_eth_fdir_flow_ext {
        uint16_t vlan_tci;
        uint8_t flexbytes[RTE_ETH_FDIR_MAX_FLEXLEN];
        /**< It is filled by the flexible payload to match. */
        uint8_t is_vf;   /**< 1 for VF, 0 for port dev */
        uint16_t dst_id; /**< VF ID, available when is_vf is 1*/
};
/**
 * A structure used to define the input for a flow director filter entry
 */
struct rte_eth_fdir_input {
        uint16_t flow_type;
        union rte_eth_fdir_flow flow;
        /**< Flow fields to match, dependent on flow_type */
        struct rte_eth_fdir_flow_ext flow_ext;
        /**< Additional fields to match */
};

用戶在添加flow director流規則的時候,需要填寫上述信息。但並不是所有的網卡都關注flow_type 和is_vf,比如intel 82599系列網卡不需要flow_type及is_vf。其它類型的filter也有類似的情況。
從上可以看出,現有方案有很多缺陷:首先,DPDK為所有網卡抽象出統一的屬性,但是某些屬性只對一種網卡有意義;其次,隨着DPDK支持的網卡越來越多,DPDK需要定義的filter類型要增加,網卡filter功能升級也需要DPDK作相應修改,這樣很容易導致API/ABI的破壞;
另外,從應用角度來看,現有的方案也有諸多不便,使得API比較難用,不夠友好:那些可選或者可缺省的屬性容易讓用戶產生疑惑;經常在某種filter類型中隨意插入一些某個網卡特有的屬性;設計復雜,也沒有比較詳細的說明文檔。
鑒於上述原因,一個generic flow API必不可少。

Generic flow API

DPDK v17.02 推出了generic flow API方案,DPDK把一條流規則抽象為pattern和actions兩部分。
Pattern由一定數目的item組成。Item主要和協議相關,支持ETH, IPV4, IPV6, UDP, TCP, VXLAN等等。item也包括一些標志符,比如PF, VF, END等,目前DPDK支持的item類型定義在rte_flow.h的enum rte_flow_item_type。在描述一個item的時候可以添加spec和mask,告訴驅動哪些需要匹配。下面以以太網包的流規則為例,該item描述的是精確匹配二層包頭的目的地址11:22:33:44:55:66。

Actions表示流規則的動作,比如QUEUE, DROP, PF, VF,END等等,DPDK支持的action類型定義在rte_flow.h的enum rte_flow_action_type。以下action表示符合某種pattern的包放入隊列3。

該方案把復雜的區分filter類型的事情交給驅動處理,用戶再也不需要關注硬件的能力,這樣使得上層應用能夠方便添加或者刪除流規則。
如果想添加一條流規則,上層應用只需要調用rte_flow_create()這個接口,並填好相應的pattern和actions;如果要刪除一條流規則,上層應用只需要調用rte_flow_destroy()。還是以flow director為例,流規則定義如下所示,對於用戶來說,這種方式更易操作。

綜上所述,generic flow API明顯方便很多。

https://www.sdnlab.com/24697.html

 DPDK數據流過濾規則例程解析—— 網卡流處理功能窺探

https://www.sdnlab.com/23216.html

[SPDK/NVMe存儲技術分析]012 - 用戶態ibv_post_send()源碼分析

https://www.cnblogs.com/vlhn/p/7997457.html

 

CentOS安裝基於Mellanox網卡的DPDK開發環境

https://blog.csdn.net/botao_li/article/details/108075832?utm_medium=distribute.pc_relevant.none-task-blog-baidujs_title-2&spm=1001.2101.3001.4242

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM