推薦系統推理優化

RecSys黑盒

輸入-輸出

在給定用戶和用戶上下文（如入口、時間、地域、用戶的人口統計學數據等）的情況下，計算用戶與庫存（如商品、文章、用戶等）發生交互(如點擊、購買、連接等)的概率，並篩選最有可能個庫存推薦給用戶，促成交互和轉化。

RecSys IO model

KPI

算法KPI - 開源
提高用戶對推薦結果的交互率和轉化率，這個是算法研究的范疇。
性能KPI - 可用+節流
Latency-Bound Throughput，在滿足要求的延時SLA(Service Level Agreement)的條件下，提高系統的吞吐。這個是系統的范疇。

如：

RecSys算法模型

RecSys算法分類

算法設計上，大致可以按下圖來划分。目前主流工業使用以DNN models為主，這也是本文的目標workload。

RecSys algorithms

DNN RecSys模型范式

DNN RecSys Model = Feature Engineering + Feature Interaction + Predictor DNN
不同的feature engineering, feature interaction和predictor DNN的選型造就了不同的模型和workload特性。

DNN RecSys model paradigm

典型DNN RecSys模型

Wide and Deep Learning (WDL)
Deep Interest Network (DIN)
Deep Interest Evolution Network (DIEN)
Deep Learning Recommendation Model (DLRM)

WDL

算法主要思路
Wide for memorization, deep for generalization
選型
- Feature Engineering
  - embedding_lookup
  - hash bucketing
  - slice (tensor manipulation)
  - concat (tensor manipulation)
  - dense fc
- Feature Interaction
  - concat (tensor manipulation)
  - MLP (Multi-Layer Perception)
- Predictor DNN
  - fc

DIN

算法主要思路
Attention, weighting interaction influence with similarity
選型
- Feature Engineering
  - embedding_lookup
  - concat (tensor manipulation)
- Feature Interaction
  - batch matrix multiplication
  - sum pooling (tensor manipulation)
  - concat (tensor manipulation)
- Predictor DNN
  - MLP

DIEN

算法主要思路
Introduce time-decay effect to attention
選型
- Feature Engineering
  - embedding_lookup
  - concat (tensor manipulation)
- Feature Interaction
  - GRU (Gated Recurrent Unit)
  - concat (tensor manipulation)
- Predictor DNN
  - MLP

DLRM

算法主要思路
Interaction using auto-correlation
選型
- Feature Engineering
  - embedding_lookup
  - sum pooling (tensor manipulation)
  - fc
- Feature Interaction
  - batch matrix multiplication
- Predictor DNN
  - MLP

DNN RecSys模型特征

Small Tensor + Big Model

Each record of Criteo TeraByte Dataset
13 numerical features + 26 categorical feature = 156 B
DLRM open-source Model
~24 billion parameters = 96 GB, most of them are embedding tables

It leads to lower Computational Intensity than CNN workloads.

Tensor Operations matter

Tensor operations which are Embedding Lookup & Tensor Manipulation occupy a non-negligible part.
time spent in Caffe2 operators in Facebook data centers

Workload Heterogeneity

Diverse combinations of lead to workload heterogeneity.
RecSys models heterogeneity
RecSys models characteristics heterogeneity

RecSys workload性能優化

Overview

optimization methods
其中，模型優化專注於優化模型自身的性能，部署優化專注於優化模型在部署環境尤其是混部環境下的性能。

模型優化

優化Principles

#1. Minimize system(HW/SW) overheads
- minimize scheduling overhead
  - minimize function calls
  - use thread pool
  - use big thread (i.e. graph fusion/stitching)
- [accelerator cases] minimize kernel launch overhead
  - use big kernel (i.e. graph fusion)
#2. Roofline analysis driven TFLOPS improvement
- improve attainable TFLOPS
- improve actual TFLOPS
1 - improve computational intensity by decreasing
2 - improve attainable TFLOPs by improving peak memory BW
3 - improve actual TFLOPS

Tensor Operation Sub-graph

主要優化方法

graph fusion/stitching

涉及的優化principles

[#1] minimize kernel launch overhead
[#1] minimize unnecessary bad argument check
[#2.2] in-register/cache computing
[#2.3] more parallelism

Case Studies

embedding_lookup fusion
Facebook multiple embedding_lookup fusion brings 7x unit level performance improvement.
tensor manipulation sub-graph fusion
Feature engineering sub-graph fusion brings 2x unit level performance improvement w/ XLA CPUInstructionFusion pass.

FC&Attention Sub-graph

Sub-graph fusion

MatMul + BiasAdd + Activation

“MatMul + BiasAdd + Activation” 是FC子圖中的典型子圖，也是graph optimizer(如TF Grappler等)一般都會實現的graph optimization pass。目前主要是基於模板匹配的方式來實現。
MatMul fusion
在RecSys中的一個復雜性在於，對於同一個”MatMul + BiasAdd + Activation”語義，經常會有不同子圖形式，下面給出兩種：
MatMul subgraph variant#1
MatMul subgraph variant#2
可以看到，雖然上述兩個子圖語義上仍然是”MatMul+BiasAdd+Activation”, 但由於形式上已經產生變化，基於模板匹配的子圖融合pass對他們並不能正確地辨識和融合，需要使用更高抽象度的融合pass去辨識。實踐也表明，增強的pass會給線上inference帶來20%左右的latency減少。

Multi-Head Attention

Multi-Head Attention作為attention結構的基本子圖，仔細分析並做極致優化是非常有必要的。
MHA fusion

Operator optimization

Increase Computation Intensity

reduce precision: FP32 → BF16
reduce data traffic
- FC: keep packed weight to amortize weight packing traffic
- DLRM batchMatMul – only load A while compute AAT by leveraging HW transposer
- DLRM index – de-duplicate indices
  remove data traffic

Increase Peak Memory BW

Improve cache residence

Example

假想系統參數

L2$ peak BW(TB/s) 4

HBM2e peak BW(TB/s) 0.8

BF16 peak TFLOPS 512

部署優化

Problem statement

Mixed deployment brings deployment optimization

Model co-location brings performance variance (noisy neighbors)
Optimal hardware varies across dynamic batch size([1, 100]) & different models

model co-location performance variance(left), optimal hardware varies across dynamic batch size and models

前期探索

Facebook

Facebook proposed DeepRecSched to search good deployment configurations with dry-run. Facebook的實驗報告了在CPU上~2x的QPS，在GPU上~5x的QPS。
Facebook DeepRecSched

其他

其他探索可見《深度學習推理性能優化》部署優化部分。

Micro-Architecture探索

主要有兩個方向：

近內存計算
代表性的工作有Facebook的NMP(Near Memory Processor), 主要是通過把embedding_lookup_reduction操作放到內存模組里面來完成，從而在不提高內存的物理帶寬的前提下提高有效帶寬。Facebook報告了9.8x的延時減少和4.2x的吞吐提高，基於內部的embedding-dominated的模型族。
data pipeline in SoC
- Intel
  Intel 計划在Sapphire Rapids CPU中引入一些data accelerator IP, 如DSA(Data Streaming Accelerator)。把memory intensive的部分從CPU指令中解放出來，offload到一個專門的IP中來實現。這為實現片上data pipeline、提高workload吞吐提供了一種可能。
References

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

假想系統參數
L2$ peak BW(TB/s)	4
HBM2e peak BW(TB/s)	0.8
BF16 peak TFLOPS	512