推薦系統(RecSys) - “沉默的大多數”
互聯網企業
-
“在阿里和很多互聯網企業中有一個“沉默的大多數”的應用,就是推薦系統:它常常占據了超過80%甚至90%的機器學習算力。”
-
Facebook AI cycles allocation
推薦系統占據了Facebook 50%的AI訓練算力,80%的AI推理算力。

算力提供商
RecSys黑盒
輸入-輸出
在給定用戶和用戶上下文(如入口、時間、地域、用戶的人口統計學數據等)的情況下,計算用戶與庫存(如商品、文章、用戶等)發生交互(如點擊、購買、連接等)的概率,並篩選最有可能
個庫存推薦給用戶,促成交互和轉化。

KPI
-
算法KPI - 開源
提高用戶對推薦結果的交互率和轉化率,這個是算法研究的范疇。 -
性能KPI - 可用+節流
Latency-Bound Throughput,在滿足要求的延時SLA(Service Level Agreement)的條件下,提高系統的吞吐。這個是系統的范疇。
如:

RecSys算法模型
RecSys算法分類
算法設計上,大致可以按下圖來划分。目前主流工業使用以DNN models為主,這也是本文的目標workload。

DNN RecSys模型范式
DNN RecSys Model = Feature Engineering + Feature Interaction + Predictor DNN
不同的feature engineering, feature interaction和predictor DNN的選型造就了不同的模型和workload特性。

典型DNN RecSys模型
-
Wide and Deep Learning (WDL)
-
Deep Interest Network (DIN)
-
Deep Interest Evolution Network (DIEN)
-
Deep Learning Recommendation Model (DLRM)
WDL
-
算法主要思路
Wide for memorization, deep for generalization -
選型
-
Feature Engineering
-
embedding_lookup
-
hash bucketing
-
slice (tensor manipulation)
-
concat (tensor manipulation)
-
dense fc
-
-
Feature Interaction
-
concat (tensor manipulation)
-
MLP (Multi-Layer Perception)
-
-
Predictor DNN
-
fc

-
-
DIN
-
算法主要思路
Attention, weighting interaction influence with similarity -
選型
-
Feature Engineering
-
embedding_lookup
-
concat (tensor manipulation)
-
-
Feature Interaction
-
batch matrix multiplication
-
sum pooling (tensor manipulation)
-
concat (tensor manipulation)
-
-
Predictor DNN
-
MLP

-
-
DIEN
-
算法主要思路
Introduce time-decay effect to attention -
選型
-
Feature Engineering
-
embedding_lookup
-
concat (tensor manipulation)
-
-
Feature Interaction
-
GRU (Gated Recurrent Unit)
-
concat (tensor manipulation)
-
-
Predictor DNN
-
MLP

-
-
DLRM
-
算法主要思路
Interaction using auto-correlation -
選型
-
Feature Engineering
-
embedding_lookup
-
sum pooling (tensor manipulation)
-
fc
-
-
Feature Interaction
-
batch matrix multiplication
-
-
Predictor DNN
-
MLP

-
-
DNN RecSys模型特征
Small Tensor + Big Model
-
Each record of Criteo TeraByte Dataset
13 numerical features + 26 categorical feature = 156 B -
DLRM open-source Model
~24 billion parameters = 96 GB, most of them are embedding tables
It leads to lower Computational Intensity than CNN workloads.
Tensor Operations matter
Tensor operations which are Embedding Lookup & Tensor Manipulation occupy a non-negligible part.

Workload Heterogeneity
Diverse combinations of
lead to workload heterogeneity.
RecSys workload性能優化
Overview
其中,模型優化專注於優化模型自身的性能,部署優化專注於優化模型在部署環境尤其是混部環境下的性能。
模型優化
優化Principles
-
#1. Minimize system(HW/SW) overheads
-
minimize scheduling overhead
-
minimize function calls
-
use thread pool
-
use big thread (i.e. graph fusion/stitching)
-
-
[accelerator cases] minimize kernel launch overhead
-
use big kernel (i.e. graph fusion)
-
-
-
#2. Roofline analysis driven TFLOPS improvement
-
improve attainable TFLOPS
-
improve actual TFLOPS
1 - improve computational intensity by decreasing
2 - improve attainable TFLOPs by improving peak memory BW
3 - improve actual TFLOPS -
Tensor Operation Sub-graph
主要優化方法
graph fusion/stitching
涉及的優化principles
-
[#1] minimize kernel launch overhead
-
[#1] minimize unnecessary bad argument check
-
[#2.2] in-register/cache computing
-
[#2.3] more parallelism
Case Studies
-
embedding_lookup fusion
Facebook multiple embedding_lookup fusion brings 7x unit level performance improvement.

-
tensor manipulation sub-graph fusion
Feature engineering sub-graph fusion brings 2x unit level performance improvement w/ XLA CPUInstructionFusion pass.

FC&Attention Sub-graph
Sub-graph fusion
MatMul + BiasAdd + Activation
“MatMul + BiasAdd + Activation” 是FC子圖中的典型子圖,也是graph optimizer(如TF Grappler等)一般都會實現的graph optimization pass。目前主要是基於模板匹配的方式來實現。
在RecSys中的一個復雜性在於,對於同一個”MatMul + BiasAdd + Activation”語義,經常會有不同子圖形式,下面給出兩種:
可以看到,雖然上述兩個子圖語義上仍然是”MatMul+BiasAdd+Activation”, 但由於形式上已經產生變化,基於模板匹配的子圖融合pass對他們並不能正確地辨識和融合,需要使用更高抽象度的融合pass去辨識。實踐也表明,增強的pass會給線上inference帶來20%左右的latency減少。
Multi-Head Attention
Multi-Head Attention作為attention結構的基本子圖,仔細分析並做極致優化是非常有必要的。
Operator optimization
Increase Computation Intensity
-
reduce precision: FP32 → BF16
-
reduce data traffic
-
FC: keep packed weight to amortize weight packing traffic
-
DLRM batchMatMul – only load A while compute AAT by leveraging HW transposer
-
DLRM index – de-duplicate indices
remove
data traffic
-
Increase Peak Memory BW
-
Improve cache residence
Example
假想系統參數 L2$ peak BW(TB/s) 4 HBM2e peak BW(TB/s) 0.8 BF16 peak TFLOPS 512
部署優化
Problem statement
Mixed deployment brings deployment optimization
-
Model co-location brings performance variance (noisy neighbors)
-
Optimal hardware varies across dynamic batch size([1, 100]) & different models

前期探索
Facebook proposed DeepRecSched to search good deployment configurations with dry-run. Facebook的實驗報告了在CPU上~2x的QPS,在GPU上~5x的QPS。

其他
其他探索可見《深度學習推理性能優化》 部署優化部分。
Micro-Architecture探索
主要有兩個方向:
-
近內存計算
代表性的工作有Facebook的NMP(Near Memory Processor), 主要是通過把embedding_lookup_reduction操作放到內存模組里面來完成,從而在不提高內存的物理帶寬的前提下提高有效帶寬。Facebook報告了9.8x的延時減少和4.2x的吞吐提高,基於內部的embedding-dominated的模型族。

-
data pipeline in SoC
-
Intel
Intel 計划在Sapphire Rapids CPU中引入一些data accelerator IP, 如DSA(Data Streaming Accelerator)。把memory intensive的部分從CPU指令中解放出來,offload到一個專門的IP中來實現。這為實現片上data pipeline、提高workload吞吐提供了一種可能。

References
-
DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference
-
The Architectural Implications of Facebook’s DNN-based Personalized Recommendation
-
Cross-Stack Workload Characterization of Deep Recommendation Systems
-
High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models
-
Accelerating the Wide & Deep Model Workflow from 25 Hours to 10 Minutes Using NVIDIA GPUs
-
Applying the Roofline Model for Deep Learning performance optimizations
-
RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing
-
MicroRec: Efficient Recommendation Inference by Hardware and Data Structure Solutions
-
AI Matrix: A Deep Learning Benchmark for Alibaba Data Centers
-
Deep Learning Recommendation Model for Personalization and Recommendation Systems
-
Optimizing Recommendation System Inference Performance Based on GPU
-


