文章發表在KDD 2018 Research Track上,鏈接為Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts。
一、摘要
多任務學習可被用在許多應用上,如推薦系統。如在電影推薦中,用戶可購買和喜歡觀看偏好的電影,故可同時預測用戶購買量以及對電影的打分。
多任務學習常對任務之間的相關性較敏感,故權衡任務之間的目標以及任務內部關系十分重要。
MMOE模型可用來學習任務之間的關系,本文采用MOE(專家模型)在多個任務之間通過共享專家子網絡來進行多任務學習,其中設置一個門結構來訓練優化每個任務。
二、引言
- 許多基於DNN的多任務學習存在着對數據分布不平衡、任務相關性等問題,內在的任務差異沖突會損害一些任務的預測。
- 也有一些論文提出新的建模技術來處理多任務學習中的任務差異,但技術常設計為每個模型增加更多模型參數,導致計算開銷變大。
- MMOE:學習任務之間的關系,學習特定任務功能,自動分配參數捕獲共享任務信息或特定任務信息,避免每次添加新參數。
多任務模型通過學習不同任務的聯系和差異,可提高每個任務的學習效率和質量。
(1)多任務學習的的框架廣泛采用shared-bottom的結構,不同任務間共用底部的隱層。
這種結構本質上可以減少過擬合的風險,但是效果上可能受到任務差異和數據分布帶來的影響。
(2)也有一些其他結構,比如兩個任務的參數不共用,但是通過對不同任務的參數增加L2范數的限制;也有一些對每個任務分別學習一套隱層然后學習所有隱層的組合。
和shared-bottom結構相比,這些模型對增加了針對任務的特定參數,在任務差異會影響公共參數的情況下對最終效果有提升。
缺點就是模型增加了參數量所以需要更大的數據量來訓練模型,而且模型更復雜並不利於在真實生產環境中實際部署使用。
因此,論文中提出了一個Multi-gate Mixture-of-Experts(MMoE)的多任務學習結構。MMoE模型刻畫了任務相關性,基於共享表示來學習特定任務的函數,避免了明顯增加參數的缺點。
MMoE模型的結構(下圖c)基於廣泛使用的Shared-Bottom結構(下圖a)和MoE結構,其中圖(b)是圖(c)的一種特殊情況。
三、一般的多任務學習模型
1、框架:
如上圖a所示,shared-bottom網絡(表示為函數f)位於底部,多個任務共用這一層。往上,K個子任務分別對應一個tower network(表示為 ),每個子任務的輸出
。
2、任務相關性實驗
接下來,我們通過一個實驗來探討任務相關性和多任務學習效果的關系。
假設模型中包含兩個回歸任務,而數據通過采樣生成,並且規定輸入相同,輸出label不同。那么任務的相關性就使用label之間的皮爾遜相關系數來表示,相關系數越大,表示任務之間越相關,數據生成的過程如下:

首先,生成了兩個垂直的單位向量u1和u2,並根據兩個單位向量生成了模型的系數w1和w2,如上圖中的第二步。w1和w2之間的cosine距離即為p,大伙可以根據cosine的計算公式得到。
隨后基於正態分布的到輸入數據x,而y根據下面的兩個式子的到:

注意,這里x和y之間並非線性的關系,因為模型的第二步是多個sin函數,因此label之間的皮爾遜相關系數和參數w1和w2之間的cosine距離並不相等,但是呈現出一個正相關的關系,如下圖:

因此,本文中使用參數的cosine距離來近似表示任務之間的相關性。
3、實驗結果
基於上述數據生成過程以及任務相關性的表示方法,分別測試任務相關性在0.5、0.9和1時的多任務學習模型的效果,如下圖:

可以看到的是,隨着任務相關性的提升,模型的loss越小,效果越好,從而印證了前面的猜想。
四、MMOE模型
1、MOE模型
先來看一下Mixture-of-Experts (MoE)模型(文中后面稱作 One-gate Mixture-of-Experts (OMoE)),如下圖所示:

可以看到,相較於一般的多任務學習框架,共享的底層分為了多個expert,同時設置了一個Gate,使不同的數據可以多樣化的使用共享層。此時共享層的輸出可以表示為:

其中fi代表第i個expert的輸出, 是n個expert network(expert network可認為是一個神經網絡),gi代表第第i個expert對應的權重,是基於輸入數據得到的,計算公式為g(x) = softmax(Wgx),其中
。g是組合experts結果的gating network,具體來說g產生n個experts上的概率分布,最終的輸出是所有experts的帶權加和。顯然,MoE可看做基於多個獨立模型的集成方法。
后面有些文章將MoE作為一個基本的組成單元,將多個MoE結構堆疊在一個大網絡中。比如一個MoE層可以接受上一層MoE層的輸出作為輸入,其輸出作為下一層的輸入使用。
2、 MMoE模型
文章提出的模型(簡稱MMoE)目的就是相對於shared-bottom結構不明顯增加模型參數的要求下捕捉任務的不同。其核心思想是將shared-bottom網絡中的函數f替換成MoE層
相較於MoE模型,Multi-gate Mixture-of-Experts (MMoE)模型為每一個task設置了一個gate,使不同的任務和不同的數據可以多樣化的使用共享層,模型結構如下:

此時每個任務的共享層的輸出不同,第k個任務的共享層輸出計算公式如下:


隨后每個任務對應的共享層輸出,經過多層全連接神經網絡得到每個任務的輸出:

從直觀上考慮,如果兩個任務並不十分相關,那么經過Gate之后,二者得到的權重系數會差別比較大,從而可以利用部分expert網絡輸出的信息,近似於多個單任務學習模型。如果兩個任務緊密相關,那么經過Gate得到的權重分布應該相差不多,類似於一般的多任務學習框架。
相對於所有任務公共一個門控網絡(One-gate MoE model,如上圖b),這里MMoE(上圖c)中每個任務使用單獨的gating networks。每個任務的gating networks通過最終輸出權重不同實現對experts的選擇性利用。不同任務的gating networks可以學習到不同的組合experts的模式,因此模型考慮到了捕捉到任務的相關性和區別。
網絡中export是切分的子網絡,實現的時候其實可以看做是三維tensor,形狀為:
dim of input feature * number of units per expert * number of experts
更新時是對這個三維tensor進行更新。
gate的形狀則為:
dim of input feature * number of experts * number of tasks
然后一點網絡中的小小小details,貼在這里可以參考一下,幫助理解:
f_{i}(x) = activation(W_{i} * x + b), where activation is ReLU according to the paper
g^{k}(x) = activation(W_{gk} * x + b), where activation is softmax according to the paper
f^{k}(x) = sum_{i=1}^{n}(g^{k}(x)_{i} * f_{i}(x))
五、實驗結果
1 人工合成數據集
下圖是實驗結果,OMoE是單門MoE。可以看到在相關性強的數據上,OMoE和MMoE差別不大,但是在相關性低的數據上,MMoE勝過其他兩個方法很多。
2、UCI census-income dataset
3、Large-scale Content Recommendation
六、主要代碼
1、導包
import pandas as pd from keras.utils import to_categorical from keras import backend as K from keras.optimizers import Adam from keras.initializers import VarianceScaling from keras.layers import Input, Dense from keras.models import Model from keras.callbacks import Callback from sklearn.metrics import roc_auc_score import numpy as np import random import tensorflow as tf from mmoe import MMoE #模型代碼 SEED = 1 # Fix numpy seed for reproducibility np.random.seed(SEED) # Fix random seed for reproducibility random.seed(SEED) # Fix TensorFlow graph-level seed for reproducibility tf.set_random_seed(SEED) #設置tensorflow的session
2、加載數據---1994年income數據
column_names = ['age', 'class_worker', 'det_ind_code', 'det_occ_code', 'education', 'wage_per_hour', 'hs_college', 'marital_stat', 'major_ind_code', 'major_occ_code', 'race', 'hisp_origin', 'sex', 'union_member', 'unemp_reason', 'full_or_part_emp', 'capital_gains', 'capital_losses', 'stock_dividends', 'tax_filer_stat', 'region_prev_res', 'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ', 'instance_weight', 'mig_chg_msa', 'mig_chg_reg', 'mig_move_reg', 'mig_same', 'mig_prev_sunbelt', 'num_emp', 'fam_under_18', 'country_father', 'country_mother', 'country_self', 'citizenship', 'own_or_self', 'vet_question', 'vet_benefits', 'weeks_worked', 'year', 'income_50k'] # Load the dataset in Pandas train_df = pd.read_csv( 'data/census-income.data.gz', delimiter=',', header=None, index_col=None, names=column_names ) other_df = pd.read_csv( 'data/census-income.test.gz', delimiter=',', header=None, index_col=None, names=column_names )
切分feature和label
label_columns = ['income_50k', 'marital_stat'] # One-hot encoding categorical columns categorical_columns = ['class_worker', 'det_ind_code', 'det_occ_code', 'education', 'hs_college', 'major_ind_code', 'major_occ_code', 'race', 'hisp_origin', 'sex', 'union_member', 'unemp_reason', 'full_or_part_emp', 'tax_filer_stat', 'region_prev_res', 'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ', 'mig_chg_msa', 'mig_chg_reg', 'mig_move_reg', 'mig_same', 'mig_prev_sunbelt', 'fam_under_18', 'country_father', 'country_mother', 'country_self', 'citizenship', 'vet_question'] train_raw_labels = train_df[label_columns] other_raw_labels = other_df[label_columns] transformed_train = pd.get_dummies(train_df.drop(label_columns, axis=1), columns=categorical_columns) transformed_other = pd.get_dummies(other_df.drop(label_columns, axis=1), columns=categorical_columns)
打標簽
transformed_other['det_hh_fam_stat_ Grandchild <18 ever marr not in subfamily'] = 0 # One-hot encoding categorical labels train_income = to_categorical((train_raw_labels.income_50k == ' 50000+.').astype(int), num_classes=2) # > 5000的為1, < 5000為0 train_marital = to_categorical((train_raw_labels.marital_stat == ' Never married').astype(int), num_classes=2) ## Never married為1, married為0
other_income = to_categorical((other_raw_labels.income_50k == ' 50000+.').astype(int), num_classes=2)
other_marital = to_categorical((other_raw_labels.marital_stat == ' Never married').astype(int), num_classes=2)
dict_outputs = { 'income': train_income.shape[1], 'marital': train_marital.shape[1] } ## dict_outputs = {'income' : 2, 'marital' : 2}
dict_train_labels = { 'income': train_income, 'marital': train_marital }
dict_other_labels = { 'income': other_income, 'marital': other_marital }
output_info = [(dict_outputs[key], key) for key in sorted(dict_outputs.keys())] ## output_info = [(2, 'income'), (2, 'marital')]
切分驗證集和測試集、訓練集
# Split the other dataset into 1:1 validation to test according to the paper validation_indices = transformed_other.sample(frac=0.5, replace=False, random_state=SEED).index test_indices = list(set(transformed_other.index) - set(validation_indices)) validation_data = transformed_other.iloc[validation_indices] validation_label = [dict_other_labels[key][validation_indices] for key in sorted(dict_other_labels.keys())] test_data = transformed_other.iloc[test_indices] test_label = [dict_other_labels[key][test_indices] for key in sorted(dict_other_labels.keys())] train_data = transformed_train train_label = [dict_train_labels[key] for key in sorted(dict_train_labels.keys())] num_features = train_data.shape[1] print('Training data shape = {}'.format(train_data.shape)) print('Validation data shape = {}'.format(validation_data.shape)) print('Test data shape = {}'.format(test_data.shape)) ############ # Training data shape = (199523, 499) # Validation data shape = (49881, 499) # Test data shape = (49881, 499)
3、 模型構建
輸入層
input_layer = Input(shape=(num_features,))
MMOE層
mmoe_layers = MMoE( units=4, num_experts=8, num_tasks=2 )(input_layer)
output_layers = []
MMOE代碼類:
from keras import backend as K from keras import activations, initializers, regularizers, constraints from keras.engine.topology import Layer, InputSpec class MMoE(Layer): """ Multi-gate Mixture-of-Experts model. """ def __init__(self, units, num_experts, num_tasks, use_expert_bias=True, use_gate_bias=True, expert_activation='relu', gate_activation='softmax', expert_bias_initializer='zeros', gate_bias_initializer='zeros', expert_bias_regularizer=None, gate_bias_regularizer=None, expert_bias_constraint=None, gate_bias_constraint=None, expert_kernel_initializer='VarianceScaling', gate_kernel_initializer='VarianceScaling', expert_kernel_regularizer=None, gate_kernel_regularizer=None, expert_kernel_constraint=None, gate_kernel_constraint=None, activity_regularizer=None, **kwargs): """ Method for instantiating MMoE layer. :param units: Number of hidden units :param num_experts: Number of experts :param num_tasks: Number of tasks :param use_expert_bias: Boolean to indicate the usage of bias in the expert weights :param use_gate_bias: Boolean to indicate the usage of bias in the gate weights :param expert_activation: Activation function of the expert weights :param gate_activation: Activation function of the gate weights :param expert_bias_initializer: Initializer for the expert bias :param gate_bias_initializer: Initializer for the gate bias :param expert_bias_regularizer: Regularizer for the expert bias :param gate_bias_regularizer: Regularizer for the gate bias :param expert_bias_constraint: Constraint for the expert bias :param gate_bias_constraint: Constraint for the gate bias :param expert_kernel_initializer: Initializer for the expert weights :param gate_kernel_initializer: Initializer for the gate weights :param expert_kernel_regularizer: Regularizer for the expert weights :param gate_kernel_regularizer: Regularizer for the gate weights :param expert_kernel_constraint: Constraint for the expert weights :param gate_kernel_constraint: Constraint for the gate weights :param activity_regularizer: Regularizer for the activity :param kwargs: Additional keyword arguments for the Layer class """ # Hidden nodes parameter self.units = units self.num_experts = num_experts self.num_tasks = num_tasks # Weight parameter self.expert_kernels = None self.gate_kernels = None self.expert_kernel_initializer = initializers.get(expert_kernel_initializer) self.gate_kernel_initializer = initializers.get(gate_kernel_initializer) self.expert_kernel_regularizer = regularizers.get(expert_kernel_regularizer) self.gate_kernel_regularizer = regularizers.get(gate_kernel_regularizer) self.expert_kernel_constraint = constraints.get(expert_kernel_constraint) self.gate_kernel_constraint = constraints.get(gate_kernel_constraint) # Activation parameter self.expert_activation = activations.get(expert_activation) self.gate_activation = activations.get(gate_activation) # Bias parameter self.expert_bias = None self.gate_bias = None self.use_expert_bias = use_expert_bias self.use_gate_bias = use_gate_bias self.expert_bias_initializer = initializers.get(expert_bias_initializer) self.gate_bias_initializer = initializers.get(gate_bias_initializer) self.expert_bias_regularizer = regularizers.get(expert_bias_regularizer) self.gate_bias_regularizer = regularizers.get(gate_bias_regularizer) self.expert_bias_constraint = constraints.get(expert_bias_constraint) self.gate_bias_constraint = constraints.get(gate_bias_constraint) # Activity parameter self.activity_regularizer = regularizers.get(activity_regularizer) # Keras parameter self.input_spec = InputSpec(min_ndim=2) self.supports_masking = True super(MMoE, self).__init__(**kwargs) def build(self, input_shape): """ Method for creating the layer weights. :param input_shape: Keras tensor (future input to layer) or list/tuple of Keras tensors to reference for weight shape computations """ assert input_shape is not None and len(input_shape) >= 2 input_dimension = input_shape[-1] # Initialize expert weights (number of input features * number of units per expert * number of experts) self.expert_kernels = self.add_weight( name='expert_kernel', shape=(input_dimension, self.units, self.num_experts), initializer=self.expert_kernel_initializer, regularizer=self.expert_kernel_regularizer, constraint=self.expert_kernel_constraint, ) # Initialize expert bias (number of units per expert * number of experts) if self.use_expert_bias: self.expert_bias = self.add_weight( name='expert_bias', shape=(self.units, self.num_experts), initializer=self.expert_bias_initializer, regularizer=self.expert_bias_regularizer, constraint=self.expert_bias_constraint, ) # Initialize gate weights (number of input features * number of experts * number of tasks) self.gate_kernels = [self.add_weight( name='gate_kernel_task_{}'.format(i), shape=(input_dimension, self.num_experts), initializer=self.gate_kernel_initializer, regularizer=self.gate_kernel_regularizer, constraint=self.gate_kernel_constraint ) for i in range(self.num_tasks)] # Initialize gate bias (number of experts * number of tasks) if self.use_gate_bias: self.gate_bias = [self.add_weight( name='gate_bias_task_{}'.format(i), shape=(self.num_experts,), initializer=self.gate_bias_initializer, regularizer=self.gate_bias_regularizer, constraint=self.gate_bias_constraint ) for i in range(self.num_tasks)] self.input_spec = InputSpec(min_ndim=2, axes={-1: input_dimension}) super(MMoE, self).build(input_shape) def call(self, inputs, **kwargs): """ Method for the forward function of the layer. :param inputs: Input tensor :param kwargs: Additional keyword arguments for the base method :return: A tensor """ gate_outputs = [] final_outputs = [] # f_{i}(x) = activation(W_{i} * x + b), where activation is ReLU according to the paper, expert_outputs = {batch_size, units per experts, numbers of experts} expert_outputs = K.tf.tensordot(a=inputs, b=self.expert_kernels, axes=1) # Add the bias term to the expert weights if necessary if self.use_expert_bias: expert_outputs = K.bias_add(x=expert_outputs, bias=self.expert_bias) expert_outputs = self.expert_activation(expert_outputs) # g^{k}(x) = activation(W_{gk} * x + b), where activation is softmax according to the paper, gate_output = { batch_size , 1} for index, gate_kernel in enumerate(self.gate_kernels): gate_output = K.dot(x=inputs, y=gate_kernel) # Add the bias term to the gate weights if necessary if self.use_gate_bias: gate_output = K.bias_add(x=gate_output, bias=self.gate_bias[index]) gate_output = self.gate_activation(gate_output) gate_outputs.append(gate_output) # f^{k}(x) = sum_{i=1}^{n}(g^{k}(x)_{i} * f_{i}(x)) for gate_output in gate_outputs: expanded_gate_output = K.expand_dims(gate_output, axis=1) weighted_expert_output = expert_outputs * K.repeat_elements(expanded_gate_output, self.units, axis=1) final_outputs.append(K.sum(weighted_expert_output, axis=2)) return final_outputs def compute_output_shape(self, input_shape): """ Method for computing the output shape of the MMoE layer. :param input_shape: Shape tuple (tuple of integers) :return: List of input shape tuple where the size of the list is equal to the number of tasks """ assert input_shape is not None and len(input_shape) >= 2 output_shape = list(input_shape) output_shape[-1] = self.units output_shape = tuple(output_shape) return [output_shape for _ in range(self.num_tasks)] def get_config(self): """ Method for returning the configuration of the MMoE layer. :return: Config dictionary """ config = { 'units': self.units, 'num_experts': self.num_experts, 'num_tasks': self.num_tasks, 'use_expert_bias': self.use_expert_bias, 'use_gate_bias': self.use_gate_bias, 'expert_activation': activations.serialize(self.expert_activation), 'gate_activation': activations.serialize(self.gate_activation), 'expert_bias_initializer': initializers.serialize(self.expert_bias_initializer), 'gate_bias_initializer': initializers.serialize(self.gate_bias_initializer), 'expert_bias_regularizer': regularizers.serialize(self.expert_bias_regularizer), 'gate_bias_regularizer': regularizers.serialize(self.gate_bias_regularizer), 'expert_bias_constraint': constraints.serialize(self.expert_bias_constraint), 'gate_bias_constraint': constraints.serialize(self.gate_bias_constraint), 'expert_kernel_initializer': initializers.serialize(self.expert_kernel_initializer), 'gate_kernel_initializer': initializers.serialize(self.gate_kernel_initializer), 'expert_kernel_regularizer': regularizers.serialize(self.expert_kernel_regularizer), 'gate_kernel_regularizer': regularizers.serialize(self.gate_kernel_regularizer), 'expert_kernel_constraint': constraints.serialize(self.expert_kernel_constraint), 'gate_kernel_constraint': constraints.serialize(self.gate_kernel_constraint), 'activity_regularizer': regularizers.serialize(self.activity_regularizer) } base_config = super(MMoE, self).get_config() return dict(list(base_config.items()) + list(config.items()))
輸出層(tower layer)
# Build tower layer from MMoE layer for index, task_layer in enumerate(mmoe_layers): tower_layer = Dense( units=8, activation='relu', kernel_initializer=VarianceScaling())(task_layer) output_layer = Dense( units=output_info[index][0], name=output_info[index][1], activation='softmax', kernel_initializer=VarianceScaling())(tower_layer) output_layers.append(output_layer)
4、模型訓練
model = Model(inputs=[input_layer], outputs=output_layers) adam_optimizer = Adam() model.compile( loss={'income':'binary_crossentropy'}, optimizer=adam_optimizer, metrics=['accuracy'] ) # Print out model architecture summary model.summary() # Train the model model.fit( x=train_data, y=train_label, validation_data=(validation_data, validation_label), callbacks=[ ROCCallback( training_data=(train_data, train_label), validation_data=(validation_data, validation_label), test_data=(test_data, test_label) ) ], epochs=100 )
參考文獻:
https://zhuanlan.zhihu.com/p/55752344
https://zhuanlan.zhihu.com/p/96796043
多任務學習模型詳解:Multi-gate Mixture-of-Experts(MMoE ,Google,KDD2018)
MMOE論文筆記(論文中有維度講解)