22、谷歌MMOE多任務學習模型（轉）

本文轉載自查看原文 2020-07-09 16:24 2139 推薦系統論文

文章發表在KDD 2018 Research Track上，鏈接為Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts。

一、摘要

多任務學習可被用在許多應用上，如推薦系統。如在電影推薦中，用戶可購買和喜歡觀看偏好的電影，故可同時預測用戶購買量以及對電影的打分。

多任務學習常對任務之間的相關性較敏感，故權衡任務之間的目標以及任務內部關系十分重要。

MMOE模型可用來學習任務之間的關系，本文采用MOE（專家模型）在多個任務之間通過共享專家子網絡來進行多任務學習，其中設置一個門結構來訓練優化每個任務。

二、引言

許多基於DNN的多任務學習存在着對數據分布不平衡、任務相關性等問題，內在的任務差異沖突會損害一些任務的預測。
也有一些論文提出新的建模技術來處理多任務學習中的任務差異，但技術常設計為每個模型增加更多模型參數，導致計算開銷變大。
MMOE：學習任務之間的關系，學習特定任務功能，自動分配參數捕獲共享任務信息或特定任務信息，避免每次添加新參數。

多任務模型通過學習不同任務的聯系和差異，可提高每個任務的學習效率和質量。

（1）多任務學習的的框架廣泛采用shared-bottom的結構，不同任務間共用底部的隱層。

這種結構本質上可以減少過擬合的風險，但是效果上可能受到任務差異和數據分布帶來的影響。

（2）也有一些其他結構，比如兩個任務的參數不共用，但是通過對不同任務的參數增加L2范數的限制；也有一些對每個任務分別學習一套隱層然后學習所有隱層的組合。

和shared-bottom結構相比，這些模型對增加了針對任務的特定參數，在任務差異會影響公共參數的情況下對最終效果有提升。

缺點就是模型增加了參數量所以需要更大的數據量來訓練模型，而且模型更復雜並不利於在真實生產環境中實際部署使用。

因此，論文中提出了一個Multi-gate Mixture-of-Experts(MMoE)的多任務學習結構。MMoE模型刻畫了任務相關性，基於共享表示來學習特定任務的函數，避免了明顯增加參數的缺點。

MMoE模型的結構(下圖c)基於廣泛使用的Shared-Bottom結構(下圖a)和MoE結構，其中圖(b)是圖(c)的一種特殊情況。

三、一般的多任務學習模型

1、框架：

如上圖a所示，shared-bottom網絡（表示為函數f）位於底部，多個任務共用這一層。往上，K個子任務分別對應一個tower network（表示為 $h^k$ ），每個子任務的輸出 $y_k=h^k(f(x))$ 。

2、任務相關性實驗

接下來，我們通過一個實驗來探討任務相關性和多任務學習效果的關系。

假設模型中包含兩個回歸任務，而數據通過采樣生成，並且規定輸入相同，輸出label不同。那么任務的相關性就使用label之間的皮爾遜相關系數來表示，相關系數越大，表示任務之間越相關，數據生成的過程如下：

首先，生成了兩個垂直的單位向量u1和u2，並根據兩個單位向量生成了模型的系數w1和w2，如上圖中的第二步。w1和w2之間的cosine距離即為p，大伙可以根據cosine的計算公式得到。

隨后基於正態分布的到輸入數據x，而y根據下面的兩個式子的到：

注意，這里x和y之間並非線性的關系，因為模型的第二步是多個sin函數，因此label之間的皮爾遜相關系數和參數w1和w2之間的cosine距離並不相等，但是呈現出一個正相關的關系，如下圖：

因此，本文中使用參數的cosine距離來近似表示任務之間的相關性。

3、實驗結果

基於上述數據生成過程以及任務相關性的表示方法，分別測試任務相關性在0.5、0.9和1時的多任務學習模型的效果，如下圖：

可以看到的是，隨着任務相關性的提升，模型的loss越小，效果越好，從而印證了前面的猜想。

四、MMOE模型

1、MOE模型

先來看一下Mixture-of-Experts (MoE)模型（文中后面稱作 One-gate Mixture-of-Experts (OMoE)），如下圖所示：

可以看到，相較於一般的多任務學習框架，共享的底層分為了多個expert，同時設置了一個Gate，使不同的數據可以多樣化的使用共享層。此時共享層的輸出可以表示為：

其中fi代表第i個expert的輸出， $f_i,i=1,\cdots,n$ 是n個expert network（expert network可認為是一個神經網絡），gi代表第第i個expert對應的權重，是基於輸入數據得到的，計算公式為g(x) = softmax(Wgx)，其中 $\sum_{i=1}^{n}{g(x)_i}=1$ 。g是組合experts結果的gating network，具體來說g產生n個experts上的概率分布，最終的輸出是所有experts的帶權加和。顯然，MoE可看做基於多個獨立模型的集成方法。

后面有些文章將MoE作為一個基本的組成單元，將多個MoE結構堆疊在一個大網絡中。比如一個MoE層可以接受上一層MoE層的輸出作為輸入，其輸出作為下一層的輸入使用。

2、 MMoE模型

文章提出的模型（簡稱MMoE）目的就是相對於shared-bottom結構不明顯增加模型參數的要求下捕捉任務的不同。其核心思想是將shared-bottom網絡中的函數f替換成MoE層

相較於MoE模型，Multi-gate Mixture-of-Experts (MMoE)模型為每一個task設置了一個gate，使不同的任務和不同的數據可以多樣化的使用共享層，模型結構如下：

此時每個任務的共享層的輸出不同，第k個任務的共享層輸出計算公式如下：

輸入就是input feature，輸出是所有experts上的權重。一方面，因為gating networks通常是輕量級的，而且expert networks是所有任務共用，所以相對於論文中提到的一些baseline方法在計算量和參數量上具有優勢。

隨后每個任務對應的共享層輸出，經過多層全連接神經網絡得到每個任務的輸出：

從直觀上考慮，如果兩個任務並不十分相關，那么經過Gate之后，二者得到的權重系數會差別比較大，從而可以利用部分expert網絡輸出的信息，近似於多個單任務學習模型。如果兩個任務緊密相關，那么經過Gate得到的權重分布應該相差不多，類似於一般的多任務學習框架。

相對於所有任務公共一個門控網絡(One-gate MoE model，如上圖b)，這里MMoE(上圖c)中每個任務使用單獨的gating networks。每個任務的gating networks通過最終輸出權重不同實現對experts的選擇性利用。不同任務的gating networks可以學習到不同的組合experts的模式，因此模型考慮到了捕捉到任務的相關性和區別。

網絡中export是切分的子網絡，實現的時候其實可以看做是三維tensor，形狀為：

dim of input feature * number of units per expert * number of experts

更新時是對這個三維tensor進行更新。

gate的形狀則為：
dim of input feature * number of experts * number of tasks

然后一點網絡中的小小小details，貼在這里可以參考一下，幫助理解：

f_{i}(x) = activation(W_{i} * x + b), where activation is ReLU according to the paper

g^{k}(x) = activation(W_{gk} * x + b), where activation is softmax according to the paper
f^{k}(x) = sum_{i=1}^{n}(g^{k}(x)_{i} * f_{i}(x))

五、實驗結果

1 人工合成數據集

下圖是實驗結果，OMoE是單門MoE。可以看到在相關性強的數據上，OMoE和MMoE差別不大，但是在相關性低的數據上，MMoE勝過其他兩個方法很多。

2、UCI census-income dataset

3、Large-scale Content Recommendation

六、主要代碼

1、導包

import pandas as pd
from keras.utils import to_categorical
from keras import backend as K
from keras.optimizers import Adam
from keras.initializers import VarianceScaling
from keras.layers import Input, Dense
from keras.models import Model
from keras.callbacks import Callback
from sklearn.metrics import roc_auc_score

import numpy as np
import random

import tensorflow as tf
from mmoe import MMoE #模型代碼

SEED = 1

# Fix numpy seed for reproducibility
np.random.seed(SEED)

# Fix random seed for reproducibility
random.seed(SEED)

# Fix TensorFlow graph-level seed for reproducibility
tf.set_random_seed(SEED)


#設置tensorflow的session

2、加載數據---1994年income數據

column_names = ['age', 'class_worker', 'det_ind_code', 'det_occ_code', 'education', 'wage_per_hour', 'hs_college',
                'marital_stat', 'major_ind_code', 'major_occ_code', 'race', 'hisp_origin', 'sex', 'union_member',
                'unemp_reason', 'full_or_part_emp', 'capital_gains', 'capital_losses', 'stock_dividends',
                'tax_filer_stat', 'region_prev_res', 'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ',
                'instance_weight', 'mig_chg_msa', 'mig_chg_reg', 'mig_move_reg', 'mig_same', 'mig_prev_sunbelt',
                'num_emp', 'fam_under_18', 'country_father', 'country_mother', 'country_self', 'citizenship',
                'own_or_self', 'vet_question', 'vet_benefits', 'weeks_worked', 'year', 'income_50k']

# Load the dataset in Pandas
train_df = pd.read_csv(
    'data/census-income.data.gz',
    delimiter=',',
    header=None,
    index_col=None,
    names=column_names
)
other_df = pd.read_csv(
    'data/census-income.test.gz',
    delimiter=',',
    header=None,
    index_col=None,
    names=column_names
)

切分feature和label

label_columns = ['income_50k', 'marital_stat']

# One-hot encoding categorical columns
categorical_columns = ['class_worker', 'det_ind_code', 'det_occ_code', 'education', 'hs_college', 'major_ind_code',
                       'major_occ_code', 'race', 'hisp_origin', 'sex', 'union_member', 'unemp_reason',
                       'full_or_part_emp', 'tax_filer_stat', 'region_prev_res', 'state_prev_res', 'det_hh_fam_stat',
                       'det_hh_summ', 'mig_chg_msa', 'mig_chg_reg', 'mig_move_reg', 'mig_same', 'mig_prev_sunbelt',
                       'fam_under_18', 'country_father', 'country_mother', 'country_self', 'citizenship',
                       'vet_question']
train_raw_labels = train_df[label_columns]
other_raw_labels = other_df[label_columns]
transformed_train = pd.get_dummies(train_df.drop(label_columns, axis=1), columns=categorical_columns)
transformed_other = pd.get_dummies(other_df.drop(label_columns, axis=1), columns=categorical_columns)

打標簽

transformed_other['det_hh_fam_stat_ Grandchild <18 ever marr not in subfamily'] = 0

# One-hot encoding categorical labels
train_income = to_categorical((train_raw_labels.income_50k == ' 50000+.').astype(int), num_classes=2)   # > 5000的為1, < 5000為0
train_marital = to_categorical((train_raw_labels.marital_stat == ' Never married').astype(int), num_classes=2)  ## Never married為1, married為0

other_income = to_categorical((other_raw_labels.income_50k == ' 50000+.').astype(int), num_classes=2) 
other_marital = to_categorical((other_raw_labels.marital_stat == ' Never married').astype(int), num_classes=2)

dict_outputs = {
    'income': train_income.shape[1],
    'marital': train_marital.shape[1]
}  ## dict_outputs = {'income' : 2, 'marital' : 2}

dict_train_labels = { 'income': train_income, 'marital': train_marital } 
dict_other_labels = { 'income': other_income, 'marital': other_marital } 
output_info = [(dict_outputs[key], key) for key in sorted(dict_outputs.keys())]  ## output_info = [(2, 'income'), (2, 'marital')]

切分驗證集和測試集、訓練集

# Split the other dataset into 1:1 validation to test according to the paper
validation_indices = transformed_other.sample(frac=0.5, replace=False, random_state=SEED).index
test_indices = list(set(transformed_other.index) - set(validation_indices))
validation_data = transformed_other.iloc[validation_indices]
validation_label = [dict_other_labels[key][validation_indices] for key in sorted(dict_other_labels.keys())]
test_data = transformed_other.iloc[test_indices]
test_label = [dict_other_labels[key][test_indices] for key in sorted(dict_other_labels.keys())]
train_data = transformed_train
train_label = [dict_train_labels[key] for key in sorted(dict_train_labels.keys())]

num_features = train_data.shape[1]
print('Training data shape = {}'.format(train_data.shape))
print('Validation data shape = {}'.format(validation_data.shape))
print('Test data shape = {}'.format(test_data.shape))


############
# Training data shape = (199523, 499)
# Validation data shape = (49881, 499)
# Test data shape = (49881, 499)

3、模型構建

輸入層

input_layer = Input(shape=(num_features,))

MMOE層

mmoe_layers = MMoE(
    units=4,
    num_experts=8,
    num_tasks=2
)(input_layer)

output_layers = []

MMOE代碼類：

from keras import backend as K
from keras import activations, initializers, regularizers, constraints
from keras.engine.topology import Layer, InputSpec


class MMoE(Layer):
    """
    Multi-gate Mixture-of-Experts model.
    """

    def __init__(self,
                 units,
                 num_experts,
                 num_tasks,
                 use_expert_bias=True,
                 use_gate_bias=True,
                 expert_activation='relu',
                 gate_activation='softmax',
                 expert_bias_initializer='zeros',
                 gate_bias_initializer='zeros',
                 expert_bias_regularizer=None,
                 gate_bias_regularizer=None,
                 expert_bias_constraint=None,
                 gate_bias_constraint=None,
                 expert_kernel_initializer='VarianceScaling',
                 gate_kernel_initializer='VarianceScaling',
                 expert_kernel_regularizer=None,
                 gate_kernel_regularizer=None,
                 expert_kernel_constraint=None,
                 gate_kernel_constraint=None,
                 activity_regularizer=None,
                 **kwargs):
        """
         Method for instantiating MMoE layer.

        :param units: Number of hidden units
        :param num_experts: Number of experts
        :param num_tasks: Number of tasks
        :param use_expert_bias: Boolean to indicate the usage of bias in the expert weights
        :param use_gate_bias: Boolean to indicate the usage of bias in the gate weights
        :param expert_activation: Activation function of the expert weights
        :param gate_activation: Activation function of the gate weights
        :param expert_bias_initializer: Initializer for the expert bias
        :param gate_bias_initializer: Initializer for the gate bias
        :param expert_bias_regularizer: Regularizer for the expert bias
        :param gate_bias_regularizer: Regularizer for the gate bias
        :param expert_bias_constraint: Constraint for the expert bias
        :param gate_bias_constraint: Constraint for the gate bias
        :param expert_kernel_initializer: Initializer for the expert weights
        :param gate_kernel_initializer: Initializer for the gate weights
        :param expert_kernel_regularizer: Regularizer for the expert weights
        :param gate_kernel_regularizer: Regularizer for the gate weights
        :param expert_kernel_constraint: Constraint for the expert weights
        :param gate_kernel_constraint: Constraint for the gate weights
        :param activity_regularizer: Regularizer for the activity
        :param kwargs: Additional keyword arguments for the Layer class
        """
        # Hidden nodes parameter
        self.units = units
        self.num_experts = num_experts
        self.num_tasks = num_tasks

        # Weight parameter
        self.expert_kernels = None
        self.gate_kernels = None
        self.expert_kernel_initializer = initializers.get(expert_kernel_initializer)
        self.gate_kernel_initializer = initializers.get(gate_kernel_initializer)
        self.expert_kernel_regularizer = regularizers.get(expert_kernel_regularizer)
        self.gate_kernel_regularizer = regularizers.get(gate_kernel_regularizer)
        self.expert_kernel_constraint = constraints.get(expert_kernel_constraint)
        self.gate_kernel_constraint = constraints.get(gate_kernel_constraint)

        # Activation parameter
        self.expert_activation = activations.get(expert_activation)
        self.gate_activation = activations.get(gate_activation)

        # Bias parameter
        self.expert_bias = None
        self.gate_bias = None
        self.use_expert_bias = use_expert_bias
        self.use_gate_bias = use_gate_bias
        self.expert_bias_initializer = initializers.get(expert_bias_initializer)
        self.gate_bias_initializer = initializers.get(gate_bias_initializer)
        self.expert_bias_regularizer = regularizers.get(expert_bias_regularizer)
        self.gate_bias_regularizer = regularizers.get(gate_bias_regularizer)
        self.expert_bias_constraint = constraints.get(expert_bias_constraint)
        self.gate_bias_constraint = constraints.get(gate_bias_constraint)

        # Activity parameter
        self.activity_regularizer = regularizers.get(activity_regularizer)

        # Keras parameter
        self.input_spec = InputSpec(min_ndim=2)
        self.supports_masking = True

        super(MMoE, self).__init__(**kwargs)

    def build(self, input_shape):
        """
        Method for creating the layer weights.

        :param input_shape: Keras tensor (future input to layer)
                            or list/tuple of Keras tensors to reference
                            for weight shape computations
        """
        assert input_shape is not None and len(input_shape) >= 2

        input_dimension = input_shape[-1]

        # Initialize expert weights (number of input features * number of units per expert * number of experts)
        self.expert_kernels = self.add_weight(
            name='expert_kernel',
            shape=(input_dimension, self.units, self.num_experts),
            initializer=self.expert_kernel_initializer,
            regularizer=self.expert_kernel_regularizer,
            constraint=self.expert_kernel_constraint,
        )

        # Initialize expert bias (number of units per expert * number of experts)
        if self.use_expert_bias:
            self.expert_bias = self.add_weight(
                name='expert_bias',
                shape=(self.units, self.num_experts),
                initializer=self.expert_bias_initializer,
                regularizer=self.expert_bias_regularizer,
                constraint=self.expert_bias_constraint,
            )

        # Initialize gate weights (number of input features * number of experts * number of tasks)
        self.gate_kernels = [self.add_weight(
            name='gate_kernel_task_{}'.format(i),
            shape=(input_dimension, self.num_experts),
            initializer=self.gate_kernel_initializer,
            regularizer=self.gate_kernel_regularizer,
            constraint=self.gate_kernel_constraint
        ) for i in range(self.num_tasks)]

        # Initialize gate bias (number of experts * number of tasks)
        if self.use_gate_bias:
            self.gate_bias = [self.add_weight(
                name='gate_bias_task_{}'.format(i),
                shape=(self.num_experts,),
                initializer=self.gate_bias_initializer,
                regularizer=self.gate_bias_regularizer,
                constraint=self.gate_bias_constraint
            ) for i in range(self.num_tasks)]

        self.input_spec = InputSpec(min_ndim=2, axes={-1: input_dimension})

        super(MMoE, self).build(input_shape)

    def call(self, inputs, **kwargs):
        """
        Method for the forward function of the layer.

        :param inputs: Input tensor
        :param kwargs: Additional keyword arguments for the base method
        :return: A tensor
        """
        gate_outputs = []
        final_outputs = []

        # f_{i}(x) = activation(W_{i} * x + b), where activation is ReLU according to the paper， expert_outputs = {batch_size, units per experts, numbers of experts}
        expert_outputs = K.tf.tensordot(a=inputs, b=self.expert_kernels, axes=1)
        # Add the bias term to the expert weights if necessary
        if self.use_expert_bias:
            expert_outputs = K.bias_add(x=expert_outputs, bias=self.expert_bias)
        expert_outputs = self.expert_activation(expert_outputs)

        # g^{k}(x) = activation(W_{gk} * x + b), where activation is softmax according to the paper, gate_output = { batch_size , 1}
        for index, gate_kernel in enumerate(self.gate_kernels):
            gate_output = K.dot(x=inputs, y=gate_kernel)
            # Add the bias term to the gate weights if necessary
            if self.use_gate_bias:
                gate_output = K.bias_add(x=gate_output, bias=self.gate_bias[index])
            gate_output = self.gate_activation(gate_output)
            gate_outputs.append(gate_output)

        # f^{k}(x) = sum_{i=1}^{n}(g^{k}(x)_{i} * f_{i}(x))
        for gate_output in gate_outputs:
            expanded_gate_output = K.expand_dims(gate_output, axis=1)
            weighted_expert_output = expert_outputs * K.repeat_elements(expanded_gate_output, self.units, axis=1)
            final_outputs.append(K.sum(weighted_expert_output, axis=2))

        return final_outputs

    def compute_output_shape(self, input_shape):
        """
        Method for computing the output shape of the MMoE layer.

        :param input_shape: Shape tuple (tuple of integers)
        :return: List of input shape tuple where the size of the list is equal to the number of tasks
        """
        assert input_shape is not None and len(input_shape) >= 2

        output_shape = list(input_shape)
        output_shape[-1] = self.units
        output_shape = tuple(output_shape)

        return [output_shape for _ in range(self.num_tasks)]

    def get_config(self):
        """
        Method for returning the configuration of the MMoE layer.

        :return: Config dictionary
        """
        config = {
            'units': self.units,
            'num_experts': self.num_experts,
            'num_tasks': self.num_tasks,
            'use_expert_bias': self.use_expert_bias,
            'use_gate_bias': self.use_gate_bias,
            'expert_activation': activations.serialize(self.expert_activation),
            'gate_activation': activations.serialize(self.gate_activation),
            'expert_bias_initializer': initializers.serialize(self.expert_bias_initializer),
            'gate_bias_initializer': initializers.serialize(self.gate_bias_initializer),
            'expert_bias_regularizer': regularizers.serialize(self.expert_bias_regularizer),
            'gate_bias_regularizer': regularizers.serialize(self.gate_bias_regularizer),
            'expert_bias_constraint': constraints.serialize(self.expert_bias_constraint),
            'gate_bias_constraint': constraints.serialize(self.gate_bias_constraint),
            'expert_kernel_initializer': initializers.serialize(self.expert_kernel_initializer),
            'gate_kernel_initializer': initializers.serialize(self.gate_kernel_initializer),
            'expert_kernel_regularizer': regularizers.serialize(self.expert_kernel_regularizer),
            'gate_kernel_regularizer': regularizers.serialize(self.gate_kernel_regularizer),
            'expert_kernel_constraint': constraints.serialize(self.expert_kernel_constraint),
            'gate_kernel_constraint': constraints.serialize(self.gate_kernel_constraint),
            'activity_regularizer': regularizers.serialize(self.activity_regularizer)
        }
        base_config = super(MMoE, self).get_config()

        return dict(list(base_config.items()) + list(config.items()))

輸出層(tower layer)

# Build tower layer from MMoE layer
for index, task_layer in enumerate(mmoe_layers):
    tower_layer = Dense(
        units=8,
        activation='relu',
        kernel_initializer=VarianceScaling())(task_layer)
    output_layer = Dense(
        units=output_info[index][0],
        name=output_info[index][1],
        activation='softmax',
        kernel_initializer=VarianceScaling())(tower_layer)
    output_layers.append(output_layer)

4、模型訓練

model = Model(inputs=[input_layer], outputs=output_layers)
adam_optimizer = Adam()
model.compile(
    loss={'income':'binary_crossentropy'},
    optimizer=adam_optimizer,
    metrics=['accuracy']
)
# Print out model architecture summary
model.summary()

# Train the model
model.fit(
    x=train_data,
    y=train_label,
    validation_data=(validation_data, validation_label),
    callbacks=[
        ROCCallback(
            training_data=(train_data, train_label),
            validation_data=(validation_data, validation_label),
            test_data=(test_data, test_label)
        )
    ],
    epochs=100
)

參考文獻：

https://zhuanlan.zhihu.com/p/55752344

https://zhuanlan.zhihu.com/p/96796043

多任務學習模型詳解：Multi-gate Mixture-of-Experts（MMoE ，Google，KDD2018）

MMOE論文筆記（論文中有維度講解）

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。