推薦系統系列（四）：PNN理論與實踐

本文轉載自查看原文 2019-11-01 11:21 419 PNN/ 機器學習/ 推薦系統/ 機器學習&深度學習

背景

上一篇文章介紹了FNN [2]，在FM的基礎上引入了DNN對特征進行高階組合提高模型表現。但FNN並不是完美的，針對FNN的缺點上交與UCL於2016年聯合提出一種新的改進模型PNN（Product-based Neural Network）。

PNN同樣引入了DNN對低階特征進行組合，但與FNN不同，PNN並沒有單純使用全連接層來對低階特征進行組合，而是設計了Product層對特征進行更細致的交叉運算。在《推薦系統系列（三）：FNN理論與實踐》中提到過，在不考慮激活函數的前提下，使用全連接的方式對特征進行組合，等價於將所有特征進行加權求和。PNN的作者同樣意識到了這個問題，認為“加法”操作並不足以捕獲不同Field特征之間的相關性。原文如下 [1]：

the “add” operations of the perceptron layer might not be useful to explore the interactions of categorical data in multiple fields.

有研究表明“product”操作比“add”操作更有效，而且FM模型的優勢正是通過特征向量的內積來體現的。基於此，PNN作者設計了product layer來對特征進行組合，包含內積與外積兩種操作。實驗表明，PNN有顯著提升，而product layer也成為了深度推薦模型中的經典結構。

分析

1. PNN結構

PNN的網絡架構如下圖所示：

從上往下進行分析，最上層輸出的是預估的CTR值，$\hat{y}=\sigma(W_3l_2+b_3)$ ，公式符號與原Paper保持一致。

第二層隱藏層：$l_2=relu(W_2l_1+b_1)$

第一層隱藏層：$l_1=relu(l_z+l_p+b_1) $

PNN核心在於計算 $l_z,l_p$ 。首先，定義矩陣點積運算 $A \bigodot B \triangleq \sum_{i,j}A_{i,j}B_{i,j}$

則：

\[\begin{align} l_z=(l_{z}^1,l_{z}^2,\dots,l_{z}^n,\dots,l_{z}^{D_1}), \qquad l_z^n=W_z^n \bigodot z \notag \\ \end{align} \tag{1} \]

\[\begin{align} l_p=(l_{p}^1,l_{p}^2,\dots,l_{p}^n,\dots,l_{p}^{D_1}), \qquad l_p^n=W_p^n \bigodot p \notag \\ \end{align} \tag{2} \]

其中：

\[\begin{align} z=(z_1,z_2,\dots,z_N) \triangleq (f_1,f_2,\dots,f_N) \notag \\ \end{align} \tag{3} \]

\[\begin{align} p=\{p_{i,j}\},i=1 \dots N,j=1 \dots N \notag \\ \end{align} \tag{4} \]

結合公式（1）（3），得：

\[\begin{align} l_z^n=W_z^n \bigodot z=\sum_{i=1}^N\sum_{j=1}^M(W_z^n)_{i,j}z_{i,j} \notag \\ \end{align} \tag{5} \]

公式（3）中，$f_i \in \mathbb{R}^M$ 表示經過embedding之后的特征向量，embedding過程與FNN保持一致。聯系PNN結構圖與公式（1）（3）可以看出，這個部分的計算主要是為了保留低階特征，對比FNN丟棄低階特征，只對二階特征進行更高階組合，顯然PNN是為了彌補這個缺點。

公式（4）中 $p_{i,j}=g(f_i,f_j)$ 表示成對特征交叉函數，定義不同的交叉方式也就有不同的PNN結構。在論文中，函數 $g(f_i,f_j)$ 有兩種表示，第一種為向量內積運算，即IPNN（Inner Product-based Neural Network）；第二種為向量外積運算，即OPNN（Outer Product-based Neural Network）。

1.1 IPNN分析

定義 $p_{i,j}=g(f_i,f_j)=\langle f_i,f_j \rangle$ ，將公式（2）進行改寫，得：

\[\begin{align} l_p^n=W_p^n \bigodot p=\sum_{i=1}^N\sum_{j=1}^N(W_p^n)_{i,j}p_{i,j}= \sum_{i=1}^N\sum_{j=1}^N(W_p^n)_{i,j}\langle f_i,f_j \rangle \notag \\ \end{align} \tag{6} \]

分析IPNN的product layer計算空間復雜度：

結合公式（1）（5）可知，$l_z$ 計算空間復雜度為 $O(D_1NM)$ 。結合公式（2）（6）可知，計算 $p$ 需要 $O(N^2)$ 空間開銷，$l_p^n$ 需要 $O(N^2)$ 空間開銷，所以 $l_p$ 計算空間復雜度為 $O(D_1NN)$ 。所以，product layer 整體計算空間復雜度為 $O(D_1N(M+N))$ 。

分析IPNN的product layer計算時間復雜度：

結合公式（1）（5）可知，$l_z$ 計算時間復雜度為 $O(D_1NM)$ 。結合公式（2）（6）可知，計算 $p_{i,j}$ 需要 $O(M)$ 時間開銷，計算 $p$ 需要 $O(N^2M)$ 時間開銷，又因為 $l_p^n$ 需要 $O(N^2)$ 時間開銷，所以 $l_p$ 計算空間復雜度為 $O(N^2(M+D_1))$ 。所以，product layer 整體計算時間復雜度為 $O(N^2(M+D_1))$ 。

計算優化

時空復雜度過高不適合工程實踐，所以需要進行計算優化。因為 $l_z$ 本身計算開銷不大，所以將重點在於優化 $l_p$ 的計算，更准確一點在於優化公式（6）的計算。

受FM的參數矩陣分解啟發，由於 $p_{i,j},W_p^n$ 都是對稱方陣，所以使用一階矩陣分解，假設 $W_p^n=\theta^n\theta^{nT}$ ，此時有 $\theta^n \in \mathbb{R}^N$ 。將原本參數量為 $N*N$ 的矩陣 $W_p^n$ ，分解為了參數量為 $N$ 的向量 $\theta^n$ 。同時，將公式（6）改寫為：

\[\begin{align} l_p^n ={} & W_p^n \bigodot p =\sum_{i=1}^N\sum_{j=1}^N(W_p^n)_{i,j}\langle f_i,f_j \rangle \notag \\ ={} & \sum_{i=1}^N\sum_{j=1}^N \theta_i^n \theta_j^n \langle f_i,f_j \rangle \notag \\ ={} &\sum_{i=1}^N\sum_{j=1}^N \langle \theta_i^nf_i, \theta_j^nf_j \rangle \notag \\ ={} & \langle \sum_{i=1}^N\theta_i^nf_i, \sum_{j=1}^N\theta_j^nf_j \rangle \notag \\ ={} & \langle \sum_{i=1}^N\delta_i^n, \sum_{j=1}^N\delta_j^n \rangle \notag \\ ={} & \Vert \sum_{i=1}^N\delta_i^n \Vert^2 \notag \\ \end{align} \tag{7} \]

其中：$\delta_i^n=\theta_i^nf_i$ ，$\delta_i^n \in \mathbb{R}^M$ 。結合公式（2）（7），得：

\[\begin{align} l_p=(\Vert \sum_{i=1}^N\delta_i^1 \Vert^2,\dots,\Vert \sum_{i=1}^N\delta_i^n \Vert^2,\dots,\Vert \sum_{i=1}^N\delta_i^{D_1} \Vert^2) \notag \\ \end{align} \tag{8} \]

優化后的時空復雜度

空間復雜度由$O(D_1N(M+N))$ 降為 $O(D_1NM)$ ；

時間復雜度由$O(N^2(M+D_1))$ 降為 $O(D_1NM)$ ；

雖然通過參數矩陣分解可以對計算開銷進行優化，但由於采用一階矩陣分解來近似矩陣結果，所以會丟失一定的精確性。如果考慮引入K階矩陣分解，雖然精度更高但計算開銷會更高。

1.2 OPNN分析

將特征交叉的方式由內積變為外積，便可以得到PNN的另一種形式OPNN。

定義 $p_{i,j}=g(f_i,f_j)=f_if_j^T$ ，將公式（2）進行改寫，得到：

\[\begin{align} l_p^n=W_p^n \bigodot p=\sum_{i=1}^N\sum_{j=1}^N(W_p^n)_{i,j}p_{i,j}= \sum_{i=1}^N\sum_{j=1}^N(W_p^n)_{i,j}f_if_j^T \notag \\ \end{align} \tag{9} \]

類似於IPNN的分析，OPNN的時空復雜度均為 $O(D_1M^2N^2)$ 。

為了進行計算優化，引入疊加的概念（sum pooling）。將 $p$ 的計算公式重新定義為：

\[\begin{align} p=\sum_{i=1}^N\sum_{j=1}^Nf_if_j^T=f_{\sum}f_{\sum}^T, \qquad f_{\sum}=\sum_{i=1}^Nf_i \notag \\ \end{align} \tag{10} \]

那么公式（9）重新定義為：（注意，此時 $p \in \mathbb{R}^{M \times M}$ ）

\[\begin{align} l_p^n=W_p^n \bigodot p=\sum_{i=1}^M\sum_{j=1}^M(W_p^n)_{i,j}p_{i,j} \notag \\ \end{align} \tag{11} \]

通過公式（10）可知， $f_{\sum}$ 的時間復雜度為 $O(MN)$ ，$p$ 的時空復雜度均為 $O(MM)$ ， $l_p^n$ 的時空復雜度均為 $O(MM)$ ，那么計算 $l_p$ 的時空復雜度均為 $O(D_1MM)$ ，從上一小節可知，計算 $l_z$ 的時空復雜度均為 $O(D_1MN)$ 。所以最終OPNN的時空復雜度為 $O(D_1M(M+N))$ 。

那么OPNN的時空復雜度由 $O(D_1M^2N^2)$ 降低到 $O(D_1M(M+N))$ 。

同樣的，雖然疊加概念的引入可以降低計算開銷，但是中間的精度損失也是很大的，性能與精度之間的tradeoff。

降低復雜度的具體策略與具體的product函數選擇有關，IPNN其實通過矩陣分解，“跳過”了顯式的product層，通過代數轉換直接從embedding層一步到位到 $l_1$ 隱層，而OPNN則是直接在product層入手進行優化 [3]

2. 性能分析

作者在 $Criteo$ 與 $iPinYou$ 數據集上進行實驗，對比結果如下。其中 $PNN^*$ 是同時對特征進行內積與外積計算，然后concat在一起送入下一層。

關於模型dropout比例、激活函數以及隱藏層參數的實驗對比如下所示：

3. 優缺點

優點：

對比FNN，在進行高階特征組合的同時，融入了低階特征，且無需進行兩階段訓練。

實驗

使用 $MovieLens100K dataset$ ，核心代碼如下。

class PNN(object):
    def __init__(self, vec_dim=None, field_lens=None, lr=None, dnn_layers=None, dropout_rate=None, lamda=None, use_inner=True):
        self.vec_dim = vec_dim
        self.field_lens = field_lens
        self.field_num = len(field_lens)
        self.lr = lr
        self.dnn_layers = dnn_layers
        self.dropout_rate = dropout_rate
        self.lamda = float(lamda)
        self.use_inner = use_inner
        assert dnn_layers[-1] == 1
        self.l2_reg = tf.contrib.layers.l2_regularizer(self.lamda)

        self._build_graph()

    def _build_graph(self):
        self.add_input()
        self.inference()

    def add_input(self):
        self.x = [tf.placeholder(tf.float32, name='input_x_%d'%i) for i in range(self.field_num)]
        self.y = tf.placeholder(tf.float32, shape=[None], name='input_y')
        self.is_train = tf.placeholder(tf.bool)

    def inference(self):
        with tf.variable_scope('emb_part'):
            emb = [tf.get_variable(name='emb_%d'%i, shape=[self.field_lens[i], self.vec_dim], dtype=tf.float32, regularizer=self.l2_reg) for i in range(self.field_num)]
            emb_layer = tf.concat([tf.matmul(self.x[i], emb[i]) for i in range(self.field_num)], axis=1) # (batch, F*K)
            emb_layer = tf.reshape(emb_layer, shape=[-1, self.field_num, self.vec_dim]) # (batch, F, K)

        with tf.variable_scope('linear_part'):
            linear_part = tf.reshape(emb_layer, shape=[-1, self.field_num*self.vec_dim]) # (batch, F*K)
            linear_w = tf.get_variable(name='linear_w', shape=[self.field_num*self.vec_dim, self.dnn_layers[0]], dtype=tf.float32, regularizer=self.l2_reg) # (F*K, D)
            self.lz = tf.matmul(linear_part, linear_w) # (batch, D)

        with tf.variable_scope('product_part'):
            product_out = []
            if self.use_inner:
                inner_product_w = tf.get_variable(name='inner_product_w', shape=[self.dnn_layers[0], self.field_num], dtype=tf.float32, regularizer=self.l2_reg) # (D, F)
                for i in range(self.dnn_layers[0]):
                    delta = tf.multiply(emb_layer, tf.expand_dims(inner_product_w[i], axis=1)) # (batch, F, K)
                    delta = tf.reduce_sum(delta, axis=1) # (batch, K)
                    product_out.append(tf.reduce_sum(tf.square(delta), axis=1, keep_dims=True)) # (batch, 1)
            else:
                outer_product_w = tf.get_variable(name='outer_product_w', shape=[self.dnn_layers[0], self.vec_dim, self.vec_dim], dtype=tf.float32, regularizer=self.l2_reg) # (D, K, K)
                field_sum = tf.reduce_sum(emb_layer, axis=1) # (batch, K)
                p = tf.matmul(tf.expand_dims(field_sum, axis=2), tf.expand_dims(field_sum, axis=1)) # (batch, K, K)
                for i in range(self.dnn_layers[0]):
                    lpi = tf.multiply(p, tf.expand_dims(outer_product_w[i], axis=0)) # (batch, K, K)
                    product_out.append(tf.expand_dims(tf.reduce_sum(lpi, axis=[1,2]), axis=1)) # (batch, 1)
            self.lp = tf.concat(product_out, axis=1)  # (batch, D)
            bias = tf.get_variable(name='bias', shape=[self.dnn_layers[0]], dtype=tf.float32)
            self.product_layer = tf.nn.relu(self.lz+self.lp+bias)

        x = self.product_layer
        in_node = self.dnn_layers[0]
        with tf.variable_scope('dnn_part'):
            for i in range(1, len(self.dnn_layers)):
                out_node = self.dnn_layers[i]
                w = tf.get_variable(name='w_%d'%i, shape=[in_node, out_node], dtype=tf.float32, regularizer=self.l2_reg)
                b = tf.get_variable(name='b_%d'%i, shape=[out_node], dtype=tf.float32)
                x = tf.matmul(x, w) + b
                if out_node == 1:
                    self.y_logits = x
                else:
                    x = tf.layers.dropout(tf.nn.relu(x), rate=self.dropout_rate, training=self.is_train)
                in_node = out_node

        self.y_hat = tf.nn.sigmoid(self.y_logits)
        self.pred_label = tf.cast(self.y_hat > 0.5, tf.int32)
        self.loss = -tf.reduce_mean(self.y*tf.log(self.y_hat+1e-8) + (1-self.y)*tf.log(1-self.y_hat+1e-8))
        reg_variables = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
        if len(reg_variables) > 0:
            self.loss += tf.add_n(reg_variables)
        self.train_op = tf.train.AdamOptimizer(self.lr).minimize(self.loss)

reference

[1] Qu, Yanru, et al. "Product-based neural networks for user response prediction." 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 2016.

[2] Zhang, Weinan, Tianming Du, and Jun Wang. "Deep learning over multi-field categorical data." European conference on information retrieval. Springer, Cham, 2016.

[3] https://zhuanlan.zhihu.com/p/56651241

知識分享

個人知乎專欄：https://zhuanlan.zhihu.com/c_1164954275573858304

歡迎關注微信公眾號：SOTA Lab
專注知識分享，不定期更新計算機、金融類文章

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。