Generalized linear models with nonlinear feature transformations (特征工程 + 線性模型) are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions (線性模型中學習到的特征系數解釋性強)through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort(線性模型的泛化性能需要大量的特征工程).
With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. (dnn從稀疏的特征向量中學習得到低維度詞向量,泛化性能較好,但是可能欠擬合)However, deep neural networks with embeddings can over-generalize。
Wide & Deep learning—jointly trained wide linear models and deep neural networks—to combine the benefits of memorization and generalization for recommender systems.

The Wide Component
The wide component is a generalized linear model of the form $y = w^T x + b$, as illustrated in Figure 1 (left). y is the prediction, $x = [x_1, x_2, ..., x_d] $ is a vector of d features, $w = [w_1, w_2, ..., w_d]$ are the model parameters and b is the bias. The feature set includes raw input features and transformed features(比如組合特征).
The Deep Component
The deep component is a feed-forward neural network, as shown in Figure 1 (right). For categorical features, the original inputs are feature strings (e.g., “language=en”). Each of these sparse, high-dimensional categorical features are first converted into a low-dimensional and dense real-valued vector(有兩種處理辦法,一種是對每個field的特征進行embedding得到一個詞向量,一種是所有field特征都one-hot后再經由一個embedding層后得到一個詞向量,這里采用后者), often referred to as an embedding vector. The dimensionality of the embeddings are usually on the order of O(10) to O(100). The embedding vectors are initialized randomly and then the values are trained to minimize the final loss function during model training. These low-dimensional dense embedding vectors are then fed into the hidden layers of a neural network in the forward pass. Specifically, each hidden layer performs the following computation:
$a ^{(l+1)} = f(W^{(l)} a ^{(l)} + b^ {(l)} )$
where l is the layer number and f is the activation function, often rectified linear units (ReLUs). $a^{(l)} , b^{(l)} , and W^{(l)} $ are the activations, bias, and model weights at l-th layer.
Joint Training of Wide & Deep Model
The wide component and deep component are combined using a weighted sum of their output log odds as the prediction, which is then fed to one common logistic loss function for joint training(Wide部分和deep部分都輸出一個概率,然后加權平均(權值需要學習),加權平均后得到的概率可直接用於對數損失函數,其實就是LR). (paddle中的實現:concatenate LR和DNN的輸出,得到一個二維向量,經由一個全連接層(激活函數為sigmoid),輸出最終概率。其實和前面的注釋意義一樣)Note that there is a distinction between joint training and ensemble. In an ensemble, individual models are trained separately without knowing each other, and their predictions are combined only at inference time but not at training time. In contrast, joint training optimizes all parameters simultaneously by taking both the wide and deep part as well as the weights of their sum into account at training time. For joint training the wide part only needs to complement the weaknesses of the deep part with a small number of cross-product feature transformations, rather than a full-size wide model.
Joint training of a Wide & Deep Model is done by backpropagating the gradients from the output to both the wide and deep part of the model simultaneously using mini-batch stochastic optimization. In the experiments, we used Followthe-regularized-leader (FTRL) algorithm [3] with L1 regularization as the optimizer for the wide part of the model, and AdaGrad [1] for the deep part(Wide部分使用FTRL優化算法即sgd+L1正則,Deep部分使用AdaGrad優化算法,但是paddle中對於一個Model只能指定一個優化方法).
The combined model is illustrated in Figure 1 (center). For a logistic regression problem, the model’s prediction is:

where Y is the binary class label, σ(·) is the sigmoid function, φ(x) are the cross product transformations of the original features x, and b is the bias term. $w_{wide}$ is the vector of all wide model weights, and $w_{deep}$ are the weights applied on the final activations $a^{(lf)}$ .
實際項目中,Deep部分的輸入為類別特征,需要進行one-hot處理,Wide部分其實就是一個LR,使用統計特征,cvr特征等,統計特征進行one-hot處理,cvr特征需要離散化再one-hot處理。
搭建網絡:
class CTRmodel(object): ''' A CTR model which implements wide && deep learning model. ''' def __init__(self, dnn_layer_dims, dnn_input_dim, lr_input_dim, model_type=ModelType.create_classification(), is_infer=False): ''' @dnn_layer_dims: list of integer dims of each layer in dnn @dnn_input_dim: int size of dnn's input layer @lr_input_dim: int size of lr's input layer @is_infer: bool whether to build a infer model ''' self.dnn_layer_dims = dnn_layer_dims self.dnn_input_dim = dnn_input_dim self.lr_input_dim = lr_input_dim self.model_type = model_type self.is_infer = is_infer self._declare_input_layers() self.dnn = self._build_dnn_submodel_(self.dnn_layer_dims) self.lr = self._build_lr_submodel_() # model's prediction # TODO(superjom) rename it to prediction if self.model_type.is_classification(): self.model = self._build_classification_model(self.dnn, self.lr) if self.model_type.is_regression(): self.model = self._build_regression_model(self.dnn, self.lr) # layer.data: define DataLayer For NeuralNetwork. def _declare_input_layers(self): # Deep部分的輸入,使用類別特征,需要one-hot處理 # Sparse binary vector : the input feature is a sparse vector and the every element in this vector is either zero or one.
# sparse_binary_vector的輸入是特征值的下標組成的向量 self.dnn_merged_input = layer.data( name='dnn_input', type=paddle.data_type.sparse_binary_vector(self.dnn_input_dim)) # Wide部分的輸入,使用統計特征和cvr特征等,統計特征one-hot處理,cvr特征先離散化再one-hot # Sparse vector : the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value. # sparse_vector的輸入是(index:value)元素組成的向量
self.lr_merged_input = layer.data( name='lr_input', type=paddle.data_type.sparse_vector(self.lr_input_dim)) # 二分類模型學習的標簽 # Dense Vector : the input feature is dense float vector. if not self.is_infer: self.click = paddle.layer.data( name='click', type=dtype.dense_vector(1)) # Deep部分使用了標准的多層前向傳導的 DNN 模型,這里輸入的特征都作了one-hot處理,然后作為一個整體進行embedding,得到詞向量。
# DeepFM對每個field單獨進行embedding,有多個embedding層 # 注意使用的時候dnn_layer_dims = [128, 64, 32, 1] def _build_dnn_submodel_(self, dnn_layer_dims): ''' build DNN submodel. ''' dnn_embedding = layer.fc( input=self.dnn_merged_input, size=dnn_layer_dims[0]) _input_layer = dnn_embedding for i, dim in enumerate(dnn_layer_dims[1:]): fc = layer.fc( input=_input_layer, size=dim, act=paddle.activation.Relu(), name='dnn-fc-%d' % i) _input_layer = fc return _input_layer # Wide部分,直接使用LR模型,激活函數改為RELU來加速. def _build_lr_submodel_(self): ''' config LR submodel ''' # size是layer dimension,我的理解是layer中神經元數目 fc = layer.fc( input=self.lr_merged_input, size=1, act=paddle.activation.Relu()) return fc # 融合Wide和Deep部分 def _build_classification_model(self, dnn, lr): merge_layer = layer.concat(input=[dnn, lr]) # sigmoid輸出概率 self.output = layer.fc( input=merge_layer, size=1, # use sigmoid function to approximate ctr rate, a float value between 0 and 1. act=paddle.activation.Sigmoid()) if not self.is_infer: # multi_binary_label_cross_entropy_cost: a loss layer for multi binary label cross entropy # 分類問題使用交叉熵損失 self.train_cost = paddle.layer.multi_binary_label_cross_entropy_cost( input=self.output, label=self.click) return self.output def _build_regression_model(self, dnn, lr): merge_layer = layer.concat(input=[dnn, lr]) self.output = layer.fc( input=merge_layer, size=1, act=paddle.activation.Sigmoid()) if not self.is_infer: # 回歸問題使用mse損失 self.train_cost = paddle.layer.mse_cost( input=self.output, label=self.click) return self.output
訓練模型:
dnn_layer_dims = [128, 64, 32, 1] # ============================================================================== # cost and train period # ============================================================================== def train(): args = parse_args() args.model_type = ModelType(args.model_type) paddle.init(use_gpu=False, trainer_count=1) dnn_input_dim, lr_input_dim = reader.load_data_meta(args.data_meta_file) # create ctr model. model = CTRmodel( dnn_layer_dims, dnn_input_dim, lr_input_dim, model_type=args.model_type, is_infer=False) # Parameters is a dictionary contains Paddle’s parameter,輸入是網絡的cost layer params = paddle.parameters.create(model.train_cost) optimizer = paddle.optimizer.AdaGrad() trainer = paddle.trainer.SGD( cost=model.train_cost, parameters=params, update_equation=optimizer) dataset = reader.Dataset() def __event_handler__(event): if isinstance(event, paddle.event.EndIteration): num_samples = event.batch_id * args.batch_size if event.batch_id % 100 == 0: logger.warning("Pass %d, Samples %d, Cost %f, %s" % ( event.pass_id, num_samples, event.cost, event.metrics)) if event.batch_id % 1000 == 0: if args.test_data_path: result = trainer.test( reader=paddle.batch( dataset.test(args.test_data_path), batch_size=args.batch_size), feeding=reader.feeding_index) logger.warning("Test %d-%d, Cost %f, %s" % (event.pass_id, event.batch_id, result.cost, result.metrics)) path = "{}-pass-{}-batch-{}-test-{}.tar.gz".format( args.model_output_prefix, event.pass_id, event.batch_id, result.cost) with gzip.open(path, 'w') as f: params.to_tar(f) trainer.train( # shuffle: 每次讀入buffer_size條訓練數據到一個buffer里,然后隨機打亂其順序,並且逐條輸出 # 一個batched reader每次yield一個minibatch # num_passes : The total train passes. # feeding (dict|list) : Feeding is a map of neural network input name and array index that reader returns. reader=paddle.batch( paddle.reader.shuffle( dataset.train(args.train_data_path), buf_size=500), batch_size=args.batch_size), feeding=reader.feeding_index, event_handler=__event_handler__, num_passes=args.num_passes)
調參
初始化參數:
默認情況下,PaddlePaddle使用均值0,標准差為$(\frac{1}{\sqrt{d}})$ 來初始化參數。其中$d$ 為參數矩陣的寬度。這種初始化方式在一般情況下不會產生很差的結果。如果用戶想要自定義初始化方式,PaddlePaddle目前提供兩種參數初始化的方式(在定義layer的時候設置參數的初始化方式):
- 高斯分布。將
param_attr設置成param_attr=ParamAttr(initial_mean=0.0, initial_std=1.0) - 均勻分布。將
param_attr設置成param_attr=ParamAttr(initial_max=1.0, initial_min=-1.0)
paddle中fc層的參數有:

調整學習率:


上述在定義
optimizer = paddle.optimizer.AdaGrad()
的時候可以指定相關參數。
adam_optimizer = paddle.optimizer.Adam( learning_rate=1e-3, regularization=paddle.optimizer.L2Regularization(rate=1e-3), model_average=paddle.optimizer.ModelAverage(average_window=0.5))
http://www.datakit.cn/blog/2016/08/21/wdnn.html
Using pre-trained word vectors in embedding layer:
https://github.com/PaddlePaddle/Paddle/issues/490
paddle實現推薦系統:
http://book.paddlepaddle.org/index.cn.html
