xgboost原理及並行實現

本文轉載自查看原文 2017-04-01 16:53 2517 machine learning

XGBoost訓練： It is not easy to train all the trees at once. Instead, we use an additive strategy: fix what we have learned, and add one new tree at a time. We write the prediction value at step t as $\hat{y}^{(t)}_i$,so we have $t$

　　　　　　　　　　　　　　　　　　$\hat{y}^{(0)}_i = 0$

　　　　　　　　　　　　　　　　　　 $\hat{y}^{(1)}_i = f_1(x_i) = \hat{y}^{(0)}_i + f_1(x_i)$

　　　　　　　　　　　　　　　　　　$\hat{y}_i^{(t)} = \sum^t_{k=1}f_k(x_i) = \hat{y}_i^{(t-1)} + f_t(x_i)$

優化目標：

If we consider using MSE(mean square error) as our loss function, it becomes the following form.

The form of MSE is friendly, with a first order(linear) term (usually called the residual) and a quadratic term. For other losses of interest (for example, logistic loss), it is not so easy to get such a nice form. So in the general case, we take the Taylor expansion of the loss function up to the second order（損失函數在上一輪模型處泰勒二階展開）

泰勒公式：

where the $g_i$ and $h_i$ are defined as

（自變量為$\hat{y}^{(t)}_i = f_t(x_i) + \hat{y}^{(t-1)}_i$,在$\hat{y}^{(t-1)}_i$處泰勒展開）

After we remove all the constants, the specific objective at step t becomes $t$

This becomes our optimization goal for the new tree. One important advantage of this definition is that it only depends on $g_i$ and $h_i$.

this is how xgboost can support custom loss functions. We can optimize every loss function, including logistic regression and weighted logistic regression, using exactly the same solver that takes $g_i$ and $h_i$ as input.

定制objective function 和evaluation function： https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py

gbm算法：

gbdt是用損失函數的負梯度在當前模型的值作為殘差的近似值，擬合一個回歸樹，一個回歸樹對應着輸入空間(特征空間)一個划分以及在划分的單元上的輸出值。上述的gm(x）這一步有問題，應該是先利用上一步的模型計算出（近似）殘差r_i，然后利用(x_i, r_i)擬合一個回歸樹g_m(x),以下面的偽代碼為准。（參考統計學習方法8.4節）

gbdt的分類算法從思想上和gbdt的回歸算法沒有區別，但是由於樣本輸出不是連續的值，而是離散的類別，導致我們無法直接去擬合輸出類別與真實類別的誤差（比如說label有0和1兩類，則預測類別要么0要么1，誤差要么0，要么±1，而回歸問題輸出值是不斷接近真實值）。為了解決這個問題，主要有兩個方法，一個是用指數損失函數，此時GBDT退化為Adaboost算法。另一種方法是用類似於邏輯回歸的對數損失函數的方法。也就是說，我們用的是類別的預測概率值和真實概率值的差來擬合損失。參考： http://www.cnblogs.com/pinard/p/6140514.html

對數損失函數：，其中,對應的負梯度為

（我的理解：gbdt用於回歸問題，使用平方損失函數；用於分類問題，使用指數損失函數，此時相當於adaboost算法中的基分類器是決策樹，或使用對數損失函數，預測值是類別的概率）

使用GBDT來解決多分類問題，實質是把它轉化為回歸問題。在回歸問題中，GBDT每一輪迭代都構建了一棵樹，實質是構建了一個函數f，當輸入為x時，樹的輸出為f(x)。

在多分類問題中，假設有k個類別，那么每一輪迭代實質是構建了k棵樹，對某個樣本x的預測值為

在這里我們仿照多分類的邏輯回歸，使用softmax來產生概率，則屬於某個類別c的概率為 $p(y = c | x) = exp(f_c(x)) / \sum_{i = 1}^Kexp(f_i(x))$

此時該樣本的loss即可以用logitloss(和softmax回歸的損失函數相同)來表示，並對f1~fk都可以算出一個梯度，f1~fk便可以計算出當前輪的殘差，供下一輪迭代學習,最終做預測時，輸入的x會得到k個輸出值，然后通過softmax獲得其屬於各類別的概率即可.xgboost的源碼實現就是這個思路。

http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf xgboost原作者ppt，看完

Model Complexity

let us first refine the definition of the tree $f(x)$ as $f_t(x) = w_{q(x)}, w \in R^T,q:R^d ->{1,2,...T} $

$w$ is the vector of scores on leaves, $q$ is a function assigning each data point to the corresponding leaf, and $T$ is the number of leaves. In XGBoost, we define the complexity as (當前樹的復雜度，對於回歸樹，葉子節點數就是當前f函數預測值的個數，score值表示葉子節點的權重，並不是輸出值，這個權重需要學習得到的 )

after reformalizing the tree model, we can write the objective value with the t-th tree as:(損失項+正則項)

For a fixed structure q(x), we can compute the optimal weight w_j of leaf j by

and calculate the corresponding optimal value by（當前樹的損失值，作為分裂的評價標准）

Normally it is impossible to enumerate all the possible tree structures q. A greedy algorithm that starts from a single leaf and iteratively adds branches to the tree is used instead. Assume that IL and IR are the instance sets of left and right nodes after the split. Lettting I = I_L ∪ I_R, then the loss reduction after the split is given by

（分裂后的損失值小於父節點的損失，直接求差是個負數，所以乘以-1）

$I_j = \{i|q(x_i) = j\}$ is the set of indices of data points assigned to the j-th leaf.Notice that in the second line we have changed the index of the summation because all the data points on the same leaf get the same score.We could further compress the expression by defining $G_j = \sum_{i \in I_j}g_i$ and $H_j = \sum_{i \in I_j}h_i$:

https://stackoverflow.com/questions/41433209/can-someone-explain-how-these-scores-are-derived-in-this-xgboost-trees

https://www.kaggle.com/general/20322

Predictions are made by summing up the corresponding leaf values of each tree. Additionally, you need to transform those values depending on the objective you have choosen. For instance: If you trained your xgb with binary:logistic, the sum of the leaf values will be the logit score. So you need to apply the logistic function to get the wanted probabilities.

parallel gradient boosting decision trees

gbdt is a sequential algorithm. Therefore, we can't parallelize the algorithm like Random Forest. We can only parallelize the algorithm in the tree building step. Therefore, the problem reduces to parallel decision tree building.

method 1: parallelize node building at each level(我的理解：確定好分裂特征和分裂點后，各子節點中instance已經確定，然后並行建立子節點，包括后續的節點，問題是不同節點中實例個數不平衡，workload imbalanced problem)

A simple idea of parallel decision tree bulding is to parallelize node building at each level. However, this method has a serious workload imbalanced problem. The reason is that a decsion tree tends to purify its nodes to obtain high prediction accuracy, and therefore many of the nodes will only contain a small group of training instances that have purified results, while some other nodes contain large group of trianing instances. The figure below shows an example of the imbalanced workload problem. Suppose we are going to build the nodes in the red box in parallel. We can see that the first and the third nodes contain much less training instances than the second and the fourth node, which causes the workload imbalanced.

method 2: parallelize split finding on each node（在每個節點中並行尋找最佳分裂特征（並行處理多個特征），如果節點中實例數目很少，並行化的好處不足以彌補多線程中上下文切換，thread join等帶來的損失）

Recall that in the split finding process on a node (the process is showed below), we need to enumerate each feature to find the split. The idea of this method is to parallelize the split finding process, so that in each node, the algorithm find split for different features in parallel.

The main problem of this method is that it will has too much overhead for small nodes. When a decision tree grows deeper, most of the nodes will only contain a small number of training instances. In this case, the computation cost for each node is very small, and the benefit brought by parallel computing can not cover the overhead brought by context switching（上下文切換）, thread joining, and etc., which makes the method fails to achieve a good speedup. However, this method indeed points us to a correct direction, and our final method is based on parallel split finding by features.

method 3: parallelize split finding at each level by features(對於每個特征，並行尋找在每個葉子節點上的最佳分裂點，並行處理多個特征)

It is showed above that at each level, the sequential building process of decision tree has two loops, the first one is an outer loop for enumerating（枚舉） the leaf nodes, and the second one is an inner loop that enumerates the features. The idea of this method is to swap the order of these two loops, so that we can parallelize the split finding for different features at the same level. A pseudocode of the algorithm is showed below. We can see that by changing the order of the loop, we also avoid sorting the instances in each node. We can sort the instances at the start of the whole building process, and then use the same sorting result at each level. On the other hand, note that to keep the correctness of the algorithm, each thread needs to carefully maintain their scaning status of each leaf node during the linear scan process, which significantly increases the coding complexity of the algorithm.（一個線程處理一個特征，首先在開始建樹之前，將所有的數據按照該特征的特征值排序，記錄排序結果。然后每一層分裂時，對於每個葉子節點中的數據逐一遍歷，計算按當前數據點的特征值分裂的增益）

The advantages of the method are:

Workload are totally balanced. Since the number of instances for each feature is the same, the workload for different jobs is the same. Thus, we do not have the workload imbalanced problem in method 1.
Overhead for parallelization is small. Since we parallelize split finding at the whole level rather than a single node, the benefit from parallel computing is totally enough to cover the overhead from parallel computing

the xgboost advantage

Regularization(正則化):xgboost在損失函數中加入了正則項（見上面的公式）
- Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
- In fact, XGBoost is also known as ‘regularized boosting‘ technique.
Parallel Processing:
- XGBoost implements parallel processing and is blazingly faster as compared to GBM.
- XGBoost also supports implementation on Hadoop.
High Flexibility
- XGBoost allow users to define custom optimization objectives and evaluation criteria.
- This adds a whole new dimension to the model and there is no limit to what we can do.
Handling Missing Values（特征值缺失的樣本點分別分配到左右兩個子節點中，計算增益，選擇增益最大的方向作為分裂方向）
- XGBoost has an in-built routine to handle missing values. <= xgboost如何處理缺失值的？
- User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.（缺失值指的應該是某一實例在分裂特征維度的取值不存在，這樣就無法確定將它分到哪個子節點中）
Tree Pruning:
- A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
- XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
- Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
Built-in Cross-Validation
- XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations(樹的數目) in a single run.
- This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
Continue on Existing Model
- User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.
- GBM implementation of sklearn also has this feature so they are even on this point.

源碼閱讀以及xgboost與gbdt的區別：

http://mlnote.com/2016/10/29/xgboost-code-review-with-paper/

https://www.zhihu.com/question/41354392

1、傳統GBDT在優化時只用到一階導數信息（作為殘差的近似值），xgboost則對損失函數進行了二階泰勒展開，同時用到了一階和二階導數，作為損失項。順便提一下，xgboost工具支持自定義代價函數，只要函數可一階和二階求導。

2、xgboost在損失函數里加入了正則項，用於控制模型的復雜度。正則項里包含了樹的葉子節點個數、每個葉子節點上輸出的score的L2模的平方和。從Bias-variance tradeoff角度來講，正則項降低了模型的variance，使學習出來的模型更加簡單，防止過擬合，這也是xgboost優於傳統GBDT的一個特性。(L2正則 + shrinkage scales + 列采樣、行采樣)

3、對缺失值的處理。對於特征的值有缺失的樣本，xgboost可以自動學習出它的分裂方向。
4、 xgboost工具支持並行

https://arxiv.org/pdf/1603.02754.pdf xgboost原始論文

處理缺失值：

When a value is missing in the sparse matrix x, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default directions are learnt from the data. The algorithm is shown in Alg. 3. The key improvement is to only visit the non-missing entries I_k . The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.

正則措施： L2正則 + shrinkage scales + 列采樣、行采樣

Besides the regularized objective，two additional techniques are used to further prevent over- fitting. The first technique is shrinkage introduced . Shrinkage scales newly added weights by a factor η after each step of tree boosting. Similar to a learning rate in tochastic optimization, shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model. The second technique is column (feature) subsampling. According to user feedback, using column sub-sampling prevents over-fitting even more so than the traditional row sub-sampling (which is also supported).

參考：

http://xgboost.readthedocs.io/en/latest/model.html#elements-of-supervised-learning

http://zhanpengfang.github.io/418home.html

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 XGBoost算法原理以及實現 GBDT為什么不能並行，XGBoost卻可以 XGBoost原理 xgboost原理 xgboost原理 xgboost原理及應用--轉 xgboost原理及應用詳述Xgboost原理 xgboost安裝與原理 XGBoost原理簡介