XGB算法梳理

本文轉載自查看原文 2019-03-04 21:07 1046 機器學習

學習內容：

1.CART樹

2.算法原理

3.損失函數

4.分裂結點算法

5.正則化

6.對缺失值處理

7.優缺點

8.應用場景

9.sklearn參數

1.CART樹

　　CART算法是一種二分遞歸分割技術，把當前樣本划分為兩個子樣本，使得生成的每個非葉子結點都有兩個分支，因此CART算法生成的決策樹是結構簡潔的二叉樹。由於CART算法構成的是一個二叉樹，它在每一步的決策時只能是“是”或者“否”，即使一個feature有多個取值，也是把數據分為兩部分。在CART算法中主要分為兩個步驟

將樣本遞歸划分進行建樹過程
用驗證數據進行剪枝

2.算法原理

　　輸入：訓練數據集$D$，停止計算的條件；

　　輸出：CART決策樹。

　　根據訓練數據集，從根結點開始，遞歸地對每個結點進行一下操作，構建二叉決策樹：

　　1）設結點的訓練數據集為$D$，計算現有特征對該點數據集的基尼指數。此時，對每個特征A，對其可能取的每個值$a$，根據樣本點計算對$A = a$的測試為“是”或“否”講$D$分割成$D_1$和$D_2$兩部分，計算$A = a$時的基尼指數。

　　2）在所有可能的特征$A$以及他們所有可能的切分點$a$中，選擇基尼指數最小的特征及其對應的切分點作為最優切分點，依最有特征與最優切分點，從現結點生成兩個子結點，將訓練數據集依特征分配到兩個子結點中去。

　　3）對兩個子結點遞歸地調用1），2），直至滿足停止條件。

　　4）生成CART決策樹。

3.損失函數

　　$L = \sum\limits_{x_i \leq R_m} (y_i - f(x_i))^2 + \sum\limits_{i=1}^K \Omega (f_k) $

4.分裂結點算法

　　使用基尼指數用於分裂結點的依據

　　概率分布的基尼指數定義為：$$Gini(p) = \sum\limits_{k=1}^K p_k (1-p_k) = 1 - \sum\limits_{k=1}^K p_k^2 $$

　　如果樣本那集合D根據特征A是否取某一可能值$a$被分割成$D_1$和$D_2$兩部分，即$$D_1 = \{(x,y) \leq D | A(x) = a \} , D_2 = D - D_1 $$

　　根據基尼指數值越大，樣本集合不確定性就越大。

5.正則化

　　標准GBM的實現沒有像XGBoost這樣的正則化步驟。正則化對減少過擬合也是有幫助的。實際上，XGBoost以“正則化提升(regularized boosting)”技術而聞名。

　　$ \Omega (f) = \gamma T + \frac{1}{2} \lambda ||\omega||^2 $

6.對缺失值處理

　　XGBoost內置處理缺失值的規則。用戶需要提供一個和其它樣本不同的值，然后把它作為一個參數傳進去，以此來作為缺失值的取值。XGBoost在不同節點遇到缺失值時采用不同的處理方法，並且會學習未來遇到缺失值時的處理方法。

7.優缺點

優點：

　　XGBoost可以實現並行處理，相比GBM有了速度的飛躍，LightGBM也是微軟最新推出的一個速度提升的算法。 XGBoost也支持Hadoop實現。

　　XGBoost支持用戶自定義目標函數和評估函數，只要目標函數二階可導就行。

8.應用場景

　　評分系統，智能垃圾郵件識別，廣告推薦系統

9.sklearn參數

　　　　　　class xgboost.XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='reg:linear', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, importance_type='gain', **kwargs)

　　max_depth: 參數類型(int) – Maximum tree depth for base learners. 樹的最大深度

　　learning_rate: 參數類型(float) – Boosting learning rate (xgb’s “eta”).學習率

　　n_estimators: 參數類型(int) – Number of boosted trees to fit.優化樹的個數

　　silent: 參數類型(boolean) – Whether to print messages while running boosting.在運行過程中是否打印流程

　　objective: 參數類型(string or callable) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).明確學習任務

　　booster: 參數類型(string) – Specify which booster to use: gbtree, gblinear or dart.指定使用的booster

　　nthread: 參數類型(int) – Number of parallel threads used to run xgboost. (Deprecated, please use n_jobs).多線程

　　n_jobs: 參數類型(int) – Number of parallel threads used to run xgboost. (replaces nthread).多線程

　　gamma: 參數類型(float) – Minimum loss reduction required to make a further partition on a leaf node of the tree.增加分支時減少的最少損失

　　min_child_weight: 參數類型(int) – Minimum sum of instance weight(hessian) needed in a child.葉節點最小權重

　　max_delta_step: 參數類型(int) – Maximum delta step we allow each tree’s weight estimation to be.最大迭代次數

　　subsample: 參數類型(float) – Subsample ratio of the training instance.訓練樣本的采樣率

　　colsample_bytree: 參數類型(float) – Subsample ratio of columns when constructing each tree.構建樹時下采樣率

　　colsample_bylevel: 參數類型(float) – Subsample ratio of columns for each split, in each level.構建每一分支時下采樣率

　　reg_alpha: 參數類型(float (xgb's alpha)) – L1 regularization term on weights.L1正則化權重

　　reg_lambda: 參數類型(float (xgb's lambda)) – L2 regularization term on weights.L2正則化權重

　　scale_pos_weight: 參數類型(float) – Balancing of positive and negative weights.正負樣本比率

　　base_score: – The initial prediction score of all instances, global bias.初始實例分數

　　seed: 參數類型(int) – Random number seed. (Deprecated, please use random_state).隨機種子

　　random_state: 參數類型(int) – Random number seed. (replaces seed).隨機種子

　　missing： 參數類型(float, optional) – Value in the data which needs to be present as a missing value. If None, defaults to np.nan.當出現缺失值時，使用該值代替。

　　importanc_type： 參數類型(string, default "gain") – The feature importance type for the feature_importances_ property: either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.特征重要類型

　　**kwargs: 參數類型(dict, optional) –Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here:

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 XGB算法梳理機器學習-XGB算法梳理 GBDT算法梳理 GBDT算法梳理隨機森林算法梳理 CatBoost算法梳理隨機森林算法基礎梳理求100以內的素數（質數）算法梳理機器學習算法思想簡單梳理機器學習算法匯總大梳理