機器學習xgboost參數解釋筆記

本文轉載自查看原文 2019-09-27 22:03 398 機器學習篇-筆記

首先xgboost有兩種接口，xgboost自帶API和Scikit-Learn的API，具體用法有細微的差別但不大。

在運行 XGBoost 之前, 我們必須設置三種類型的參數: （常規參數）general parameters，（提升器參數）booster parameters和（任務參數）task parameters。

常規參數與我們用於提升的提升器有關，通常是樹模型或線性模型
提升器參數取決於你所選擇的提升器
學習任務的參數決定了學習場景, 例如回歸任務可以使用不同的參數進行排序相關的任務
命令行參數的行為與 xgboost 的 CLI 版本相關
本文只介紹xgboost自帶的API，Scikit-Learn的API可以對照參考。

1 xgboost.train(params, dtrain, num_boost_round=10, evals=(), \
2 obj=None, feval=None, maximize=False, early_stopping_rounds=None, \
3 evals_result=None, verbose_eval=True, learning_rates=None, \
4 xgb_model=None, callbacks=None)

params：這是一個字典，里面包含着訓練中的參數關鍵字和對應的值，形式如下：

 1 params = {
 2     'booster':'gbtree',
 3     'min_child_weight': 100,
 4     'eta': 0.02,
 5     'colsample_bytree': 0.7,
 6     'max_depth': 12,
 7     'subsample': 0.7,
 8     'alpha': 1,
 9     'gamma': 1,
10     'silent': 1,
11     'objective': 'reg:linear',
12     'verbose_eval': True,
13     'seed': 12
14 }

其中具體的參數以下會介紹。

General Parameters
booster [default=gbtree]

有兩中模型可以選擇gbtree和gblinear。gbtree使用基於樹的模型進行提升計算，gblinear使用線性模型進行提升計算。缺省值為gbtree。
silent [default=0]

取0時表示打印出運行時信息，取1時表示以緘默方式運行，不打印運行時信息。缺省值為0。
nthread [default to maximum number of threads available if not set]

XGBoost運行時的線程數。缺省值是當前系統可以獲得的最大線程數
num_pbuffer [set automatically by xgboost, no need to be set by user]

size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.
num_feature [set automatically by xgboost, no need to be set by user]

boosting過程中用到的特征維數，設置為特征個數。XGBoost會自動設置，不需要手工設置。
Booster Parameters
eta [default=0.3]

為了防止過擬合，更新過程中用到的收縮步長。在每次提升計算之后，算法會直接獲得新特征的權重。 eta通過縮減特征的權重使提升計算過程更加保守。缺省值為0.3
取值范圍為：[0,1]
gamma [default=0]

minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
range: [0,∞]
max_depth [default=6]

數的最大深度。缺省值為6
取值范圍為：[1,∞]
min_child_weight [default=1]

孩子節點中最小的樣本權重和。如果一個葉子節點的樣本權重和小於min_child_weight則拆分過程結束。在現行回歸模型中，這個參數是指建立每個模型所需要的最小樣本數。該成熟越大算法越conservative
取值范圍為: [0,∞]
max_delta_step [default=0]

Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. 通常不需要這個參數，但是當類非常不平衡時，它可能有助於邏輯回歸。將其設置為1-10可能有助於控制更新
取值范圍為：[0,∞]
subsample [default=1]

用於訓練模型的子樣本占整個樣本集合的比例。如果設置為0.5則意味着XGBoost將隨機的沖整個樣本集合中隨機的抽取出50%的子樣本建立樹模型，這能夠防止過擬合。
取值范圍為：(0,1]
colsample_bytree [default=1]
在建立樹時對特征采樣的比例。缺省值為1
取值范圍：(0,1]
Task Parameters
objective [ default=reg:linear ]
定義學習任務及相應的學習目標，可選的目標函數如下：

“reg:linear” –線性回歸。
“reg:logistic” –邏輯回歸。
“binary:logistic”–二分類的邏輯回歸問題，輸出為概率。
“binary:logitraw”–二分類的邏輯回歸問題，輸出的結果為wTx。
“count:poisson”–計數問題的poisson回歸，輸出結果為poisson分布。在poisson回歸中，max_delta_step的缺省值為0.7。(used to safeguard optimization)
“multi:softmax” –讓XGBoost采用softmax目標函數處理多分類問題，同時需要設置參數num_class（類別個數）
“multi:softprob” –和softmax一樣，但是輸出的是ndata * nclass的向量，可以將該向量reshape成ndata行nclass列的矩陣。沒行數據表示樣本所屬於每個類別的概率。
“rank:pairwise”–set XGBoost to do ranking task by minimizing the pairwise loss
base_score [ default=0.5 ]

the initial prediction score of all instances, global bias
eval_metric [ default according to objective ]
校驗數據所需要的評價指標，不同的目標函數將會有缺省的評價指標（rmse for regression, and error for classification, mean average precision for ranking）
用戶可以添加多種評價指標，對於Python用戶要以list傳遞參數對給程序，而不是map參數list參數不會覆蓋’eval_metric’
The choices are listed below:
“rmse”: root mean square error
“logloss”: negative log-likelihood
“error”: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
“merror”: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
“mlogloss”: Multiclass logloss
“auc”: Area under the curve for ranking evaluation.
“ndcg”:Normalized Discounted Cumulative Gain
“map”:Mean average precision
“ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
“ndcg-”,”map-”,”ndcg@n-”,”map@n-”: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatively
“gamma-deviance”: [residual deviance for gamma regression]
seed[ default=0 ]

random number seed.

隨機數的種子。缺省值為0

dtrain：訓練的數據

num_boost_round：這是指提升迭代的次數，也就是生成多少基模型

evals：這是一個列表，用於對訓練過程中進行評估列表中的元素。形式是evals = [(dtrain,'train'),(dval,'val')]或者是evals = [(dtrain,'train')]，對於第一種情況，它使得我們可以在訓練過程中觀察驗證集的效果

obj：自定義目的函數

feval：自定義評估函數

maximize：是否對評估函數進行最大化

early_stopping_rounds：早期停止次數，假設為100，驗證集的誤差迭代到一定程度在100次內不能再繼續降低，就停止迭代。這要求evals 里至少有一個元素，如果有多個，按最后一個去執行。返回的是最后的迭代次數（不是最好的）。如果early_stopping_rounds存在，則模型會生成三個屬性，bst.best_score，bst.best_iteration和bst.best_ntree_limit

evals_result：字典，存儲在watchlist中的元素的評估結果。

verbose_eval ：(可以輸入布爾型或數值型)，也要求evals里至少有一個元素。如果為True,則對evals中元素的評估結果會輸出在結果中；如果輸入數字，假設為5，則每隔5個迭代輸出一次。

learning_rates：每一次提升的學習率的列表，

xgb_model：在訓練之前用於加載的xgb model。

以上原文鏈接https://blog.csdn.net/iyuanshuo/article/details/80142730

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習筆記之機器學習算法XGBoost 機器學習——XGBoost 機器學習--boosting家族之XGBoost算法機器學習算法總結(四)——GBDT與XGBOOST 機器學習筆記之XGBoost實現對鳶尾花數據集分類預測機器學習筆記（一）機器學習算法中GBDT和XGBOOST的區別有哪些？【機器學習】--xgboost初始之代碼實現分類機器學習算法中GBDT和XGBOOST的區別有哪些機器學習總結（一） Adaboost,GBDT和XGboost算法