The overall parameters have been divided into 3 categories by XGBoost authors:
- General Parameters: Guide the overall functioning
- Booster Parameters: Guide the individual booster (tree/regression) at each step
- Learning Task Parameters: Guide the optimization performed
general parameters
- booster [default=gbtree](基分類器類型)
- Select the type of model to run at each iteration. It has 2 options:
- gbtree: tree-based models
- gblinear: linear models
- Select the type of model to run at each iteration. It has 2 options:
- silent [default=0]:
- Silent mode is activated if set to 1, i.e. no running messages will be printed.
- It’s generally good to keep it 0 as the messages might help in understanding the model.
- nthread [default to maximum number of threads available if not set]
- This is used for parallel processing and number of cores in the system should be entered
- If you wish to run on all cores, value should not be entered and algorithm will detect automatically
booster parameters
Though there are 2 types of boosters, I’ll consider only tree booster here because it always outperforms the linear booster and thus the later is rarely used.
- eta [default=0.3](學習率,非常重要的參數)
- Analogous to learning rate in GBM
- Makes the model more robust by shrinking the weights on each step
- Typical final values to be used: 0.01-0.2
- min_child_weight [default=1](控制過擬合,如果太大會導致欠擬合)
- Defines the minimum sum of weights of all observations required in a child.(葉子節點中實例數 * 葉子節點的score)
- This is similar to min_child_leaf in GBM but not exactly. This refers to min “sum of weights” of observations while GBM has min “number of observations”.
- Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
- Too high values can lead to under-fitting hence, it should be tuned using CV.
- max_depth [default=6](最大深度,一般增加特征,深度變小)
- The maximum depth of a tree, same as GBM.
- Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
- Should be tuned using CV.
- Typical values: 3-10
- max_leaf_nodes
- The maximum number of terminal nodes or leaves in a tree.
- Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
- If this is defined, GBM will ignore max_depth.
- gamma [default=0](分裂所需的最小損失減少值)
- A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
- Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
- max_delta_step [default=0]
- In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
- Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
- This is generally not used but you can explore further if you wish.
- subsample [default=1](樣本采樣的比例,防止過擬合)
- Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
- Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
- Typical values: 0.5-1
- colsample_bytree [default=1](特征采樣的比例,防止過擬合)
- Similar to max_features in GBM. Denotes the fraction of columns to be randomly samples for each tree.
- Typical values: 0.5-1
- colsample_bylevel [default=1]
- Denotes the subsample ratio of columns for each split, in each level.
- I don’t use this often because subsample and colsample_bytree will do the job for you. but you can explore further if you feel so.
- lambda [default=1](L2正則化系數)
- L2 regularization term on weights (正則化系數)(analogous to Ridge regression)
- This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
- alpha [default=0](L1正則化系數)
- L1 regularization term on weight (analogous to Lasso regression)
- Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
- scale_pos_weight [default=1](正例的權重,如果使用對數損失,默認1即可)
- A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.
learning task parameters
These parameters are used to define the optimization objective the metric to be calculated at each step.
- objective [default=reg:linear](損失函數)
- This defines the loss function to be minimized. Mostly used values are:
- binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
- multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)
- you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
- multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.
- This defines the loss function to be minimized. Mostly used values are:
- eval_metric [ default according to objective ]
- The metric to be used for validation data.
- The default values are rmse for regression and error for classification.
- Typical values are:
- rmse – root mean square error
- mae – mean absolute error
- logloss – negative log-likelihood
- error – Binary classification error rate (0.5 threshold)
- merror – Multiclass classification error rate
- mlogloss – Multiclass logloss
- auc: Area under the curve
- seed [default=0](隨機數種子,可指定多個不同的種子,訓練不同的模型,然后ensemble)
- The random number seed.
- Can be used for generating reproducible results and also for parameter tuning.
General Approach for Parameter Tuning
- Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate. XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required.
- Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
- Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.
- Lower the learning rate and decide the optimal parameters .
先初始化一個比較大的學習率,然后利用xgboost自帶的cv調整樹的數目,其次調整樹相關的參數,包括深度、最小孩子節點權重等,最后調整學習率
import pandas as pd import numpy as np import xgboost as xgb from xgboost.sklearn import XGBClassifier from sklearn import cross_validation, metrics from sklearn.grid_search import GridSearchCV ''' xgb is the direct xgboost library. XGBClassifier is an sklearn wrapper for XGBoost.This allows us to use sklearn's Grid Search with parallel processing . ''' train = pd.read_csv("train_modified.csv") target = "Disbursed" IDcol = "ID" def modelfit(alg, dtrain, predictors, useTrainCV = True, cv_folds = 5, early_stopping_rounds = 50): if useTrainCV: xgb_param = alg.get_xgb_params() xgtrain = xgb.DMatrix(dtrain[predictors].values, label = dtrain[target].values) cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round = alg.get_params()['n_estimators'], nfold = cv_folds, metrics = 'auc', early_stopping_rounds = early_stopping_rounds) alg.set_params(n_estimators = cvresult.shape[0]) alg.fit(dtrain[predictors], dtrain['Disbursed'], eval_metric = 'auc') dtrain_predictions = alg.predict(dtrain[predictors]) dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1] print "\nModel Report" print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions) print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob) predictors = [x for x in train.columns if x not in [target, IDcol]] xgb1 = XGBClassifier(learning_rate = 0.1, n_estimators = 1000, max_depth = 5, min_child_weight = 1, gamma = 0, subsample = 0.8, colsample_bytree = 0.8, objective = 'binary:logistic', nthread = 4, scale_pos_weight = 1, seed = 27) modelfit(xgb1, train, predictors)

由上圖可知,在給定learning_rate = 0.1的情況下,n_estimators = 120是最佳的樹的個數
參考:
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
http://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters
https://github.com/dmlc/xgboost/tree/master/demo/guide-python
http://xgboost.readthedocs.io/en/latest/python/python_api.html
